Learning in Graphical Models (Adaptive Computation and Machine Learning)

Series Foreword The goal of building systems that can adapt to their environments and learn from their experience has a...

Author: Michael I. Jordan (Editor)

127 downloads 1732 Views 57MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Series Foreword

The goal of building systems that can adapt to their environments and learn from their experience has attracted researchersfrom many fields , including computer science, engineering, mathematics, physics, neuroscience, and cognitive science. Out of this research has come a wide variety of learning techniques that have the potential to transform many industrial and scientific fields . Recently, several research communities have begun to converge on a common set of issues surrounding supervised, unsupervised, and reinforcement learning problems. The MIT Press Series on Adaptive Computation and Machine Learning seeksto unify the many diverse strands of machine learning research and to foster high-quality research and innovative applications. This book collects recent research on representing, reasoning, and learning with belief networks . Belief networks (also known as graphical models and Bayesian networks) are a widely applicable formalism for compactly representing the joint probability distribution over a set of random variables. Belief networks have revolutionized the development of intelligent systems in many areas. They are now poised to revolutionize the development of learning systems. The papers in this volume reveal the many ways in which ideas from belief networks can be applied to understand and analyze existing learning algorithms (especially for neural networks). They also show how methods from machine learning can be extended to learn the structure and parameters of belief networks . This book is an exciting illustration of the convergence of many disciplines in the study of learning and adaptive computation.

Preface

Graphical models are a marriage between probability theory and graph theory . They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering - uncertainty and complexity - and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms . Fundamental to the idea of a graphical model is the notion of modularity - a complex system is built by combining simpler parts . Probability theory provides the glue whereby the parts are combined , insuring that the system as a whole is consistent , and providing ways to interface models to data . The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly -interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms . Many of the cl~ sical multivariate probabilistic systems studied in fields such as statistics , systems engineering , information theory , pattern recogni tion and statistical mechanics are special cases of the general graphical model formalism - examples include mixture models , factor analysis , hid den Markov models , Kalman filters and Ising models . The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism . This has many advantages- in particular , specialized techniques that have been developed in one field can be trans ferred between research communities and exploited more widely . Moreover , the graphical model formalism provides a natural framework for the design of new systems . This book presents an in -depth exploration of issues related to learn ing within the graphical model formalism . Four of the chapters are tutorial articles (those by Cowell, MacKay , Jordan , et al., and Heckerman ). The remaining articles cover a wide spectrum of topics of current research in terest . The book is divided into four main sections: Inference , Indepen dence , Foundations for Learning , and Learning from Data . While the sections can be read independently of each other and the articles are to a large extent self-contained , there also is a logical flow to the material . A full appreciation of the material in later sections requires an understanding 1

2 of the material in the earlier sections . The book begins with the topic of probabilistic inference . Inference refers to the problem of calculating the conditional probability distribu tion of a subset of the nodes in a graph given another subset of the nodes. Much effort has gone into the design of efficient and accurate inference algorithms . The book covers three categories of inference algorithms - exact algorithms , variational algorithms and Monte Carlo algorithms . The first chapter , by Cowell , is a tutorial chapter that covers the basics of exact infer ence, with particular focus on the popular junction tree algorithm . This material should be viewed as basic for the understanding of graphical models . A second chapter by Cowell picks up where the former leaves off and covers advanced issues arising in exact inference . Kjcerulff presents a method for increasing the efficiency of the junction tree algorithm . The basic idea is to take advantage of additional independencies which arise due to the partic ular messages arriving at a clique ; this leads to a data structure known as a "nested junction tree ." Dechter presents an alternative perspective on exact inference , based on the notion of "bucket elimination ." This is a unifying perspective that provides insight into the relationship between junction tree and conditioning algorithms , and insight into space/ time tradeoffs . Variational methods provide a framework for the design of approximate inference algorithms . Variational algorithms are deterministic algorithms that provide bounds on probabilities of interest . The chapter by Jordan , Ghahramani , Jaakkola , and Saul is a tutorial chapter that provides a general overview of the variational approach , emphasizing the important role of convexity . The ensuing article by Jaakkola and Jordan proposes a new method for improving the mean field approximation (a particular form of variational approximation ) . In particular , the authors propose to use mixture distributions as approximating distributions within the mean field formalism . The inference section closes with two chapters on Monte Carlo meth ods. Monte Carlo provides a general approach to the design of approximate algorithms based on stochastic sampling . MacKay 's chapter is a tutorial presentation of Monte Carlo algorithms , covering simple methods such M rejection sampling and importance sampling , M well as more sophisticated methods based on Markov chain sampling . A key problem that arises with the Markov chain Monte Carlo approach is the tendency of the algorithms to exhibit random -walk behavior ; this slows the convergence of the algorithms . Neal presents a new approach to this problem , showing how a sophisticated form of overrelaxation can cause the chain to move more systematically along surfaces of high probability . The second section of the book addresses the issue of Independence . Much of the aesthetic appeal of the graphical model formalism comes from

3 the "Markov properties " that graphical models embody . A Markov prop erty is a relationship between the separation properties of nodes in a graph

(e.g., the notion that a subset of nodes is separated from another subset of nodes, given a third subset of nodes) and conditional independenciesin the family of probability distributions associated with the graph (e.g., A is independent of B given C , where A , Band C are subsets of random

variables). In the case of directed graphs and undirected graphs the relationships are well understood (cf . Lauritzen , 1997) . Chain graphs , however, which are mixed graphs containing both directed and undirected edges, are less well understood . The chapter by Richardson explores two of the Markov properties that have been proposed for chain graphs and identifies natural "spatial " conditions on Markov properties that distinguish between these Markov

properties

and those for both directed

and undirected

graphs .

Chain graphs appear to have a richer conditional independence semantics than directed and undirected graphs The chapter by Studeny and Vejnarova addresses the problem of characterizing stochastic dependence. Studeny and Vejnarova discuss the proper ties of the multiinformation function , a general information -theoretic func tion from which many useful quantities can be computed , including the conditional

mutual

information

for all disjoint

subsets of nodes in a graph .

The book then turns to the topic of learning . The section on Founda tions for Learning contains two articles that cover fundamental concepts that are used in many of the following articles . The chapter by H eckerman is a tutorial

article

that

covers

many

of the

basic

ideas

associated

with

learning in graphical models . The focus is on Bayesian methods , both for parameter learning and for structure learning . Neal and Hinton discuss the

expectation-maximization (EM) algorithm . EM plays an important role in the graphical model literature , tying together inference and learning prob lems. In particular , EM is a method for finding maximum likelihood (or maximum a posteriori ) parameter values, by making explicit use of a prob abilistic inference (the "E step" ) . Thus EM - based approaches to learning generally make use of inference algorithms as subroutines . Neal and Hinton describe the EM algorithm as coordinate ascent in an appropriately -defined cost function . This point of view allows them to consider algorithms that take partial E steps, and provides an important justification for the use of approximate inference algorithms in learning . The section on Learning from Data contains a variety of papers concerned with the learning of parameters and structure in graphical models . Bishop provides an overview of latent variable models , focusing on prob abilistic principal component analysis , mixture models , topographic maps and time series analysis . EM algorithms are developed for each case . The

article by Buhmann complements the Bishop article , describing methods

4 for dimensionality reduction , clustering , and data visualization , again with the EM algorithm providing the conceptual framework for the design of the algorithms . Buhmann also presents learning algorithms based on approxi mate inference and deterministic annealing . Friedman and Goldszmidt focus on the problem of representing and learning the local conditional probabilities for graphical models . In partic ular , they are concerned with representations for these probabilities that make explicit the notion of "context -specific independence ," where , for example , A is independent of B for some values of C but not for others . This representation can lead to significantly more parsimonious models than standard techniques . Geiger, Heckerman , and Meek are concerned with the problem of model selection for graphical models with hidden (unobserved ) nodes. They develop asymptotic methods for approximating the marginal likelihood and demonstrate how to carry out the calculations for several cases of practical interest . The paper by Hinton , Sallans , and Ghahramani describes a graphical model called the "hierarchical community of experts " in which a collection of local linear models are used to fit data . As opposed to mixture models , in which each data point is assumed to be generated from a single local model , their model allows a data point to be generated from an arbitrary subset of the available local models . Kearns , Mansour , and Ng provide a careful analysis of the relationships between EM and the K -means algorithm . They discuss an "information -modeling tradeoff ," which characterizes the ability of an algorithm to both find balanced assignments of data to model components , and to find a good overall fit to the data . Monti and Cooper discuss the problem of structural learning in networks with both discrete and continuous nodes. They are particularly concerned with the issue of the discretization of continous data , and how this impacts the performance of a learning algorithm . Saul and Jordan present a method for unsupervised learning in layered neural networks based on mean field theory . They discuss a mean field approximation that is tailored to the case of large networks in which each node has a large number of parents . Smith and Whittaker discuss tests for conditional independence tests in graphical Gaussian models . They show that several of the appropriate statistics turn out to be functions of the sample partial correlation coefficient . They also develop asymptotic expansions for the distributions of the test statistics and compare their accuracy as a function of the dimen sionality of the model . Spiegelhalter , Best , Gilks , and Inskip describe an application of graphical models to the real-life problem of assessing the effectiveness of an immunization program . They demonstrate the use of the graphical model formalism to represent statistical hypotheses of interest and show how Monte Carlo methods can be used for inference . Finally ,

5 Williams provides an overview of Gaussian processes , deriving the Gaussian process approach from a Bayesian point of view , and showing how it can be applied to problems in nonlinear regression, classification , and hierarchical modeling . This volume arose from the proceedings of the International School on Neural

Nets

" E .R . Caianiello

," held

at

the

Ettore

Maiorana

Centre

for

Scientific Culture in Erice , Italy , in September 1996. Lecturers from the school contributed chapters to the volume , and additional authors were asked to contribute chapters to provide a more complete and authoritative coverage of the field . All of the chapters have been carefully edited , following a review process in which each chapter WM scrutinized by two anonymous reviewers and returned to authors for improvement . There are a number of people to thank for their role in organizing the Erice meeting . First I would like to thank Maria Marinaro , who initiated the ongoing series of Schools to honor the memory of E .R . Caianiello , and who co- organized the first meeting . David Heckerman was also a co- organizer of the school , providing helpful advice and encouragement throughout . Anna Esposito at the University of Salerno also deserves sincere thanks for her help in organizing the meeting . The staff at the Ettore Maiorana Centre were exceedingly professional and helpful , initiating the attendees of the school into the wonders of Erice . Funding for the School was provided by the NATO Advanced Study Institute program ; this program provided generous support that allowed nearly 80 students to attend the meeting . I would

also like

to thank

Jon Heiner

, Thomas

Hofmann

, Nuria

Oliver

,

Barbara Rosario , and Jon Yi for their help with preparing the final docu ment

.

Finally , I would like to thank Barbara Rosario , whose fortuitous atten dance as a participant at the Erice meeting rendered the future condition ally independent of the past .

Michael I. Jordan

INTRODUCTION TO INFERENCE FOR BAYESIAN NETWORKS

ROBERT

COWELL

City University , London .

The School of Mathematics, Actuarial Science and Statistics, City University, Northampton Square, London EC1E OHT

1. Introduction The field of Bayesian networks , and graphical models in general , has grown enormously over the last few years, with theoretical and computational developments in many areas. As a consequence there is now a fairly large set of theoretical concepts and results for newcomers to the field to learn . This tutorial aims to give an overview of some of these topics , which hopefully will provide such newcomers a conceptual framework for following the more detailed and advanced work . It begins with revision of some of the basic axioms of pro babili ty theory .

2. Basic axioms of probability Probability

theory

ing

under

degree

data

,

of

at

of

using

of

recent

us

of

obeys

P

(

A

)

,

absence

is

a

of

as

proposition

,

event

1

if

but

system

of

certainty

a

.

reason

-

Within

numerical

the

measure

consistency

of

being

,

denoted

with

and

only

if

A

P

of

(

A

)

:

is

,

the

certain

.

9

,

is

,

was

and

the

abandoned

.

algorithms

AI

number

uncertainty

prohibitive

community

It

that

.

probability

a

encapsulated

with

systems

the

axioms

axioms

logic

cope

computational

within

by

following

expert

efficient

revival

basic

,

to

became

in

of

a

Boolean

made

calculations

inference

had

or

were

the

some

A

the

,

Attempts

for

with

an

deductive

.

,

has

begin

=

the

interpreted

a

development

theory

ability

used

theory

the

which

in

rules

theory

probability

Let

logic

under

is

systems

production

probability

with

is

belief

probability

use

that

inductive

.

expert

sets

as

probability

consistent

hand

Early

by

known

,

framework

the

1

also

uncertainty

Bayesian

is

,

theory

in

the

.

interval

The

prob

[ 0

-

,

1

]

,

10

ROBERT COWELL

2 If A and B are mutually exclusive , then P (A or B ) = P (A ) + P (B ). We will be dealing exclusively with discrete random variables and their probability distributions . Capital letters will denote a variable , or perhaps a set of variables , lower case letter will denote values of variables . Thus suppose A is a random variable having a finite number of mutually exclusive states (al , . . ' , an) . Then P (A ) will be represented by a vector of non negative real numbers P (A ) = (Xl " , . , xn ) where P (A = ai ) = Xi is a scalar , and Ei Xi = 1. A basic concept is that of conditional probability , a statement of which takes the form : Given the event B = b the probability of the event A = a is X, written P (A == a I B == b) = x . It is important to understand that this is not saying : "If B = b is true then the probability of A = a is x " . Instead it says: "If B = b is true , and any other information to hand is irrelevant to A , then P (A == a) == X" . (To see this , consider what the probabilities would be if the state of A was part of the extra information ) . Conditional probabilities are important for building Bayesian networks , as we shall see. But Bayesian networks are also built to facilitate the calculation of conditional probabilities , namely the conditional probabilities for variables of interest given the data (also called evidence) at hand . The fundamental rule for probability calculus is the product rulel

P(A andB) = P(A I B)P(B).

(1)

This equationtells us how to combineconditionalprobabilitiesfor individual variablesto definejoint probabilitiesfor setsof variables. 3. Bayes ' theorem The -

simplest

form

written

marginal

of

Bayes

as

P (A , B )

and

conditional

-

' theorem of

two

relates

events

or

probabilities

the

joint

probability

hypotheses

A

and

P (A B

is This

with

Bayes can

a prior

lOr

more

' theorem be

(2)

probability

I B ) =

P (B

(3)

I A ) P (A ) P (B ) ,

.

interpreted

generally

of

we easily obtain P (A

which

terms

B )

:

P(A, B) = P(A I B)P(B) = P(B I A)P(A). By rearrangement

in

and

as

follows

P ( A ) for

P ( A and

. our

We are interested in A , and we begin belief

about

A ,

B I C ) :=: P ( A I B , C ) P ( B I C ) .

and then

we observe

INTRODUCTION TO INFERENCEFORBAYESIANNETWORKS 11 B . Then Bayes' theorem, (3), tells us that our revisedbelief for A, the posterior probability P(~ I B) is obtainedby multiplying the prior P (A) by the ratio P (B I A)/ P(B ). The quantity P(B I A), asa functionof varying A for fixed B , is calledthe likelihoodof A. We canexpressthis relationship in the form: posterior

cx:

prior

x likelihood

P(A IB) cx : P(A)P(B IA). Figure 1 illustrates this prior -to- posterior inferenceprocess. Each diagram

0 0 P (A ,B )

Figure 1.

0f P (0 B )P (A IB )

Bayesian inference as reversing the arrows

represents in different ways the joint distribution P (A , B ) , the first repre sents the prior beliefs while the third represents the posterior beliefs . Often , we will think of A as a possible "cause" of the "effect" B , the downward arrow represents such a causal interpretation . The "inferential " upwards arrow then represents an "argument against the causal flow " , from the observed effect to the inferred cause. (We will not go into a definition of "causality " here.) Bayesian networks are generally more complicated than the ones in Figure 1, but the general principles are the same in the following sense. A Bayesian network provides a model representation for the joint distri bution of a set of variables in terms of conditional and prior probabilities , in which the orientations of the arrows represent influence , usually though not always of a causal nature , such that these conditional probabilities for these particular orientations are relatively straightforward to specify (from data or eliciting from an expert ) . When data are observed, then typically an inference procedure is required . This involves calculating marginal prob abilities conditional on the observed data using Bayes' theorem , which is diagrammatically equivalent to reversing one or more of the Bayesian network arrows . The algorithms which have been developed in recent years

12

ROBERT COWELL

allows these calculations to be performed in an efficient and straightfor ward

manner

4 . Simple

.

inference

problems

Let us now consider some simple examples of inference . The first is simply Bayes' theorem with evidence included on a simple two node network ; the remaining examples treat a simple three node problem . 4 .1 .

PROBLEM

I

Supposewe have the simple model X - + Y , and are given: P (X ), P (Y I X ) and Y == y . The problem is to calculate P (X I Y == y ) .

Now from P (X ), P (Y I X ) we can calculate the marginal distribution P (Y ) and hence P (Y = y ). Applying Bayes' theorem we obtain

P (X I Y = y) = P (Y = yIX )P (X ) P (Y = y) .

4 .2 .

PROBLEM

(4)

II

Suppose now we have a more complicated model in which X is a par -

ent of both Y and Z : Z +- X -::,. Y with specified probabilities P (X ), P (Y I X ) and P (Z I X ) , and we observe Y = y . The problem is to calculate P (Z I Y = y ) . Note that the joint distribution is given by P (X , Y, Z ) = P (Y I X )P (Z I X )P (X ) . A 'brute force ' method is to calculate :

1. The joint distribution P (X , Y, Z ). 2. The marginal distribution

P (Y ) and thence P (Y = y ) .

3. The marginal distribution P (Z, Y ) and thence P (Z, Y = y). 4. P (Z I Y = y) = P (Z, Y = y)/ P (Y = y). An alternative method is to exploit the given factorization :

1. Calculate P (X I Y = y) = P (Y = y I X )P (X )/ P (Y = y) using Bayes' theorem, where P (Y = y) = Ex P (Y = y I X )P (X ).

2. Find P(Z I Y = y) = Ex P(Z I X )P(X I Y = y). Note that the first step essentially reverses the arrow between X and Y . Although the two methods give the same answer, the second is generally more efficient . For example , suppose that all three variables have 10 states .

Then the first method in explicitly calculating P (X , Y, Z ) requires a table of 1000 states . In contrast the largest table required for the second method has size 100. This gain in computational efficiency by exploiting the given factorizations is the basis of the arc-reversal method for solving influence

INTRODUCTION

diagrams

,

example

shows

4 .3 .

TO

and

of

the

Suppose

calculate

that

P

The

calculational

' sent

'

( Z

I Y

2 .

Find

P

3 .

Find

P

4 .

Find

P

( Z

=

) ,

) .

Note

P

( Z

, X

P

( Y

, X

( X

, Y

( X

)

( Z

, X

)

==

( Z

, Y

=

, Z

the

P

I Y

==

( X

undirected

)

and

( Y

, X

=

P

( Z

I X

) P

( X

)

)

=

P

( Y

I X

) P

( X

)

)

=

P

( Z

, X

proceeds

) P

( Y

using

y

, X

= )

( X

, Y

) P

( X

ExP =

= ) j

( Z

P

( Z

, Y

y

P

=

we

had

( X

, Y

, Z

)

=

P

likewise

the

for , X

=

P x

)

( Y

I Z

Z the

- E-

X

)

=

y

) j

P

( X

-

X

-

problem

XY is

, to

) .

' message

'

in

step

1

which

is

) .

Ez

I X

P

P

) P

( X P

I Y

P

( Y

, X I X

)

=

P

- +

P

P

=

x

conditional X

joint

graphs

( Z

=

of

graph

directed

) j

the

( Z

, Y

=

y

)

( Z

I X

) P

( X

) ,

get

example

fact

.

) .

( Y

=

In

tree

) .

, Y

( Z

( Y

, Z

, X

I X

( Y

( Z )

I X

and

I X

) . P

( Z

) P

Y

with

this can

( Z

( Z

I X

, X

) P

)

)

Hence I Y

given , X

=

X x

( Dawid

distribution be

( X

)

independence

probability

)

)

P

I Z

following

that

=

an

The junction

ZX

Again

, X

a

( X

, X

,

( Y

) .

)

P

( Z

)

we

P

is

a

structure

P

:

Ey

P

=

example

which

P

on

that

now

y

P

and

.

propagation

13

independence

IMt

from

algorithms

using

given

, X

y

P

Conditional

the

NETWORKS

2 :

Calculate

In

are

( Z

steps

step

1 .

.

BAYESIAN

propagation

calculation

we P

P

5

- tree

same

probabilities

in

FOR

III

now

and

junction

the

PROBLEM

INFERENCE

,

factorized

)

=

= P

( 1979 though

x ( Z ) ) .

this

according

Z

+ -

X

- t

Y

:

P

( X

, Y

, Z

)

=

P

( X

Z

- t

X

- t

Y

:

P

( X

, Y

, Z

)

=

P

( Y

) P I X

( Y ) P

I X ( X

Z

+ -

X

+ -

Y

:

P

( X

, Y

, Z

)

=

P

( X

I Y

) P

( Z

) P I Z I X

( Z

I X

) .

) P

( Z

) .

) P

( Y

) .

,

I X

three

we =

We is

to

:

say

obtain x

) .

This

associated not

unique distinct

.

ROBERT COWELL

14

Each of thesefactorizations follows from the conditional independenceproperties which each graph expresses , viz Z 11 Y I X , (which is to be read as "Z is conditionally independent of Y given X" ) and by using the general factorization property : P (X1 , . . . Xn ) = = -=

P (X11 X2, . . . , Xn )P (X2, . . . , Xn ) P (X 1 I X2, . . . , Xn )P (X2 \ X3, . . . , Xn )P (X3, . . . , Xn ) . .. P (X1 / X2, . . . , Xn ) . . . P (Xn - ll Xn )P (Xn ).

Thus for the third example P (X , Y, Z ) = P (Z I X , Y )P (X I Y )P (Y ) = P (Z / X )P (X I Y )P (Y ). Note that the graph Z - t X ~ Y does not obey the conditional independence property Z lL Y I X and is thus excluded from the list ; it factorizes as P (X , Y, Z ) = P (X I Y, Z )P (Z )P (Y ). This example showsseveralfeaturesof generalBayesiannetworks. Firstly , the use of the conditional independenceproperties can be used to simplify the general factorization formula for the joint probability . Secondly, that the result is a factorization that can be expressedby the use of directed acyclic graphs (DAGs). 6. General specification

in DAGs

It is these features which work together nicely for the general specification of Bayesian networks. Thus a Bayesiannetwork is a directed acyclic graph, whosestructure definesa set of conditional independenceproperties. These properties can be found using graphical manipulations, eg d-separation (see eg Pearl(1988)). To each node is associateda conditional probability distri bution , conditioning being on the parents of the node: P (X I pa(X )). The joint density over the set of all variables U is then given by the product of such terms over all nodes: P (U) = IIp x

(X I pa(X )).

This is called a recursivefactorization according to the DAG ; we also talk of the distribution being graphical over the DAG. This factorization is equivalent to the general factorization but takes into account the conditional independenceproperties of the DAG in simplifying individual terms in the product of the general factorization. Only if the DAG is complete will this formula and the general factorization coincide, (but even then only for one ordering of the random variables in the factorization).

INTRODUCTION TO INFERENCE FORBAYESIAN NETWORKS15 6.1. EXAMPLE Considerthe graph of Figure 2.

P(A,B,C,D,E,F,G,H,I) =P(A)P(B)P(C) P(D IA)P(E IA,B)P(F IB,C) P(GIA,D,E)P(H IB,E,F)P(I IC,F). Figure 2.

It

to

is

simply

marginalising

useful

to

removing

note

that

it over

the

and

Nine node example.

marginalising

any variable

over

edges H

to in

it the

from

a

its

above

childless

node

parents gives

. Thus

is

for

equivalent

example

,

:

P(A, B, C, D, E, F, G, I ) = L P(A,B, C, D, E, F, G,H, I ) H = L P(A)P(B)P(C)P(D IA)P(E IA, B)P(F IB, C) H P(GIA, D,E)P(H IB, E, F)P(I IC, F) = P(A)P(B)P(C)P(D IA)P(E IA, B)P(F IB, C) P(GIA, D, E)P(I IC, F) L P(H IB, E, F) H

= P (A)P (B )P (C)P (D I A)P (E I A, B )P (F I B , C) P (G I A , D , E )P (I I C, F ), which can be represented by Figure 2 with H and its incident edges removed .

Directed acyclic graphs can always have their nodes linearly ordered so

that for each node X all of its parents pa(X ) precedesit in the ordering. Such and ordering is called a topological ordering of the nodes. Thus for example (A , B , C, D , E , F , G, H , I ) and (B , A , E , D , G , C, F , I , H ) are two of the many topological orderings of the nodes of Figure 2. A simple algorithm to find a topological ordering is as follows : Start with the graph and an empty list . Then successively delete from the graph any node which

does not have any parents , and add it to the end of the

list . Note that if the graph is not acyclic , then at some stage a graph will be obtained in which no node has no parent nodes, hence this algorithm can be used as an efficient way of checking that the graph is acyclic .

16

ROBERTCOWELL

Another equivalent way is to start with the graph and an empty list , and successively delete nodes which have no children and add them to the

beginning of the list (cf. marginalisation of childless nodes.) 6 .2 . DIRECTED

MARKOV

PROPERTY

An important property is the directed Markov property . This is a condi tional independence property which states that a variable is conditionally independent of its non-descendents given it parents :

X Jl nd(X ) I pa(X ). Now recall that the conditional probability P (X I pa (X )) did not necessarily

mean that if pa(X ) = 7r* say, then P (X = x ) = P (x 17r*), but included the

caveat

For

the

that

DAGs

any this

other

information

' other

information

is irrelevant

to X

' means , from

the

for

this

to hold .

directed

Markov

property , knowledge about the node itself or any of its descendents. For if all of the parents of X are observed, but additionally observed are one or more descendents Dx of X , then because X influences Dx , knowing

D x and pa(X ) is more informative than simply knowing about pa(X ) alone . However having information about a non-descendent does not tell us anything more about X , because either it cannot influence or be influenced by X either directly or indirectly , or if it can influence X indirectly , then only through influencing the parents which are all known anyway . For example , consider again Figure 2. Using the previous second topological ordering we may write the general factorization as:

P (A , B , C, D , E , F, G, I , H ) = P (B ) * P (A I B ) * P (EIB , A ) * P (D I B , A , E ) * P (G f B , A , E , D ) * P (C I B , A , E , D , G) * P (FIB , A , E , D , G, C) * P (I I B , A , E , D , G, C, F ) * P (HIB , A , E , D , G, C, F , I )

(5)

but now we can use A lL B from the directed Markov property to simplify

P (A I B ) - t P (A ), and similarly for the other factors in (5) etc, to obtain the factorization

in Figure 2. We can write the general pseudo- algorithm of

what we have just done for this example as

INTRODUCTION

TO INFERENCE

Topological

FOR BAYESIAN

ordering

General factorization Directed :~

7.

Making

the

inference

Markov

17

NETWORKS

+ +

property

Recursive factorization

.

engine

We shall now move on to building the so called " inference engine " to in troduce new concepts and to show how they relate to the conditional in dependence / recursive factorization ideas that have already been touched upon . Detailed justification of the results will be omitted , the aim here is to give an overview , using the use the fictional ASIA example of Lauritzen and Spiegelhalter . 7 .1 .

ASIA : SPECIFICATION

Lauritzen lows :

and Spiegelhalter

describe their fictional

problem

domain

as fol -

Shortness -of -breath (Dyspnoea ) may be due to Tuberculosis , Lung can cer or Bronchitis , or none of them , or more than one of them . A recent visit to Asia increases the chances of Tuberculosis , while Smoking is known to be a risk factor for both Lung cancer and Bronchitis . The results of a single X -ray do not discriminate between Lung cancer and Tuberculosis , as neither does the presence or absence of Dyspnoea .

@ / @ I

P(U) =P(A)P(S) P(T IA)P(L I S) P(B I S)P(E I L, T) P(D IB, E)P(X I E)

Figure3. ASIA

18

ROBERTCOWELL

The network for this fictional example is shown in Figure 3. Each vari able is a binary with the states ( " yes" , " no " ) . The E node is a logical node taking value " yes" if either of its parents take a " yes" value , and " no " oth erwise ; its introduction facilitates Lung cancer and Tuberculosis .

modelling

the relationship

of X -ray to

Having specified the relevant variables , and defined their dependence with the graph , we must now assign (conditional ) probabilities to the nodes . In real life examples such probabilities may be elicited either from some large database (if one is available ) as frequency ratios , or subjectively from the expert from whom the structure has been elicited (eg using a fictitious gambling scenario or probability wheel ) , or a combination of both . However M this is a fictional example we can follow the third values . (Specific values will be omitted here .) 7.2. CONSTRUCTING

THE INFERENCE

route and use made - up

ENGINE

With our specified graphical model we have a representation density in terms of a factorization : P (U )

=

IIP v

(Vlpa

(V ) )

=

P (A ) . . . P (X I E ) .

of the joint

(6) (7)

Recall that our motivation is to use the model specified by the joint distri bution to calculate marginal distributions conditional on some observation of one or more variables . In general the full distribution will be computa tionally difficult to use directly to calculate these marginals directly . We will now proceed to outline the various stages that are performed to find a representation of P ( U ) which makes the calculations more tractable . (The process of constructing the inference engine from the model specification is sometimes called compiling the model .) The manipulations required are almost all graphical . There are five stages in the graphical manipulations . Let us first list them , and then go back and define new terms which are introduced . 1. Add undirected edges to all co- parents which are not currently joined (a process called marrying parents ) . 2. Drop all directions in the graph obtained from Stage 1. The result is the so- called moral graph . 3 . Triangulate the moral graph , that is , add sufficient additional undi rected links between nodes such that there are no cycles (ie . closed paths ) of length 4 or more distinct nodes without a short -cut . 4 . Identify the cliques of this triangulated graph . 5. Join the cliques together to form the junction tree .

INTRODUCTION TO INFERENCEFOR BAYESIAN NETWORKS 19 Now let us go through these steps, supplying somejustification and defining the new terms just introduced as we go along. Consider first the joint density again. By a changeof notation this can be written in the form

P(U)

-

-

(8)

II a(V,pa(V)) v a(A) ...a(X,E).

(9)

where a(X , pa(X )) == P (V I pa(V )). That is, the conditional probability factors for V can be consideredas a function of V and its parents. We call such functions potentials. Now after steps 1 and 2 we have an undirected graph, in which for each node both it and its set of parents in the original graph form a complete subgraph in the moral graph. (A complete graph is one in which every pair of nodes is joined together by an edge.) Hence, the original factorization of P (U) on the DAG G goesover to an equivalent factorization on these complete subsetsin the moral graph Cm. Technically we say that the distribution is graphical on the undirected graph Gm. Figure 4 illustrates the moralisation processfor the Asia network. Now let us de-

0

0

j

0

@

@

""

@

@

Figure4. MoralisingAsia: Two extra links arerequired, A - Sand L - B . Directionality is droppedafter all moral edgeshavebeenadded. note the set of cliques of the moral graph by om . (A clique is a complete subgraph which is not itself a proper subgraph of a complete subgraph , so it is a maximal complete subgraph .) Then each of the complete subgraphs formed from { V } U pa (V ) is contained within at least one clique . Hence we can form functions ac such that

P(U) =

nac(Vc) cEcm

where ac (Vc ) is a function of the variables in the clique C . Such a factoriza tion can be constructed as follows : Initially define each factor as unity , i .e.,

20 ac

ROBERT COWELL

(

Vc

)

and

only

and

multiply

to

=

1

for

one

a

Note

not

that

by

Those

:

fact

an

are

ancestor

is

separates

a

there

the

the

First

)

one

( V

I

pa

the

of

{

)

is

a

functions

)

V

of

result

of

( V

function

find

}

one

U

pa

that

( V

)

clique

potential

on

rep

the

in

the

-

cliques

of

set

A

and

B

A

is

S

the

of

The

.

B

The

A

or

in

node

a

definitions

Y

A

E

we

A

:

the

moral

an

node

of

a

A

set

and

have

is

is

a

.

-

A

of

set

nodes

a

)

set

ancestral

the

node

( ii

ancestral

of

these

condi

.

parent

.

between

With

exploit

elicudating

definitions

sets

path

.

a

B

ancestors

every

of

to

.

.

for

more

of

its

is

original

sets

ancestral

if

node

used

later

construction

( i )

it

the

specification

are

described

parents

the

numerical

graph

be

some

of

of

process

of

the

ancestral

either

the

moralisation

in

moral

require

if

the

"

powerful

of

the

some

buried

will

we

union

sets

a

in

independences

on

B

and

is

"

"

is

.

node

least

through

Lemma

done

conditional

which

itself

Y

passes

the

terms

edges

the

still

graph

( at

node

nodes

of

properties

of

the

of

is

in

" visible

moral

of

P

subgraph

into

extra

all

Markov

the

factor

complete

this

the

of

independence

ancestor

When

computations

Aside

tional

adding

remain

local

In

.

they

which

.

each

the

distributions

read

though

efficient

for

distribution

joint

to

,

Then

.

possible

DAG

.

contains

function

the

am

am

conditional

new

of moral

in

which

this

obtain

the

cliques

clique

resentation

8

all

b

set

S

E

B

1

Let

P

factorize

recursively

according

to

g

.

Then

AlLEIS

whenever

of

A

the

A

,

Band

separates

sets

To

B

of

S

in

( QAn

A

can

and

,

we

in

subgraph

the

of

if

is

look

U

if

( AUBUS

BUS

)

) m

,

graph

.

d

to

Y

.

is

When

into

them

is

in

.

set

which

or

not

ancestral

,

Suppose

a

no

possible

any

set

.

S

) )

let

set

of

children

longer

m

in

the

.

-

smallest

calculation

us

that

of

Then

( AUBUS

of

picture

have

An

.

conditional

ways

the

Q

( Q

check

alternative

ancestral

it

minimal

B

properties

are

G

graph

to

- separation

from

acyclic

from

want

come

the

,

A

finding

find

the

directed

we

they

set

nodes

left

that

for

wish

set

separates

-

ancestral

a

S

at

graphs

delete

not

the

us

algorithm

that

successively

are

only

either

simple

G

and

moral

why

following

Then

if

subsets

tell

the

understand

graph

nodes

by

containing

disjoint

lemmas

we

ancestral

be

from

these

dependences

they

separated

set

S

A

What

the

are

2

Let

-

B

ancestral

Lemma

d

and

smallest

.

consider

we

have

nodes

Y

,

the

~

U

provided

delete

any

.

INTRODUCTION TO INFERENCEFOR BAYESIAN NETWORKS 21 Now recall that deleting a childless node is equivalent to marginalising over that node. Hence the marginal distribution of the minimal ancestral set containing A lL B I S factorizes according to the sub-factors of the original joint distribution . So these lemmas are saying that rather than go through the numerical exercise of actually calculating such marginals we can read it off from the graphical structure instead, and use that to test conditional independences. (Note also that the directed Markov property is also lurking behind the sceneshere.) The "moral" is that when ancestral sets appear in theorems like this it is likely that such marginals are being considered. 9. Making the junction

tree

The remaining three steps of the inference-engine construction algorithm seem more mysterious , but are required to ensure we can formulate a consistent and efficient message passing scheme. Consider first step 3 - adding edges to the moral graph am to form a triangulated graph Gt . Note that adding edges to the graph does not stop a clique of the moral graph formed from being a complete subgraph in at . Thus for each clique in am of the moral graph there is at least one clique in the triangulated graph which contains it . Hence we can form a potential representation of the joint prob ability in terms of products of functions of the cliques in the triangulated graph :

P(U) =

II cECt ac(Xc )

by analogy with the previous method outline for the moral graph . The point is that after moralisation and triangulation there exists for each a node-parent set at least one clique which contains it , and thus a potential representation can be formed on the cliques of the triangulated graph . While the moralisation of a graph is unique , there are in general many alternative triangulations of a moral graph . In the extreme , we can always add edges to make the moral graph complete . There is then one large clique . The key to the success of the computational algorithms is to form triangulated graphs which have small cliques , in terms of their state space . SIze. Thus after finding the cliques of the triangulated graph - stage 4 - we are left with joining them up to form a junction tree . The important prop erty of the junction tree is the running intersection property which means that if variable V is contained in two cliques , then it is contained in every clique along the path connecting those two cliques . The edge joining two cliques is called a separator . This joining up property can always be done , not necessarily uniquely for each triangulated graph . However the choice of

22

ROBERT COWELL

tree

is

immaterial

junction

except

tree

pendence

of

the

( not

.

necessarily

this

.

sage

The

running

algorithms

.

possible

If

.

the

ie

,

5

shows

of

a

!

)

@

~

"

-

j

Inference

:

~ ~ - _/

"

"

manageable

in

the

of

size

then

version

basic

the

is

the

mes

unit

of

-

the

computational

local

of

It

computation

Asia

and

a

possible

......

, , " "' .. .-..

the

summarise

tree

joint

consistence

.

becomes

@

on

unction

the

them

passing

become

to

between

@

FIgure

will

edges

between

granularity

-

some

~

@

We

loses

@

V

.

It

.

/

10

cliques

The

inde

extra

message

ensures

triangulated

.

adding

separators

the

.

conditional

independence

with

the

the

DAG

of

given

define

are

0

(

cliques

and

they

cliques

Figure

tree

. ,

of

original

process

property

cliques

,

,

conditional

computation

intersection

between

the

the

retain

)

local

all

on

does

considerations

necessarily

by

it

that

computation

junction

efficiency

not

distribution

However

fact

passing

local

the

but

neighbouring

of

possible

,

independences

graph

because

is

of

conditional

moral

computational

many

properties

the

for

captures

.

We

have

Junction

junction

some

probability

5 .

seen

the

basic

Asia

that

results

we

can

functions

P

for

tree

of

using

tree

( U

of

form

a

defined

)

=

II

the

( Xc

passing

potential

on

ac

message

)

on

the

representation

cliques

of

:

.

CEct

This

sections

potential

can

be

of

generalized

to

neighbouring

representation

cliques

include

functions

)

to

form

on

the

the

following

:

P(U)=D _ CECt ac Xc llSESt bs ((Xs )).

separators

so

( the

called

generalized

inter

-

INTRODUCTION TO INFERENCE FORBAYESIAN NETWORKS23 (for instanceby makingthe separatorfunctionsthe identity). Now, by sendingmessages betweenneighbouring cliquesconsistingof functionsof the separatorvariablesonly, whichmodifythe interveningseparatorand the cliquereceivingthe message , but in sucha waythat the overallratio of productsremainsinvariant,wecanarriveat the followingmarginal representation : p(U) = llCECp(C). llSESp(S)

(10)

Marginalsfor individualvariablescanbe obtainedfrom theseclique(or separator ) marginalsby furthermarginalisation . Suppose that weobserve "evidence " , : X A = xA' Definea newfunction P* by . P*(x) = { 0 P(x) otherwIse if XA ~ xA

(11)

ThenP*(U) = P(U, ) = P([' )P(U I[,). Wecanrewrite(11) as P*(U) = P(u) n l (v), vEA

(12)

wherel (v) is 1 if Xv==x~, 0 otherwise . Thusl(v) is the likelihood function basedon the partialevidence Xv = x~. Clearlythis alsofactorizes on the junctiontree, andby message passingwemayobtainthe followingclique marginalrepresentation p(VI ) = llCECP(CIt :) . llSESP(SI )

(13)

or by omittingthe normalization stage , p(V, ) =: rICECP (C, ) . rIsEsp(S, )

(14)

Againmarginaldistributionsfor individualvariables , conditionaluponthe evidence , canbe obtainedby furthermarginalisation of individualclique tables, as can the probability(accordingto the model) of the evidence , P( ). 11. Why the junction tree? Giventhat themoralgraphhasniceproperties , whyis it necessary to goon to formthejunctiontree? Thisis bestillustratedby anexample , Figure6:

24

ROBERTCOWELL A

@-----( )---- @ ) E Figure 6.

A non-triangulated graph

The cliquesare (A, B , C), (A, C, D), (C, D, F ), (C, E, F ) and (B , C, E) with successive intersections(A, C), (C, D), (C, F ), (C, E) and (B , C). Suppose we havecliquemarginalsP (A, B, C) etc.. WecannotexpressP(A, B , C, D) in terms of P(A, B , C) and P (A, C, D) - the graphicalstructure doesnot imply B Jl D I(A, C). In generalthere is no closedfor expressionfor the joint distribution of all six variablesin terms of its cliquesmarginals. 12 . Those extra

edges again

Having explained why the cliques of the moral graph are generally not up to being used for local message p~ sing , we will now close by indicating where the extra edges to form a triangulated graph come from . Our basic message passing algorithm will be one in which marginals of the potentials in the cliques will form the messages on the junction tree . So let us begin with our moral graph with a potential representation in terms of functions on the cliques , and suppose we marginalise a variable Y say,which belongs to more than one clique of the graph , say two cliques , 01 and O2, with variables Y U Zl and Y U Z2 respectively . They are cliques , but the combined set of variables do not form a single clique , hence there must be at least one pair of variables , one in each clique , which are not joined to each other , Ul and U2 say. Now consider the effect of marginalisation of the variable Y . We will have

L aCt(Y UZl)aC2 {Y UZ2) ==f {Zl UZ2), y a function of the combined variables of the two cliques minus Y . Now this function cannot be accommodated by a clique in the moral graph because the variables ul and U2 are not joined (and there may be others ) .

INTRODUCTION TOINFERENCE FORBAYESIAN NETWORKS25 Hence

we cannot

P (U the

Y ) on

missing

is

can

P (U why

such

one

adds

Pearl

is one

reasoning book

properties

good

collection

which

covers

for

be

doing

scheme

in this

representation extra

able

out

fill

, then

edges

to

that

. This

accommodate one

so results

must

fill - in

in being

able

.

Bayesian intelligence

of

its

. This

in

junction

expert

systems

for

graphical

connected

DAGs with

is Shafer but

also

and

other

overviews

selected

by

papers

number

text

-

prob

-

models

;

( ie prior them

Pearl

.) A

( 1990 ) ,

formalisms

for

the

ex -

. An

is ( Spiegelhalter

a large

. His

introducing

propagating

good

of the

contain

community , from

reasoning

contains

uncertain

and

reasoning

also

for

singly

trees

uncertain

significance

references

material

use ; axiomatics

propagation

probabilistic

methods

editors

introductory

et ai . , 1993 ) . Each

of

references

for

further

. ( 1979 ) introduced

dependence phasis

. More

on

showed

Asia how

' s using

J unction in

do

relational

on this

given

in other

also

the

by

for

treating

conditional

independence

properties

latter

Lauritzen

also

are

in -

with

given

contains

discussion

textbook

reprinted

and

) ; see

by

proofs

Spiegelhalter

em Whit of

-

the

( Lauritzen section

Bayesian

( 1988 ) , who

in multiply

in

are known

in junction on

and

calculations

is also

areas

of propagation

introductory

Markov

probability , ( it

databases

and

their

basis conditional

8 .)

was

consistent

arise

of

( 1996 ) . ( The

section

propagation

trees

formulation

A recent

in

to

and

Lauritzen

example

axiomatic

accounts

models

and

stated

The

the

recent

graphical

( 1990 )

lemmas

more

, and

artificial

for

on

probabilistic

Dawid

eral

these

, to

a wealth

of making

only

a potential

of

, if we

cliques

find

turns

helped

in the

; etc , to

historical

three

reading

who

arguments

uncertainty

these

taker

popular

not

two

having

. It

distribution

. However

of the

graph

graph

joint

reading

of papers

the

moral

passing

pioneers

and

removed

graph

expressions

, 1988 ) contains

development

review

the

Y

we can

moral

message

the

to the

node

of the

of variables

, and

to

further

theory

plaining

pairs

a triangulated

become

handling

with

the

marginal

Markov

trees

edges

of

( Pearl

ability

graph

reduced

a consistent

Suggested

DAG

the

to form

up

representation

be accommodated

13 .

of

moral

intermediate

set

a potential

between

Y ) on

sufficiently to

the

edges

marginal for

form

( Shafer

and

by different and

connected Pearl

, 1990 ) ) .

names

( eg join

Spiegelhalter

of that

paper

trees

is given

networks

, 1988 ) for

. A recent by

and

Dawid

is ( Jensen

gen -

( 1992 ) .

, 1996 ) .

References Dawid , A . P. (1979) . Conditional independence in statistical theory (with discussion). Journal of the Royal Statistical Society, Series B, 41 , pp. 1- 31.

26

ROBERTCOWELL

Dawid

, A . P . ( 1992 expert

Jensen

Applications

) .

An ) .

, S . L . and

on

graphical

Journal , J . Mateo

( 1988 .

the ) .

, G . R . and mann

Spiegelhalter analysis Whittaker Sons

and introduction

, San

Pearl

Mateo

expert

, J . ( 1990 , Chichester

) . .

) .

application

, J . ( ed . ) ( 1990

Graphical

, Series in

) .

Local to

. UCL

B , 50

, pp

systems

in

Press

probabilistic

, London

computations expert

intelligent

Readings

for

uncertain

systems . 157 - 224 . Morgan

reasoning

with ( with

.

probabilities discussion

) .

. Kaufmann

, San

. Morgan

Kauf

.

, A . P . , Lauritzen

systems

networks

, D . J . ( 1988

Society

algorithm

. 25 - 36 .

Bayesian .

inference

, California

propagation , 2 , pp

. CUP

their

Statistical

, D . J . , Dawid in

to

and

Probabilistic

general

models

Spiegelhalter

Royal

a

Computing

Graphical

structures of

of

Statistics

, S . L . ( 1996

Lauritzen

Shafer

.

, F . V . ( 1996

Lauritzen

Pearl

) .

systems

.

Statistical models

in

, S . L . , and Science

, 8 , pp

applied

multivariate

Cowell

, R . G . ( 1993

) .

Bayesian

. 219 - 47 . statistics

. John

Wiley

and

-

ADVANCED INFERENCE IN BAYESIAN NETWORKS

RO BERT

COWELL

City University , London .

The Schoolof Mathematics , Actuarial Scienceand Statistics, City University, NorthamptonSquare , LondonEC"lE OHT

1. Introduction The previous chapter introduced inference in discrete variable Bayesian net works . This used evidence propagation on the junction tree to find marginal distributions of interest . This chapter presents a tutorial introduction to some of the various types of calculations which can also be performed with the junction tree , specifically : -

Sampling . Most likely configurations . Fast retraction . Gaussian and conditional Gaussian models .

A common theme of these methods is that of a localized message-passing algorithm , but with different 'marginalisation ' methods and potentials tak ing part in the message passing operations . 2 . Sampling Let us begin with the simple simulation problem . Given some evidence t: on a (possibly empty ) set of variable X E , we might wish to simulate one or more values for the unobserved variables .

2.1. SAMPLING IN DAGS Henrion proposedan algorithm called probabilistic logic sampling for DAGs, which works as follows. One first finds a topological ordering of the nodes of the DAG Q. Let us denote the ordering by (Xl , X2, . . . , Xn ) say after relabeling the nodes, so that all parents of a node precedeit in the ordering, hence any parent of X j will have an index i < j . 27

28

ROBERT COWELL Assume

from

if

at

P

X2

( XI

)

h ~

no

Otherwise

h

the

X

stage will

state

1[ * .

*

=

, x2

the

we

over

Now

suppose can

samples are

say

that

.

x ;

at

Xl

rejection

.

,

will

x

is

This a

.

We

Let

us

of

the

see

we the

( U

small

junction

To

draw

an

that

made

directed

rators

such

graph and that

analogy

a

which

clique as that

have

=

ordering

: all it

II

P

( C

the

I

) j

is

the

, 01 a

,

because

the

Then

,

correct

if

it

otherwise

the

the

.

that

This

when

a

distribution

networks

, of

can

the

start

case so

number

is

we

current

of sampling

let

.

alone

nodes

be

large bearing

used

to

sample

and

edges

topological

a

P

tree

( S

from that

a

marginal

:

I t : ) .

tree , 8m

the also

, Om ordering

)

DAG

edges

of

directed

from is

the

the

are

away

, . . .

have

s

,

also

and

junction

n

sampling

root

point is

, 81

,

from

evidence on

direct as

follows

( 00

suppose evidence

continue

.

ensure

Thus

logic

tree

propagated

I t : )

the

they

to

TREE

probability

fixed

because

separators this

to is

previous

the

probabilistic

the

c

pose

) .

from

with

, .

The

from

node

quite

JUNCTION

joint

P

I

say

drawn

cases

.

THE

that

the

.

probabilities

' s

even

how

evidence

USING

of

of

such

steps

( U

for

case

independently

then

next

is

of

nodes

xi

the

the

correctly

set

random

generated

exponentially now

and

obtain

Henrion

with

}

the

definite

sampled

known

at

it is

is

all

some

rejection

x

to

generated sampling

with

assume

j

because

sampled

more

.

no

general

obtain

a

P

balancing

that

or

j

=

values

correct

is

one

X

ie

say

are In

I

Now

x2

) .

in

will

* ) ,

be

drawn

sample

the

successfully

x ;

X

to

all

shall

representation

set

proceed

problem

even

SAMPLING

we

u

xi

be

generate

will

of

been

samples

increases

efficiently

2 .2 .

,

=

to

.

there

1[ * )

we

X

X2

obtain

=

=

,

distribution state

to

Xl

)

introduces

correct

have

one

rejection

rejection

evidence

}

now

simply

,

( U

on

one

the

not

ensures

However ones

cannot

discarding

case

interest

We

( Xj

nodes

case

evidence

the

( X21

and

the

such

, but

I pa

wish

)

sample

node

ordering P

( Xj

of

we

Each

that

case

== ,

.

If

for

( X2

themselves

P

) .

have

but

Instead

step

complete

Xl

from ,

P

can

state

P

from

each

( U

one

a

topological

probability

applied

j

.

again

sampled

we

drawn

distribution

case

from

been

Then

from

the

sample

sample

P

be

resulting

can

with

with

stage

x }

~ )

.

samples

sample ( by

sampled

that

still

at

be

the

, x

again

are

we

. . .

can

we

distribution

begin

scheme

to

'

full

evidence one

parent

hence

have

no

Then

we

a

can

we

.

then

already

When

is

say

as

we

have

( xi

from

I

) ;

jth

there xi

,

~

possibilities

parents

u

that

obtain

parents

i t

other in

first

to

,

the

between root acyclic

such

that of

the

let

us

now

junction cliques

.

The

.

Let 00 nodes

result us is

and is

label the in

sup

tree

a the

root the

-

are sepa

-

directed cliques clique directed

,

/"/-.[~ / ~ ~ /]-".'Dir ["jun ]'~ -tre .".

ADVANCED INFERENCE IN BAYESIAN NETWORKS

29

tree, and with Si the parent of Ci ; seeFigure 1. (Note that this notation

Figure 1.

has a subtlety that the tree is connected , but the disconnected case is easily dealt with .) Then we may divide the contents of the parent separator Si into the clique table in Ci to obtain the following representation :

m P(UIf;) = P(XcoIf;) 11P(XCi\SiIXSi,f;). i=l

This is called the set-chain representation , and is now in a form similar to the recursive factorization on the DAG discussed earlier , and can be sampled from in a similar manner . The difference is that instead of sampling individual variables at a time , one samples groups of variables in the cliques . Thus one begins by sampling a configuration in the root clique , drawing from P (Xco , ) to obtain xco say. Next one samples from P (XCl \ Sl I XS1-' ) where the states of the variables for XS1 are fixed by xco because XS1 C XCo . One continues in this way, so that when sampling in clique Ci the variables XSi will already have been fixed by earlier sampling , as in direct sampling from the DAG . Thus one can sample directly from the correct distribution and avoid the inefficiencies of the rejection method .

3. Most likely configurations One contributing reason why local propagation on the junction tree to find marginals "works " is that there is a "commutation behaviour " between the

30

ROBERTCOWELL

operation of summation and the product form of the joint density on the tree , which allows one to move summation operations through the terms in the product , for example :

L,B,Cf(A,B)f(B,C)=2 :::,Bf(A,B)LCf(B,C). A A However summation is not the only operation which has this another very useful operation is maximization , for example :

property ;

A ,B,Cf{A,B)f{B,C)=max A,B(f{A,B)maxf C {B,C)) , max provided the factors are non-negative, a condition which will hold for clique and separator potentials representing probability distributions .

3.1. MAX-PROPAGATION So suppose we have a junction tree representation of a probability distri -

nca(C)j nsb(S) b*(8)=max C\Sa(C), (which can be performed

locally through

the commutation

property

above )

what do weget? The answeris the max-marginalrepresentation of thejoint density:

-=rIG maxU \G P ( U ,& ) rIs maxU \SP(S,).

P(U,&) = no (c, nspmax pmax (s,t:)) The interpretation isthatforeach configuration c* ofthevariables inthe clique C , the value Pcax (c* ) is the highest probability value that any configuration of all the variables can take subject to the constraint that the variables of the clique have states c* . (One simple consequence is that this most likely value appears at leMt once in ever). clique and separator .) To see how this can come about , consider a simple tree with two sets of variables in each clique :

1 P(A,B,C,E:) = a(A,B)~

a(B,C).

ADVANCED INFERENCE IN BAYESIAN NETWORKS

31

Now recall that the messagep~ sing leaves invariant the overall distribution . So take ~ maximization

to be the root clique , and send then the first message, a over C :

b*(B) = maxa c (B,C). after "collecting" this messagewe have the representation:

P(A,B,C, ) = (a(A,B)b (B} b;(:1B)a(B,C). b*(B)) The root clique now holds the table obtained over C , because pmax (A , B , )

by maximizing

P (A , B , C , )

:=

max c P (A , B , C , )

=

mgx ( a (A , B ) -:r;rii b* (B ) ) biB1 ) a (B , C )

=

B ) )- ) mgx b; (1:B ) a (B , C ) ( a (A , B ) b b*("(B

=

( a (A , B ) -:r;rii b* (B ) ) .

By symmetry the distribute message results in the second clique table hold ing the max - marginal value maxA P (A , B , C , ) and the intervening separa tor holding maxA ,C P (A , B , C , E ) . The more general result can be obtained by induction on the numbers of cliques in the junction tree . (Note that one can pass back to the sum - marginal representation from the max - marginal representation by a sum -propagation .) A separate but related task is to find the configuration of the variables which takes this highest probability . The procedure is as follows : first from a general potential representation with some clique Co chosen as root , per form a collect operation using maximization instead of summation . Then , search the root clique for the configuration of its variables , Co say, which has the highest probability . Distribute this as extra " evidence " , fixing succes sively the remaining variables in the cliques further from the root by finding a maximal configuration consistent with the neighbouring clique which has also been fixed , and including the states of the newly fixed variables as evi dence , until all cliques have been so processed . The union of the " evidence " yields the most likely configuration . If there is " real " evidence then this is incorporated in the usual way in the collect operation . The interpretation is that the resulting configuration acts as a most likely explanation for the data .

32

ROBERTCOWELL

Note the similarity to simulation, where one first does a collect to the root using ordinary marginalisation, then does a distribute by first randomly selecting a configuration from the root , and then randomly selecting configurations from cliques successivelyfurther out.

3.2. DEGENERACY OFMAXIMUM It is possible to find the degeneracy of the most likely configuration , that is the total number of distinct configurations which have the same maximum probability pmax (U I ) = p* by a simple trick . (For most realistic applica tions there is unlikely to be any degeneracy, although this might not be true for ego genetic -pedigree problems .) First one performs a max -propagation to obtain the max -marginal representation . Then one sets each value in each clique and separator to either 0 or 1, depending on whether or not it has attained the maximum probability value , thus :

1 if Pcax (XG1 ) = p* Ic (xc 1 ) = { 0 otherwise , and

.1 Is(xs 1)={I0otherwise ifp.s;n(xs ax .)=p* Then

I (UI ) = 11Ic(xcI )j n Is(Xs1 ) c s is a potential representation of the indicator function of most likely config urations , a simple sum-propagation on this will yield the degeneracy as the normalization .

3.3. TOP N CONFIGURATIONS In finding the degeneracy of the most likely configuration in the previous section , we performed a max -propagation and then set clique elements to zero which did not have the value of the highest probability . One might be tempted to think that if instead we set to zero all those elements which are below a certain threshold p < 1 then we will obtain the number of configurations having probability ~ p . It turns out that one can indeed find these configurations after one max -propagation , but unfortunately not by such a simple method . We will discuss a simplified version of an algorithm by Dennis Nilsson which allows one to calculate the top N configurations

ADVANCED

by

a

sequence

can

be To

have (X

by

of

propagations

found

after

a

begin

with

an

, . . .

do

a

M1

M2

= state

, Xn

max

=

the

of

(X ~ 2

(x ~

) ,

that

and

let

propagations

least

we ,

)

has

no

the

j - th find

recently

evidence

ordering

, and

33

shown

how

they

.)

have

any

do

have

the

.

will

node

Necessarily

must

one

with

( 1997

NETWORKS

kj

most

The

.

first

Let

states

likely

step

is

us

write

( xJ

, . . . x ~j ) .

to

this

configuration

as We

, denoted

1

l , . . . x ~ n ) . 2

at

BAYESIAN

- propagation

nodes

1 , . . . x ~ n ) ,

of

max

the

- propagation 1

IN

. ( Nilsson

single

assume

ordering

1 , X2

now

INFERENCE

the

differ

from

variable

" pseudo

second

.

the

So

- evidence

- most most

now

"

j

we

as

likely

configuration

likely

configuration

perform

follows

,

a

in

further

n

max

-

.

ml & 1

=

Xl

#

Xl

1 ml

& 2

=

Xl

=

Xl

ml 1 and

X

2

#

X2

2

ml & 3

=

X

1 =

XlI

1 =

Xl

ml an

d

X

2

=

ml

X 2 2

an

d

X

3

#

X 3 3

. . . 1 C "' n

By

this

ing one

procedure

the ,

of

found

the

we

most

likely

them

has

by

& j

third

set

of

at

least

the

most

=

one

or

. To

using

=

it

and

. . .

Xj

with

which

we

for

and

" " j

an

M2

,

found

Xj

as

= F Xj

Xj

=

=

to

be

found

M1

and

need

each

,

and

which

to

x7

in

can

be

the

jth

set

, ie

1

~ 11

and

Xj

either M2

perform

follows

-

only

i = x7j

. Now

in

other

the

configuration a

further

in n

-

j

+

1

:

2 . J

m ej

one

.

- l

the

m

=

, exclud

Hence

each was

is

2 j

.

of

- J. ml - ; - Xn n

n

configurations

configuration

M3

and

of

X

1

1 j

d

an

found

configuration

X ~ l

"

- l 1

set

likely

likely 1

out

=

mn -

- ) normalizations

" evidence

j

Xn

already

disagrees

find

1 =

remaining

configuration ,

place

propagations

Xl

n -

have

most

( max most

likely

propagations

we

second the

X

. . .

the

MI

second

propagating

1 d

an

partition

at

the

ml

one

looking

Suppose

by

X

=

Xj

2 .

2 .+ l

m

J and

Xj

+ 1 #

Xj

~ l

,

. . .

" n - j + 1 " j

This find

further the

partitions third

most

likely

configuration

most

likely

to

develop

=

the likely by

suitable

X . . .

allowed

essentially etc

states

notation

. . We

performing .

The to

m ; - l X n - 1 an

n - 1 =

configuration

configuration a

d

idea keep

After can a

is track

quite of

d

X

-. L m2 I X n n .

n

propagating then similar simple which

these

find

the

partition ,

the

partition

we

fourth on main one

can most

the

third

problem is

up

to

.

34

ROBERT COWELL

If we have prior evidence , then we simply take this into account at the beginning , and ensure that the partitions do not violate the evidence . Thus for example if we have evidence about m nodes being in definite states , then instead of n propagations being required to find M2 after having found M1 , we require instead only n - m further propagations . One application of finding a set of such most likely explanations is to explanation , ie, answering what the states of the unobserved variables are likely to be for a particular case. We have already seen that the most likely configuration offers such an explanation . If instead we have the top 10 or 20 configurations , then in most applications most of these will have most variables in the same state . This can confirm the diagnosis for most variables , but also shows up where diagnosis is not so certain (in those variables which differ between these top configurations ) . This means that if one is looking for a more accurate explanation one could pay attention to those variables which differ between the top configurations , hence serve to guide one to what could be the most informative test to do (cf. value of information ) . The use of partitioned "dummy evidence" is a neat and quite general idea , and will probably find other applications .!

4. A unification One simple comment to make is that minimization can be performed in a similar way to maximization . (In applications with logical dependencies the minimal configuration will have zero probability and there will be many such configurations . For example in the ASIA example half of the 256 configurations have zero probability .) Another less obvious observation is that sum , max and min -propagation are all special cases of a more general propagation based upon P norms , used in functional analysis . Recall that the LP norm of a non-negative real val ued function is defined to be

1 LP (f)=(lEX fP (X)dX )P

For p = 1 this gives the usual integral , for p - t 00 this give the maximum of the function over the region of integration , and for p - t - 00 we obtain the minimum of f . We can use this in our message propagation in our junction tree : the marginal message we pass from clique to separator is the P marginal , lSee for example (Cowell, 1997) for sampling without replacement from the junction tree.

ADVANCED

INFERENCE

IN BAYESIAN

35

NETWORKS

defined by : 1 p

bs(Xs) =

L a~(Xc) Xc \ Xs

So that we can obtain the LP marginal representation :

P (U I ' ) =

llCEC pbP ( X I t : c ) llSES PfP (X1s)

which is an infinite -family of representations . Apart from the L2 norm , which may have an application to quadratic scoring of models , it is not clear if this general result is of much practical applicability , though it may have theoretical

5.

uses .

F' ast retraction

Suppose that for a network of variables X we have evidence on a subset of

k variables U: , = { 'u : u E U*} , with ,U of the form "Xu = x~" Then it can be useful to compare

each item of evidence

with

the probabilistic

prediction given by the system for Xu on the basis of the remaining evidence

\ {u} : "Xv = x~ for v E U \ { u} " , as expressedin the conditional density of Xu given E\ { u} . If we find that

abnormally

low probabilities

are being

predicted by the model this can highlight deficiencies of the model which could need attention or may indicate a rare case is being observed . Now one "brute force" method to calculate such probabilities is to per form k separate propagations , in which one takes out in turn the evidence on each variable

in question

and propagates

the evidence

for all of the

remaining variables . However it turns out that yet another variation of the propagation algorithm allows one to calculate all of these predictive probabilities in one p.ropagation , at le~ t for the c~ e in which the joint probability is strictly

positive, which is the casewe shall restrict ourselvesto here. (For probabilities with zeros it may still be possible to apply the following algorithm ; the matter depends upon the network and junction tree . For the Shafer-Shenoy message passing scheme the problem does not arise because divisions are not necessary.) Because of the computational savings implied , the method is called fast -retraction . 5 .1.

QUT

- MARGINALISATION

The basic idea is to work with a potential representation of the prior joint probability even when there is evidence. This means that , unlike the earlier

36

ROBERT COWELL

sections , we do not modify the clique potentials by multiplying them by the evidence likelihoods . Instead we incorporate the evidence only into forming the messages, by a new marginalisation method called out-marginalisation , which will be illustrated for a simple two- clique example :

~

~

Here A , Band C are disjoint sets of variables , and the clique and separator potentials are all positive . Suppose we have evidence on variables a E A , ,B E Band ' Y E C . Let us denote the evidence functions by ha , hp and h'Y' where ha is the product of the evidence likelihoods for the variables a E A etc . Then we have

P(ABC) P(ABC, 0.) P(ABC, "1) P(ABC,[,0., "1)

-

-

-

1 g(AB )~9(B)~g(BC ) P(ABC)ha P(ABC)h'Y P(ABC)hah'Y.

where the g 's are the clique and separator We message

take from

the

clique C! ! : QJ to

~

as ~

root defined

g* (B) =

. Our

first

potentials step

is to

. send

an

out

- marginal

as :

LCg(BC )h"Y.

That is we only incorporate into the message that subset of evidence about the variables in C , thus excluding any evidence that may be relevant to the separator variables B . Note that because we are using the restriction that the joint probability is non zero for every configuration , this implies that the potentials and messages are also non zero. Sending this message leaves the overall product of j unction tree potentials invariant as usual :

P(ABC )=g(AB )g*)(B)@ --g*(B -_!..-)@ g(BC ). g(B

ADVANCED

N ow

let

us

use

INFERENCE

this

representation

pout

( AB

,

\ AUB

-

shows

by

that

symmetry

of

the

the

joint

g ( B ) is

joint

P

Further

out

desired

of

, with an

thus

)

and

the

clique

~

arrive

pout

at

( AB

,

the

have

)

these

follows on

Fast

retraction

predictive ing

re

with

the

following

evidence

alter

,

of between

, the

of

=

then

which

be

over

: EA

pout

two

back

the

( AB

of , and

out - margin

A . The

separator

, t : \ AUB ) ha

==

representation

to

because

.

( C

, E \ c

( S

,

,

.

the ,

use the

besides

,

.d

.

by

To

way

multiplied one do

to

this

,

by

,

the

one

would

deal ev

require

potential

- initialise

-

a

with represen

the

-

might

evidence a

re

of by

case

propagating

retains need

.

comparing

previous

are

case

no

the

describe

about

always is

yield evidence :

previously

another

However

tree there

then having

,

Consider

another

.

) .

)

potentials

at

\ BUG

.

evidence

tree

)

\ s

tree

has

clique

look

will

case

the

, [

representation

pout

in

( BC

variables

- marginal

evidence

junction

) pout

potentials

- clique

cliques

the

, [ ; \ B

pout

propagating

probability es

also

individual

lls

applicable

After

the

joint c ~

)

simple

evidence

of

- retraction

( U

where

in .

the

out - margin

out - marginal

( ~

out

against

likelihoods to

tation

the number ,

- initialisation

fast

tree

from

the

probabilities

idence wish

,

uction

will

clique

for the

pout

llc

ind

the

message

taken

following

\ AUB

of

P

which

is simply out - margin

out - marginalisation

probabilities we

~

send

marginalisation

- marginalisation

general

clique

now

:

=

predictive In

content

probability

( ABC

of the can

also

( B , & \ B ) . We

of the

content . We

probability

potential pout

the

probability

calculate

g(AB )g* (B ) g(B ) .

-

joint

37

NETWORKS

~ g*)(B)I:& >.g_*.(!B _-)I:& >g(BC )h'Y L.., g()B cg(AB g(AB g*)(B)0.g._ ..-) >c=g(BC )h'Y==g*(B) g()B *(!.B

-

the

to

BAYESIAN

Lc P(ABC , "'() L P(ABC )h"'(

)

-

This

IN

junction

-

38

ROBERTCOWELL

6 . Modelling

with

continuous

variables

All examples and discussion has have been restricted to the special case of discrete random variables . In principle , however, there is no reason why we should not build models having continuous random variables as well as, or instead of , discrete random variables , with more general conditional probability densities to represent the joint density , and use local message passing to simplify the calculations . In practice the barrier to such general applicability is the inability of performing the required integrations in closed form representable by a computer . (Such general models can be analyzed by simulation , for example Gibbs sampling .) However there is a case for which such message pa8sing is tractable , and that is when the random variables are such that the overall distribution is multivariate - Gaussian . This further extends to the situation where both discrete and continuous random variables coexist within a model having a so called conditional -gaussian joint distribution . We will first discuss Gaussian models , and then discuss the necessary adj ustments to the theory enabling analysis of mixed models with local computation . 7.

Gaussian

models

Structurally , the directed Gaussian model looks very much like the discrete models we have already seen. The novel aspect is in their numerical specifi cation . Essentially , the conditional distribution of a node given its parents is given by a Gaussian distribution with expectation linear in the values of the parent nodes, and variance independent of the parent nodes. Let us take a familiar example :

[ YJ--t [ KJ-t ~ . NodeY, whichhasno parents , hasa normaldistributiongivenby

Ny(J1 ,y;a})cx : exp ( -(y20 -'2 yJ.L~)2) , where ,uy and ay are constants . Node X has node Y as a parent , and has

the conditionaldensity:

Nx(J1 ,x+fJx ,yy;uk)cx :exp (-(x- Jl ,x2u2 -X{3x ,yy)2) '

ADVANCEDINFERENCEIN BAYESIANNETWORKS

39

whereJi,x , 'sx,y and ax are constants . Finally, node Z has only Xasa parent ; its conditional

density

is given by

Nz(Jlz+f3z ,xx;O "~)cx :exp ( -(z- Jl ,z2O -2(lZ'XX )2) . "z In general, if a node X had parents { Y1, . . . , Yn} it would have a conditional density:

Nx (J.Lx+

Li (3X ,~Yi;ak)cx :exp (-(x- J.Lx2a2 - Ei ,~Yi )2) ' x (3X

Now the joint density is obtained by multiplying together the separate component

Gaussian distributions

:

P(X, Y, Z) = Ny (Jl,y; a} )Nx (Jl,x + f3x,yY; alr)Nz(J1 ,z + f3z,xx ; a~)

exp (-~(x- /lx,Y- /lY,Z- /lz)K(x- /lx,Y- /lY,Z- /lZ)T) ,

cx:

where K is a symmetric (positive definite ) 3 x 3 matrix , and T denotes transpose . In a more general model with n nodes, one obtains a similar expression with an n x n symmetric (positive definite ) matrix . Expanding the exponential , the joint density can be written as:

exp h y)-2 K yxKxy K yyKXZ K yZ)(X Y hz Kzx Kzy Kzz z)) ((XYz)(hX 1(xyz)(KXX where the

hx

most

will

= useful

consist

for

of

properties

7 .1 .

J.Lx / a ~

we

Suppose

we in

exp

jjz

, Bz ,x / a ~

constructing

functions shall

GAUSSIAN

potential

+

of

be

using

local this

type

etc

. This

form

messages . Let

us

of

, and now

the

joint

indeed

define

density

local them

is

messages and

list

the

.

POTENTIALS

have

n

a subset

9 + (Yl

hI K 1 1 . . K 1 k Yl , , 1 . . . . Yk )hk :.--2(YIKk ,l..Kk ,kYk

continuous

random

{ YI , . . . , Yk }

of

variables

variables

Xl

is

, . . . , Xn

a function

. A

of

the

Gaussian form

:

Yk)

where K is a constant positive definite k x k matrix , h is a k dimen sional constant vector and 9 is a number . For shorthand we write this as a

40 triple respective

ROBERT COWELL ,

(g , h , K triples

) .

G

. aUSSlan potentials

together

can be multiplied by adding their

:

4>1 * 4>2 = (91 + 92, hI + h2 , K1 + K2 ) . Similarly

division

is

easily

handled

:

These operations will be used in passing the "update factor " from separator to clique . To initialize cliques we shall require the extension operation combined with multiplication . Thus a Gaussian potential defined on a set of variables Y is extended to a larger set of variables by enlarging the vector hand matrix K to the appropriate size and setting the new slots to zero. Thus for example : <jJ(x ) = exp (g + xTh - ! xTK x ) extends to

cf >(x,y) = cf >(x) = exp(g+ (x y) (~) - ~(x y) (~ ~) (:) ) . Finally , toformthemessages wemust define marginalisation , which isnow anintegration . LetustakeY1andY2tobetwosets ofdistinct variables , and 4>(Yl,Y2 ) =exp(9+(Yl Y2 ) (~~) - ~(Yl Y2 ) (~~:~ ~~:~) (~~) ) sothatthehandK areinblocks . Then integrating over Y1yields anew vector h andmatrix K asfollows : h = h2- K2,lK~th1 K = K22 , - K21K1 , -11K12o , ' (Discussion of the normalization will be omitted , because it is not required except for calculating probability densities of evidence .) Thus integration has a simple algebraic structure . 7.2. JUNCTION TREES FOR GAUSSIAN NETWORKS Having defined the directed Gaussian model , the construction of the junc tion tree proceeds exactly as for the discrete case, as far as the structure is concerned . The difference is with the initialization . A Gaussian potential of correct size is allocated to each clique and separator . They are initialized with all elements equal to zero.

ADVANCED INFERENCE IN BAYESIAN NETWORKS

41

Next for each conditional density for the DAG model , a Gaussian poten tial is constructed to represent it and multiplied into anyone clique which contains the node and its parents , using extension if required . The result is a junction tree representation of the joint density . Assum ing no evidence , then sending the clique marginals as messages results in the clique marginal representation , as for the discrete case:

P(U) = II p(Xc )/ll

sP(Xs).

c

Care must be taken to propagate evidence. By evidence [; on a set of nodes Y we mean that each node in Y is observed to take a definite value . (This is unlike the discrete case in which some states of a variable could be excluded but more than one could still be entertained .) Evidence about a variable must be entered into every clique and separator in which it occurs . This is because when evidence is entered on a variable it reduces the dimensions of every h vector and K matrix in the cliques and separators in which it occurs . Thus for example , let us again take Y1 and Y2 to be two sets of distinct variables , and

K2 ,2)(Yl Y2 Yk )(~~)-~(YlY)(Kl kK2 ,l1Kl ))

(Yl, Y2) <Xexp ( (Yl so

that

the

variables

h

=

hI

-

After

the

Y

to

Y2K2

, 1

such

Gaussian

are

take

and

K

evidence

evidence

again

in

values

=

has

propagation

with

(

K

2

standard

tation

the

hand

of

Y2

KI

, I

distributions

blocks

Then

been

entered

yield

.

on

.

the

Suppose

we

potentials

now

observe

become

the

modified

to

.

will

included

)

'

in

the

Further

every

clique

within

individual

clique

-

and

marginal

clique

nodes

separator

,

density

marginals

then

represen

then

-

gives

.

7.3. EXAMPLE Let us take out three node example again , with initial tions as follows :

[Y]- +[RJ - +~ N(Y) N(X I Y) N(Z IX)

-

N(O,l ) N(y,l ) N(x,l )

conditional

distribiU-

42

ROBERTCOWELL

The cliquesfor this treeare[ K: rJ and[ K! ] . After initializingand propagating , the clique potentials are

exp (-~(x y)(!1~1 )(:)) cx exp (-~(x z)(~~~1 )(:))

<jJ(x , y) cx <jJ(x , z)

with separator (x ) cxexp(- x2/ 4); Now if we enter evidenceX = 1.5, say, then the potentials red lice to :

4>(X = 1.5, y) cx : exp(1.5y - y2) and 1

2

4>(X = 1.5, z) cx : exp(1.5z - 2z ), because in this example X makes up the separator

between the two cliques .

The marginal densities are then :

P (Y ) = N (0.75, 0.5) and P (Z ) = N (1.5, 1). Alternatively idence

that

cj>( X on

,

)

cx

Z

exp

[ ! : : XJ

is

=

As as

we

and

have

(X

(

)

seen

( x

=

,

message

so

y )

N

the

that

(

the

from

after

105 )

( l , 2 / 3 )

root

clique

~

to

propagation

,

and

[ K : rJ the

enter is

ev

given

clique

-

by

potential

-

~

( x

y )

(

!

1

~

1 )

( :

)

)

. ,

and

P

(Y

)

=

N

( 1 / 2 , 2 / 3 ) .

models

treatment

models employed

The

and

of minor

(2 )

Gaussian

networks

differences

evidence

are

has

to

be

is

in

( 1 )

entered

much

the

the

nature

into

every

same of

the

clique

.

passage . The

)

as

:

Gaussian

separator The

ences

[ K : rJ

densities

discrete

potentials

take the

O . 75x2 form

Conditional

for

Then

< X exp

P

8 .

-

the

marginal

we

1 .5 .

( I . 5x of

cf>( x , y )

with

suppose

first

to

mixed

is

a

restriction

models

proceeds in

the

with modeling

some stage

more :

important

continuous

differ variables

-

ADVANCEDINFERENCE IN BAYESIAN NETWORKS are

not

allowed

discrete

to

parents

differ

in

ple

,

the

of

to

as

for

discrete

the

on

following

original

8

is

paper

. 1

.

CG

The

by

set

of

element

of

X

( r

joint

y

( x

)

E

( i

)

,

O

f

,

( i

I

}

,

tentials

0

,

for

sim

to

,

the

sub

to

theory

,

we

but

( i

)

( i

and

)

=

the

)

)

( i

)

parent

-

but

con

-

configurations

include

more

indicator

careful

for

more

follow

( i

)

=

{

with

details

closely

)

( i

the

)

g

( i

{

any

4 > ( i

-

y

the

see

here

the

.

,

E

)

( i

)

-

we

+

yTh

is

}

)

exp

of

{

g

-

at

lrllog

( i

The

.

define

)

discrete

joint

( i

density

The

/

2

}

f

the

is

with

,

triple

it

is

form

) y

and

typical

the

the

K

i

a

of

;

given

( i

( t : J. )

denote

has

yT

( g

only

,

h

,

K

)

defined

moment

charac

-

by

=

K

( i

)

are

( 27r

)

-

l

K

-

~

CG

the

)

distribution

can

~

y

values

it

positive

and

,

,

.

( i

the

generalize

4 >

X

)

( i

that

one

( i

( i

,

=

~

variables

=

the

characteristics

logdetK

)

,

lh

functions

,

)

case

canonical

+

x

variables

f

{ p

K

discrete

Let

means

of

is

the

.

denoting

which

exp

networks

=

i

r

continuous

triple

the

( i

are

u

whether

the

Gaussian

4 > ( x

X

that

have

which

the

,

=

~

with

of

y

by

logp

=

space

when

we

,

into

V

characteristics

denoted

Inverting

As

again

potentials

by

need

have

notation

thus

indicates

~

g

Gaussian

we

we

the

whose

distribution

canonical

>

teristics

CG

=

{

the

) ~

to

partitioned

,

values

a

)

is

)

state

the

be

f

( i

called

( i

,

V

to

where

K

are

certain

,

and

former

indexed

because

allowed

guide

Lauritzen

variables

and

assumed

X

,

have

nodes

.

brief

variables

the

variables

for

are

only

discrete

- POTENTIALS

continuous

is

a

can

for

The

K

Also

be

nodes

.

latter

potentials

g

discrete

matrices

.

not

Gaussian

the

and

parents

constants

The

,

h

might

the

normalization

.

nodes

models

discrete

ie

specified

continuous

vectors

variables

functions

of

discrete

,

,

probabilities

those

g

of

children

conditional

constants

figurations

discrete

The

character

tables

with

have

.

43

( i

.

( i

) T

)

=

K

( i

E

( i

)

-

) ~

( i

)

distributions

l

}

to

,

/

h

2

CG

( i

)

=

.

po

-

form

)

+

yTh

( i

)

-

yT

K

( i

) y

/

2

}

.

K (i ) is restricted to be symmetric , though not necessarily invertible . However we still call the triple (g, h , K ) the canonical characteristics , and if for all i , x (i ) > 0 and K (i ) is positive definite then the moment characteristics are given M before . Multiplication , division and extension proceed as for the Gaussian potentials have already been discussed. Marginalisation is however different ,

44

ROBERT COWELL

because adding two CG potentials in general will result in a mixture of CG potentials - a function of a different algebraic structure . Thus we need to distinguish two types of marginalisation - strong and weak.

8.2. MARGINALISATION

Marginalising continuous variables corresponds tointegration . Let Y2) , h = ( hI h2) , K = ( KI1 K21K22 y = ( YI K12) withYlhaving dimensionp andY2dimension qwithK11ispositive definite . Thentheintegral f ~(i,Yl,Y2 )dYlisfiniteandequal toa CGpotential ~ withcanonical characteristics given as g(i) = g(i) + {plog(27r ) - logdetK11 (i) + h1(i)TKll(i)-lh1(i)} /2 h(i) = h2(i) - K21 (i)K11 (i)-lh1(i) K(i) = K22 (i) - K21 (i)K11 (i)-lK12 (i). Marginalising discrete variables corresponds to summation . Since in general addition of CG potentials results is a mixture of CG potentials , an alternative definition based upon the moment characteristics {p , ~, E } is used which does result in a CG potential ; however it is only well defined for K (i-, j ) positive definite . Specifically , the marginal over the discrete states - of ~ is defined as the CG potential with moment characteristics {j), ~, E } where and

p(i) = LP (i,j ), ~(i) = L ~(i,j )p(i,j )jjj(i),

J

J

t (i) = 2:::: . ~(i,j )p(i,j )/p(i) +L. (~(i,j ) - ~(i))T(~(i,j ) - ~(i))p(i,j )/p(i). J

Note

that

J

the

latter

can

be

written

as

p(i) (~(i) +~(i)T~(i)) =~ p(i,j) (~(i,j)+~(i,j)T~(i,j)) . J

so that if E (i , j ) and ~(i , j ) are independent of j then they can be taken through the summations as constants . This observation is used to define a marginalisation over both continuous and discrete variables : First marginalise over the the continuous variables and then over the discrete variables . If , after marginalising over the continuous variables the result ing pair (h, K ) is independent of the discrete variables to be marginalised over , (summation over these discrete variables then leaves the pair (h , K )

ADVANCED INFERENCE IN BAYESIAN NETWORKS

45

unmodified ) , we say that we have a strong marginalisation . Otherwise one sums over the discrete variables using the moment characteristics , and the overall marginalisation is called a weak marginalisation . Weak and strong marginalisation satisfy composition :

LA LB AuBuC = AUB L AUBUC , but in general only the strong marginalisation satisfies

L (ifJAUB 'l/JB ) = 'l/JB L ifJAUB . A

Under

both

the

type

correct

8

( I

the

. 3

.

=

)

p

( i

of

we

in

moralize

cg

the

,

sulting

cliques

to

it

.

Unlike

the

then

Any

can

the

R

tree

to

,

when

discrete

,

extra

20ne

way

nodes and

tion orderings

to the

cases

joining . .

Finding

any

of

V

( Y

take

density

will

I

I

=

i

)

=

then

~

expectations

have

( i

)

,

.

\

A

we

A

~

r

or

B

of

B

satisfies

( B

two

we

n

A

)

Ll

the

from

select

has

restricted

the

a

clique

defined

re

-

root

of

as

.

the

follows

:

neighbours

cliques

root

First

.

neighbouring

from

Then

themselves

~

and

.

continuous

any

root

cliques

a

the

must

choose

strong

,

of

.

Now

freely

called

than

by

all

.

tree

tree

triangulate

variables

tree

variables

junction

eliminate

cannot

discrete

junction

the

Then

first

so

R

)

of

the

discrete

pair

to

away

only

is

not

purely

continuous

vertices

.

a

good

.

we

we

between

' unmarked pairs

,

construct

junction

a

closer

triangulate

all

ordering

,

choose

separator

status

)

construct

to

the

a

furthest

the

( i

to

we

way

eliminate

separator

clique

beyond

the it

a

the

~

used

is

,

( B

Thus

=

'

marginalisation

usual

for

A

)

i

step

the

which

' marginalised

how

Specifically

we

with

=

construct

Instead

clique

on

I

a

.

under

pure

.

. e

is

first

we

we

tree

I

adjust

in

. 2

previous

junction

( Y

i

TREE

The

DAG

and

E

,

- potentials

ordering

variables

,

2

distribution

have

messages

,

order

JUNCTION

- closure

elimination

)

CG

THE

that

we

=

correct

non

means

pass

to

i

MAKING

The

marginalisation

moments

P

where

of

A

graph ' .

its

unmarked triangulations

is One

to

take

then

an works

neighbours is

ordering through .

then

The

equivalent

of

the

each

nodes node

ordering

is to

finding

and in called good

give

turn

, an

all

of

marking elimina

elimination

-

46

ROBERT COWELL

8.4. PROPAGATION ONTHEJUNCTION TREE The point of this restriction is that on the collect operation , only strong marginalisations are required to be performed . This is because our restricted elimination ordering - getting rid of the continuous variables first , is equiva lent to doing the integrations over the continuous variables before marginal ising any of the discrete variables . Thus our message passing algorithm takes the form : 1. Initialization : Set all clique and separator potentials to zero with unit indicators , and multiply in the model specifying potentials using the extension

operation

where appropriate

.

2. Enter evidence into all clique and separator potentials , reducing vector and

matrix

sizes

as necessary

.

3. Perform a collect operation to the strong root , where the messages are formed by strong marginalisation by first integrating out the redundant continuous

variables , and then summing

over discrete variables .

4. Perform a distribute operation , using weak marginalisation where appropriate discrete

when mixtures variables

might

be formed

on marginalising

over the

.

The result is a representation of the joint CG-distribution including evidence , because of the invariant nature of the message passing algorithm . Furthermore , because of the use of weak marginalisation for the distribute operation , the marginals on the cliques will themselves be CG -distributions whose first two moments match that of the full distribution . The following is an outline sketch of why this could be. First by the construction of the junction tree , all collect operations are strong marginals , so that after a collect -to-root operation the root clique contains a strong marginal . Now suppose, for simplicity , that before the

distribute operation we move to a set-chain representation (cf. section 2.2). Then

apart

from

the strong

root , each clique will

have the correct

joint

density P (XCi\ Si l Xsi ) where Si is the separator adjacent to the clique Ci on the path between it and the strong root . Now on the distribute oper ation the clique Ci will be multiplied by a CG-potential which will either be a strong marginal or a weak marginal . If the former then the clique potential will be the correct marginal joint density . If the latter then we

may write the clique potential as the product P (XCi\ Si) * Q(Xsi ) where Q is the correct weak marginal for the variables X Si. Now consider taking an expectation of any linear or quadratic function of the XCi with respect to this "density " . We are free to integrate by parts . However , choosing to inte -

grate wrt . XCi\ Si first meansthat we form the expectation wrt the correct CG-density P (XCi\ Si I Xsi ), and will thus end up with a correct expectation (which will be a linear or quadratic function in the XSi ) multiplied

ADVANCED INFERENCE IN BAYESIAN NETWORKS

47

by the correct weak marginal Q (X Si) . Hence performing these integrations we will obtain the correct expectation of the original function wrt the true joint density . For brevity some details have been skipped over here, such as showing that the separator messages sent are correct weak marginals . Detailed justifications and proofs use induction combined with a careful analysis of the messages sent from the strong root on the distribute operation . See the original paper for more details . 9 . Summary This tutorial has shown the variety of useful applications to which the junction -tree propagation algorithm can be used. It has not given the most general or efficient versions of the algorithms , but has attempted to present the main points of each so that the more detailed descriptions in the orig inal articles will be easier to follow . There are other problems , to which the junction -tree propagation algorithm can be applied or adapted to , not discussed here, such as: - Influence diagrams : Discrete models , with random variables , decisions and utilities . Potentials are now doublets representing probabilities and utilities . Junction tree is generated with a restricted elimination generalising that for cg-problems to emulate solving the decision tree . - Learning probabilities . Nodes presenting parametrisations of probabil ities can be attached to networks , and Bayesian updating performed using the same framework . - Time series. A network can represent some state at a given time , and they can be chained together to form a time -window for dynamic mod elling . The junction tree can be expanded and contracted to allow forward - prediction or backward smoothing . Doubtless new examples will appear in the future .

10. Suggested further

reading

Probabilistic logic sampling for Bayesian networks is described by Henrion (1988) . A variation of the method - likelihood -weighting sampling in which rejection steps are replaced by a weighting scheme is given by Shachter and Peot (1989) . Drawing samples directly from the junction tree is described by Dawid (1992), which also shows how the most likely config uration can be found from the junction tree . The algorithm for finding the N - most likely configurations is due to Nilsson (1994) , who has also developed a more efficient algorithm requiring only one max -propagation on the junction tree . V -propagation is not described anywhere but here.

48

ROBERT COWELL

Fast retraction is introduced in (Dawid , 1992) and developed in more detail in (Cowell and Dawid , 1992) .

Gaussian networks are described by Shachter and Kenley(1989), who use arc-reversal and barren -node reduction algorithms for their evaluation .

(The equivalenceof various evaluation schemesis given in (Shachter et al., 1994) .) The treatment

of Gaussian and conditional -gaussian networks is

based on the original paper by Lauritzen( 1992). For pedagogical reasons this chapter specialized the conditional -gaussian presentation of (Lauritzen , 1992) to the pure gaussian case, to show that the latter is not so different from the pure discrete case. Evaluating influence diagrams by junction -trees is treated in (Jensen et al., 1994) . For an extensive review on updating probabilities (Buntine , 1994) . Dynamic junction trees for handling time

series is described by Kjrerulff (1993). See also (Smith et al., 1995) for an application using dynamic junction trees not derived from a DAG model . References Buntine , W . L . ( 1994) . Operations for learning with graphical models . Journal ficial Intelligence Research , 2 , pp . 159- 225.

of Arti -

Cowell, R . G. (1997) . Sampling without replacement in junction trees, Research Report 15, Department of Actuarial Science and Statistics , City University , London . Cowell , R . G . and Dawid , A . P. ( 1992) . Fast retraction of evidence in a probabilistic expert system . Statistics and Computing , 2 , pp . 37- 40 .

Dawid , A . P. (1992) . Applications of a general propagation algorithm for probabilistic expert systems . Statistics

and Computing , 2 , pp . 25- 36.

Henrion , M . (1988) . Propagation of uncertainty by probabilistic logic sampling in Bayes' networks. In Uncertainty in Artificial Intelligence, (ed. J. Lemmer and L . N . Kanal ), pp . 149 - 64 . North

- Holland

, Amsterdam

.

Jensen, F ., Jensen, F . V ., and Dittmer , S. L . (March 1994). From influence diagrams to junction trees . Technical Report R -94-2013 , Department Computer Science Aalborg University , Denmark .

of Mathematics

and

Kj cerulff, U . (1993) . A compu tational scheme for reasoning in dynamic pro babilistic networks . Research Report R -93-2018 , Department Science , Aalborg University , Denmark .

of Mathematics

and Computer

Lauritzen , S. L . (1992) . Propagation of probabilities , means and variances in mixed graphical

association models . Journal

of the American

Statistical

Association , 87 ,

pp . 1098 - 108 .

Nilsson , D . ( 1994) . An algorithm for finding the most probable configurations of discrete variables that are specified in probabilistic expert systems . M .Sc. Thesis , Department of Mathematical Statistics ) University of Copenhagen .

Nilsson, D . (1997) . An efficient algorithm for finding the M most probable configurations in a probabilistic

expert system . Submitted

to Statistics and Computing .

Shachter, R. D ., Andersen, S. K ., and Szolovits, P. (1994). Global conditioning for probabilistic Uncertainty

inference in belief networks . In Proceedings of the Tenth Conference on in Arlifical Intelligence , pp 514- 522.

Shachter, R . and Kenley, C. (1989) . Gaussian influence diagrams. Management Science, 35 , pp . 527 - 50 .

Shachter, R . and Peot, M . (1989) . Simulation approaches to general probabilistic inference on belief networks . In Uncertainty in Artificial Intelligence 5, (ed . M . Hen rion , R . D . Shachter , L . Kanal , and J . Lemmer ) , pp . 221- 31. North -Holland , North -

ADVANCED

Holland

INFERENCE

IN

BAYESIAN

NETWORKS

49

.

Smith , J. Q., French, S., and Raynard , D . (1995). An efficient graphical algorithm for updating

the estimates of the dispersal of gaseous waste after an accidental release . In

Probabilistic reasoning and Bayesian belief networks, (ed. A . Gammerman) , pp. 12544 . Alfred Waller , Henley -on - Thames .

INFERENCE IN BAYESIAN NETWORKS USING NESTED JUNCTION TREES

UFFE

KJ

lERULFF

Department

of

Fredrik

Abstract

.

Shenoy

architectures

lations to

clique

sent

in

and

reduced

.

junction

is large

Introduction

Kim

and

was

are

passed

1990

) .

More

variable

a

tree

be

leaves

but

distributions

( or ,

,

a

Shafer

this

That

is

,

the

we

show of

may

the

the

nested

usefulness

of

evaluation

both

that

methods

The

Hugin

-

the

messages

exploiting .

-

mes of

paper

of

re

factorization

computation

empirical

the

root

;

nodes ( at

least

tree

)

Jensen

the

involving and

the

at

Shafer

of

up

a

51

,

-

. ,

1990

the

) normalizing

the to

;

for

then

a

Shenoy

,

particular

the

leaf

of

interest

.

toward

the

root

contain constant

Later

network

and

from

the

.

messages

the

Shafer

variable from

will

network only

where

inward

messages

cliques to

Bayesian

distribution

messages containing

( or

a

networks

corresponding et

probability

node

in

connected networks

sending

propagation all

inference singly

join

1988

posterior by

,

In

reductions

connected

tree

computed

performed

.

the

propagation

and

for

multiply

outward is

. a

way

formulated ,

junction

toward

subsequent

posterior

first

Spiegelhalter precisely

can the

)

to

and

clique

the

thorough

networks

passing

in

( Lauritzen

tree

and independence

.

( 1983

extended

ugin

via

in

such a

Bayesian

a

structured

achieve

H

Denmark

the

conventional a

algorithms

message

this

of

- world

the

computed

junction

the

University (!),

exploiting

trees

through

Pearl

through

of

to

both

of

be

a

Aalborg

Aalborg

messages

presents

technique

inference

1 .

by

of

costs

paper

,

- 9220

in

junction

emphaBized real

Shenoy

DK

inference

can

form

time

trees

method ten

the

,

improved

clique

nested

The

7E

incoming a

such

space

be

be

the

from

exploiting

both

of can

by

be

Science

Vel

efficiency

potential

by

a

The

induced

sage

Computer

Bajers

the ) .

nodes

correct

If

52

UFFEKJlERULFF

The Hugin and the Shafer-Shenoy propagation methods will be reviewed briefly in the following ; for more in -depth presentations , consult above references. We shall assume that all variables of a Bayesian network are discrete . 2. Bayesian

Networks

and Junction

Trees

A Bayesian network can be defined as a pair (Q,p), where Q = (V, E ) is an acyclic , directed graph or , more generally , a chain graph (Frydenberg , 1989) with vertex set V and edge set E , and p is a discrete probability function with index set V . To each vertex v E V corresponds a (discrete ) random variable

Xv with domain

Xv . Similarly , a subset A ~ V corresponds

to a set

of variables XA with domain XA = XvEAXv. Elements of XA are denoted XA = (xv ) , v E A .

A probability function "' v(xv, Xpa(v)) is specified for each variable Xv , where pa(v) denotesthe parents of v (i.e., the set of vertices of 9 from which there are directed links to v). If pa(v) is non-empty, "'v is a conditional probability distribution for Xv given Xpa(v); otherwise, "'v is a marginal probability distribution for Xv . The joint probability function pv = P for X V is Markov with respect to the acyclic , directed graph g . That is, Q is a map of the independence relations represented by p : each pair of vertices not connected by a directed edge represents an independence relation . Thus , 9 is often referred to as the independence graph ofp . The probability function p being Markov with respect to 9 implies that p factorizes according to g :

p = n ~V' vEV

For a more detailed account of Markov fields over directed graphs , consult

e.g. the work of Lauritzen et al. (1990. Exact inference in Bayesian networks involves the computation of marginal probabilities for subsets C ~ V , where each C induces a fully connected (i .e., complete ) subgraph of an undirected graph derived from Q. (Formally , a set A ~ V is said to induce a subgraph QA = (A , EnAxA ) ofQ = (V, E ) .) A fixed set C of such complete subsets can be used no matter which variables have been observed

and no matter

for which variables

we want posterior

marginals . This is the idea behind the junction tree approach to inference . The derived undirected graph is created through the processes of moral ization and triangulation . In moralization an undirected edge is added between each pair of disconnected vertices with common children and , when this has been completed , all directed edges are replaced by undirected ones. In the triangulation process we keep adding undirected edges (fill -ins ) to the moral graph until there are no cycles of length greater than three with -

NESTEDJUNCTIONTREES

53

out a chord (i .e., an edge connecting two non-consecutive vertices of the cycle ). The maximal complete subsets of the triangulated graph are called cliques. It can easily be proved that an undirected graph is triangulated if and only if the cliques can be arranged in a tree structure such that the intersection between any pair of cliques is contained in each of the cliques on the path between the two . A tree of cliques with this property is referred to as a junction tree. Now , inference can be performed by passing messages between neighbouring cliques in the junction tree .

3. Inference in Junction

Thees

Inthefollowing wewillbriefly review thejunction tree approach toinfer ence .For amore in-depth treatment ofthesubject consult e.g.thebook of Jensen (1996 ). IntheHugin architecture , apotential table c isassociated with each clique , C.The linkbetween any pair ofneighbouring cliques ,say Cand D, shall bedenoted aseparator , and itislabelled byS=CnD.FUrther ,with each separator , S, isassociated one ortwo potential tables ('mailboxes '). The Hugin algorithm uses one mailbox perseparator , referred to~ 8 and \f~'out S. Inaddition ,weassign each function I'\.v,vEV,toaclique , C,such that {v} Upa(V) EC. That is,foreach clique , C, isassociated asubset ofthe conditional probabilities specified fortheBayesian network .LetKcdenote thissubset , and define ~c=KE n.K :c1'\.. As mentioned above, inference in a j unction tree is based on message passing . The scheduling of message passes is controlled by the following rule : a clique , C , is allowed to send a message to a neighbour , D , if C has received messages from all of its neighbours , except possibly from D , and C has not previously sent a message to D . Thus propagation of messages is initiated in leaf cliques and proceeds inwards until a 'root ' clique , R , has received messages from all of its neighbours . (Note that any clique may be root .) The root clique is now fully informed in the sense that it has got information (via its neighbours ) from all cliques of the junction tree , and thus PR cx cPR,with the proportionality constant being the probability of the evidence (i .e., observations ) , if any. An outward propagation of messages from R will result in Pc cx cPc for all C E C, where C denotes the set of cliques .

54

UFFEKJ.lERULFF

A messageis passedfrom clique C to clique D via separator S = C n D , as follows: 1. C generatesthe messageV = ; , where V is the clique potential for D after the absorption. Now, assumethat we are performing the inward pass, and that clique C is absorbing messages4JS1 ' . . . , 4Jsnfrom neighbouring cliques C1, . . . , Cn and sending a message
1. Absorb messages
L\S< Pc C i=l

2. Generate message
\f' Si

f -

,.,h*

. -

\f' Si ' ~ -

1

, . . . , n

4. If algorithm = Hugin, then store clique potential : c/Jc f - c/Jc

5. DiscardcPO , Sl' . . . , cPsn 6. SendcfJSo to Co (Note that if C is the root clique , we skip steps 2 and 6.) Thus , considering the inward p~ s, the only difference between the H ugin and the ShaferShenoy algorithms is that the Hugin algorithm stores the clique potential . Storing the clique potential increases the space cost, but (most often ) reduces the time cost of the outward pass, as we shall see shortly . Next , assume that we are performing the outward pass, and that clique C receives a message, 4>80' from its 'parent clique ' Co (i .e., the neighbouring clique on the path between C and the root clique ) , and sends messages 4>Sl ' . . . , 4>sn to its remaining neighbours , 01 , . . . , On. (The * 's should not be confused by the similar potentials of the inward pass.) This is done as follows .

1. Absorb messagecPSo : if algorithm = Hugin then

~* -- 'fiG ~ 4>80 'fiG
NESTEDJUNCTIONTREES

55

.C L\Si ~ 1,..,i-nl,i+1,..,n< ~Si =C L\Si
else

2. Store
if algorithm = Hugin then

4>Sot - 4>80 else ~ out ' f ' So

3

.

If

4

.

Discard

5

.

Send

+

=

< Pc

c / > Sl

'

4

index

)

,

{

sets

=

and

~

time

c

Ui

Ui

the

}

U1

)

'

Thus

,

,

potential

:

4 > c

+

-

4 > c

,

if

very

the

expensive

generation

be

of

obtained

approach

due

.

to

the

these

This

is

generation

can

what

be

is

relaxed

,

exploited

in

.

Trees

U

.

be

can

trees

( g

( fam

v

clique

may

savings

network

=

store

trees

subgraph

( v

then

ciJSn

junction

Bayesian

fam

( U

,

Junction

complete

1 -

.

junction

Nested

a

.

and

nested

In

.

,

< PSo

potentials

space

.

Hugin

and

in

clique

both

the

~ * ' f ' So

algorithm

Inference

of

-

.

where

pa

.

,

( v

Urn

( v

)

.

~

each

=

)

( V

,

In

,

fam

( v

general

V

Ui

)

)

an

induces

, p

x

,

induce

~

E

)

each

fam

a

potential

( v

set

)

)

of

undirected

a

complete

of

~

the

v

,

moral

,

subgraph

E

V

,

graph

potentials

graph

v

,

~

1i

Ul

of

'

=

induces

.

.

.

( U

( Ui

'

,

,

Q

~

Ui

Ui

a

,

um

where

,

with

Ui

x

x

Ui

Vi

)

)

of

potential

~u = II ~Ui 1;

is Markov with respect to 1 (i.e., 1 is an independencegraph of ~u). Thus, if triangulating 1 does not result in a complete graph, we can build a junction tree corresponding to the triangulated graph, and then exploit the messagepassing principle for computing marginals of ~u . Assume that ~Ul' . . . , ~um are the potentials involved in computing a message(i .e., a marginal), say 80' in a clique C. That is, ~Ul' . . . , ~Umare the incoming messagesand the I'\,v'S associatedwith C. Then, instead of computing the

56

UFFEKJlERULFF

clique potential ~* \fiG -

n ~Ui 'l;

(cf. Step 1 of the above inward algorithm ), inducing a complete graph, and

marginalizingfrom so (cf. Step2 of the aboveinwardalgorithm ), we might be able to exploit conditional independencerelationships between variables of So given the remaining variables of C in computing

cPSo through messagepassingin a junction tree residinginside C (which, remember , is a clique of another , larger junction tree) . This , in essence, is the idea pursued in the present paper . In principle , this nesting of junc tion trees can extend to arbitrary depth . In practice , however, we seldomly encounter instances with depth greater than three to four . As a little sidetrack , note the following property of triangulations . Let 0 be a clique of a junction tree corresponding to a triangulation of a moral

graph 9 , let KG = { ~Vl' . . . '
the graph induced by

graph , unless the triangu -

lation of (] contains a superfluous fill -in : a fill -in between vertices u and v is

superfluous if {u, v} ~ Si for each i = 1, . . . ,n , becausethen C can be split into two smaller neighbouring cliques 0 ' = C \ {u} and C" = C \ { v} with Ci , i == 1, . . . , n , connected to either

to C ' if v E Ci , to C " if u E Ci , or , otherwise ,

0 ' or C " .

Therefore , assuming the triangulation to be minimal in the sense of not containing any superfluous fill -ins, the nested junction tree principle cannot be applied

when a clique , C , receives messages from all of its neighbours ,

or when a clique potential cPais involved in the computation (cf. the out ward pass of the Hugin algorithm ) . However , in the inward pass and in the outward pass of the Shafer-Shenoy algorithm , any non -root clique receives messages from all but one, say Co, of its neighbours , making it possible to exploit the junction tree algorithm for generating the message to Co. To illustrate the process of constructing nested junction trees, we shall consider the situation where clique 016 is going to send a message to clique 013 in the junction tree of a subnet , here called Munin1 , of the Munin net -

work (Andreassenet ai., 1989). Clique 016 and its neighbours are shown in Figure 1. The variables of 016 = { 22, 26, 83, 84, 94, 95, 97, 164, 168} (named corresponding to their node identifiers in the network ) have 4, 5, 5, 5, 5, 5, 5, 7, and 6 possible states , respectively .

The undirectedgraph inducedby the potentialscPs1 ' cPs2 , cPs3 , and Vl may be depicted as in Figure 2. At first sight this graph looks quite messy, and it might be hard to believe that its triangulated graph will be anything but complete . However , a closer examination reveals that the graph

NESTED JUNCTION TREES G19

G13

I/>s~ c/J~S Figure

1

.

Clique

respectively

a

message

is

22

must

be

26

,

83

,

Figure

2

SO

,

table

and

-

Thus

4 >

82

so

. e

.

,

95

,

,

625

,

clique

cPs1

)

-

to

'

clique

its

and

< ! > S3

from

probability

cliques

G19

potential

013

,

,

cPVl

026

=

,

P

{

and

G63

97122

,

,

26

)

,

.

cliques

are

{

83

,

84

,

97

,

164

,

168

}

and

induced

i

.

e

.

been

,

by

clique

potentials

containing

reduced

tables

of

cps1

a

size

< / JS2

nine

to

total

,

,

< / JS3

tree

000

(

'

and

< / JV1

variables

junction

381

'

)

with

a

including

a

'

with

5

-

a

clique

separator

.

be

shall

,

the

further

cPV1

try

5

to

-

)

continue

clique

broken

three

and

4 > S2

{ 22, 26, 83, 84, 94, 95, 168} { 83, 84, 97, If >4, 168} { 94, 95, 97} { 22, 26, 97}

.

(

has

we

,

}

'

the

that

168

clique

000

remaining

cPs3

,

tree

cannot

,

sent

graph

with

750

the

97

undirected

junction

it

got

i

,

encouraged

clique

'

94

9

2

size

and

and

original

8

of

has

generated

The

size

an

table

,

.

< ! > Sl

messages

and

84

the

of

-

messages

these

triangulated

,

two

on

81 = 82 = 83 = VI =

I'\-S3

receives

Based

already

{

(

016

.

57

down

potentials

induce

our

has

clique

associated

.

The

8

-

clique

associated

the

graph

break

in

it

,

with

shown

-

with

on

it

Figure

the

.

down

only

other

These

3

.

the

hand

potentials

.

In

potential

,

NESTEDJUNCTIONTREES

59

(Cb n So) \ Ca = { 22, 26, 94, 95} ; that is, 4 x 5 x 5 x 5 = 500 messages

must be sentvia separator{83, 84,97, 168} in order to generate80. Sending a message from Cb to Ca involves inward propagation of messages in

the (Cc, Cd) junction tree, but , again, since neither Cc nor Cd contain all variables of the (Ca, Cb) separator, we need to send multiple messagefrom Cd to Cc (or vice versa). For example, letting Cc be the root clique of the inward pMS at this level, we need to send 5 messagesfrom Cd (i .e., one for each instantiation of the variables in (Cd n { 83, 84, 97, 168} ) \ Cc = { 97} ) for each of the 500 messages to be sent from Cb (actually from Cc) to Ca. Similarly , for each message sent from Cd to Cc, 20 messages must be sent

from Ce to Cf (or 25 messagesfrom Cf to Ce). Clearly, it becomesa very

time consumingjob to generate4>80' exploitingnestingto this extent.

525

872, 750,000

000

22 , 26 , 83 , 84 , 94 , 95 , 164 , 168

2t625

5t250

( 7x )

, 000

75 , 000

( 150x

. .

500

83

22

97

84

94

168

95

26

4

!100 ( 20x )

..

( 500x .

It 210t 000

)

~

750

)

17 , 000

( 5x )

/' ..I '

~

,."

, ,

Figure 5. The nested junction tree for clique C16 in Muninl . Only the connection to neighbour C13 is shown. The small figures on top of the cliques and separators indicate table sizes, assuming no nesting. The labels attached to the arrows indicate (1) the time cost of sending a single message, and (2) the number of messagesrequired to compute the separator marginal one nesting level up.

A proper balance between space and time costs will most often be of interest . We shall address that issue in the next section . Finally , however, let us briefly analyze the case where Cb is chosen as root instead of Ca (see Figure 6) . First , note that , since Cb contains

60

UFFEKJlERULFF

the

three

potentials

4>1 , . . . , 4>1 , from subgraph

.

Ca

The

via

time

S

and

=

cost

of

and

receives

message

{ 83 , 84 , 97 , 168 } , Cb

collapses

computing

clique

the

first

potentials to

a

,

complete

potential

, 4>Cb

=

525 , 000 7, 429 , 500

22 , 26 , 83 , 84 , 94 , 95 , 164 , 168

2, 625 , 000

5, 250 ( 7x ) .. 750 83 84 97 168

Figure

6.

The

( i .e . , the with

Cb as root

a single

4>81

one

*

4>83

the

( 750

+

cost

of

and

using

Cb 4

The

nesting

is

2 , 625

, 000

, using

43 % , respectively

order must

junction

~

C16 in Muninl is computed

the of

time

arrows

cost

is

tables

, with

via

of

4

x

375

marginalization

of

measure

, the

total

time

.

the

clique

cost

clique

of sending

the

separator

the

clique

Hugin

potentials

potentials

based

time

cost

Using

six

root

propagation

compute

, 000

remaining

a conservative

) . Thus

to

the

message

( I ) the

required

the

Cb being

inward

indicate

messages

computing

) . Each

( 750

costs

of

and

5 x

nested

Time

on of

has

the

a

larger

generating

of 4>80 '

375

, 000

) +

conventional 2 , 625

, 000

space

525

, 000

( i . e . , non =

13 , 125

trees and

7 x

time

costs

- nested

, 000

approach

=

7 , 427

, 500

) message

, respectively provides

savings

. gener

-

. Thus

, in

of

85 %

.

Costs the

its

+

junction

, of

evaluate

comnare -

a of

6 x

the

and to

to

number

( this

time

and

we

has

cost

+

case

In

4>1 ,

clique C16

, is

, 000

and

Space

up .

, 000

this

5.

level

, 000

root

375

space

ation

375

the

as x

*

for by

attached

( 2 ) the

time

525

tree

generated

labels

, and

, the

6 x

time

be

) . The

*

algorithm

junction

to

message

marginal

is

nested

message

applicability

~

snace -

tree approaches .

and

of time

the

costs

nested with

junction those

of

tree the

approach

conventional

,

S331 :IL NOI~~Nflr a3LS~N

' HddV'IVNOI~N'aANOO log

. -

r

L _

11-

-

1881

.

1

1

1 So I

. J -,

1

C

n

19

HOVO

62

UFFEKJ.lERULFF

space

required

pass

,

In

be

cost

The

clique

,

Co

of

<1>c

shall

assume

is

it

.

) .

For

( m

n

) IXcl

time

during

that

' l / Jc

,

.

is

is

the

outward

stored

in

the

clique

Note

potential

that

if

utilized

shall

,

,

,

,

information

this

however

< Pc

cost

,

may

refrain

from

.

c/ Jc

only

not

generated

we

compute

will

the

are

combination

always

C

+

simplicity

of

If

to

potentials

will

is

message

the

and

compute root

store

c/ Jc

clique

it

in

when

( i .e . ,

< Pc

, whereas

more

So

=I

than

0 ) ,

the

one

message

,

.

processing

8 .

Note

if

message

,

always

Using

that

have

to <1>so

n

times

algorithm

an

Note

marginalize when

V; c

, <1>80

'
So

=

or

0

generating

Kc

=

0

once

that the

-

( but

! ,
=

,

to

.

both

. . . ,
and

the

The

)

if n

to

# 1

if

in

0

pass

,

# =

case

the

Hugin

algorithm which

+

Kc So

.

do

case

we

contributing

n

and

we

which

tables

equals

both

in :

4>c of

Ci

0

the that

inward

- Shenoy

to

So -

#

Shafer

Note

clique

algorithms

number '

an

clique on

.

leaf

Kc two

shown

a

based

by

unless the

4 > Si

was

a

is be

generated

being

contributes

,

will

preceded

whereas

table

+ 1 '

be

C

'

pass

there pass

must be

table

message

outward ,

inward )

between

one

the

with

only

from

4> 0

since

only

directly

to

the < Pc

pass

difference

unless

in ( =

algorithm

4> c ,

during algorithm

C

4Jc

outward

, the

computes it

,

- Shenoy

C

Hugin in

otherwise

anything

.

clique

the

processing ;

Shafer

do

v; c

in using

assume

the

=

does

1996

method

shall

4> c

we

extra

Outward

cost

not

little

contributing

algorithm to

conventional we

a

tables c/ Jc

algorithm - Shenoy

Figure

table

n

methods

contribute

5 . 1 .2 .

use

contributing

,

refined

Hugin

to

+

the

( Shenoy

Shafer

'

,

computing of

such

table

and

Anyway

m

of

reduced

The

in

with

domain

using

< Pso

,

time

.

I,

.

general

the

the

" ' UVm ' l / Jc

pass

the

IXvlu

recomputing

inward

on

,

0

1

0 , and

( i .e . ,

n Kc

if

tables

either =

0 .

5.2. NESTEDAPPROACH In describing the costs associated with the nested junction tree approach , we shall distinguish between message processing at level 1 (i .e., the outer most level) and at deeper levels.

5.2.1. Levell The processing of messages in a non-root clique C (i .e., receiving messages from neighbours C 1, . . . , Cn via separators 81, . . . , 8n , and sending to clique Co) may involve inference in a junction tree induced by VI U . . . U Vm U 81 U . . . U 8n (see Figure 9). Note that , in this setup , C may either be involved in inward message passing using either algorithm , or it may be involved in outward Shafer-Shenoy message passing . In the inward case, C1, . . . , Cn are

63

NESTEDJUNCTIONTREES I I

I

ct >So

I

r

I

I So I

t

-

L

I-

-

I-

,

~

I

C

c/>ST&

c!>Sl . .

81

Sn

ConvCost~ut (C) if algorithm

= Hugin

if So # 0

{conventional inward messageprocessing in C }

if
{ computing ~c = 4>0 ~ ""So}

Ct +- lx501 + IXcl else

Ct +- ( m + n + l ) IXcl { computing

c = so I17 = 14>Si I17 = 1
for i = 1 to n Ct +- Ct + IXcl

{ computing

else if ( So # 0 ) 1\ (n = 0 ) 1\ ( m > 0 ) Ct +- Ct + 21Xc I

{ C is a leaf clique

{ computing

with

Kc

# 0}

CPo = 'l/JcCPso}

for i = 1 to n if r > 1

{ r is the number

Ct f - Ct + rlXcl

of tables

contributing

to ~ c }

{ computing

4>0 = Il/Jc
Ct f - Ct + IXcl

{ computing

<1>Si = EC \ Si c/>c }

Cs f - Cs + IXsi I

{ storing

c/>Si in
Figure 8 . Space and time costs of receiving a message from the inward neighbour and sending to the outward neighbours . Note that r = n + 1 if m > 0 1\ So # 0 , r = n if m > 0 V So :1: 0 , and r = n - 1 if m = 0 1\ So = 0 .

the

outward

outward the

neighbours neighbour

inward

neighbour

assuming cannot The

done

message , the inward

and

Co

which

C

plus

a minimal be

tree . Thus performing

to

the

inference

of

message

propagation

inward

is going

remaining

triangulation

through

the

. In to

outward

, computing in

an

through processing towards

the

outward

send , and

case , Co

C1 , . . . , Cn

neighbours a marginal

induced

inference in a root

clique clique

. ( Recall

that

a root

clique

in

junction

is the

includes ,

tree . )

in

the C

level

equals

in the

level

2 junction the

cost

2 junction

of

64

UFFEKJlERULFF

cPSo

4>51

c/JSn .

.

.

NestedCostfn /out(C) CSf - c~oot

{space cost of inward propagation towards clique 'root '} { time cost of inward propagation towards clique 'root '} {storing ~So}

Ct -+- c~oot

Csf - Cs+ Ix801 Figure 9.

Spaceandtimecostsof receivingn messages andsending0 or 1, wherethe marginalis computed in a junctiontreeinducedby VI U. . . UVmU81U. . . U8n.

tree

( see

below

describe clique

is

5 .2 .2 . We

shall

inference

the

clique

2

( more

outward

operation Now

, since

at

cPSo . In

Section

propagation

of

situation

, say

separator

levell

6 , we

towards

shall a

root

10

processing

an

S * - marginal

that , and

the

message

purpose

generate tree

where

to

of

perform

a message

, we

-

that

going

at

1 , wanting

the tree

to

clique tree

C * can

j unction is

to

clique

junction

levell

S * . Note

defining

1 is

conventional

the

should

from

only

compute

-

per

-

a marginal

.

a

C * , at

of

junction a root

Figure in

costs . Since

>

the

marginals in

those

time deeper

towards

neighbours

message are

2 or

tree

passing

a clique

and

level

1 containing

) a set

the

via

or

l -

outward in

storing inward

space

at

a junction

typically

neighbour

of

conventional

the

message

contained

cost

performing

-

trees

level

Consider its

deeper

analyze

in

inward

from

or

j unction

at

space

of

.

now in

the

cost

calculated

ing

or

plus

the

Level

passing

form

)

how

E::: ~

relevant

be

be

C

receives

level

l

to

send

engaged

generated

1

which

a message in

potentials containing

messages

>

either

involved clique and

C

is to

inward in

that

C . might

share

a

NESTEDJUNCTIONTREES

65

ConvCost ~1(C) if algorithm

= Hugin

if m + n >

1

Ct t - (m + n) IXcl

{

computing

Cs t - IXcl

{

storing

{

comp

{

a

{

computing

{

storing

{

computing

the

c

.

=

rem

first

c

}

2:::~=1(ri - 1)n~~~rj 4>0'8}

.

number

of

1/ ; 0

1 / ; 0

~

}

absorptions

=

I1i

}

CPVi

}

}

I1i

Ti

CPo

' s

}

if (a > 0) V (m > 1) Ct-+- Ct+ (m + n) f1i rilXcl {computingf1i Ti I } if So =F0 Cs-+- Cs+ IXSoI {storing
{computingf1i ri marginals}

~

~

Figure 10. Spaceand time costsof receivingmessages from outward neighboursand sendingto the inward neighbourin a junction tree at nestinglevel greaterthan 1.

66

UFFEKJlERULFF

variables to

with

send

( assuming

That to

be

message

able

to

the

of of

This

way as

of

that

in

the

Denote to

by

send

that

to

chrononizes C

of

to

ICI

<

to

the

first

) 4>0

remaining

Ili

replacing messages

C

the

messages

Co .

for

Co

* ) \ co

neighbour

messages

will

be

from

the

outward

in

a junction

alternative

C ,

by

the

neighbours tree

will

to the

to

multiplied

called

, but

, or

, if , for

is

.

referred

variable

shall

that

outward

that

the

messages

are

messages

in

one

, and

such

, for

prop

refrain

-

from

each

>

neighbours

combination

of

C 1 syn

received

, for

either of

the

messages

.

( Note

-

from

, we

each

sends

part

have such

that

C2 , . . . , Cn

1 . Thus , C

the

Ci

scheduled

message

from

i

== 0 , generates

neighbour

batch

each

messages

I for

outward

So

that of

I ) I XSi

actual

algorithm

=

I1i

: l ~

r i

-

I

Using of

greater

. time

the

time

except once .

1

of

marginal

and

of

is

the

the

IIi

IXcl

>

root ri time .

not

have

time

need

combina

IX ( cns

that

-

* ) \ co 1

S * - marginal

=

=

( ri

( ii ) IIi

4>~ is

a

we

cor

-

, typically

>

+ IIi

S * - marginal

we

must an

is

(m

+

-

0

,

I Xc

m

>

1 ) IIi 4>~

TilXcl is

is

compute

case

the

re -

:

.

(i)

be

( i ) , it

( ii ) to

be

is

similar

computed

composed

the

pays 4>0 ' 8 ,

. For

corresponding equals

the

, a ) is

multiple

. Case

to

that

cases

the

going

by

absorptions

1 . In

going

. The

performed

. Note

two

only

S * - marginal

n ) IXcl

I multiplications

of

computing

compute

processed

1 ) Il ; ~ l rj

number or

to

are

distinguish

before (n

C

are

and

( i . e . , the

, since

, the

,

cost

messages

I divisions

algorithm

by

r1 . replacements

ri

I XSi

therefore

of

taken

that

Ci , where

computed

cost

step

of

such

1 , and

clique

first

messages

1/ Jc

which is

. The and

m

compute

1/ Jc

time

involves

of

cost

C

Ci

the

= l 4>Sj ' which

from

from

- Shenoy

to

,

r7 . combinations a

Shafer

)

that

at

combinations

than

( wrt

Ilj Ei

originating

the

number

=

message

a message

each

with

from

Hugin

one

placing

If

to

x ( cns

IS * I . )

Using ( the

of

their

(r i -

the

C

is going

* ) \ co I messages

multiple

from

) ; an

it

. Furthermore

considering

combinations

Co , or , if

responding

sent

which

inward

sending

marginals

of

size

messages

messages

IXsol

activity

of

its

clique

received

( 1996

to

.

all

all

IX ( cns

to

root

be

Co

configuration

are

worth

number

send

its

storage

tion

for

the

processes

extra

the

to

Jensen be

send

each

== 1 , . . . , n . Assuming

C2 , . . . , Cn

C 1,

by

paper

r i

C , i

is

messages

) might

present

Co

arbitrary

firing , 1994

clique

messages

needed

computing

( Xu

to

for

neighbours

of

with

have

sent

if

number

variable

agation

share

appropriate

messages

the

will

be

, C ' s outward

number

not

0) , C

S * - marginal

reason

product

#

does

must

generate

the

same

it

So

is , a

generate

to

S * which

S *larger

of

NESTED JUNCTION TREES

67

5.2.3. Level .2 or deeper- nested The processing of messagesin a non-root clique C at levell > 1, where the message, <1>80' to be sent is generated through inference in an induced junction tree at levell + 1, is shown in Figure 11. This situation resembles

8 .

(ro x)4>80 t So C

4>V1' . ' " 4>v'tJI .' 4>81' . ' " 4>sn

(rl x) rPs1 I / 81

8n .

.

\ . rPSn(TnX) '\

.

NestedCostt ; (C) CS-+- c~oot

{ space cost of inward prop . towards 'root '}

Ct -+- c~oot Cs-+- Cs+ IXso I

{ time cost of inward prop . towards 'root '} { storing 4>50}

Ct -+- CtIX(cns. )\ col lli Ti Cs-+- Cs+ E ~=2(ri - l )IXsi I

{inward prop. IX(cns. )\ col lli Ti times} {storing multiple messages for eachi > I } and

.

Figure 11. Space and time costs of receiving messages from outward neighbours sending to the inward neighbour in a junction tree at nesting level greater than 1.

~

the situation shown in Figure 9, the only difference being that C may receive multiple messagesfrom each outward neighbour, and that it may have to send multiple messagesto clique Co. Since C needs to perform IIi ri absorptions, with each absorption correspondingto an inward passin

UFFE KJlERULFF

68

the junction tree at levell + 1, and IX(cns*)\coI marginalizationsfor each combinationof messages , a total of IX(cns*)\col ni ri inward passesmust be performedin the levell + 1 tree. 5.3. SELECTINGCOSTFUNCTION Now, depending on the level, directionality of propagation, and algorithm used, we should be able to select which of the five cost functions given in Figures 7- 11 to use. In addition , we need a function for comparing two pairs of associated space and time costs to select the smaller of the two. To determine which of two costs, say c = (cs, Ct) and c' = (c~, c~), is the smaller, we compare the linear combinations Cs+ , Ct and C~ + 'Yc~, where the time factor , is chosenaccording to the importance of time cost. The algorithm Cost(C) for selecting the minimum cost is shown in Figure 12, where ' -<' refers to the cost comparison mentioned above.

Cost(C) if level = 1 if (direction

= inward

if ConvCostfn

) V ( algorithm

( C ) - < NestedCostfn

c = ConvCostfn

= Shafer - Shenoy ) / out ( C )

(C )

else c = NestedCostfn

/ out ( C )

else c = ConvCost

~ut ( C )

else if ConvCost

~ l ( C ) - < NestedCost

c = ConvCost

~ l (C )

~ 1( C )

else c = NestedCost

~ l (C )

Figure 12 .inclique Selecting functions and the minimum cost associated with message processing C.cost 5.4. SUMOF COSTS

Undera giventimefactor," theoverallminimumspaceandtimecosts of inwardor outwardpropagation of messages towards /froma givenroot

NESTEDJUNCTIONTREES clique, R, can

now be computed

69

as

CR= } : Cost (C) + C;emp , CEC

(1)

where

max IXsl +{CIct Hugill prop . SES > cmax not stored }IXclifoutward c;emp=

max CECIXc I

if Shafer-Shenoyprop.

0

otherwise.

During outward H ugin propagation we need auxiliary spacewhen generating messages ; thus, a space of size maxSESIXsl suffices. Further , for each clique, C, for which
Costs

All of the cost functions mentioned aboveare relative to a given root clique, R, and we are therefore only able to compute the cost of probability propagation (inward or outward) with R as root . However, we want to select the root clique such that the associatedcost is minimal , and, therefore, we must be able to compute the 'root cost' cC for each clique C E C. Assuming that root clique R has neighbours C1, . . . , On, Equation 1 can be re-expressedas n cR = Cost(R) + L cCi\ R + c~emp, i=l

(2)

where cCi\ R denotes the root cost in the subtree rooted at Ci and with the R-branch cut off. Note that , since cCi\ R = Cost(Ci ) +

L CC\Ci, CEneighbours (Ci)\ { R}

70

UFFEKJlERULFF

the root costscan be computedthrough inwardpropagationof costs. This is illustrated in Figure 13, whereeachclique C sendsthe cost message Cost(C) + l::~ 1CCi\C to its inward neighbourCo.

CCl \C

. . .

Cost(G) + E ~=l CCi\C . . .

.

.

CCYL \C

Figure 13.

Propagating costs of probability propagation .

Thus, what lacks to compute the root cost for Ci , i = 1, . . . , n, is cR\ Ci (i.e., the root cost for the subtree rooted at R and with the Ci-branch cut off ). However, this is nothing but the cost messagesent from R to Ci if we perform outward propagation of costs from R. That is, after a full propagation of costs (i.e., inward and outward) we can easily compute the root cost of any clique. Note that Cost() dependson the directionality (i.e., inward or outward). So, to compute the root costs for both inward and outward probability propagation we need to perform two full cost propagations. 7. Experiments To investigate the practical relevance of nested junction trees, the cost propagation schemedescribed above has been implemented and run on a variety of large real- world networks. The following networks were selected. The KK network (50 variables) is an early prototype model for growing barley. The link network (724 variabIes) is a version of the LQT pedigreeby ProfessorBrian Suarezextended for linkage analysis (Jensenand Kong, 1996). The Pathfindernetwork (109 variables) is a tool for diagnosing lymph node diseases(Heckerman et al., 1992). The Pignet network (441 variables) is a small subnet of a pedigree of breeding pigs. The Diabetesnetwork (413 variables) is a time-sliced network for determining optimal insulin dose adjustments (Andreassen et al., 1991). The Muninl - 4 networks (189, 1003, 1044, and 1041variables, respectively ) are different subnets of the Munin system (Andreassenet al., 1989).

NESTED JUNCTION TREES

71

The Water network (32 variables ) is a time -sliced model of the biological processes of a water treatment plant (Jensen et al., 1989) . The average space and time costs of performing a probability propaga tion is measured for each of these ten networks . Tables 1- 4 summarize the results obtained for inward Hugin propagation , inward Shafer-Shenoy prop agation , full H ugin propagation (i .e., inward and outward ) , and full ShaferShenoy propagation , respectively . All space/ time figures should be read as millions of floating point numbers / arithmetic operations . The first pair of space/ time columns lists the costs associated with conventional junction tree propagation . The remaining three pairs of space/ time columns show, respectively , the least possible space cost with its associated time cost , the costs corresponding to the highest average relative saving , and the least possible time cost with its associated space cost. The largest average relative savings were found by running the algorithm with various , -values for each network . The optimal values, , * , are shown in the rightmost columns .

TABLE

1

tional

(

.

Space

and

approach

,

minimum

space

and

(

iv

)

time

and

cost

minimum

costs

the

)

for

nested

,

(

iii

time

)

inward

propagation

trees

maximum

cost

Hugin

junction

average

relative

Space

Time

saving

15

. 1

50

. 3

0

. 7

28

. 1

83

. 3

0

. 3

0

. 7

0

. 1

2

. 2

Diabetes

. 8

. 0

207

Munin3

Munin4

As

728

9

3

. 7

12

.

. 3

64

. 6

28

. 3

anticipated

ated

with

costs

of

minimum

,

1119

. 9

because

cost

may

,

be

ii

(

)

i

)

the

conven

maximum

space

-

nesting

and

that

. 8

0

. 3

0

. 9

0

. 5

4

. 9

1128

0

. 6

3532

. 7

is

not

time

costs

,

(

1 '

. 0

2

. 7

26

.

9

. 5

9

. 9

Pathfinder

. 8

0

. 25

.

1

0

. 15

. 3

0

. 30

8

. 4

0

. 30

9

. 25

2

. 4

2

. 4

3

. 4

0

. 0

0

. 30

. 9

22

. 3

0

.

time

costs

general

than

,

since

network

very

limited

15

associ

the

-

time

nesting

Pathfinder

a

315

50

larger

to

1

29

. 9

maximum

The

0

. 8

12

the

much

although

possible

68

1

,

in

only

. 25

. 3

63

recommended

is

0

. 6

. 3

.

. 30

. 6

5

. 5

large

. 30

0

0

1

are

0

. 7

. 6

6

)

. 5

61

. 0

. 8

0

37

2

. 9

*

. 3

. 2

315

,

. 3

0

. 1

,

12

. 6

. 5

=

Thus

100

Time

0

34

. 4

=

8

1

but

unacceptably

nesting

68

. 7

85

.

. 5

. 0

1

costs

. 9

62

90

203

,

Space

40

61

. 2

.

. 5

. 2

95818

networks

,

. 7

0

. 2

it

11

. 1

. 5

space

6

. 3

1

propagation

cost

of

Time

. 1

29692

. 5

=

Space

0

all

minimum

space

time

19

. 7

for

conventional

,

. 04

0

. 7

. 2

8

0

. 1

3

18

Water

33

. 7

Munin2

differently

0

11

Munin1

ated

. 2

0

Time

link

0

=

Space

KK

Pignet

(

Nested

,

Pathfinder

with

.

Conventional

Network

using

approach

yields

the

associ

-

behaves

degree

.

72

UFFEKJlERULFF

TABLE 2. Space and time costs for inward Shafer-Shenoy propagation using (i) the conventional approach, and the nested junction trees approach with (ii ) maximum nesting (minimum space cost) , (iii ) maximum average relative saving of space and time costs, and (iv ) minimum time cost. Conventional

Nested 1 = 0

Network

Space

Time

Space

'Y=

Time

' Y*

'Y =

Space

Time

100

Space

Time

-

--y* .

KK

7 .9

52 . 2

6 .0

787 .9

6 .8

42 .7

7 .0

42 .4

0 . 15

link

4 .5

83 .3

2 .1

587 .4

3 .6

76 . 1

4 .0

73 .8

0 .05

Pathfinder

0 .1

0 .7

0 .1

1 .3

0 .1

0 .7

0 .1

0 .7

0 . 10

Pignet

0.3

2.2

0.2

4.0

0.2

2.4

0.3

2.1

0.15

Diabetes

1 .2

33 .9

0 .5

97 .7

728 .8

79 .2

Munin2

0 .7

9 .9

0 .2

Muninl

51 .0

0 .7

36 .2

1.2

33 .3

0 .05

.7

85 .3

379 .3

88 . 0

368 .6

0 . 10

50 .7

0 .4

11 .1

0 .7

9 .1

0 .05

31260

Munin3

0 .7

12 .2

0 .2

324 .0

0 .5

12 .0

0 .7

10 .5

0 .05

M unin4

3 .0

64 .3

1.1

235 . 7

2 .1

52 .6

2 .4

51 .4

0 .05

Water

6 .1

29 .2

5 .3

115 .8

5 .9

27 . 1

6 .1

26 .9

0 .15

However , as the , = , * columns show, a moderate increase in the space costs tremendously reduces the time costs. (The example in Figure 5 demonstrates the dramatic effect on the time cost as the degree of nesting is varied .) In fact , for , = , * , the time costs of nested computation are either roughly identical or smaller than those of conventional computation , while space costs are still significantly reduced for most of the networks . Interestingly , for all networks the minimum time costs (, = 100) are less than the time costs of conventional propagation , and , of course, the associated space costs are also less than in the conventional case, since the saving on the time side is due to nesting which inevitably reduces the space cost . Comparing Tables 3 and 4 with , = , * , we note , somewhat surprisingly , that the time costs of a full H ugin propagation are consistently smaller than those obtained using the Shafer-Shenoy algorithm , while the space costs are either comparable or smaller for the H ugin algorithm . Note , however , that the , * 's are significantly smaller in the Shafer-Shenoy case, indicating an attempt to keep the space costs under control .

ACKNOWLEDGEMENTS I wish to thank Steffen L . Lauritzen for suggesting the cost propagation scheme, Claus S. Jensenfor providing the link and Pignet networks, David

73

JUNCTION TREES NESTED

TABLE3. Space andtimecostsfora fullHuginpropagation using(i) theconventional approach , andthenested junctiontreesapproach with(ii) maximum nesting (minimum space cost), (iii) maximum average relativesavingof space andtimecosts , and(iv) minimum timecost. Conventional Nested , =0 1= , . l' = 100 Network Space Time Space Time SpaceTime SpaceTime , . KK 15.8 93.3 1.4 1162 .2 2.7 104 .1 9.0 80.5 0.20 link 28.4 164 .1 0.5 29773 .2 7.2 164 .6 12.6 142 .6 0.20 Pathfinder 0.2 1.2 0.1 1.6 0.2 1.1 0.2 1.1 0.15 Pignet 0.9 4.5 0.1 64.1 0.4 4.3 0.6 4.1 0.20 Diabetes 11.0 59.5 0.6 116 .5 1.0 61.0 5.3 55.5 0.15 Munin1 213.3 1362 .3 25.5 96451 .8 74.0 949.6 74.4 948.8 0.15 Munin2 3.2 18.7 0.6 212.6 1.4 19.0 2.4 17.3 0.20 Munin3 3.8 23.6 0.5 97.3 1.4 21.6 2.5 20.8 0.15 Munin4 18.4 119 .4 5.0 1183 .2 5.5 122 .1 13.0 105 .2 0.20 Water 9.0 46.3 1.0 3550 .7 3.2 44.0 4.4 40.3 0.15

TABLE 4. Spaceand time costsfor a full Shafer-Shenoypropagationusing (i) the conventional approach, and the nestedjunction trees approachwith (ii) maximumnesting (minimum spacecost) , (iii ) maximumaveragerelativesavingof spaceand time costs, and (iv) minimum time cost. Conventional

Network KK link Pathfinder Pignet Diabetes Munin1 Munin2 Munin3 Munin4 Water

Nested , =0 "Y= , * , = 100 Space Time Space Time Space Time Space Time

,*

9.0 6.8 0.1

153.6 263.1 2.1

7.1 4.5 0.1

889.3 767.1 2.6

7.9 6.0 0.1

114.9 222.5 2.1

8.1 6.4 0.2

114.4 220.8 2.0

0.05 0.05 0.05

0.5 1.8 117.0 1.1 1.2 4.9 6.7

7.0 80.9 2411.9 31.6 44.8 209.9 60.6

0.3 1.1 98.4 0.6 0.7 3.0 5.9

8.9 97.9 32943.8 72.4 356.6 381.3 147.3

0.4 1.3 105.8 0.8 1.0 3.1 6.5

7.1 82.0 1309.1 31.2 44.1 199.4 55.5

0.5 1.8 110.9 1.1 1.2 5.7 6.6

6.8 79.0 1263.4 29.2 42.5 154.9 55.3

0.05 0.05 0.05 0.05 0.05 0.01 0.15

74

UFFEKJlERULFF

References Andreassen, S., Hovorka, R., Benn, J., Olesen, K . G. and Carson, E. R. (1991) A modelbased approach to insulin adjustment , in M . Stefanelli , A . Hasman , M . Fieschi and

J. Talmon (eds), Proceedings of the Third Conference on Artificial Intelligence in Medicine , Springer -Verlag , pp . 239- 248. Andreassen , S., Jensen , F . V ., Andersen , S. K ., Falck , B ., Kjrerulff , U ., Woldbye , M .,

S0rensen, A . R., Rosenfalck, A . and Jensen, F . (1989) MUNIN - an expert EMG assistant, in J. E. Desmedt (ed.) , Computer-Aided Electromyography and Expert Systems, Elsevier Science Publishers B. V . (North -Holland ) , Amsterdam , Chapter 21. Frydenberg, M . ( 1989) The chain graph Markov property , Scandinavian Journal of Statistics , 17 , pp . 333 - 353 .

Jensen , F . V . ( 1996) An Introduction

to Bayesian Networks . UCL Press , London .

Jensen, F . V ., Kjrerulff , U., Olesen, K . G. and Pedersen, J. (1989) Et forprojekt til et ekspertsystem for drift af spildevandsrensning (an expert system for control of waste water treatment - a pilot project ) , Technical report , Judex Datasystemer A / S, Aalborg , Denmark . In Danish .

Jensen, C. S. and Kong , A . (1996) Blocking Gibbs sampling for linkage analysis in large pedigrees with many loops , Research Report R -96-2048 , Department of Computer Science , Aalborg University , Denmark , Fredrik Bajers Vej 7, DK -9220 Aalborg 0 .

Jensen, F. V ., Lauritzen , S. L . and Olesen, K . G. (1990) Bayesian updating in causal probabilistic

networks by local computations , Computational

Statistics

Quarterly , 4 ,

pp . 269 - 282 .

Heckerman , D ., Horvitz , E . and Nathwani , B . ( 1992) Toward normative expert systems : Part I . The Pathfinder project , Methods of Information in Medicine , 31 , pp . 90- 105.

Kim , J. H . and Pearl, J. (1983) A computational model for causal and diagnostic reasoning in inference systems . In Proceedings of the Eighth International on Artificial Intelligence , pp . 190- 193.

Joint Conference

Lauritzen , S. L ., Dawid , A . P., Larsen, B. N . and Leimer, H.-G. (1990) Independence properties of directed Markov fields , Networks , 20 , pp . 491- 505. Lauritzen , S. L . and Spiegelhalter , D . J . ( 1988) Local computations with probabilities on graphical structures and their application to expert systems . J oumal of the Royal Statistical Society , Series B , 50 , pp . 157- 224.

Shafer, G. and Shenoy, P. P. (1990) Probability propagation , Annals of Mathematics and Artificial -

Intelli -.qence,. 2 ., -pp- . 327- 352 .

Shenoy, P. P. (1996) Binary Join Trees, in D . Geiger and P. Shenoy (eds.) , Proceedingsof the Twelfth Conference on Uncertainty Publishers

, San Francisco

, California

in Artificial

,- -pp- . 492 - 499 .

Intelligence , Morgan Kaufmann

Xu , H . (1994) Computing marginals from the marginal representation in Markov trees, in Proceedings of the Fifth International

Conference on Information

Processing and

Management of Uncertainty in Knowledge-Based Systems (IPMU ) , Cite Interna tionale

Universitaire

, Paris

, France

, pp . 275 - 280 .

BUCKET

ELIMINATION

PROBABILISTIC

: A

UNIFYING

FRAMEWORK

FOR

INFERENCE

R . DECHTER Department

of Information

and

Computer

Science

University of California , Irvine dechter @ics.uci .edu

Abstract . Probabilistic inference algorithms for belief updating , finding the most probable explanation , the maximum a posteriori hypothesis , and the maxi mum expected utility are reformulated within the bucket elimination frame work . This emphasizes the principles common to many of the algorithms appearing in the probabilistic inference literature and clarifies the relation ship of such algorithms to nonserial dynamic programming algorithms . A general method for combining conditioning and bucket elimination is also presented . For all the algorithms , bounds on complexity are given as a function of the problem 's structure .

1. Overview Bucketeliminationis a unifyingalgorithmicframework that generalizes dynamicprogramming to accommodate algorithms for manycomplex problem solvingandreasoning activities , includingdirectionalresolution for propo sitionalsatisfiability(Davisand Putnam, 1960 ), adaptiveconsistency for constraintsatisfaction(Dechterand Pearl, 1987 ), Fourierand Gaussian eliminationfor linearequalitiesand inequalities , and dynamicprogram mingfor combinatorial optimization(BerteleandBrioschi , 1972 ) . Here, after presenting theframework , wedemonstrate that a numberof algorithms for probabilisticinference canalsobe expressed asbucket -eliminationalgorithms. The main virtuesof the bucket -eliminationframeworkare simplicity and generality . By simplicity , we meanthat a completespecification of 75

76

R. DECHTER

bucket -elimination algorithms is feasible without introducing extensive ter -

minology (e.g., graph conceptssuch as triangulation and arc-reversal) , thus making the algorithms accessible to researchers in diverse areas. More im portant , the uniformity of the algorithms facilitates understanding , which encourages cross-fertilization and technology transfer between disciplines . Indeed , all bucket -elimination algorithms are similar enough for any im provement to a single algorithm to be applicable to all others expressed in this framework . For example , expressing probabilistic inference algorithms a.g bucket -elimination methods clarifies the former 's relationship to dynamic programming and to constraint satisfaction such that the knowledge accumulated in those areas may be utilized in the probabilistic framework . The generality of bucket elimination can be illustrated with an algorithm in the area of deterministic reasoning . Consider the following algo-

rithm for deciding satisfiability . Given a set of clauses (a clause is a disjunction of propositional variables or their negations) and an ordering of the propositional variables , d = Ql , ..., Qn, algorithm directional resolution

(DR) (Dechter and Rish, 1994) , is the core of the well-known Davis-Putnam algorithm for satisfiability (Davis and Putnam, 1960). The algorithm is described using buckets partitioning the given set of clauses such that all the clauses containing Q i that do not contain any symbol higher in the ordering are placed in the bucket of Q i , denoted bucketi .

The algorithm (seeFigure 1) processesthe buckets in the reverseorder of d. When processing bucketi , it resolves over Q i all possible pairs of clauses in the bucket and inserts the resolvents into the appropriate

lower buckets .

It was shown that if the empty clause is not generated in this process then the theory is satisfiable and a satisfying truth assignment can be generated in time linear in the size of the resulting theory . The complexity of the

algorithm is exponentially bounded (time and space) in a graph parameter called induced width (also called tree-width) of the interaction graph of the theory , where a node is associated with a proposition and an arc connects

any two nodes appearing in the same clause (Dechter and Rish, 1994) . The belief-network algorithms we present in this paper have much in common with the resolution procedure above. They all possess the prop erty of com piling a theory into one from which answers can be extracted easily and their complexity is dependent on the same induced width graph parameter . The algorithms are variations on known algorithms and , for the most part , are not new , in the sense that

the basic ideas have existed for

some time (Cannings et al., 1978; Pearl, 1988; Lauritzen and Spiegelhalter, 1988 ; Tatman

and

Shachter

Favro , 1990 ; Bacchus

and

, 1990 ; Jensen van

et al . , 1990 ; R .D . Shachter

Run , 1995 ; Shachter

, 1986 ; Shachter

and

, 1988 ;

Shimony and Charniack, 1991; Shenoy, 1992) . What we are presenting here is a syntactic and uniform exposition emphasizing these algorithms ' form

BUCKETELIMINATION

77

Algorithm directional resolution Input : A set of clausescp, an ordering d == Q! , ..., Qn. Output : A decision of whether

Algorithm directional resolution

as a straightforward elimination algorithm . The main virtue of this presentation , beyond uniformity , is that it allows ideas and techniques to flow across the boundaries between areas of research. In particular , having noted that elimination algorithms and clustering algorithms are very similar in the context of constraint processing (Dechter and Pearl , 1989) , we find that this similarity carries over to all other tasks . We also show that the idea of conditioning , which is as universal as that of elimination , can be incorporated and exploited naturally and uniformly within the elimination framework . Conditioning is a generic name for algorithms that search the space of partial value assignments , or partial conditionings . Conditioning means splitting a problem into subproblems based on a certain condition . Al gorithms such as backtracking and branch and bound may be viewed as conditioning algorithms . The complexity of conditioning algorithms is exponential in the conditioning set, however , their space complexity is only linear . Our resulting hybrid of conditioning with elimination which trade off time for space (see also (Dechter , 1996b; R . D . Shachter and Solovitz , 1991)) , are applicable to all algorithms expressed within this framework . The work we present here also fits into the framework developed by Arnborg and Proskourowski (Arnborg , 1985; Arnborg and Proskourowski , 1989) . They present table -based reductions for various NP -hard graph prob lems such as the independent -set problem , network reliability , vertex cover , graph k-colorability , and Hamilton circuits . Here and elsewhere (Dechter and van Beek, 1995; Dechter , 1997) we extend the approach to a different set of problems .

78

R. DECHTER Following

preliminaries

algorithm

for

we

extend

(

4

and

to

)

.

2

-

9

.

tree

present

and

of

We

(

section

8

Then

poly

,

-

a

-

pos

-

utility

tree

for

.

.

expected

' s

)

)

maximum

schemes

section

3

explana

maximum

Pearl

describe

elimination

probable

the

the

to

then

most

finding

finding

-

(

the

algorithms

elimination

bucket

performance

finding

taBks

for

the

algorithms

combining

the

Conclusions

are

given

in

.

provide

of

nodes

a

signify

,

the

and

.

A

of

Xi

,

of

parent

Xi

in

)

E

E

E

,

we

G

,

of

while

(

Xi

its

)

pai

Xi

Let

=

acyclic

=

,

graph

between

having

i

Ipai

Xl

.

.

.

,

,

and

)

the

}

.

=

)

(

.

Xi

)

=

{

IXi

V

,

,

(

Xi

)

,

of

chi

.

.

i

:

#

For

,

{

of

the

arcs

.

.

.

,

Xn

set

(

,

are

is

a

set

edges

,

.

the

If

set

pointing

Xi

)

,

,

to

comprises

,

Pi

has

.

}

of

Xi

arise

Xi

it

,

variables

Ch

if

Xl

the

can

family

graph

variable

denoted

confusion

of

=

.

linked

conditional

directed

is

each

acyclic

directions

}

the

Xi

is

=

j

comprises

The

graph

a

V

,

domains

by

of

the

we

abbreviate

includes

no

Xi

directed

ignored

:

-

over

the

quantified

where

Xi

no

by

,

V

to

nodes

the

}

E

Whenever

directed

E

Xj

given

between

notion

un

graph

from

are

points

child

,

.

Dn

and

Xi

X

Ch

A

G

Xj

value

the

beliefs

acyclic

influences

influences

pa

to

takes

on

partial

directed

and

cycles

(

Xi

,

Xj

.

)

and

.

{

D1

,

Xi

of

identical

X

,

(

set

.

are

,

Xi

denoted

graph

)

domains

,

and

undirected

,

these

pair

(

points

variables

an

Xj

{

Xi

Xi

by

child

a

=

a

causal

relies

that

the

that

is

say

nodes

variables

=

about

by

direct

network

graph

and

Xi

of

reasoning

defined

that

of

belief

directed

elements

is

variables

strength

A

It

existance

the

probabilities

for

.

random

arcs

variables

formalism

uncertainty

representing

The

X

.

with

networks

P

the

the

clustering

conditions

In

of

to

)

relates

we

Preliminaries

Belief

pa

5

7

,

its

taBks

it

section

)

analyze

the

extend

method

der

{

and

Section

join

section

(

,

(

6

conditioning

(

)

2

and

to

hypothesis

section

section

updating

algorithm

section

teriori

(

belief

the

tion

(

.

product

.

,

Xn

A

P

its

The

.

}

be

a

belief

=

=

{

Pi

parents

belief

set

of

random

network

network

}

,

,

namely

is

where

Pi

a

variables

pair

(

G

denotes

over

P

)

where

probability

is

a

directed

relationships

probability

a

multivalued

G

probabilistic

conditional

represents

,

matrices

distribution

Pi

: . . . :

over

form

P(Xl, ...., Xn) ==Ili =lP (Xilxpai ) where an assignment (X I = Xl , ..., Xn = xn ) is abbreviated to x = (Xl , ..., xn) and where Xs denotes the projection of a tuple X over a subset of variables S . An evidence set e is an instantiated subset of variables . A = a denotes a partial assignment to a subset of variables A from their respective domains . We use upper case letter for variables and nodes in a graph and lower case letters for values in variable 's domains .

79

BUCKETELIMINATION

(a)

Figure

2.

Example

belief

2 .1

(b)

network

P (g , f , d , c , b, a ) = P ( glf ) P ( flc , b) P ( dlb , a ) P ( bla ) P ( cla )

Consider

the

belief

network

P ( g , f , d , c , b , a ) == P ( glf Its

acyclic The

graph following

namely

, given

each

proposition

some

observed

rest

of the

are

some

function

evidence

an

is known

that

multiply

, and

small

cycle

elimination relationship We partial

with

conclude tuple

( us , xp ) to Definition subset

an

tasks

variables

are

or for

existing this

singly

the

cycle

. In above

methods

will

be

with

some

of

tuple

S , where

variables Us

, and

functions XES

Xp by )

, the

for

sparse

called -

networks

sections be

,

bucket

presented

-

and

.

a variable a value

functions

( Pearl

Spiegelhal

conventions

Given

permit

, also

and

will

discussed

that

algorithm

approach

subsequent tasks

a util -

networks

well

notational

appended

all

propagation

- cutset

work

hypothesis also

, they

, 1988 ; Lauritzen

the

( map ) ,

variables

- connected this

the

section

the

. Nevertheless

clusters of

decision

NP - hard

extending

of

, 4 . given

( meu ) .

methods

small

a subset

of

to

hypothesis

finally

a subset

( Pearl

each

( mpe ) , or , given

problem

for to

, 1986 ) . These

2 . 2 ( elimination of

are

networks

, S a subset denote

the

of

Msignment

aposteriori to

updating

probability

explanation

, and

{ B ,C } .

: 1 . belief

posterior

probability

assignment

to

of

algorithm

- cutsets

networks

probable

probability

approaches

algorithms

case , pa ( F ) =

the

maximum

tree - clustering

ter , 1988 ; Shachter

belief

their

these

- connected

conditioning

the

utility

main

over

, finding

propagation two

this

most

assignment

expected

2a . In

a maximum

3 . Finding

, finding

a polynomial

with

the

by

, b ) P ( dlb , a ) P ( bla ) P ( cla ) .

, computing

, finding

maximizes

the

1988 ) . The

defined

observations

variables

maximizes

Figure

, 2 . Finding

that

to

queries

variables

variables

It

in

a set

or , given

ity

is given

) P ( flc

defined

not

xp

. Let in

u be

S . We

a

use

of Xp .

a function ( minxh

h defined ),

( maxxh

over ),

80

R. DECHTER

(meanxh), and (Ex h) are definedover U = S - {X } as follows. For every U = u, (minxh )(u) = minxh(u, x), (maxxh) (u) = maxxh(u, x), (Ex h)(u) = Ex h(u, x), and (meanxh)(u) =" Ex~ , where' ,XI is Ithe I cardinality of X 's domain. Given a set of functions h1, ..., hj defined over the subsets81, ..., 8j , the productfunction (lljhj ) and EJ hj are defined over U == UjSj . For every U = u, (lljhj )(u) = lljhj (usj)' and (Ej hj ) (u) = Ej hj (uSj) . 3 . An Elimination

Algorithm

for Belief

Assessment

Belief updating is the primary inference task over belief networks . The task is to maintain the probability of singleton propositions once new evidence arrives . Following Pearl 's propagation algorithm for singly -connected net works (Pearl , 1988) , researchers have investigated various approaches to belief updating . We will now present a step by step derivation of a general variable - elimination algorithm for belief updating . This process is typical for any derivation of elimination algorithms . Let X I = XI be an atomic proposition . The problem is to assess and update the belief in Xl given some evidence e. Namely , we wish to compute P (X I = Xl ie) = Q' . P (X I = Xl , e) , where Q' is a normalization constant . We will develop the algorithm using example 2.1 (Figure 2) . Assume we have the evidence 9 = 1. Consider the variables in the order d1 = A , C , B , F , D , G . By definition we need to compute

P(gif )P(flb,c)P(dla,b)P(cla)P(bla)P(a)

L

P (a , 9 == 1) ==

c,b,j ,d ,g = l

We can now apply some simple symbolic manipulation , migrating each conditional probability table to the left of summation variables which it does not reference, we get == P ( a ) L

P ( cla

) L

C

Carrying pute defined

the the

P ( bla ) L b

computation

rightmost by : AG ( f ) =

from

summation Lg

P ( flb

, c ) LP

f

= 1 P ( glf

right which ) and

( dlb

, a ) L

d

to

left

( from

generates place

it

P ( glf

)

( 1)

9= 1

as

to

A ) , we

a

function

G

over

far

to

the

left

first

com

-

f , AG ( f ) as

possible

,

yielding

P(a) L P(cla)L P(bla ) L P(flb,c)Aa(f) L P(dlb,a) C b f d

(2)

BUCKETELIMINATION bucketa bucketD bucketF bucketB bucketc bucketA

= = = = = =

81

P(glf), 9 = 1 P(dlb,a) P(flb, c) P(bla) P(cla) P(a)

The answer to the query P (alg == 1) can be computed by evaluating the last product and then normalizing . The bucket -elimination algorithm mimics the above algebraic manipu lation using a simple organizational devise we call buckets, as follows . First , the conditional probability tables (CPT s, for short ) are partitioned into buckets , relative to the order used d1 == A , C , B , F , D , G , as follows (going from last variable to first varaible ) : in the bucket of G we place all functions mentioning G . From the remaining CPTs we place all those mentioning D in the bucket of D , and so on . The partitioning rule can be alternatively stated as follows . In the bucket of variable X i we put all functions that mention Xi but do not mention any variable having a higher index . The resulting initial partitioning for our example is given in Figure 3. Note that observed variables are also placed in their corresponding bucket . This initialization step corresponds to deriving the expression in Eq . ( 1) . Now we process the buckets from top to bottom , implementing the

82

R. DECHTER

Bucket G Bucket D Bucket F Bucket B Bucket C

P(a )

Bucket A Figure 4.

Bucket elimination along ordering d 1 = A , C, B , F, D , G.

right to left computation of Eq. (1) . Bucketa is processedfirst . Processing a bucket amounts to eliminating the variable in the bucket from subsequent computation

. To eliminate

G , we sum over all values of g . Since , in this case

we have an observed value 9 = 1 the summation is over a singleton value .

Namely, AG(f ) = L9 =1P(glf ), is computedand placedin bucketF (this correspondsto deriving Eq. (2) from Eq. (1)). New functions are placed in lower buckets using the same placement rule . BucketD is processed next . We sum-out D getting AD(b, a) = Ed P (dlb, a) ,

that is computed and placed in bucketB, (which correspondsto deriving Eq. (3) from Eq. (2)). The next variable is F . BucketF contains two functions P (fJb, c) and Aa (f ), and thus, following Eq. (4) we generate the function

AF(b, c) := Ll P(flb , c) . Aa(f ) whichis placedin bucketB(this corresponds to deriving Eq. (4) from Eq. (3)) . In processingthe next bucketB, the function AB(a, c) == Lb (P (bla) . An (b, a) . AF(b, c)) is computed and placed in bucketc (deriving Eq. (5) from Eq. (4)). In processing the next bucketc, Ac (a) = LCECP (cla) . AB(a, c) is computed (which correspondsto deriving Eq. (6) from Eq. (5)). Finally , the belief in a can be computed in bucketA, P (alg == 1) == P (a) . AC(a) . Figure 4 summarizes the flow of computation of the bucket elimination algorithm for our example . Note that since throughout this process we recorded two -dimensional functions at the most ,

the complexity the algorithm using ordering d1 is (roughly) time and space quadratic in the domain sizes. What will occur if we use a different variable ordering ? For example , lets apply the algorithm using d2 = A , F , D , C , B , G . Applying algebraic manipulation from right to left along d2 yields the following sequence of deri vations

:

P(a, 9 = 1) = P (a) Ll Ld Lc P(cla) Lb P(bla) P(dla, b)P(flb , c) Lg =1P (glf )=

BUCKETELIMINATION

bucket

B

bucket

C

bucket

D

bucket

F

bucket

A

83

-

-

= P(a)

(a) Figure 5.

P(a) Ll P(a) Ll P(a) Ll P(a) Ll

(b)

The buckets output

when processing along d2 = A , F , D , C , B , G

AG(f ) Ld Lc P(cla) Lb P(bla) P(dla, b)P(flb , c)== Aa(f ) Ld Lc P(Cla)AB(a, d, c, f ) == A9(f ) Ld Ac(a, d, f ) == Aa(f )AD(a, f ) =

P (a)AF(a) The bucket elimination

process for ordering

d2 is summarized

in Figure

5a. Each bucket contains the initial CPTs denoted by P 's, and the functions generated throughout the process, denoted by AS. We summarize with a general derivation of the bucket elimination algo-

rithm , called elim-bel. Consider an ordering.ofthe variablesX = (Xl , ..., Xn). Using the notation Xi = (Xl ' ..., Xi) and xi = (Xi, Xi+l , ..., Xj), where Fi is the family of variable Xi , we want to compute :

P(Xl, e) = ~ P(Xn,e) = ~ ~ IIiP(xi, elXpai )= X= X2n

_(nX2

l ) Xn

Seperating X n from the rest of the variables we get :

P(Xn,elxpan )llxiEchnP (xi,elXpai )= }=: IIXiEX - FnP(Xi, elXpai ) .E xn _ (n -l) X=X2

where

An(XUn ) = L P(xn, eIXpan )I1XiEChnP (Xi, e!Xpai ) xn

(7)

84

R. DECHTER

Figure 6.

Algorithm

elim - bel

Where Un denoted the variables appearing with X n in a probability component , excluding Xn . The process continues recursively with Xn - l . Thus , the computation performed in the bucket of Xn is captured by Eq . (7) . Given ordering Xl , ..., Xn , where the queried variable appears first , the C PTs are partitioned using the rule described earlier . To process each bucket , all the bucket 's functions , denoted AI , ..., Aj and defined over subsets SI , ..., Sj are multiplied , and then the bucket 's variable is eliminated by summation . The computed function is Ap : Up - t R , Ap == }::;x ni = IAi , where Up = UiSi - Xp . This function is placed in the bucket of it : largest index variable in Up. The procedure continues recursively with the bucket of the next variable going from last variable to first variable . Once all the buckets are processed, the answer is available in the first bucket . Algorithm elim -bel is described in Figure 6.

Theorem 3.1 Algorithm elim-bel compute the posterior belief P (xlle ) for any given ordering of the variables. 0 Both the peeling algorithm for genetic trees (Cannings et al., 1978), and Zhang and Poole's recent algorithm (Zhang and Poole, 1996) are variations of elim-bel.

BUCKETELIMINATION

(a )

85

(c)

(b)

Figure 7.

Two ordering of the moral graph of our example problem

3.1. COMPLEXITY We see that although elim -bel can be applied using any ordering , its complexity varies considerably . Using ordering d1 we recorded functions on pairs of variables only , while using d2 we had to record functions on four variables

(see Bucketc in Figure 5a) . The arity of the function recorded in a bucket equals the number of variables appearing in ing the bucket 's variable . Since recording a space exponential in r we conclude that the exponential in the size of the largest bucket

that processed bucket , exclud function of arity r is time and complexity of the algorithm is which depends on the order of

.

processIng

.

Fortunately , for any variable ordering bucket sizes can be easily read in advance

from

an ordered

associated

with

the elimination

process . Consider

the moral graph of a given belief network . This graph has a node for each propositional variable , and any two variables appearing in the same CPT are connected in the graph . The moral graph of the network in Figure 2a is given in Figure 2b . Let us take this moral graph and impose an ordering on its nodes. Figures 7a and 7b depict the ordered moral graph using the two orderings d1 = A , C , B , F , D , G and d2 = A , F , D , C , B , G . The ordering is pictured from bottom up . The width of each variable in the ordered graph is the number of its earlier neighbors in the ordering . Thus the width of G in the ordered graph along d1 is 1 and the width of F is 2. Notice now that using ordering d1, the

number

of variables

in the

initial

buckets

of G and

2 respectively . Indeed , in the initial partitioning

F , are also

1 , and

the number of variables

mentioned in a bucket (excluding the bucket's variable) is always identical to the width of that node in the corresponding ordered moral graph . During processing we wish to maintain the correspondance that any

86

R. DECHTER

two nodes in the graph are connected if there is function (new or old ) deefined on both . Since, during processing , a function is recorded on all the variables apearing in a bucket , we should connect the corresponding nodes in the graph , namely we should connect all the earlier neighbors of a processed variable . If we perform this graph operation recursively from last node to first , (for each node connecting its earliest neighbors ) we get the the induced graph . The width of each node in this induced graph is identical to the bucket 's sizes generated during the elimination process (see Figure 5b) . Example 3 .2 The induced moral graph of Figure 2b, relative to ordering d1 == A , C , B , F , D , G is depicted in Figure 7a. In this case the ordered graph and its induced ordered graph are identical since all earlier neighbors of each node are already connected. The maximum induced width is 2. Indeed, in this case, the maximum arity of functions recorded by the elimination algorithms is 2. For d2 == A , F , D , C , B , G the induced graph is depicted in Figure 7c. The width of C is initially 1 (see Figure 7b) while its induced width is 3. The maximum induced width over all variables for d2 is 4, and so is the recorded function 's dimensionality . A formal definition of all the above graph concepts is given next . Definition 3 .3 An ordered graph is a pair (G , d) where G is an undirected graph and d = Xl , ..., Xn is an ordering of the nodes. The width of a node in an ordered graph is the number of the node 's neighbors that precede it in the ordering . The width of an ordering d, denoted w (d) , is the maximum width over all nodes. The induced width of an ordered graph , w* (d) , is the width of the induced ordered graph obtained as follows : nodes are processed from last to first ,. when node X is processed, all its preceding neighbors are connected. The induced width of a graph , w * , is the minimal induced width over all its orderings . The tree-width of a graph is the minimal induced width plus one (Arnborg , 1985) . The established connection between buckets ' sizes and induced width motivates finding an ordering with a smallest induced width . While it is known that finding an ordering with the smallest induced width is hard (Arnborg , 1985) , usefull greedy heuristics as well as approximation algorithms are available (Dechter , 1992; Becker and Geiger , 1996) . In summary , the complexity of algorithm elim -bel is dominated by the time and space needed to process a bucket . Recording a function on all the bucket 's variables is time and space exponential in the number of variables mentioned in the bucket . As we have seen the induced width bounds the arity of the functions recorded ; variables appearing in a bucket coincide with the earlier neighbors of the corresponding node in the ordered induced moral graph . In conclusion :

BUCKETELIMINATION

87

Theorem 3 .4 Given an ordering d the complexity of elim -bel is (time and space) exponential in the induced width w* (d) of the network 's ordered moral graph . 0

3.2. HANDLINGOBSERVATIONS Evidence should be handled in a special way during the processing of buck ets. Continuing with our example using elimination on order d1, suppose we wish to compute the belief in A = a having observed b = 1. This observation is relevant only when processing bucket B . When the algorithm arrives at that bucket , the bucket contains the three functions P (bla) , AD (b, a) , and AF (b, c) , as well as the observation b == 1 (see Figure 4) . The processing rule dictates computing AB(a, c) == P (b = 1Ia) AD(b == 1, a) AF(b == 1, c) . Namely , we will generate and record a two -dimensioned function . It would be more effective however , to apply the assignment b == 1 to each function in a bucket separately and then put the resulting func tions into lower buckets . In other words , we can generate P (b == 11a) and AD (b == 1, a) , each of which will be placed in the bucket of A , and AF (b = 1, c) , which will be placed in the bucket of C . By so doing , we avoid increasing the dimensionality of the recorded functions . Processing buckets containing observations in this manner automatically exploits the cutset conditioning effect (Pearl , 1988) . Therefore , the algorithm has a special rule for processing buckets with observations : the observed value is assigned to each function in the bucket , and each resulting function is moved individ ually to a lower bucket . Note that , if the bucket of B had been at the top of our ordering , as in d2, the virtue of conditioning on B could have been exploited earlier . When processing bucketB it contains P (bla) , P (dlb, a) , P (flc , b) , and b == 1 (see Figure 5a) . The special rule for processing buckets holding observations will place P (b == lla ) in bucket A, P (dlb == 1, a) in bucketD , and P (flc , b == 1) in bucketF . In subsequent processing , only one-dimensional functions will be recorded . We see that the presence of observations reduces complexity . Since the buckets of observed variables are processed in linear time , and the recorded functions do not create functions on new subsets of variables , the corresponding new arcs should not be added when computing the induced graph . Namely , earlier neighbors of observed variables should not be connected . To capture this refinement we use the notion of adjusted induced graph which is defined recursively as follows . Given an ordering and given a set of observed nodes, the adjusted induced graph is generated by processing from top to bottom , connecting the earlier neighbors of unobserved nodes only . The adjusted induced width is the width of the adjusted induced graph .

88

R. DECHTER

Theorem

3 .5 Given a belief network having n variables , algorithm elim -

bel when using ordering d and evidencee, is (time and space) exponential in the adjusted inducedwidth w* (d, e) of the network'8 orderedmoral graph. 0

3 .3 .

FOCUSING

ON

RELEVANT

SUBNETWORKS

We will now present an improvement to elim -bel whose essenceis restricting the computation to relevant portions of the belief network . Such restrictions are already available in the literature in the context of existing algorithms

(Geiger et al., 1990; Shachter, 1990) . Since summation over all values of a probability function is 1, the recorded functions of some buckets will degenerate to the constant 1. If we could recognize such cases in advance, we could avoid needless compu tation by skipping some buckets . If we use a topological ordering of the

belief network's acyclic graph (where parents precede their child nodes) , and assuming that the queried variable starts the ordering ! , we can recognize skipable buckets dynamically , during the elimination process. Proposition 3 .6 Given a belief network and a topological ordering ~Y 1, ..., Xn , algorithm elim -bel can skip a bucket if at the time of processing, the bucket contains no evidence variable , no query variable and no newly computed function

. 0

Proof : If topological ordering is used, each bucket (that does not contain the queried variable) contains initially at most one function describing its probability conditioned on all its parents . Clearly if there is no evidence, summation will yield the constant 1. 0 Example 3 .7 Consider again the belief network whose acyclic graph is given in Figure 2a and the ordering d1 = A , C , B , F , D , G , and assume we want to update the belief in variable A given evidence on F . Clearly the buckets of G and D can be skipped and processing should start with bucket F . Once the bucket of F is processed, all the rest of the buckets are not skipable. Alternatively , the relevant portion of the network can be precomputed by using a recursive marking procedure applied to the ordered moral graph .

(seealso (Zhang and Poole, 1996)). Definition 3 .8 Given an acyclic graph and a topological ordering that starts with the queried variable , and given evidence e, the marking process works as follows . An evidence node is marked, a neighbor of the query variable is marked, and then any earlier neighbor of a marked node is marked . 1otherwise , the queried variable can be moved to the top of the ordering

BUCKETELIMINATION Algorithm .

.

89

elim -bel

.

2 . Backward

: For p -(- n downto

1 , do

for all the matrices AI , A2, ..., Aj in bucketp, do . (bucket with observed variable) if Xp == xp appears in bucketp, then substitute Xp = xp in eachmatrix Ai and put eachin appropriate bucket.

. else, if bucketpis NOT skipable , t~en

Up-(- Ui=1Si - {Xp} Ap== LXp lli =l Ai. Add Apto the largest -index variable in Up' .

.

.

Figure 8.

Improved algorithm elim-bel

The marked belief subnetwork, obtained by deleting all unmarkednodes, can be processed now by elim-bel to answer the belief-updating - query. - It is eaBY to see that Theorem ponential graph .

3 .9 The complexity of algorithm elim - bel given evidence e is exin the adjusted induced width of the marked ordered moral sub -

Proof : Deleting the unm larked nodes from the belief network results in a belief subnetwork whose distribution is identical to the marginal distribu ~~ tion over the marked variables . o .

4. An Elimination

Algorithm

for mpe

In this section we focus on the task of finding the most probable explana tion . This task appears in applications such as diagnosis and abduction . For example , it can suggest the disease from which a patient suffers given data on clinical findings . Researchers have investigated various approaches to finding the mpe in a belief network . (See, e.g., (Pearl , 1988; Cooper , 1984; Peng and Reggia , 1986; Peng and Reggia, 1989)) . Recent proposals include best first -search algorithms (Shimony and Charniack , 1991) and algorithms based on linear programming (Santos, 1991) . The problem is to find XOsuch that P (xO) == maxx IIiP (xi , elxpai) where x == (Xl ' ..., xn ) and e is a set of observations . Namely , computing for a given ordering Xl , ..., Xn ,

M==llXn !axP(x)==IJlax ) Xn -l max Xnlli=lP(Xi,elXpai

(8)

This can be accomplished as before by performing the rnxirnization operation along the ordering from right to left , while migrating to the left , at

90

R. DECHTER

each step , all components that do not mention the maximizing variable . We get ,

M == m~x P (xn , e) == Jllax max IIiP (xi , elXpai) == X = Xn

X (n - l )

Xn

ill_ax IlXiEX - FnP(Xi, elXpai) . max P (Xn, elxpan)IlXiEchnP(Xi, elXpai) =

X =

Xn

-

l

Xn

ill_ax IIXiEX - FnP(Xi, elXpai) . hn(xun)

X =

Xn

-

l

where

hn(xun) == ~ ~x P(Xn, eIXpan )I1XiEChnP (Xi, elXpai ) Where Un are the variables appearing in components defined over X n. Clearly , the algebraic manipulation of the above expressions is the same as the algebraic manipulation for belief assessment where summation is replaced by maximization . Consequently , the bucket -elimination procedure elim -mpe is identical to elim -bel except for this change. Given ordering Xl , ..., X n, the conditional probability tables are partitioned as before . To process each bucket , we multiply all the bucket 's matrices , which in this case

are denoted hi , ..., hj and defined over subsets51, ..., 5j , and then eliminate the bucket 's variable by maximi .zation . The computed function in this case

is hp : Up-t R, hp = maxxpni =lhi , whereUp= UiSi - Xp. The function obtained by processing a bucket is placed in the bucket of its largest -index

variable in Up. In addition , a function x~(u) = argmaxxphp (u) , which relates an optimizing value of Xp with each tuple of Up, is recorded and placed in the bucket

of X p .

The procedure continues recursively , processing the bucket of the next variable while going from last variable to first variable . Once all buckets are processed, the mpe value can be extracted in the first bucket . When this backwards phase terminates the algorithm initiates a forwards phase to compute an mpe tuple . Forward phase: Once all the variables are processed, an mpe tuple is computed by assigning values along the ordering from X I to Xn , consult ing the information recorded in each bucket . Specifically , once the partial

assignment x == (Xl , ..., Xi- I) is selected, the value of Xi appended to this tuple is xi (X) , where XOis the function recorded in the backward phase. The algorithm is presented in Figure 9. Observed variables are handled as in elim - bel .

Example 4 .1 Consider again the belief network of Figure 2. Given the or dering d == A , C , B , F , D , G and the evidence g == 1, process variables from last to the first after partitioning the conditional probability matrices into

buckets, such that bucketa = { P (glf ), 9 = 1} , bucketD == { P (dlb, a)} ,

BUCKETELIMINATION

Figure 9.

bucketF bucket

A

Ilf

) ,

is

placed

==

{ P

( f

==

{

( a ) }

and

maxd

( dl

( dlb

matrices

..

and

place

the

function .

P

( a )

tension

of

==

.

is

buckets

,

we

in

F

,

to

determined

the ,

1 ,

bucket be

( bl

B

.

h D hc

mpe

.

( a )

==

==

the

( f

=

.

)

==

also ,

now

maxi

h F

p

P

given

( flb

tuple

( c /a )

( g

== ( f

( b , a )

DO

( b , a )

by

.

hc

we

hB

==

( f

it

( a A

going

) ,

record

place .

) ==

two

, c )

bucket ,

P

and

hD

,

and

in

,

contains

B

( b , c )

maxc

mpe

( c /a ) }

argmaxhG

eliminate

is

with

)

Record

To

( b , a )

ha ( f

next

value

along

.

( b , c ) B

{ P

computing

processed

hF

a )

get GO

by

bucket

compute ,

==

==

function

Compute

P

bucketc

bucketD

in

Finally

, c ) ,

in and

M

==

forward

.

process

which

) .

maxb

C

( a ) ,

( f

,

we

partial

can

compile

tuples

be

viewed

as

information to

a

compilation

regarding

variables

higher

in

the the

( or most

o .rdering

learning

)

probable ( see

ex

also

-

section

.2 ) .

Similarly bounded are

in

hG

) } 9

The

result of

( bla

assign

Process

function

A

backward ,

and

, .

the

eliminate

. hG

.

bucket

( a , c )

bucket

the

The

7

, c )

G

{ P

bucketF

well

resulting

To in

through

phase

putting

h B

it

maxa

as

The

==

process in

and

Algorithm elim-mpe

bucketB

To

, a ) .

( flb the

bucketc place

P

,

result

bucketc

b , a )

argmaxDP

.

the

in P

P

place

/b , c ) }

91

to

the

case

the

induced

exponentially bounded

by

of in

the

belief

updating

dimension width

, of

w

the

the

* ( d , e )

complexity recorded

of

the

of matrices

ordered

elim

- mpe

, and moral

graph

is

those .

In

92

R. DECHTER

summary : Theorem 4 .2 Algorithm elim -mpe is complete for the mpe task. Its complexity (time and space) is O (n . exp (w* (d, e) )) , where n is the number of variables and w * (d, e) is the e-adjusted induced width of the ordered moral graph . 0

5. An Elimination

Algorithm

for MAP

We next present an elimination algorithm for the map task . By its defini tion , the task is a mixture of the previous two , and thus in the algorithm some of the variables are eliminated by summation , others by maximization . Given a belief network , a subset of hypothesized variables A = { AI , ..., Ak } , and some evidence e, the problem is to find an assignment to the hypoth esized variables that maximizes their probability given the evidence. For- . mally , we wish to compute maXa;k P (x , e) = maXa;k EXk +l IIi = lP (Xi , elxpai) where x == (aI , ..., ak, Xk+ l , ..., xn) . In the algebraic manipulation of this expression , we push the maximization to the left of the summation . This means that in the elimination algorithm , the maximized variables should initiate the ordering (and therefore will be processed last ) . Algorithm elim map in Figure 10 considers only orderings in which the hypothesized vari ables start the ordering . The algorithm has a backward phase and a forward phase, but the forward phase is relative to the hypothesized variables only . Maximization and summation may be somewhat interleaved to allow more effective orderings ; however , we do not incorporate this option here. Note that the relevant graph for this task can be restricted by marking in a very similar manner to belief updating case. In this case the initial mark ing includes all the hypothesized variables , while otherwise , the marking procedure is applied recursively to the summation variables only . Theorem 5 .1 Algorithm elim -map is complete for the map task. Its complexity is O (n . exp (w* (d, e) ) , where n is the number of variables in the relevant marked graph and w* (d, e) is the e-adjusted induced width of its marked moral graph . 0 6 . An Elimination

Algorithm

for MEU

The last and somewhat more complicated task we address is that of find ing the maximum expected utility . Given a belief network , evidence e, a real-valued utility function u (x ) additively decomposable relative to func tions 11, ..., Ij defined over Q = { Ql , ..., Qj } , Qi <; X , such that u (x ) = LQjEQ fj (xQj ) ' and a subset of decision variables D = { D1 , ...Dk } that are assumed to be root nodes, the meu task is to find a set of decisions

BUCKETELIMINATION

Figure 10.

93

Algorithm elim-map

dO== (dOl, ..., dOk) that maximizes the expected utility . We assumethat the variables not appearing in D are indexed Xk+l , ..., Xn . Formally, we want to compute E ==

and

max L1,..Xn d1 ,..,dXk k+

lli =l P (Xi, elXpai' d1, ..., dk)U(X) ,

do == argmaxDE

As in the previous tasks , we will begin by identifying the computation associated with Xn from which we will extract the computation in each bucket . We denote an assignment to the decision variables by d == (d1, ..., dk) and xi == (Xk, ..., xi ) . Algebraic manipulation yields

E=m ;xXk IIi=!P(X,elXpai i 'd)Q }:J"EQ -}:n+ -ll}:Xn fj (xQj)

We can now separate the components in the utility functions into those mentioning X n, denoted by the index set tn , and those not mentioning X n, labeled with indexes in = { I , ..., n} - tn. Accordingly we get

E=max :+l)}= :n~lP(Xi ,elxpai 'd).(}= :nfj(xQj )+JEtn }= dXk .1 .:fj (xQj)) _(}= n -l Xn JE

94

R. DECHTER

E == m; x[_(nL L IIi : lP (Xi, elXpai ' d} jEln L fj (xQj} Xk+l l) Xn L L IIi=lP (Xi, e\Xpai ' d) L fj (xQj)] _(n- l) Xn jEtn Xk+l By migrating to the left of Xn all of the elements that are not a function of Xn , we get

max L-l llXiEX -Fn P(Xi,elxpai ' d).(L fj(xQj ))LXnllXiEFnP (Xi,elxpai ' d) d [Xk . l -n JE n +l (9) +-n L-l IIxiEx -FnP (xi,elXpai ' d).LXnIIxiEFnP (xi,elXpai ' d)jEtn L fj (xQj )] Xk +l

An(xunld)==L llXiEFiP (Xi, elxpai ' d) Xn

We define en over W n as

8n(xWnld ) ==L llXiEFnP (Xi, elXpai ' d) L Ij (XQj)) Xn

jEtn

1111111111111

We denote by Un the subset of variables that appear with Xn in a proba bilistic component , excluding Xn itself , and by W n the union of variables that appear in probabilistic and utility components with Xn , excluding Xn itself . We define An over Un as (x is a tuple over Un U Xn )

After substituting Eqs. (10) and (11) into Eq. (9), we get """""

"""""

en (x WnId) ]

E == m: x -Ln.., llXiEX- FnP(Xi, elxpai'd).'\n(xUnld)[~ Ij (xQj)+ An (XUn Id) - l JEln Xk + l

(12) The functions (In and An compute the effect of eliminating Xn . The result

(Eq. (12)) is an expression, which does not include Xn , where the product has one more matrix An and the utility

components have one more

element Tn== ~ . Applyingsuchalgebraic manipulation to therestof the variables in order yields the elimination algorithm elim -meu in Figure 11. We assume here that decision variables are processed last by elim -meu . Each bucket contains utility components f)i and probability components , Ai . When

there is no evidence , An is a constant

and we can incorporate

the marking modification we presented for elim -bel . Otherwise , during processing, the algorithm generates the Ai of a bucket by multiplying all its

BUCKETELIMINATION

Figure 11.

Algorithm

95

elim - meu

probability components and summing over Xi . The () of bucket Xi is computed as the average utility of the bucket ; if the bucket is marked , the average utility of the bucket is normalized by its A. The resulting () and A are placed into the appropriate buckets . The maximization over the decision variables can now be accomplished using maximization as the elimination operator . We do not include this step explicitly , since, given our simplifying assumption that all decisions are root nodes, this step is straightforward . Clearly , maximization and summation can be interleaved to some degree, thus allowing more efficient orderings . As before , the algorithm 's performance can be bounded as a function of the structure of its augmented graph. The augmented graph is the moral graph augmented with arcs connecting any two variables appearing in the same utility component Ii , for some i . Theorem 6 .1 Algorithm elim -meu computes the meu of a belief network augmented with utility components (i . e., an influence diagram ) in 0 (n .

96

R. DECHTER

Z3 Z2 ~

VI V2 U3 Xl (a)

(b)

Figure 12.

(a) A poly-tree and (b) a legal processing ordering

exp (w * (d , e) ) , where w * (d , e) is the induced width along d of the augmented moral graph . 0 Tatman and Schachter (Tatman and Shachter , 1990 ) have published an algorithm that is a variation of elim - meu , and Kjaerulff 's algorithm (Kjreaerulff , 1993 ) can be viewed as a variation of elim - meu tailored to dynamic probabilistic networks .

7. Relation of Bucket Elimination

to Other Methods

7.1. POLY-TREE ALGORITHM When the belief network is a poly -tree , both belief assessment, the mpe task and map task can be accomplished efficiently using Pearl 's poly -tree

algorithm (Pearl, 1988) . As well, when the augmentedgraph is a tree, the meu can be computed efficiently . A poly -tree is a directed acyclic graph whose underlying undirected graph has no cycles. We claim that if a bucket elimination algorithm process variables in a

topological ordering (parents precedetheir child nodes), then the algorithm coincides (with someminor modifications) with the poly-tree algorithm . We will demonstrate the main idea using bucket elimination for the m pe task . The arguments are applicable for the rest of the tasks . Example

7 .1 Consider the ordering Xl , U3, U2, VI , YI , ZI , Z2, Z3 of the poly -

tree in Figure 12a, and assumethat the last four variablesare observed(here we denote an observed value by using primed lowercase letter and leave other

variables in lowercase ) . Processingthe bucketsfrom last to first , after the first four buckets have been processed as observation buckets, we get

bucket(U3) = P (U3) , P (xllul , U2, U3) , P (ZI3Iu3) bucket(U2) = P (U2), P (z/2Iu2)

BUCKETELIMINATION

97

bucket(U1) = P(Ul), P(Z/IIU1) bucket(,X 1)- := P(y/llx1) . -. . When processing bucket(U3) by elim-mpe, we get hU3(Ul , u2, U3), which is placed in bucket(U2) . The final resulting bucketsare bucket(U3) == P (U3), P (xllul , u2, U3) , P (ZI3Iu3) bucket(U2) == P (U2), P (ZI2Iu2), hu3(Xl , U2, UI) bucket(UI) == P (UI) , P (z' lluI ), hu2(Xl , UI) bucket(XI ) == P (Y' llxI ), hUl(XI) We can now choose a value Xl that maximizes

the product

in X I 's bucket ,

then choose a value Ul that maximizes the product in VI 's bucket given the selected value of X I , and so on. I t is easy to see that if elim -mpe uses a topological ordering of the

poly-tree, it is time and spaceO(exp(IFI )), where IFI is the cardinality of the maximum family size. For instance , in Example 7.1, elim -mpe records

the intermediate function hU3(Xl , U2, Ul ) requiring O(k3) space, where k bounds

the

domain

size for

each

variable

. Note , however

, that

Pearl ' s al -

gorithm (which is also time exponential in the family size) is better , as it records functions on single variables only . In order to restrict space needs, we modify elim -mpe in two ways . First , we restrict processing to a subset of the topological orderings in which sibling nodes and their parent appear consecutively as much as possible. Second, whenever the algorithm reaches a set of consecutive buckets from the same family , all such buckets are combined and processed aE one superbucket. With this change, elim -mpe is similar to Pearl 's propagation algorithm on poly -trees .2 Processing a super-bucket amounts to eliminating all the super-bucket 's variables without recording intermediate results . Example 7 .2 Consider Example 7.1. Here, instead of processing each b1tcket OfUi separately, we compute by a br'ute-force algorithm the function hu1,U2,U3(Xl ) in the super-bucket of VI , V2, U3 and place the function in the bucket of XI We get the unary function hCI1 ,U2,(T3(XI ) == maXttl,U211t3 P (U3)P (XII 'UI, U2, 'U3)P (Z/3Iu3)P ('lt2)P (Z/2Iu2)P ('ltl ) P (Z/1Iul ) .

The details for obtaining an ordering such that all families in a poly -tree can be processed a5 super-buckets can be worked out , but are beyond the scope

of this

Proposition

paper . In summary

,

7 .3 There exist an ordering of a poly-tree, such that bucket-

elimination algorithms (elim-bel, elim-mpe, etc.) with the super-bucketmodification have the same time and space complexity as Pearl 's poly -tree algorithm for the corresponding tasks. The modified algorithm 's time complexity is exponential in the family size, and it requires only linear space. 0 2Actually , Pearl 's algorithm rooted

tree

in

order

to

be identical

should be restricted with

ours .

to message passing relative

to one

98

R. DECHTER

F

Figure 13 .

7 .2 .

@

Clique - tree associated with the induced graph of Figure 7a

JOIN-TREECLUSTERING

Join -tree clustering (Lauritzen and Spiegelhalter , 1988) and bucket elimi nation are closely related and their worst -case complexity (time and space) is essentially the same. The sizes of the cliques in tree-clustering is identical to the induced -width plus one of the corresponding ordered graph . In fact , elimination may be viewed as a directional (i .e., goal- or query -oriented ) version of join - tree clustering . The close relationship between join -tree clustering and bucket elimination can be used to attribute meaning to the intermediate functions computed by elimination . Given an elimination ordering , we can generate the ordered moral induced graph whose maximal cliques (namely , a maximal fully -connected subgraph ) can be enumerated as follows . Each variable and its earlier neighbors are a clique , and each clique is connected to a parent clique with whom it shares the largest subset of variables (Dechter and Pearl , 1989) . For example , the induced graph in Figure 7a yields the clique-tree in Figure 13, If this ordering is used by tree-clustering ,the same tree may be generated . The functions recorded by bucket elimination can be given the following meaning (details and proofs of these claims are beyond the scope of this paper ) . The function hp(u) recorded in bucketp by elim -mpe and defined over UiSi - { Xp } , is the maximum probability extension of u , to variables appearing later in the ordering and which are also mentioned in the clique subtree rooted at a clique containing Up' For instance , hF (b, c) recorded by elim -mpe using d1 (see Example 3.1) equals maxf ,g P (b, c, f , g) , since F and G appear in the clique-tree rooted at (FC B ) . For belief assessment, the function Ap = l:::xp ni =lAi , defined over Up = UiSi - Xp , denotes the probability of all the evidence e+P observed in the clique subtree rooted at a clique containing Up, conjoined with u . Namely , Ap(U) = P (e+P, u) .

BUCKETELIMINATION

99 P(d=llb,a)P(g=Olf =O) P(d=lib,s)P(g=Olf=1) P(d=lib,s)P(g=Olf=O)

...

...

...

...

P(d=Ilb,a)P(g=Olf=1) Figure 14. probability tree

8 . Combining

Elimination

and Conditioning

A serious drawback of elimination algorithms is that they require considerable memory for recording the intermediate functions . Conditioning , on the other hand , requires only linear space. By combining conditioning and elimination , we may be able to reduce the amount of memory needed yet still have performance guarantee . Conditioning can be viewed as an algorithm for processing the algebraic expressions defined for the task , from left to right . In this case, partial results cannot be assembled; rather , partial value assignments (conditioning on subset of variables ) unfold a tree of subproblems , each associated with an assignment to some variables . Say, for example , that we want to compute the expression for m pe in the network of Figure 2: M

=

max P (glf ) P ( flb , c ) P ( dla , b) P ( cla ) P ( bla ) P ( a ) a,c,b,f ,d,g

= max P ( a ) maxP a c

( cla ) maxP b

( bla ) maxP f

( flb , c ) maxP d

( dlb , a ) maxP 9

(glf ) . ( 13 )

We can compute along

the

traversed algorithms

the

ordering either such

expression

from

breadth

first - first

as best - first

by traversing variable

to

or depth - first search

the tree

last

variable

and will

and branch

in Figure .

result

and bound

The

14 , going

tree

in known .

can

be

search

100

R. DECHTER

Algorithm

elim

Input d ;

: A a

subset :

:

1 .

For

2 .

of

The

Initialize

PI

.

P

P

=

=

probable

We

C

output

max

of

{ p , PI p

and to

will

on

and

}

V C

mpe

a

== X

-

C

. Clearly

m :

x

P

responding

value

denote

by

v

== m ' i(x

m ~ x

every

the

partial

a

of

the

is

computed 0

When graph

( n

Given

. exp

on

be

the

variables ordered

P

with

conditioned

( c , v , e ) = = maXc

c , we

elimina

variables

V

and

by

, vlliP

compute

= lP

( Xi

maxv

while

, C

c an

-

~

X

,

assignment

, c , v , elXpai

P

kept

. The

the

a

( v , c , e )

)

and

a

cor

-

without variables

of

( w * ( d , eUc ordered

) +

O ICI

moral

in such

e

U

that

( n ) ) ,

U

all

variables

of

conditioned

15

.

complexity

the c )

the

Figure

space of

the

nodes

connecting

earlier

adjusted

, C

( w * ( d , cUe

where

constitute its

in

and

variables . exp

graph

c

for retaining

is

ordered .

In

this

neighbors

.

conditioning is

tuple

rest

w * ( d , e

and

variables

enumerated

presented

time the

width

observed

- mpe

the

conditioned

the

is

over

)

be

, and

c ,

generated

set

the will

algorithm

induced

conditioned

- cond

, e , CIXpai

treating

variables

both is

( Xi

computation

assignment

for

elim

the can

. ) .

- mpe

to

probability

graph

8 . 1

O

be

by

and

algorithm

c .

observations

conditioning of

assignment

basic

value

induced

is

an

- cond

combining

argmaxvIIi

maximum

evidence

plexity

as

probability

elim

conditioned

will

adjusted

Theorem of

the

particular

graph

=

This

exponentially

both

,

.

subset

algorithm .

the

bounded

ity

) (c )

probability

Given

of

of a

tuple

elimination

computing

case

variables

tuple

variables

maximum

moral

idea be

combinations

of

cUe

Algorithm

C

( Xv

using

with maximum

tuple

15 .

Let

maximizing

observed

the

c , do

maximizing

.

. We

( x , e )

, for

as

of e .

,

Therefore

by

ordering

.

- mpe the

the

task

=

elim

( update

demonstrate

the

} ; an

; observations

assignment

Figure

tion

, . . . , Pn

o .

The

Return

{ PI

variables

assignment

f ~

BN

conditioned

most

every

.

- mpe

network

C

Output

- cond

belief

the

that

a

induced

was

adjusted

cycle

- cutset

induced

,

the

space

) ) , while

width

its

width

-

com

w * ( d , cUe

relative

of

complex time

the equals

to

e

graph 1 . In

and

,

) ,

the this

BUCKETELIMINATION case

elim

- cond

- mpe

1988 ; Dechter Clearly take

ables

. There that

can

an

by

the

one

super

some

, during

We

mon

bucket

the

of

elim - mpe

possible

The

for

are

and to

belief

algorithms be

more

performance

reduced

suffer

of from

: exponential

using

which

avoiding

into

recording

Dechter

, 1996 ) .

in

aB similar

probabilis

-

to

-

observing

appeared

bucket the

both

com -

in the

past

et al . , 1997 ) .

shown

bucket the

and

, that

- tree

toward

by

the

standard

- elimination

and

usual and

difficulty exponential

Rish

, 1994 ; Dechter . We

have

portions - width

tree - clustering with in

and

that

which

the

the allows of the

complexity bound

.

algorithms dynamic

worst

constraint

, 1997 ) . Space

shown

, and

- based

induced

time

resolution

algo -

( always

highlights

relevant graph

associated

plague

the

.

clustering

the

accompanied

than

then orderings

framework

join

on

algorithms

elim - meu , for

derived

ex -

effort

if

network some and

pro -

expressing

, algorithms

example

proposed and

of

conscience

elim - map

also

conditioning

framework

explicitly

the

dynamic

way

without

, 1988 ) for

procedures

refined

this

- connected

to

not

were

uniform

, for

( Pearl

of

generalizes

and

. In

- elimination

space

deficiencies ( Dechter

also

network

applies

- assessment

are

performance

elimination

buckets

algorithms

, which

a singly

elegance

which

gramming

given

bucket

bounds

to

have

were

enhancements

The

decides

method

and

viewed

( Bistarelli

of the

same

. Such

likely

, thus

frameworks

a concise

algorithms

common

be

had

in

' s algorithms

network

is

by

elim -

method

and

consecutive

many

can

reasoning

) . The

simplicity

paper

framework

. We

elim - bel

trees

focusing

of

vari and

- mpe . One

a bucket

; EI - Fattah

, unifying

presented

to Pearl

on

conditioned

Conclusion

properties

tree - propagation

features

,

effectively

recorded

conditioning

that

recently

designer

and

reduces

the

, 1996 ) ) . Another a set

algorithms

probabilistic

the

more

elim - cond

to process

, 1996b

this

- elimination

topological

part

rithm

more

have

for

the

in

Rish

by

( Dechter

various

and

, we

algorithms

( Pearl

conditioning

functions

, collects

processes

to

between

of

and

. In addition

between

the

assignments

arity

reaBoning

, 1992 ) and

Using

be implemented

hybrids

, whether

throughout

Summary

ploit

it

algorithms

gramming

algorithms

k

mentioned

features

10 .

loop - cutset

procedure

( Dechter

results

deterministic

( Shenoy

can

the

approach

that

war

elimination

- mpe

basic

processing

- bucket

- bucket

had and

known

partial

on

( see

super

Related

the

bound

intermediate

9 .

the

of possible

refine

upper

conditioning

uses

tic

of shared

is a variety

dynamically or

elim - cond

advantage

imposes

to

, 1990 ) .

, algorithm

if we

ination

reduces

101

pro -

case . Such - satisfaction

complexity

conditioning

can can

be

102

R. DECHTER

implemented

naturally

quirement

ing

and

still

Finally

for

,

be

.

no

attempt

These

was

.

ditional

In

can

made

and

particular

reducing

.

the

space

Combining

the

,

1996

re

-

condition

virtues

paper

exploit

of

be

addressed

Poole

to

optimize

-

forward

as

,

and

)

the

the

be

run

-

bucket

-

structure

recently

can

algorithms

vs

within

presented

1997

the

compilation

exploiting

,

;

this

to

improvements

matrices

Boutilier

elimination

thus

combining

in

nor

should

,

probability

;

a5

,

issues

framework

,

features

viewed

computation

sources

elimination

.

distributed

press

of

topological

can

search

top

exploiting

elimination

backward

time

of

in

(

incorporated

re

the

Santos

con

et

on

-

elimination

top

al

of

.

-

,

in

bucket

-

.

In

summary

eral

,

tasks

,

which

what

we

applicable

the

importantly

buckets

,

Dechter

11

,

.

A

This

as

have

1997

)

exposition

bucket

done

,

-

by

via

approximation

areas

of

associated

-

,

research

with

algorithms

combining

sev

reasoning

several

elimination

either

or

across

deterministic

between

benefit

the

be

uniform

and

ideas

organizational

shown

a

.

the

to

use

be

conditioning

with

algorithms

of

improved

as

elimi

is

-

shown

in

.

Acknowledgment

preliminary

like

version

to

thank

of

Irina

different

grant

IRI

F49620

-

-

of

9157636

96

-

America

-

0224

,

and

paper

and

this

,

1

this

Rish

versions

of

is

probabilistic

of

all

can

we

both

the

allow

.

nation

here

transfer

,

should

uniformly

provide

to

facilitates

More

(

on

while

paper

Air

appeared

Nir

.

Force

This

work

Electrical

,

1996a

.

I

on

by

Research

ACM

-

grant

20775

and

Institute

would

comments

supported

Scientific

Research

)

useful

partially

grants

Power

Dechter

their

was

of

MICRO

(

for

Office

Rockwell

in

Freidman

AFOSR

95

grant

NSF

-

043

,

RP8014

-

Amada

06

.

References

S

.

Arnborg

and

to

S

. A

.

A

partial

k

Arnborg

-

.

.

Proskourowski

trees

.

of

A

.

S

.

.

and

.

D

In

.

Bertele

and

,

Boutilier

Artificial

van

.

Run

.

.

BIT

.

A

.

.

.

Brioschi

N

,

Journal

of

95

UAI

-

96

onserial

A

,

1985

Rossi

Cassis

ssociation

-

11

)

,

pages

France

-

csps

,

for

81

89

.

of

problems

24

-

In

1995

,

restricted

1989

.

graphs

with

bounded

finding

,

and

close

1996

to

optimal

Academic

jnmction

Press

constraint

Computing

Practice

.

.

based

Principles

.

Programming

Semiring

hard

-

on

in

,

algorithm

.

:

.

ordering

,

Dynamic

F

np

23

problems

23

)

fast

(

and

the

-

variable

AI

.

Montanari

: 2

-

sufficiently

in

F

U

25

Dynamic

CP

for

,

combinatorial

,

(

Geiger

algorithms

Mathematics

for

survey

Uncertainty

timization

C

.

a

time

Applied

Programming

Bistarelli

1997

P

Constraints

Becker

trees

U

and

Linear

and

algorithms

-

Bacchus

Discrete

Efficient

decomposability

F

.

.

,

satisfaction

Machinery

(

J

A

C

M

)

,

to

1972

.

and

op

-

appear

,

.

.

Context

Intelligence

-

specific

independence

(

UAI

-

96

)

in

,

pages

115

-

bayesian

123

,

networks

1996

.

.

In

Uncertainty

in

BUCKETELIMINATION

103

C. Cannings, E.A . Thompson, and H .H. Skolnick. Probability functions on complex pedigrees. Advances in Applied Probability , 10:26- 61, 1978. G.F . Cooper. Nestor: A computer-based medical diagnosis aid that integrates causal and probabilistic knowledge. Technical report , Computer Science department , Stanford University , Palo-Alto , California , 1984. M . Davis and H . Putnam . A computing procedure for quantification theory. Journal of the Association of Computing Machinery , 7(3) , 1960. R. Dechter and J. Pearl. Network-based heuristics for constraint satisfaction problems. A rtificial Intelligence , 34:1- 38, 1987. R. Dechter and J. Pearl. Tree clustering for constraint networks. Artificial Intelligence , pages

353

R . Dechter In

- 366

and

Principles

1994

.

of

.

Directional

resolution

I ( nowledge

: The

Representation

davis

and

- putnam

procedure

Reasoning

, revisited

( I ( R - 94 ) , pages

.

134 - 145

,

.

R . Dechter

and

I . Rish

Constraint

. To

Practice R . Dechter

and

P . van

of

Constraint

Beek

.

to

think

- 96 ) , 1996

Local

and

. Constraint

for

Artificial .

for

sat

consistency

- 95 ) , pages

constraint

240 - 257

processing

Intelligence

networks

algorithms

relational

( CP

schemes .

? hybrid

.

In

Principles

of

.

global

programming

decomposition

R . Dechter

or

( CP

. Enhancement

cutset

1992

guess

Programming

R . Dechter

, 41 : 273

Encyclopedia

of

. , 1995

In

Principles

: Backjumping

- 312

, 1990

Artificial

and

. , learning

and

.

Intelligence

, pages

276

- 285

,

algo

-

.

R . Dechter

.

rithms

.

R . Dechter

Bucket In

A rtificial

Ijcai

: A

reasoning

and :

R . Dechter

results

on

- 96 ) , pages

D . Geiger

, 20 : 507

F . V . Jensen

, S .L

networks U . Kjreaerulff

by . A

structures

An

J . Pearl

, 1990

of

the

inference

tradeoffs

. In

211

- 219

Uncertainty

, 1996 in

.

Artificial

. generating

Fifteenth

evaluation

of

circuits

, 1996

approximations

in

International

Joint

automated

Conference

on

.

structural In

parameters

Uncertainty

in

for

probabilistic

Artificial

Intelligence

. .

Identifying

independence

in

bayesian

networks

.

Net

-

. , and

computation

computational in

of

probabilistic

- 96 ) , pages

. .

Lauritzen local

Uncertainty

S . L . Lauritzen

, and

- 534

for ( UAI

- space

, 1996

scheme

benchmark

244 - 251

, T . Verma

works

time

220 - 227

general

, 1997

framework

Intelligence

- 97 : Proceedings

Intelligence

Y . El - Fattah

unifying

for

- 96 ) , pages

- buckets

. In

A

Artificial

parameters

( UAI . Mini

reasoning

: in

. Topological

R . Dechter

( UAI

elimination

Uncertainty

Intelligence

In

, 1989

I . Rish

Artificial

K . G . Olesen .

scheme

for

Intelligence

Bayesian

and

D . J . Spiegelhalter

. Local

their

to

expert

updating

Statistics

reasoning ( UAI

and

application

.

Computational

in

dynamic

- 93 ) , pages

computation systems

with .

in

causal

Quarterly

Journal

probabilistic

, 4 , 1990

probabilistic

121 - 149

, 1993

the

.

.

probabilities of

. networks

on Royal

graphical Statistical

Society , Series B , 50( 2) :157- 224, 1988. J . Pearl . Probabilistic Reasoning in Intelligent Systems . Morgan Kaufmann , 1988. Y . Peng and l .A . Reggia . Plausability of diagnostic hypothesis . In National Conference on Artificial Intelligence (AAAI86 ) , pages 140- 145, 1986. Y . Peng and l .A . Reggia . A connectionist model for diagnostic problem solving , 1989. D . Poole . Probabilistic partial evaluation : Exploiting structure in probabilistic inference . In Ijcai - 97 : Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence , 1997. S.K . Anderson R . D . Shachter and P. Solovitz . Global conditioning for probabilistic inference in belief networks . In Uncertainty in Artificial Intelligence ( UAI - 91) , pages 514- 522 , 1991. B . D 'Ambrosio R .D . Shachter and B .A . Del Favro . Symbolic probabilistic inference in belief networks . A utomated Reasoning , pages 126- 131, 1990. E . Santos , S.E . Shimony , and E . Williams . Hybrid algorithms for approximate belief updating in bayes nets . International Journal of Approximate Reasoning , in press .

104

R. DECHTER

E. Santos. On the generation of alternative explanations with implications for belief revision. In Uncertainty in Artificial Intelligence ( UAI -91) , pages 339- 347, 1991. R.D . Shachter. Evaluating influence diagrams. Operations Research, 34, 1986. R.D . Shachter. Probabilistic inference and influence diagrams. Operations Research, 36, 1988. R. D . Shachter. An ordered examination of influence diagrams. Networks, 20:535- 563, 1990. P.P. Shenoy. Valuation -based systems for bayesian decision analysis. Operations Research, 40:463- 484, 1992. S.E. Shimony and E. Chamiack . A new algorithm for finding map assignments to belief networks. In P. Bonissone, M. Henrion , L . Kanal , and J. Lemmer ed., Uncertainty in Artificial Intelligence , volume 6, pages 185- 193, 1991. J.A . Tatman and R.D . Shachter. Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man, and Cybernetics, 1990. N .L . Zhang and D. Poole. Exploiting causal independencein bayesian network inference. Journal of Artificial Intelligence Research(JAIR ) , 1996.

AN INTRODUCTION TO V.t\.RIATION AL METHODS FOR GRAPHICAL MODELS

MICHAEL

I . JORDAN

Massachusetts Institute of Technology Cambridge , MA ZO UBIN

G HAHRAMANI

University of Toronto Toronto

, Ontario

TOMMI

S. JAAKKOLA

University Santa

of California

Cruz , CA

AND LAWRENCE

K . SAUL

AT & T Labs

- Research

Florham

Park

, NJ

Abstract . This paper presents a tutorial introduction to the use of varia tional methods for inference and learning in graphical models . We present a number of examples of graphical models , including the QMR -DT database , the sigmoid belief network , the Boltzmann machine , and several variants of hidden Markov models , in which it is infeasible to run exact inference algorithms . We then introduce variational methods , showing how upper and lower bounds can be found for local probabilities , and discussing methods for extending these bounds to bounds on global probabilities of interest . Finally we return to the examples and demonstrate how variational algorithms can be formulated in each case.

105

MICHAELI. JORDAN ET AL.

106 1. The

Introduction problem

of probabilistic

computing the nodes nodes

( the

( the

probability

" hidden

" evidence

set of hidden wish

inference

a conditional

to calculate

" or " unobserved

" or " observed

nodes

and

letting

P ( HIE

General

exact

inference

( Cowell

tematic

We often

volume

of the

as inferred

of the parameters quantity

evaluation

of the likelihood

tor

and

maximize

complexity have recourse tion

tree

of P ( HIE are many

of which

we discuss

in this

calculation

to approximation

construction

clique

architectural

Even

in particular

the exact

resentation

of the joint

model

another

; put

the

of distributions implied

or clusters

of nodes

algorithms

probability

particular

pendencies

to consider

probability that

exact

produce

provide

, there

are other

the time

and

or space

it is necessary

the context

complexity

of the junc -

of the exact

is man -

procedures

associated

with

" nearly " conditionally

consideration the

rep -

a graphical

have the same complexity

is consistent

. Note

numerical with

regard within

conditional

be situations independent

are

cliques .

algorithms

no use of the

in

see , there

lead to large

under

to

is exponential

approximation

may

, al make

algorithms

problems

distribution

by the graph . There are

the

necessarily

make

-

) . Moreover

tree . As we will

distribution

way , the algorithms

less of the family

that

).

.

. Within time

of P ( HIE

) generally

paper , in which

the complexity

can be reason that

, the

as

the numera

generally

quantities

is unacceptable

in the junction

assumptions

in cases in which

ageable , there

in fact

learning

procedures

, for example

the size of the maximal natural

cases in which

mod -

E , P ( E ) is an

compute

of P ( HIE

) as a subroutine and

exact

simply

, they

joint

by Eq . ( 1) , the

to the calculation

( and related

to inference

in the

in graphical

model , for fixed

divide

sys -

, P ( E ) . Viewed

. As is suggested

do not

this

take

edges in the graph .

evidence

of the calculation

likelihood

to perform

algorithms

probabilities

related

solution of the

the

nodes , we

present

of missing

marginal

algorithms

gorithms

cases , several

of other

H represent

developed

of the observed

of Eq . ( 1 ) and

there

the values

( 1)

independencies

is closely

inference

use of the calculation

of

of some of

set of evidence

been

as the likelihood

as a by - product

Although

have

of the graphical

known

denominator

a satisfactory

.

the pattern

the likelihood that

) = % r2

to calculate

important

, although

the

; Jensen , 1996 ) ; these

the probability

a function

Indeed

" nodes ) , given

E represent

conditional

from

also wish

els , in particular

is the problem

" nodes ) . Thus , letting

algorithms

, this

advantage

distribution

models

over the values

):

P ( HIE

calculation

in graphical distribution

in which

inde nodes

, situations

in

AN INTRODUCTION TO VARIATIONAL METHODS

107

which node probabilities are well determined by a subset of the neighbors of the node , or situations in which small subsets of configurations of variables contain most of the probability mass. In such cases the exactitude achieved by an exact algorithm may not be worth the computational cost . A variety of approximation procedures have been developed that attempt to identify and exploit such situations . Examples include the pruning algorithms of Kjrerulff (1994) , the "bounded conditioning " method of Horvitz , Suermondt , and Cooper (1989) , search-based methods (e.g., Henrion , 1991), and the "localized partial evaluation " method of Draper and Hanks (1994) . A virtue of all of these methods is that they are closely tied to the exact meth ods and thus are able to take full advantage of conditional independencies . This virtue can also be a vice , however, given the exponential growth in complexity of the exact algorithms . A

related

approach

cations

of graphical

MacKay

, &

as an iterative

graphical

Another

approach

Carlo

for

design

and

in

in

appli -

( McEliece

Pearl ' s algorithm

inference

, for

in non - singly - connected

( see MacKay

problem

Kjrerulff

can be slow this

ational

of the

to converge chapter

approach

generally

provide

underlying averaging

It

yields bounds

exploit solution

to diagnose

can come

with into

of values

phenomena

their

yet

. an -

. Vari -

procedures basic

that

intuition

can be probabilis

connectivity

play , rendering

can lead

convergence

provide

. The

there

nodes

neighbors

relatively

. Taking

to simple , accurate

are

advan approxi

. to emphasize

outlined

complementary to any given

are

features problem

that

the

by no means

various

approaches

mutually

exclusive

of the graphical may well

involve

model

to inference ; indeed

formalism

an algorithm

.

algorithms

algorithms

graphs

of their

the

, which

dense

their

of convergence

inference

complex

, in graphs

include

are that

of interest

is that

settings

guarantees

approximation

on probabilities

that

& Luby , 1993 ; Fung

algorithms

methods

deterministic

methods

averaging

is important we have

design

; in particular

procedures

it can be hard

involves algorithms

, 1994 ; Jensen , Kong , &

approach

of approximate

to particular

of these

and

theoretical Carlo

Carlo

, and Neal , 1993 ) and applied ( Dagum

of these

variational

phenomena

insensitive

and Monte

algorithms

of Monte

, & Spiegelhalter

the

variational

simple

mation

models

we discuss

to

methodology

tically

, this volume

in graphical

of implementation

disadvantages

In

. A variety

, 1995 ; Pearl , 1988 ) . Advantages

simplicity

other

of approximation

methods

Favero , 1994 ; Gilks , Thomas

that

, Kim

arisen decoding

( Pearl , 1988 ) has been used successfully

method

to the

use of Monte

to the inference

tage

has

to error - control

particular

models

approximate

have been developed

The

inference

inference

.

making

&

approximate

Cheng , 1996 ) . In

singly - connected graphs

to

model

that

they

. The best combines

-

108

MICHAELI. JORDANET AL.

aspects of the different methods . In this vein , we will present variational methods in a way that emphasizes their links to exact methods . Indeed , as we will see, exact methods often appear as subroutines within an overall variational approximation (cf . Jaakkola & Jordan , 1996; Saul & Jordan , 1996) . It should be acknowledged at the outset that there is as much "art " as there is "science" in our current understanding of how variational meth ods can be applied to probabilistic inference . Variational transformations form a large , open-ended class of approximations , and although there is a general mathematical picture of how these transformations can be exploited to yield bounds on probabilities in graphical models , there is not as yet a systematic algebra that allows particular variational transformations to be matched optimally to particular graphical models . We will provide illustrative examples of general families of graphical models to which varia tional methods have been applied successfully, and we will provide a general mathematical framework which encompasses all of these particular examples, but we are not as yet able to provide assurance that the framework will transfer easily to other examples . We begin in Section 2 with a brief overview of exact inference in graph ical models , basing the discussion on the junction tree algorithm . Section 3 presents several examples of graphical models , both to provide motivation for variational methodology and to provide examples that we return to and develop in detail as we proceed through the chapter . The core material on variational approximation is presented in Section 4. Sections 5 and 6 fill in some of the details , focusing on sequential methods and block methods , respectively . In these latter two sections, we also return to the examples and work out variational approximations in each case. Finally , Section 7 presents conclusions and directions for future research.

2. Exact inference In this section we provide a brief overview of exact inference for graphical models , as represented by the junction tree algorithm (for relationships between the junction tree algorithm and other exact inference algorithms , see Shachter , Andersen , and Szolovits , 1994; see also Dechter , this volume , and Shenoy, 1992, for recent developments in exact inference ) . Our intention here is not to provide a complete description of the junction tree algorithm , but rather to introduce the "moralization " and "triangulation " steps of the algorithm . An understanding of these steps, which create data structures that determine the run time of the inference algorithm , will suffice for our

AN INTRODUCTION TO VARIATIONALMETHODS

109

Figure 1. A directed graph is parameterized by associating a local conditional probability with each node. The joint probability is the product of the local probabilities .

purposes.l For a comprehensiveintroduction to the junction tree algorithm see Cowell (this volume) and Jensen (1996). Graphical models come in two basic flavors- directed graphical models and undirected graphical models. A directed graphical model is specified numerically by associating local conditional probabilities with each of the nodes in an acyclic directed graph. These conditional probabilities specify the probability of node Si given the values of its parents, i .e., P (8iI87r(i)), where 1[ (i ) representsthe set of indices of the parents of node 8i and 87r(i) representsthe correspondingset of parent nodes (seeFig. 1).2 To obtain the joint probability distribution for all of the N nodesin the graph, i .e., P (8 ) == P (8l , 82, . . . , SN), we take the product over the local node probabilities :

N P(S) = n P(SiIS7r (i)) i=l Inference involves the calculation of conditional

(2) probabilities

under this

joint distribution . An undirected graphical model (also known as a "Markov random field " ) is specified numerically by associating "potentials " with the cliques of the graph .3 A potential is a function on the set of configurations of a clique lOur presentation will take the point of view that moralization and triangulation , when combined with a local message-passing algorithm , are sufficient for exact inference . It is also possible to show that , under certain conditions , these steps are necessary for exact inference . See Jensen and Jensen ( 1994) . 2Here and elsewhere we identify the ith node with the random variable Si associated with the node . 3We define a clique to be a subset of nodes which are fully connected and maximal ; i .e., no additional node can be added to the subset so that the subset remains fully connected .

110

MICHAELI. JORDANET AL.

S4 (/

- .' ,

I

\

S

51(". -- - '

3

,= 3(C3 ) .~) S6 -

/

"

S2 (

-

-

,

--'.'

.--Ss -)

Figure 2. An undirected graph is parameterized by associating a potential with each clique in the graph. The cliques in this example are C1 = { 81, 82, 83} , C2 = { 83, 84, 85} , and C3 = { 84, 85, 86} . A potential assigns a positive real number to each configuration of the corresponding clique. The joint probability is the normalized product of the clique potentials .

(that is, a setting of values for all of the nodes in the clique ) that Msociates a positive real number with each configuration . Thus , for every subset of nodes Ci that forms a clique , we have an associated potential
(3) where M is the total number of cliques and where the normalization factor Z is obtained by summing the numerator over all configurations :

Z= L II c/Ji(Ci)} . {5} { i=l

(4)

In keeping with statistical mechanical terminology we will refer to this sum as a "partition function ." The junction tree algorithm compiles directed graphical models into undirected graphical models ; subsequent inferential calculation is carried out in the undirected formalism . The step that converts the directed graph into an undirected graph is called "moralization ." (If the initial graph is already undirected , then we simply skip the moralization step). To under stand moralization , we note that in both the directed and the undirected cases, the joint probability distribution is obtained as a product of local

AN INTRODUCTION

TO VARIATIONAL

A

DAD

B

C

METHODS

B

(a)

111

C

(b)

Figure 3. (a) The simplest non-triangulated graph. The graph has a 4-cycle without a chord. (b) Adding a chord between nodes B and D renders the graph triangulated .

functions

. In

the

directed

case , these

functions

are the

node

conditional

probabilities P (Si IS7r (i)). In fact, this probability nearly qualifies as a potential function ; it is certainly a real-valued function on the configurations

of the set of variables { Si, S7r(i)} . The problem is that these variables do not always appear together within a clique . That is, the parents of a common child are not necessarily linked . To be able to utilize node conditional probabilities as potential functions , we "marry " the parents of all of the nodes with undirected edges. Moreover we drop the arrows on the other edges in the graph . The result is a "moral graph ," which can be used to represent the probability distribution on the original directed graph within the

undirected

formalism

.4

The second phase of the junction tree algorithm is somewhat more complex . This phase, known as "triangulation ," takes a moral graph as input and produces as output an undirected graph in which additional edges have (possibly ) been added . This latter graph has a special property that allows recursive calculation of probabilities to take place . In particular , in a triangulated graph , it is possible to build up a joint distribution by proceeding sequentially through the graph , conditioning blocks of interconnected nodes only on predecessor blocks in the sequence. The simplest graph in which this is not possible is the "4-cycle ," the cycle of four nodes shown in Fig . 3(a). If we try to write the joint probability sequentially as, for

example, P (A)P (BIA )P (CIB )P (DIC ), we seethat we have a problem. In particular , A depends on D , and we are unable to write the joint probability as a sequence

of conditionals

.

A graph is not triangulated if there are 4-cycles which do not have a chord, where a chord is an edge between non-neighboring nodes. Thus the 4Note in particular

that Fig . 2 is the moralization

of Fig . 1.

112

MICHAELI. JORDAN ET AL.

graph

in

chord

as

Fig

.

in

sequentially More

in

A

any

,

in cliques

The

the

local

data

In models

models

.

all

graph need

If

a

that

to

junction

node

lie

appears

on

based

cliques in

the

path that

on

achieving

assign

common

,

a

consequence be

the

possible

the

) . ,

local

In

a

same

junction .,

consistency

implies

will by

cliques

,

will

able

.

will

either to

,

of

be

achieve

of

in

per

particular

the

For

for

potential

efficient

is

inference

specific exact

able

lower

,

it

for Kjrerulff

for

display the

in

the

the

size moral

1990

these

" obvious

of

"

cliques

graph

triangulation ,

graphical

inference

to

bound

cliques

e .g . ,

;

investigate

algorithms see

tree to

complexity

represent

clique

junction as

.

we

the

,

cliques

costs

specific

algorithms

time

the to

the

the so

The

of

required in

considering

consider

.

size

paper

we be

on potentials

cliques the

computational

cases

we

performed clique

.

( for

in Thus

a we

discussion

) .

Examples

In

this

section

inference

is

system

in

remaining is

to

these

or

triangulation

.

,

is as

:

cliques

have

are

small

this

the

of

triangulated not

is

the

nodes

obtain

of

,

will

) . it

important

property

values

of

to

triangulation

of

, D

a

probability

known

can

they

on

of

consider

In

all the

( That

that

number

remainder and

. that

neighboring

number

the

critical

the

in has

nodes

( CIB

structure

inference

depends

the in

therefore

) P

adding

joint

property

appears

rescaling

between

discrete

it

by the

triangulated

data

intersection

and

calculation

exponential

, DIA

been a

triangulated write

intersection

calculations

consistency this

( B

has

cliques

running

marginalizing

forming

) P

property

the

be can

.

probabilistic

involve

,

This

can we

probabilistic

to

of

( A

into

tree .

for

consistency

P

running

between

because

=

graph

the

it

graph

graph

the

probability

,

)

a the

has

two

;

latter

, D

once

algorithm

bal

the , C

of

consistency

tree

triangulated

In

, B

cliques the

marginal

3

( A

tree

two

local

glo

not

) .

cliques

general

is

P

junction

between a

is

3 ( b

generally the

.

.

as

arrange tree

3 ( a )

Fig

fit

we

generally which

data

and

examples

infeasible a

examples to

present

fixed involve

subsequently

of

.

graphical

Our

graphical

first model

estimation used

example is

problems for

models

prediction

in

which

involves

used

to in

which or

a

answer

diagnostic

queries a

diagnosis

exact

graphical

.

The

model .

3.1. THE QMR-DT DATABASE The QMR-DT database is a large-scale probabilistic database that is intended to be used as a diagnostic aid in the domain of internal medicine.5

5The acronym "refers tothe "Decision Theoretic "version ofthe "Quick Medical Reference ."QMR " -DT

113

AN INTRODUCTION TO VARIATIONALMETHODS

diseases

symptoms Figure

. 4

.

The

evidence

structure

nodes

We

provide

see

The

per

-

of

(

Fig

.

4

,

use

of

, 6

the

.

a

.

The

the

a

of

"

bipartite

i

f

d

)

=

probabilities

diseases

OR

of

archival

data

,

"

model

P

{

fild

.

.

)

That

,

The

and

obtained

,

the

;

represent

di

the

for

further

which

the

up

-

represent

disease

nodes

and

are

binary

P

(

[

fid

Il

P

by

diseases

(

d

)

P

expert

nodes

.

bipartite

are

Making

form

nodes

findings

{

dj

]

~

,

of

we

the

obtain

:

)

,

P

(

were

probabilities

probability

with

All

variables

the

to

)

fild

,

refer

findings

.

symptom

P

(

of

diseases

random

and

)

we

vector

of

unobserved

from

conditional

represent

nodes

henceforth

the

vector

diseases

conditional

were

is

the

;

in

of

600

the

=

prior

and

over

,

model

layer

symptoms

over

(

nodes

here

lower

implied

probability

shaded

database

approximately

"

marginalizing

joint

DT

the

denotes

f

The

.

findings

d

.

graphical

and

observed

components

P

from

-

independencies

following

The

QMR

are

symbol

conditional

and

the

database

set

model

. "

.

There

the

as

thus

the

graph

)

in

symptoms

f

)

graphical

findings

diseases

is

symbol

DT

"

is

nodes

observed

1991

database

evidence

binary

(

represent

see

-

~

of

ale

DT

symptom

The

the

et

nodes

symptoms

QMR

to

overview

,

QMR

the

referred

brief

Shwe

layer

4000

are

a

details

of

and

dj

)

.

obtained

of

by

the

Shwe

findings

assessments

the

ith

a

symptom

5

)

(

6

)

et

given

under

that

,

(

ale

the

"

noisy

-

is

6In particular , the pattern of missing edgesin the graph implies that (a) the diseases are marginally independent, and (b) given the dise~ es, the symptoms are conditionally independent .

114

MICHAEL

I . JORDAN

ETAL.

absent, P (fi = Old), is expressedas follows:

P(fi = Old ) = (1- qiO ) II (1 - qij)dj

(7)

jE1r (i )

where the qij are parameters

obtained

from the expert

Msessments . Con -

sidering casesin which only one diseMe is present, that is, { dj = I } and { dk = 0; k ~ j } , we see that qij can be interpreted M the probability that the ith finding is present if only the jth disease is present . Considering the case in which

all diseases

are absent , we see that

the qiO parameter

can be

interpreted as the probability that the ith finding is present even though no disease

is present

.

We will find it useful to rewrite the noisy -OR model in an exponential

form :

P (fi = Old) = e- EjE1r (i) Bijdj- Bio

(8)

where (}ij := - In ( 1 - qij ) are the transformed parameters . Note also that the probability of a positive finding is given as follows :

P (fi = lid ) = I - e- EjE7r(i) Oijdj- lJio

(9)

These forms express the noisy - OR model as a generalized

If we now form the joint probability

distribution

linear model .

by taking products of

the local probabilities P (fild ) as in Eq. (6), we seethat negative findings are benign with respect to the inference problem . In particular ~ . a nroduct -

of exponential factors that are linear in the diseases(cf. Eq. (8)) yields a joint probability that is also the exponential of an expression linear in the diseases. That is, each negative finding can be incorporated into the joint probability in a linear number of operations . Products of the probabilities of positive findings , on the other hand , yield cross products terms that are problematic for exact inference . These cross product terms couple the diseases (they are responsible for the "explaining away" phenomena that arise for the noisy -OR model ; see Pearl , 1988) . Unfortunately , these coupling terms can lead to an exponential growth in inferential complexity . Considering a set of standard diagnos-

tic cases (the "CPC cases" ; see Shwe, et al. 1991), Jaakkola and Jordan (1997c) found that the median size of the maximal clique of the moralized QMR -DT graph is 151.5 nodes. Thus even without considering the trian gulation step , we see that diagnostic calculation under the QMR - DT model is generally

infeasible .7

7Jaakkola and Jordan (1997c) also calculated the median of the pairwise cutset size. This

value was found

for the QMR - DT .

to be 106 .5 , which

also rules out exact

cutset

methods

for inference

115

ANINTRODUCTION TOVARIATIONAL METHODS Input

Hidden Output Figure

5

output

nodes

3

. 2

.

.

The

NEURAL

Neural

graphical the

networks

are

each

logistic

j

( z

model

interpreting

the

,

)

=

Fig

/

(

.

5

)

.

e

-

of

one

the

of

us

one

Z

)

.

,

We

The

input

nodes

nonlinear

consider

those

and

treat

such

a

Si

.

from

neural

example

as

node

the

,

the

network

each

that

For

"

functions

with

probability

values

activation

obtained

variable

the

"

activation

as

can

as

two

a

such

binary

node

its

with

Let

and

+

.

MODELS

a

activation

write

1

network

endowed

associating

takes

we

1

neural .

GRAPHICAL

zero

by

variable

function

( see

between

a

nodes

graphs

node

function

graphical

of

evidence

layered

bounded

binary

of

AS

are

at

that

structure

set

NETWORKS

function

a

layered comprise

and

associated

using

the

logistic

:

1

P

where

j

( ) ij

and

is

"

i

,

=

a

is

neural

on

,

way

,

=

1

+

"

parameter

~

e

"

bias

"

to

requires

that

6

'

(

parent

with

(

1992

include

,

and

.

problem

)

.

The

i

.

This

advantages

to

treat

perform

unsupervised

Realizing

be

nodes

node

ability

to

learning

inference

)

the

data

10

10

between

Neal

manner

supervised

the

- J

edges

missing

as

- ' S 1. J

associated

by

this

handle

footing

( i )

with

introduced

in

,

same

6

L . . . jE7r

these

solved

in

bene

an

-

efficient

.

In

ered

fact

,

it

neural

parents

is

all

of

these

ticular

,

nodes

,

cally

during

are

of

of

are

inference

a

do

N

in

of

exact

hidden

least

in

the

0

( 2N

the

a

)

general

( see

is

nodes

hidden

,

the

-

are

.

in

par

additional

)

.

-

-

layers

layer

6

evidence

probabilisti

hidden

particular

Fig

clear

become

preceding

-

as

neural

layer

output

ignoring

lay

has

moralized

this

layer

in

in

the

in

general

generally

Thus

penultimate

units

at

.

nodes

network

in

network

inference

the

infeasible

neural

the

ancestors

is

is

a

layer

all

neural

their

in

preceding

for

units

inference

node

the

necessary

as

there

exact

A

between

hidden

,

if

in

links

training

the

that

.

nodes

links

thus

see

models

has

dependent

Thus

to

the

graph

That

easy

network

network

complexity

the

network

the

however

)

network

calculations

learning

( i )

associated

( ) iO

belief

treating

1IS7r

parameters

and

sigmoid

diagnostic

fits

the

node

the

of

are

( Si

.

,

the

time

growth

116

MICHAELI. JORDAN ET AL.

Hidden

OD J

j

. /

..;". /

..

Output Figure 6. Moralization of a neural network . The output nodes are evidence nodes during training . This creates probabilistic dependencies between the hidden nodes which are captured by the edges added by the moralization .

Figure 7. A Boltzmann machine . An edge between nodes Si and Sj is associated with a factor exp ( (Jij Si Sj ) that contributes multiplicatively to the potential of one of the cliques containing the edge. Each node also contributes a factor exp ((}iOSi) to one and only one potential .

in clique

size due to triangulation

or even hundreds of hidden neural network using exact

. Given

that

neural

networks

with

units are commonplace , we see that inference is not generally feasible .

dozens

training

a

3.3. BOLTZMANNMACHINES A Boltzmann nodes

and

ular , the

machine

is an undirected

a restricted clique

potentials

graphical

set of potential

functions

are formed

by taking

model

with

binary

( see Fig . 7 ) . products

In

- valued partic

-

of " Boltzmann

factors " - exponentials of terms that are at most quadratic in the Si ( Hin ton & Sejnowski , 1986 ) . Thus each clique potential is a product of factors exp { 8ijSiSj

} and factors

exp { (JiOSi } , where

Si E { O, 1} .8

8It is also possible to consider more general Boltzmann machines with multivalued nodes , and potentials that are exponentials of arbitrary functions on the cliques . Such models are essentially equivalent to the general undirected graphical model of Eq . (3)

ANINTRODUCTION TOVARIATIONAL METHODS

117

A given pair of nodes 8i and 8j can appear in multiple , overlapping cliques. For each such pair we ~ sume that the expressionexp{ OijSiSj} appears as a factor in one and only one clique potential . Similarly , the factors

exp{(}io8i } are ~ sumed to appear in one and only one clique potential . Taking the product over all such clique potentials (cf. Eq. (3)), we have:

P (S) =

e~ i<j BijSiSj+ ~ i BiOSi Z '

( 11)

where we have set (}ij = 0 for nodes Si and Sj that are not neighbors in the graph - this convention allows us to sum indiscriminately over all pairs

Si and Sj and still respect the clique boundaries. We refer to the negative of the exponent in Eq . (11) as the energy. With this definition the joint probability in Eq . (11) has the general form of a Boltzmann distribution . Saul and Jordan (1994) pointed out that exact inference for certain special cases of Boltzmann machine - such as trees, chains , and pairs of coupled chains- is tractable and they proposed a decimation algorithm for this purpose . For more general Boltzmann machines , however, decimation is not immune to the exponential time complexity that plagues other exact methods . Indeed , despite the fact that the Boltzmann machine is a special class of undirected graphical model , it is a special class only by virtue of its parameterization , not by virtue of its conditional independence struc ture . Thus , exact algorithms such as decimation and the junction tree algorithm , which are based solely on the graphical structure of the Boltzmann machine , are no more efficient

for Boltzmann

machines

than they are for

general graphical models . In particular , when we triangulate generic Boltz mann machines , including the layered Boltzmann machines and grid -like Boltzmann machines , we obtain intractably large cliques . Sampling algorithms have traditionally

been used to attempt to cope

with the intractability of the Boltzmann machine (Hinton & Sejnowski, 1986). The sampling algorithms are overly slow, however, and more recent work has considered the faster "mean field" approximation (Peterson & Anderson , 1987) . We will describe the mean field approximation for Boltz mann machines later in the paper - it is a special form of the variational approximation approach that provides lower bounds on marginal probabili ties . We will also discuss a more general variational algorithm that provides upper and lower bounds on probabilities (marginals and conditionals ) for

Boltzmann machines (Jaakkola & Jordan, 1997a). (although the latter can represent zero probabilities while the former cannot) .

118

MICHAEL I. JORDAN ET AL.

XT

Xj

7t ,

~ I

B

-

-

~ (,'

B

.

.

.

B

~

('-) B . YT

Figure 8. A HMM represented as a graphical model . The left -to -right spatial dimension represents time . The output nodes Yi are evidence nodes during the training process and the state

3 .4 .

nodes

HIDDEN

Xi

are hidden .

MARKOV

MODELS

In this section , we briefly review hidden Markov models . The hidden Markov model (HMM ) is an example of a graphical model in which exact inference is tractable ; our purpose in discussing HMMs here is to lay the groundwork for the discussion of intractable variations on HMMs in the following sections . See Smyth , Heckerman , and Jordan (1997) for a fuller discussion of the HMM as a graphical model . An HMM is a graphical model in the form of a chain (see Fig . 8) . Consider

a sequence of multinomial

conditional probability

" state " nodes Xi and assume that the

of node Xi , given its immediate predecessor Xi - I ,

is independent of all other precedingvariables. (The index i can be thought of as a time index). The chain is assumedto be homogeneous ; that is, the matrix of transition probabilities , A = P (Xi !Xi - I ), is invariant acrosstime . We also require a probability

distribution

7r = P (XI ) for the initial state

Xl . The HMM

model also involves a set of "output " nodes

i and an emis -

sion probability law B = P ( i !Xi ), again assumedtime-invariant. An HMM is trained by treating the output nodes as evidence nodes and the state nodes as hidden nodes. An expectation -maximization (EM ) algorithm (Baum , et al ., 1970; Dempster , Laird , & Rubin , 1977) is generally used to update the parameters A , B , 7r; this algorithm involves a simple iter -

ative procedure having two alternating steps: (1) run an inferencealgorithm to calculate the conditional probabilities P (Xil { i } ) and P (Xi , Xi - ll { i } ); (2) update the parameters via weighted maximum likelihood where the weights are given by the conditional probabilities calculated in step (1) . It

is easy to see that

exact

II

alization and triangulation

inference

is tractable

for HMMs

. The

mor -

steps are vacuous for the HMM ; thus the time

ANINTRODUCTION TOVARIATIONAL METHODS ~J J)

~v<J)

119

~3J) . . .

. . .

. . .

YJ

Y2

Yj

Figure 9. A factorial HMM with threechains. The transition matricesare A (1), A (2), and A (3) associatedwith the horizontal edges , and the output probabilitiesare determined by matricesB (1), B (2), and B (3) associatedwith the vertical edges .

complexity can be read off from Fig. 8 directly . We see that the maximal clique is of size N2 , where N is the dimensionality of a state node. Inference therefore scalesas O(N2T ), where T is the length of the time series.

3.5. FACTORIALHIDDENMARKOVMODELS In many problem domains it is natural to make additional structural assumptions about the state space and the transition probabilities that are not available within the simple HMM framework. A number of structured variations on HMMs have been considered in recent years (see Smyth, et al., 1997); generically these variations can be viewed as "dynamic belief networks" (Dean & Kanazawa, 1989; Kanazawa, Koller , & Russell, 1995). Here we consider a particular simple variation on the HMM theme known as the "factorial hidden Markov model" (Ghahramani & Jordan, 1997; Williams & Hinton , 1991). The graphical model for a factorial HMM (FHMM ) is shown in Fig. 9. The system is composedof a set of M chains indexed by m . Let the state node for the mth chain at time i be representedby Xi(m) and let the transition matrix for the mth chain be representedby A (m). We can view the effective state space for the FHMM as the Cartesian product of the state spacesassociatedwith the individual chains. The overall transition probability for the system by taking the product acrossthe intra -chain transition

120

MICHAELI. JORDANET AL. . . .

. . .

YJ

~

~

Figure 10 . A triangulation of an FHMM with two component chains . The morali step links states at a single time step . The triangulation step links states diago between neighboring time steps . probabilities : M P(XiIXi -l)=m n=lA(m )(X(m )IXi i (~{), (12 ) where the symbol Xi stands for the M -tuple (Xi (l ), Xi (2), . . . , Xi (M)) . Ghahramani and Jordan utilized a linear -Gaussian distribution for the emission probabilities of the FHMM . In particular , they assumed: P (~ IXi ) = N (L B (m)Xi (m), E ), m

(13)

where the B (m) and ~ are matrices of parameters . The FHMM is a natural model for systems in which the hidden state is realized via the joint configuration of an uncoupled set of dynamical systems . Moreover , an FHMM is able to represent a large effective state space with a much smaller number of parameters than a single unstructured Cartesian product HMM . For example , if we have 5 chains and in each chain the nodes have 10 states , the effective state space is of size 100,000, while the transition probabilities are represented compactly with only 500 parameters . A single unstructured HMM would require 1010parameters for the transition matrix in this case. The fact that the output is a function of the states of all of the chains implies that the states become stochastically coupled when the outputs are observed . Let us investigate the implications of this fact for the time complexity of exact inference in the FHMM . Fig . 10 shows a triangulation for the case of two chains (in fact this is an optimal triangulation ) . The cliques for the hidden states are of size N3 ; thus the time complexity of

ANINTRODUCTION TOVARIATIONAL METHODS

121

. . .

Figure 11. A triangulation of the state nodes of a three -chain FHMM with three com ponent chains . (The observation nodes have been omitted in the interest of simplicity ) .

Figure 12.

This graph is not a triangulation

.

.

.

.

.

.

.

.

.

of a three -chain FHMM .

exact inference is 0 (N3T ) , where N is the number of states in each chain

(we assumethat each chain has the same number of states for simplicity ). Fig . 11 shows the case of a triangulation

of three chains ; here the triangula -

tion (again optimal ) creates cliques of size N4 . (Note in particular that the graph in Fig . 12, with cliques of size three , is not a triangulation ; there are 4-cycles without a chord ) . In the general case, it is not difficult to see that cliques of size NM + l are created , where M is the number

of chains ; thus

the complexity of exact inference for the FHMM scales as O (NM + IT ) . For a single unstructured Cartesian product HMM having the same number of

states as the FHMM - i .e., NM states- the complexity scalesas O(N2MT ), thus exact inference

for the FHMM

is somewhat

less costly , but the expo -

111111111111

nential growth in complexity in either case shows that exact inference is infeasible for general FHMMs .

122

MICHAELI. JORDANET AL.

Uj

U2

U3 .

Yj

~

.

.

Y3

Figure 13. A hidden Markov decision tree . The shaded nodes { Ui } and { i } represent a time series in which each element is an (input , output ) pair . Linking the inputs and outputs are a sequence of decision nodes which correspond to branches in a decision tree . These decisions are linked horizontally to represent Markovian temporal dependence .

3 .6 .

HIGHER

- ORDER

HIDDEN

MARKOV

MODELS

A related variation on HMMs considers a higher -order Markov model in which each state depends on the previous K states instead of the single previous state . In this case it is again readily shown that the time complex ity is exponential in K . We will not discuss the higher - order HMM further in this chapter ; for a variational algorithm for the higher - order HMM see Saul and Jordan (1996) .

3.7. HIDDENMARKOVDECISIONTREES Finally , we consider a model in which a decision tree is endowed with Marko vian dynamics (Jordan , et al ., 1997) . A decision tree can be viewed as a graphical model by modeling the decisions in the tree as multinomial ran dom variables , one for each level of the decision tree . Referring to Fig . 13, and focusing on a particular time slice, the shaded node at the top of the diagram represents the input vector . The unshaded nodes below the input nodes are the decision nodes. Each of the decision nodes are conditioned on the input and on the entire sequence of preceding decisions (the vertical arrows in the diagram ) . In terms of a traditional decision tree diagram , this dependence provides an indication of the path followed by the data point as it drops through the decision tree . The node at the bottom of the diagram is the output variable . If we now make the decisions in the decision tree conditional not only

AN INTRODUCTION

TO VARIATIONAL

METHODS

123

on the current data point , but also on the decisions at the previous moment in time , we obtain a hidden Markov decision tree (HMDT ) . In Fig . 13, the horizontal edges represent this Markovian temporal dependence. Note in particular that the dependency is assumed to he level-specifIc- the proba bility of a decision depends only on the previous decision at the same level of the decision

tree .

Given a sequence of input vectors Ui and a corresponding sequence of output vectors i , the inference problem is to compute the conditional probability distribution over the hidden states . This problem is intractable for general HMDTs - as can be seen by noting that the HMDT includes the FHMM as a special case. 4 . Basics of variational

methodology

Variational methods are used as approximation methods in a wide variety of settings , include finite element analysis (Bathe , 1996), quantum mechanics (Sakurai , 1985) , statistical mechanics (Parisi , 1988) , and statistics (Rustagi , 1976). In each of these cases the application of variational methods converts a complex problem into a simpler problem , where the simpler problem is generally characterized by a decoupling of the degrees of freedom in the original problem . This decoupling is achieved via an expansion of the problem to include additional parameters , known as variational parameters , that must be fit to the problem

at hand .

The terminology comes from the roots of the techniques in the calculus of variations . We will not start systematically from the calculus of varia tions ; instead , we will jump off from an intermediate point that emphasizes the important role of convexity in variational approximation . This point of

view turns out to be particularly well suited to the development of varia tional methods for graphical models . 4 .1 .

EXAMPLES

Let us begin by considering a simple example . In particular , let us express the logarithm function variationally : In (x ) = min { "\x - In ''\ - I } . A

(14)

In this expression A is the variational parameter , and we are required to perform the minimization for each value of x . The expression is readily verified by taking the derivative with respect to A, solving and substituting . The situation is perhaps best appreciated geometrically , as we show in Fig . 14. Note that the expression in braces in Eq . (14) is linear in x with slope A. Clearly , given the concavity of the logarithm , for each line having

124

MICHAELI. JORDAN ET AL. 5

. .

.'

3 .

,,

.

"

,

I

.

.

.

.

,

,

.

,

.

,

.

.

.

.

.

.

.

.

.

'

'

'

'

'

'

,..'

,

.

.

"

,...,..,.. ..,.,.,.-' -~'.~." .,,,-' ,

.'

.

,

.

.

.

.-

,

.

.

.

-

-

.

'

. -'

-

.

~

-

_

.

-

-

-

~

. , ' ','.'~ . , ' : : '. '. ', '~'. ~, ' . . " , . ' - . ~ - - - . '

-1

-3

-5 0

0 .5

1

1 .5

2

2 .5

3

X

Figure 14.

Variational

transformation

of the logarithm

function . The linear functions

(Ax - In A - I ) form a family of upper bounds for the logarithm , each of which is exact for a particular

value of x .

slope .,\ there is a value of the intercept such that the line touches the logarithm at a single point . Indeed , - In A - I in Eq . (14) is precisely this intercept . Moreover , if we range across "\ , the family of such lines forms an upper envelope of the logarithm function . That is, for any given x , we have:

In(x) ~ AX - In A - I ,

(15)

for all A. Thus the variational transformation provides a family of upper bounds on the logarithm . The minimum over these bounds is the exact value of the logarithm . The pragmatic converted

justification

a nonlinear

have obtained

function

a free parameter

for such a transformation into

a linear

A that

function

. The

is that cost

we have

is that

we

must be set , once fOl' each x . For

any value of A we obtain an upper bound on the logarithm ; if we set A well we can obtain a good bound . Indeed we can recover the exact value of logarithm for the optimal choice of A. Let us now consider a second example that is more directly relevant to graphical models . For binary -valued nodes it is common to represent the probability that the node takes one of its values via a monotonic nonlinear ity that is a simple function - e .g., a linear function - of the values of the parents of the node . An example is the logistic regression model :

1 f (x) = 1+ e-X'

(16)

AN

which

we

the

have

values

seen

of

The

the

TO

previously

in

parents

logistic

bound

of

function

will

the

INTRODUCTION

not

a

is

work

.

VARIATIONAL

Eq

node

( 10

) .

Here

x

is

the

weighted

sum

of

.

neither

However

.

125

METHODS

convex

,

the

=

-

nor

logistic

concave

,

function

is

so

log

a

simple

linear

concave

.

That

is

,

function

g

is

a

concave

second

function

derivative

of ) .

functions

and

thereby

particular

,

can

we

( x

Thus

)

x

( as

we

can

( 1

can

+

e -

X )

readily

bound

bound

write

In

the

( 17

be the

verified

log

logistic

by

logistic

calculating

the

function

function

by

with

the

)

linear

exponential

.

In

:

g

( x

)

=

min

{ Ax

-

H

( A ) }

,

( 18

)

A

where

H

( We

( A )

will

is

the

binary

explain

suffices

to

logistic

think

function

the

entropy

how

minimum

the

of

it

) .

We

now

the

exponential

and

is

a

plotted

variational

in

obtain

Fig

an

as

.

15

.

)

for

of

i ty

in

a

graphical

with

( x

)

=

( 1

a

exponentials the

e -

on

CONVEX

Can

we

find variational

~

This

is

H

( A ) ]

-

In

.

,

Such

a

.

is

function

joint

Eq

For

.

to

x

it

the

log

noting

that

examples

any

of

of x

A

we

:

( 20

)

are the

a

a

.

If

variational as

obtained

by

,

-

the

of

form

the

local repre

functions

simple

including

is

the

transformations

more that

have

instead

we

parameters in

Eq

.

( 20

taking

-

) -

we

see

products

particularly

so

systematically been

utilized

-

form

of

given

that

.

transformations

)

probabil

over probabilities

of

in

in

joint

product

conditional

not

significant

DUALITY

variational

)

are

value

values

obtain

take

computationally in

;

for all

variationally

probability

tractable

linear

function

products

by

A ) .

( 20

,

obtain

product

,

-

now

for

sides

that

to

( 2 ) ) .

( l

.

particular

we

term

In

for

( A ) .

required

Eq

A ) ;

( 19

for

in

are

-

:

again

H

( I

below

.

function

eAx

A -

both

logistic

bounds

. we

logistic

the

-

once

representation

are

4 .2 .

the

X ) .

of

transformation

( cf

each

.

exponents

)

regression

network

bound

( x

models

representing

that

of

+

our

note

In

intercept

the

logistic

A

commute

[ eAX

better the

model

logistic

1 /

augment i .e .

of

we

the

provide

probabilities

sented

f

A

graphical

conditional

min

-

arises

exponential

for

, of

advantages

context

the

=

==

appropriate

function

( x

Finally

bound

( A )

function

transformation

upper

choices The

the

take

f

Good

, H

entropy

simply

f

This

function

binary

? in

Indeed the

, literature

many

126

MICHAELI. JORDAN ET AL. 1. 8

,

,

,

,

,

,

,

,

.

,

,

,

,

,

,

,

'

,

, .

,

,

,

.

,

,

,

,

.

,

1 4

,

'

,

'

,

,

"

.

,

,

,

.

,

,

,

,

,

,

,

.

' ,

.

,

,

,

"

,

,

,

.

.

-

" ,

,

,

,

'

"

.

,

,

'

,

"

.

.

'

' )

"

'

. . ..

r

,

. ,. ,," ' -, 1"

. '

,

. - - _. - ' _

" " "

,

,

"

,

,

, ,

..

,' ,

,

,

"

,

, ,

0 6

"

" , ,

"

,, , ~~.::,;~.:~.$"~;'=----'-------_._,

,

,

,

, , '

'

"

.

,

,

, ,

,

.

,

,

.

1 0

,

,

,

"

s -

..' - , ' .

' )

----------~~~~~~~~~~~-I'.-'.-I,I~~~~" 0 .2

----' ----- -------------

- 0 ..2 -3

-2

-1

0

1

2

3

x

Figure 15.

Variational transformation of the logistic function .

on graphical models are examplesof the general principle of convexduality. It is a general fact of convex analysis (Rockafellar, 1972) that a concave function f (x ) can be representedvia a conjugateor dual function as follows:

j (x) = min{ A ATx - j *(A)} ,

(21)

where we now allow x and A to be vectors . The conjugate function j * (A) can be obtained from the following dual expression :

j *(A) = min{ ATx - j (x )} . x

(22)

This relationship is easily understood geometrically , as shown in Fig . 16. Here we plot f (x ) and the linear function AX for a particular value of A. The short vertical segments represent values AX- f (x ) . It is clear from the figure that we need to shift the linear function AX vertically by an amount which is the minimum of the values AX - f (x ) in order to obtain an upper bounding line with slope A that touches f (x ) at a single point . This observation both justifies the form of the conjugate function , as a minimum over differences Ax - f (x ) , and explains why the conjugate function appears as the intercept

in Eq. (21). I t is an easy exercise to verify that the conj ugate function for the logarithm is f * (A) = In A + 1, and the conjugate function for the log logistic function is the binary entropy H (A) . Although we have focused on upper bounds in this section , the frame work of convex duality applies equally well to lower bounds ; in particular

AN INTRODUCTION TO VARIATIONALMETHODS

127

f (x)

x Figure

16

tions

-

.

The

conjugate

represented

function

as

dashed

f

lines

-

*

(

A

)

is

between

obtained

by

AX

and

f

(

x

minimizing

)

across

the

devia

-

.

for convex f (x ) we have : j

(

x

)

=

maX

{

ATx

-

j

*

(

A

)

}

(23)

,

. , \

where f

*

(

A

)

=

max

{

AT

x

-

f

(

x

)

(24)

}

x

is

the

conjugate

We

not

function

have

restricted

on

to

transforming

linear

the

of

the

in

x2

function

we

.

focused

write

bounds

bounds

.

argument

(

can

linear

Jaakkola

of

&

the

Jordan

in

More

this

section

general

of

1997a

but

bounds

function

,

,

)

.

interest

For

convex

can

duality

be

rather

example

obtained

by

than

,

if

f

is

the

(

x

)

is

value

concave

:

f (x) = min{AX2 - l *(A)} , .,\

(

25

)

where J*('\) is the conjugatefunction of J(x) == f (x2). Thus the transformation yields a quadratic bound on f (x). It is also worth noting that suchtransformationscanbe combinedwith the logarithmictransformation utilized earlier to obtain Gaussianrepresentations for the upper bounds. This can be useful in obtaining variational approximationsfor posterior distributions (Jaakkola& Jordan, 1997b). To summarize , the generalmethodologysuggestedby convexduality is the following. We wish to obtain upper or lower boundson a function of interest. If the function is already convexor concavethen we simply

128

MICHAELI. JORDANET AL.

calculate

the

then

we

convex

the

a

then

also

,

such

the

function

is

as

For

or

function

approach

in

to

logarithm

,

whose

be

concave

the

of

conjugate

this

the

convex

renders

transformations

the

.

not

that

consider

calculate

back

transform

If

transformation

may

transform

properties

the

argument

the

transformed

useful

we

inverse

has

,

function

need

to

useful

to

algebraic

.

. 3

.

APPROXIMATIONS

FOR

CONDITIONAL

discussion

thus

ability

far

distributions

has

at

proximations

interest

,

in

interest

AND

our

Let

in

us

lower

on

P

(

and

(

Si

SiIS7r

and

I

S7r

i

)

(

At

are

the

(

'

(

the

.

That

)

,

providing

,

assume

product

of

)

(

S

(

HIE

ap

-

of

)

that

is

P

(

that

E

our

)

we

conditional

forms

that

,

(

Si

I S7r

(

respectively

n

P

(

upper

bound

SiIS7r

(

i

a

)

,

-

Af

)

and

where

Af

appropriate

the

upper

i

,

parameterizations

first

an

have

probabil

pU

bounds

Consider

is

=

Suppose

local

have

variational

.

)

the

lower

bounds

P

of

we

and

bounds

upper

-

these

probability

.

each

that

upper

lower

do

prob

probabilities

P

marginal

local

How

global

concreteness

for

different

and

the

for

bound

is

.

the

?

graphs

upper

)

model

the

distribution

and

directed

generally

upper

that

i

graphical

conditional

problems

an

At

a

for

for

problem

learning

focus

bound

ities

the

inference

interest

of

approximations

approximations

for

the

on

nodes

into

particular

in

for

PROBABILITIES

focused

the

translate

pL

JOINT

PROBABILITIES

The

is

We

We

.

invertible

.

.

and

find

an

concave

function

space

function

for

or

of

4

conjugate

look

,

bounds

we

.

have

Given

:

)

'"

: : : ;

n

pU

(

SiIS7r

(

i

)

'

AY

(26)

)

' "

for

Eq

any

.

fixed

ing

For

settings

(

26

,

sums

example

)

of

must

thus

values

hold

upper

of

for

any

bounds

over

the

,

on

and

H

(

E

)

parameters

S

whenever

on

be

a

the

disjoint

Af

some

probabilities

form

E

P

variational

of

marginal

variational

letting

the

subset

other

can

right

-

partition

{LH }P(H,E)

hand

be

~ L n pU(SiIS7r (i)' Af), {H} i

Moreover

,

is

obtained

side

of

.

subset

of

S

,

by

the

we

equation

have

held

tak

-

.

:

(27)

where , as we will see in the examples to discussed below , we choose the vari ational forms pU (Si IS7r(i ), Af ) so that the summation over H can be carried

AN INTRODUCTION

TO VARIATIONAL

METHODS

129

out efficiently (this is the key step in developing a variational method). In either Eq. (26) or Eq. (27), given that these upper bounds hold for any

settingsof valuesthe variationalparametersAf , they hold in particular for optimizing settings of the parameters . That is, we can treat the right -hand

side of Eq. (26) or the right -hand side Eq. (27) as a function to be minimized

with respectto Af . In the latter case,this optimizationprocesswill induce interdependencies betweenthe parametersAf . Theseinterdependencies are desirable ; indeed they are critical for obtaining a good variational bound on the marginal probability of interest . In particular , the best global bounds are obtained when the probabilistic dependencies in the distribution are reflected in dependencies in the approximation . To clarify the nature of variational bounds , note that there is an im portant distinction to be made between joint probabilities (Eq . (26) ) and

marginal probabilities (Eq. (27)). In Eq. (26), if we allow the variational parameters to be set optimally

for each value of the argument S , then it

is possible (in principle ) to find optimizing settings of the variational parameters that recover the exact value of the joint probability . (Here we assume that the local probabilities P (SiIS -rr(i)) can be represented exactly via a variational transformation , as in the examples discussed in Section

4.1). In Eq. (27), on the other hand, we are not generally able to recover exact values of the marginal by optimizing over variational parameters that depend only on the argument E . Consider , for example , the case of a node Si E E that has parents in H . As we range across { H } there will be summands on the right -hand side of Eq . (27) that will involve evaluating the

local probability P (SiIS-rr(i)) for different values of the parents S-rr(i). If the

variationalparameterAf dependsonly on E, we cannotin generalexpect to obtain an exact representation for P (SiIS-rr(i)) in each summand. Thus, some of the summands in Eq . (27) are necessarily bounds and not exact values

.

This observation provides a bit of insight into reasons why a variational bound might be expected to be tight in some circumstances and loose in others . In particular , if P (SiISrr(i)) is nearly constant as we range across Srr(i )' or if we are operating at a point where the variational representation

is fairly insensitiveto the setting of Af (for examplethe right-hand sideof the logarithm in Fig . 14) , then the bounds may be expected to be tight . On the other hand , if these conditions are not present one might expect that the bound would be loose. However the situation is complicated by the

interdependencies betweenthe Af that areinducedduringthe optimization process . We will

return

to these

issues in the discussion

.

Although we have discussed upper bounds , similar comments apply to lower bounds , and to marginal probabilities obtained from lower bounds on the joint

distribution

.

130

MICHAELI. JORDAN ET AL.

The two

conditional

per

and

and

distributions lower

lower

bounds

, then

fewer

est

and

Finally

, it

simply

as

strict

a

ability

to

conditional bound

on

substitute (H

tion

)

and

to

P

( HIE

In

the

been

as

important

forward

to

ease

and

on

case

a

utility

.

have

simplify

more

readily

, the

in

development

require

9Note necessarily

way

for

a

-

to

inter

this

is

prob

obtained , we

into

might

the

obtain

a

parameters

. We

variational

form

an

-

provide do

marginal

calculate

we

practical

for

approxima

-

choices

effective

a

in

of

node

properties

.

Also

,

Eq

. ( 27 are

variational

)

are

simple

depend functions

which

the

transfor

certain

approximations

-

architectures others

functions not

,

model

probability

variational

than

currently

-

. The

section

conditional

transformation

issues

this

in

well

; in some

par cases

understood can

in

some

and cases

.

P ( H , E ) in E

general jointly

a. s a marginal exhaust

the

probability set

of

nodes

-

.

architecture

in

to

con

straight

probability

others

of

architectures

new

node

-

All

examples

necessarily

regime

than

variational in

for

of

the

related

not

parameter

algebraic

and

to is

.

provide

degree

outlined

choice

and

certain

it

frame

examples

interest a

that

methods

variational - out

generalized ,

readily

. These

H

be

general

worked

. To

certain more

creativity

that

of

the

bounds

treat

the of

particular ,

others of

assume

the

under

substantial

that

of

that

One

example

approximation

the

useful

marginal

complex

the

be

methods

) .

to

illustrate

however

, including

particular

that

and

also

variational

number

can ,

applying

themselves

mations

ticular

that

and

In

lend

a

variational

of

topology

operated

denomi

involves

parameterized

methodology

details

functions

the

numerator

thus

form

architectures

emphasize

develop

graph

on

upper

Generally

, the

bound

fitting

the

will in

histories to

the

we

variational

architectural

the

a

, for

by

variational

applied

involve of

serve is

this

sections

has

examples

also It

it

into

utilize

.

-

) .

examples

crete

parameters

then

have

involves

as

used

is

. Thus )

of up

numerator

parameters

(E

must

can

than are

variational

P

ratio

obtain

the

methods

that

likelihood

the

To

.

rather

distribution

the

following

as

these

the

these

, E

work

is

substitute

) .9

bounds

= = HUE

m ~ thods

approximation

probability

lower

S

variational

sampling

variational

, and

can

as

lower

evaluation

, is

(E

denominator

, because

which

that

hand

) j P

the

and

approximations

( much

obtain

upper

in

, E

, we

and

finished

case

other

(H

distribution

obtain

noting

tractable

P

numerator

function

the

=

conditional

essentially

a

worth

) , on )

the

the

simply

is

bounds

to

, in

is

( HIE ( HIE

can

is

. Indeed

sums

we

labor

P ; i .e . , P the

both

, if

our

sums

no

on

, however

nator

on

bounds

speaking

P

distribution

marginal

; that S .

is , we

do

not

-

AL METHODS ANINTRODUCTION TOVARIATION 4.4. SEQUENTIAL Let

us now

AND BLOCK

consider

in

131

METHODS

somewhat

more

detail

how

variational

methods

can be applied to probabilistic inference problems . The basic idea is that suggested above - we wish to simplify the joint probability distribution by transforming the local probability functions . By an appropriate choice of variational transformation , we can simplify the form of the joint probability distribution and thereby simplify the inference problem . We can transform some or all of the nodes. The cost of performing such transformations is that we obtain bounds or approximations to the probabilities rather than exact

results

.

The option of transforming only some of the nodes is important implies a role for the exact methods as subroutines within a variational

; it ap -

proximation . In particular , partial transformations of the graph may leave

someof the original graphical structure intact and/ or introduce new graphical structure to which exact methods can be fruitfully applied . In general , we wish to use variational approximations in a limited way, transforming the graph into a simplified graph to which exact methods can be applied . This will in general yield tighter bounds than an algorithm that transforms the entire graph without regard for computationally tractable substructure . The majority of variational algorithms proposed in the literature to date can be divided into two main classes: sequential and block. In the sequential approach , nodes are transformed in an order that is determined during the inference process. This approach has the advantage of flexibility and generality , allowing the particular pattern of evidence to determine the best choices of nodes to transform . In some cases, however, particularly when there are obvious substructures in a graph which are amenable to exact methods , it can be advantageous to designate in advance the nodes to be transformed . We will see that this block approach is particularly natural

in the setting

5 . The sequential

of parameter

estimation

.

approach

The sequential approach introduces variational transformations for the nodes in a particular order . The goal is to transform the network until the result ing transformed

network

is amenable

to exact methods . As we will see in

the examples below , certain variational transformations can be understood graphically as a sparsification in which edges are removed from the graph . A series of edge removals eventually renders the graph sufficiently sparse that an exact method becomes applicable . Alternatively , we can variation ally transform all of the nodes of the graph and then reinstate the exact node probabilities sequentially while making sure that the resulting graph stays computationally tractable . The first example in the following section

132

MICHAELI. JORDAN ET AL.

illustrates

the latter

approach Many example run

approach

time

of the exact

methods

provide

, one can run

a greedy

triangulation

of the junction

is sufficiently

small , in terms

the exact Ideally

made

the

the resulting be as small

of the

is , an ordering

would

as possible

order

clique

) . Thus

have been . In

the

Jaakkola

and

Jordan

is a bipartite

findings

are based

QMR - DT graph

i .e ., symptoms of the joint

impact

on inference

present

no difficulties

associated

Repeating

. Moreover

nodes

and

would

be

so that

( in particu graph

, particularly

-

would

given

that

1 -

to

approach

-

of a positive

there

for the findings

therefore

they

, the negative form that

been made

have

findings

probabilities updates

and focus findings

representation

:

concave ; thus , as in the

to express

the

on

.

= lid ) = I - e - EjE1r (i ) 6ijdj - 6iO

e - x is log

no

of the prob the

are positive

, we have the following

are not

be marginal

on the disease

have already

varia -

nodes that

assume

for

QMR - DT

probabilities

the exponential

findings

when

finding

can be used

can simply

and

of

QMR - DT

of sequential

symptom

us therefore

findings inference

context

the

( Eq . ( 8 ) for the negative that

observed

given

Eq . ( 9 ) for convenience

, we are able

we return

the conditional

of negative

.

in the

, as we have discussed -

orderings

. As we have seen , the

) . Note

time . Let

the negative

node

presented

by omission

for inference

of performing

probability

function

function

inference

be chosen

at each step

an application

network

are not

distribution

P ( fi The

the

run time

to the

triangulated

variational

findings that

in linear

with

the problem for the

section

in which

in Eq . ( 8 ) , the effects be handled

bound

transformations

would

problem

best

on the noisy - OR model

and Eq . ( 9 ) for the positive

can

allotted

as possible

( 1997c ) present

to the

network

ability

time . For

estimated

to transform

used to choose

following

QMR - DT NETWORK

out

time

run

to upper

variational

of the resulting

is perhaps

5 .1. THE

-

their

. If this

of the nodes

and show how a sequential in this network .

methods

bound

algorithm

in which

is a difficult

approach

example

network inference

ized

the former

is unlikely to produce the simplest graph at each step ; partial orders must be considered . In the literature to date

sequential

a specific

that

overall

be as simple

the maximal

procedures

The

findings

illustrates

.

, that

a single ordering that is , different heuristic

of the

choice

graph

lar , such that

tional

example

algorithm

can stop introducing

procedure

optimally

tests

tree inference

proced ure , the system run

and the second

.

variational

upper

( 28 ) case of the bound

logistic

in terms

of

134

MICHAELI. JORDANET AL.

The sequential methodology utilized by Jaakkola and Jordan for infer ence in the QMR -DT network actually proceeds in the opposite direction . They first transform all of the nodes in the graph . They then make use of a simple heuristic to choose the ordering of nodes to reinstate , basing the choice on the effect of reinstating each node individually starting from the completely transformed state . (Despite the suboptimality of this heuristic , they found that it yielded an approximation that was orders of magnitude more accurate than that of an algorithm that used a random ordering ). The algorithm then proceeds as follows : (1) Pick a node to reinstate , and consider the effect of reintroducing the links associated with the node into the current graph . (2) If the resulting graph is still amenable to exact methods , reinstate the node and iterate . Otherwise stop and run an exact method . Finally , (3) we must also choose the parameters Ai so as to make the approximation as tight as possible . It is not difficult to verify that products of the expression in Eq . (32) yield an overall bound that is a convex function of the Ai parameters (Jaakkola & Jordan , 1997c) . Thus standard optimization algori thms can be used to find good choices for the Ai. Jaakkola and Jordan (1997c) presented results for approximate inference on the " CPC cases" that were mentioned earlier . These are difficult cases which have up to 100 positive findings . Their study was restricted to upper bounds because it was found that the simple lower bounds that they tried were not sufficiently tight . They used the upper bounds to determine vari ational parameters that were subsequently used to form an approximation to the conditional posterior probability . They found that the variational approach yielded reasonably accurate approximations to the conditional posterior probabilities for the CPC cases, and did so within less than a minute of computer time .

5.2. THEBOLTZMANN MACHINE Let us now consider a rather different example . As we have discussed, the Boltzmann machine is a special subset of the class of undirected graph ical models in which the potential functions are composed of products of quadratic and linear "Boltzmann factors ." Jaakkola and Jordan (1997a) in troduced a sequential variational algorithm for approximate inference in the Boltzmann machine . Their method , which we discuss in this section , yields both upper and lower bounds on marginal and conditional probabilities of interest . Recall the form of the joint probability machine :

distribution

for the Boltzmann

P(S)=eLi <j8ij Si Sj +Li8iOSi Z .

(33)

AN INTRODUCTION TO VARIATIONALMETHODS

135

To obtain marginal probabilities such as P (E ) under this joint distribu tion , we must calculate sums over exponentials of quadratic energy func tions . Moreover , to obtain conditional probabilities such as P (HIE ) := P (H , E )/ P (E ), we take ratios of such sums, where the numerator requires fewer sums than the denominator . The most general such sum is the par tition function itself , which is a sum over all configurations { S } . Let us therefore focus on upper and lower bounds for the partition function as the general case; this allows us to calculate bounds on any other marginals or conditionals of interest . Our approach is to perform the sums one sum at a time , introducing variational transformations to ensure that the resulting expression stays computationally tractable . In fact , at every step of the process that we describe , the transformed potentials involve no more than quadratic Boltz mann factors . (Exact methods can be viewed as creating increasingly higher order terms when the marginalizing sums are performed ) . Thus the trans formed Boltzmann machine remains a Boltzmann machine . Let us first consider lower bounds . We write the partition function as follows :

~ eEj
InSiE L{O ,l} -

-

>

L .(}jkSjSk + j:l:i L (}joSj+ In SiE L{O,l}eLj~i (}ijSiSj +(}iOSi {J.
(}jkSjSk+ L

j#i

(35)

(JjoSj + Af L (}ijSj+ (JiO+ H{Af), (36) j#i

where the sum in the first term on the right -hand side is a sum over all pairs j < k such that neither j nor k is equal to i , where H (.) is as before the binary entropy function , and where Af is the variational parameter associated with node Si . In the first line we have simply pulled outside of the sum all of those terms not involving Si , and in the second line we

136

MICHAEL10JORDANET ALo

~ I;~ =::=:~:I

sl. (a)

(b)

Figure 18. The transformation of the Boltzmann machine under the approximate marginalization over node Si for the case of lower bounds . ( a) The Boltzmann machine before the transformation . (b ) The Boltzmann machine after the transformation , where Si has become delinked . All of the pairwise parameters , ()jk , for j and k not equal to i , have remained unaltered . As suggested by the wavy lines , the linear coefficients have changed for those nodes that were neighbors of Si .

have perfoI 'med the sum over the two values of Si . Finally , to lower bound the expression in Eq . (35) we need only lower bound the term In (1 + eX) on the right -hand side. But we have already -found variational bounds for a related expression in treating the logistic function ; recall Eq . (18) . The upper bound in that case translates into the lower bound in the current case:

In(l + e- X) ~ AX+ H(A).

(37)

This is the bound that we have utilized in Eq . (36) . Let us consider the graphical consequences of the bound in Eq . (36) (see Fig . 18) . Note that for all nodes in the graph other than node Si and its neighbors , the Boltzmann factors are unaltered (see the first two terms in the bound ) . Thus the graph is unaltered for such nodes. From the term in parentheses we see that the neighbors of node Si have been endowed with new linear terms ; importantly , however, these nodes have not become linked (as they would have become if we had done the exact marginalization ) . Neighbors that were linked previously remain linked with the same (}jk parameter . Node Si is absent from the transformed partition function and thus absent from the graph , but it has left its trace via the new linear Boltzmann factors associated with its neighbors . We can summarize the effects of the transformation by noting that the transformed graph is a new Boltzmann machine with one fewer node and the following parameters :

AN INTRODUCTION TOVARIATIONAL METHODS -

j ,k

()jk ()jO+ Af()ij

j

#

137

. 'l

=1= i .

Note finally that we also have a constant term Af (JiO+ H (Af ) to keep track of. This term will have an interesting interpretation when we return to the Boltzmann machine later in the context of block methods. Upper bounds are obtained in a similar way. We again break the partition function into a sum over a particular node Si and a sum over the configurations of the remaining nodes S\ Si. Moreover, the first three lines of the ensuing derivation leading to Eq. (35) are identical. To complete the derivation we now find an upper bound on In( 1 + eX). Jaakkola and Jordan (1997a) proposed using quadratic bounds for this purpose. In particular , they noted that : In(l + eX) = In(ex/ 2 + e- x/2) + x / 2

(38)

and that In(eX/2 + e- x/2) is a concavefunction of x2 (as can be verified by taking the secondderivative with respect to x2). This implies that In(l + eX) must have a quadratic upper bound of the following form: In(l + eX) S Ax2 + x/ 2 - g* (A).

(39)

where g* (A) is an appropriately defined conjugate function . Using these upper bounds in Eq. (35) we obtain:

InSiE L{O
(}jkSjSk + L OjoSj

+

j :f=i

Afj# LifJijSj +fJio

+_ 21 jL# i (JijSj+ (JiG - g*(Af), (40)

-

where Af is the variational parameter associated with node Si . The graphical consequences of this transformation are somewhat dif ferent than those of the lower bounds (see Fig . 19) . Considering the first two terms in the bound , we see that it is still the case that the graph is unaltered for all nodes in the graph other than node Si and its neighbors , and moreover neighbors of Si that were previously linked remain linked . The quadratic term , however, gives rise to new links between the previ ously unlinked neighbors of node Si and alters the parameters between previously linked neighbors . Each of these nodes also acquires a new linear term . Expanding Eq . (40) and collecting terms , we see that the approxi mate marginalization has yielded a Boltzmann machine with the following par ameters :

138

MICHAELI. JORDANET AL.

41 ;~~:=~~ 1

.

(a) Figure

19

.

The

transformation

marginalization

over

chine

before

where

the

Si

have

edges

,

The

with

a

+

2AfOjiOik

OjO

=

=

OjO

+

Oij

is

,

vantage

Jaakkola

does

given

as

by

reveal

6 . The block

the

/

AfO

2

,

a

ma

the

transformation

the

neighbors

-

,

of

Si

values

linear

algorithm

)

.

coefficients

that

= 1 =

i

j

= 1 =

i

g

upper

In

*

(

As

.

All

.

when

at

are

nodes

this

point

un

an

,

of

This

seeming

is

a

tighter

a

-

exact

transformation

.

bound

the

nodes

neighbors

readily

that

delinks

;

-

given

transformations

bound

as

.

transforma

,

links

the

)

bound

these

revealed

upper

Af

particular

upper

the

k

simply

is

between

structure

fact

-

combine

tree

The

,

additional

the

.

links

)

approximate

Boltzmann

parameter

new

Aforo

and

to

subroutine

1997a

of

[ j

+

lower

no

as

the

The

new

have

.

,

such

tractable

by

Jordan

Oio

natural

introducing

mitigated

&

a

the

)

all

have

also

+

by

particular

(

called

not

is

Si

2AfOiOOij

of

In

,

linked

of

+

more

.

,

formerly

a

after

suggest

consequences

structure

hand

edges

introduces

methods

tractable

other

node

2

is

somewhat

(

.

consequences

is

.

j

/

term

under

bounds

machine

dashed

computational

it

algorithm

the

(

Ojk

upper

Boltzmann

unaltered

=

exact

til

The

were

transformation

,

)

machine

of

neighbors

=

bound

Boltzmann

case

the

the

Ojk

have

delinked

b

As

are

constant

also

lower

,

parameters

the

(

that

lines

graphical

tions

those

wavy

the

the

.

and

the

and

Finally

for

delinked

linked

by

other

Si

.

become

become

of

node

transformation

has

suggested

(b)

on

delinked

disad

-

bound

.

approach

An alternative approach to variational inference is to designate in advance a set of nodes that are to be transformed . We can in principle view this "block approach " as an off-line application of the sequential approach . In the case of lower bounds , however, there are advantages to be gained by

AN INTRODUCTION

TO VARIATIONAL

METHODS

139

developing a methodology that is specific to block transformation . In this section , we show that a natural global measure of approximation accuracy can

be obtained

for

lower

bounds

via

a block

version

of the

variational

formalism . The method meshes readily with exact methods in cases in which tractable substructure can be identified in the graph . This approach was first presented by Saul and Jordan (1996) , as a refined version of mean field theory

for Markov

random

fields , and has been developed

further

in a

number of recent studies (e.g., Ghahramani & Jordan, 1997; Ghahramani & Hinton , 1996; Jordan, et al., 1997). In the block approach , we begin by identifying a substructure in the graph of interest that we know is amenable to exact inference methods (or , more generally , to efficient approximate inference methods ) . For example , we might pick out a tree or a set of chains in the original graph . We wish to use this simplified structure to approximate the probability distribution on the original graph . To do so, we consider a family of probability distri butions that are obtained from the simplified graph via the introduction of variational

parameters . We choose a particular

approximating

distribution

from the simplifying family by making a particular choice for the variational parameters . As in the sequential approach a new choice of variational parameters

must

be made

each time

new evidence

is available

More formally , let P (S) represent the joint distribution

.

on the graphical

model of interest , where as before S represents all of the nodes of the graph and Hand E are disjoint subsets of S representing the hidden nodes and

the evidence nodes, respectively . We wish to approximate the conditional probability P (HIE ). We introduce an approximating family of conditional probability distributions , Q (HIE , A) , where A are variational parameters . The graph representing Q is not generally the same as the graph repre senting P ; generally it is a sub-graph . From the family of approximat ing distributions

Q , we choose a particular

distribution

by minimizing

the

Kullback -Leibler (KL ) divergence, D (QIIP), with respect to the variational parameters :

A* == argminA D (Q(HIEix ) II P (HIE )), where for any probability is defined

as follows

distributions

(41)

Q (S) and P (S) the KL divergence

:

Q (S)

D(QIIP) = L Q(S) InP( S) " { S} The minimizing

values of the variational

(42)

parameters , A* , define a partic -

ular distribution , Q(HIE , A*), that we treat as the best approximation of P (HIE ) in the family Q (HIE , A).

140

MICHAELI. JORDANET AL.

One

simple

justification

approximation

for

accuracy

ability

of

the

mations

Q

evidence

( HIE

inequality

is

as

P

, ,\ ) .

( E

Indeed

follows

using

that

it

) ,

( i .e . , we

the

the

( E

)

==

In

=

the

L

In

~

a

between

seen

KL

to

be

L

lower

obtain

We

can

also

,

as ,

the

the

numbers

log ,

. ,

for

vector

the

"

in

Eq

.

( HIE

) .

KL

indeed Eq

convex .

( 23

( E

of )

using

prob

-

approxi

-

Jensen

' s

in

the

values

) .

~~..:~

2

( HIE

)

]

sides

Thus

the

,

right

.

( 43

of by

this

the side

according

)

equation

is

positivity

- hand A

of

of to

by

,

( H

for

simplicity

( HIE

) .

can

Eq Eq

. .

the

( 43

( 41

the

)

) ,

viewed

} : =

P

( H

f In

eln

P

( x

P

( E

( H

,E

case

)

H in as

H

.

to

Theat be

to

the

variables

be

define

appeal

viewed

" >" "

also

for

the

be

of

,

In

In

can

configurations

Finally

=

,

, >" ) of

)

an with

parameter

, E

of

making

approach

configuration

)

~ .:~

[ ~

block

Q

set

( E

In

divergence the

expression

P

:~

is we

DIVERGENCE

variational

( 23

.

hand

choosing

Consider

P

following

)

( QIIP

{ H

in

family

ofP

of

the

)

)

by

KL

each

the

( HIE

) ,

THE

of

In

In

is

measure on

.

- valued

on

Q

1991

distribution

one

, E

right

linking

probability

" x that

,

choice

1997

( H

D ,

bound

The

defined

variable

verified

the

logarithm

P

and

Moreover

thereby ,

H

numbers

vector

) .

the

theory

left

AND

( Jaakkola

real

the

( E lower

nodes

over

P

justify

approach valued

a

bound

in

Q

Q

Thomas

DUALITY

duality

)

}

divergence &

tightest

CONVEX

vex

of

on

the

6 .1 .

KL

( Cover

bound

~

the

the

divergence

as

lower

}

{ H

easily

best

likelihood

bound

{ H

difference

divergence

the

:

InP

The

KL

yields

InP

Eq a

.

of

discrete

as

a

.

-

( 23

) .

this

-

vector

Treat

vector

( E

con

sequential

this More of

real

vector ) .

It

as can

be

) :

)

( 44

)

}

, E

) .

Moreover

,

by

direct

substitution

) :

f * (Q) = mill

(HIE ,;\)InP(H,E)-lnP (E) {2:::H}Q

(45)

AN INTRODUCTION TO VARIATIONAL METHODS

141

and minimizing with respect to In P (H , E ), the conjugate function f * (Q) is seento be the negative entropy function ~ {H} Q(HIE ) In Q(HIE ). Thus, using Eq. (23), we can lower bound the log likelihood as follows: In P (E ) ~ :2:::: Q(HIE ) In P (H , E ) - Q(HIE ) In Q(HIE ) {H}

(46)

This is identical to Eq. (43). Moreover, we seethat we could in principle recover the exact log likelihood if Q were allowed to range over all probability distributions Q(HIE ). By ranging over a parameterized family Q(HIE , A), we obtain the tightest lower bound that is available within the family. 6.2. PARAMETER ESTIMATION VIA VARIATIONAL METHODS Neal and Hinton (this volume) have pointed out that the lower bound in Eq. (46) has a useful role to play in the context of maximum likelihood parameter estimation. In particular , they make a link between this lower bound and parameter estimation via the EM algorithm . Let us augment our notation to include parameters0 in the specification of the joint probability distribution P (SIO). As before, we designatea subset of the nodes E as the observedevidence. The marginal probability P (EIO), thought of as a function of (), is known as the likelihood. The EM algorithm is a method for maximum likelihood parameter estimation that hillclimbs in the log likelihood . It does so by making use of the convexity relationship between In P (H , EIO) and In P (EIO) described in the previous section. In Section 6 we showedthat the function

(Q,8) = L Q(HIE) InP(H, EIO) - Q(HIE) InQ(HIE) {H}

(47)

is a lower bound on the log likelihood for any probability distribution Q(HIE ). Moreover, we showed that the difference between InP (EI (}) and the bound (Q, O) is the KL divergence between Q(HIE ) and P (HIE ). Supposenow that we allow Q(HIE ) to range over all possible probability distributions on H and minimize the KL divergence. It is a standard result (cf. Cover & Thomas, 1991) that the KL divergenceis minimized by choosing Q(HIE ) == P (HIE , 0), and that the minimal value is zero. This is verified by substituting P (HIE , ()) into the right -hand side of Eq. (47) and recovering In P (E I0) . This suggeststhe following algorithm . Starting from an initial parameter vector 0(0), we iterate the following two steps, known as the "E (expectation) step" and the "M (maximization) step." First , we maximize the bound (Q, O) with respect to probability distributions Q. Second, we fix

142

MICHAELI. JORDAN ET AL.

Q and maximize the bound I:,(Q, 0) with respect to the parameters O. More formally , we have: (E step) :

Q(k+ l )

=

argmaxQ

(Q , (}(k))

(48)

(M step) : fJ(k+l ) =

argmax(}

(Q(k+l ), fJ)

(49)

which is coordinate ascent in (Q , ()) . This can be related to the traditional presentation of the EM algorithm (Dempster , Laird , & Rubin , 1977) by noting that for fixed Q , the right -

hand side of Eq. (47) is a function of fJonly through the In P (H , ElfJ) term . Thus ma:ximizing (Q , ()) with respect to () in the M step is equivalent to ma:ximizing the following function :

L P(HIE,(}(k)) InP(H, ElfJ).

(50)

{H }

Maximization

of this function , known as the "complete log likelihood " in

the EM literature Let

us now

, defines the M step in the traditional return

to

the

situation

in

which

presentation

we are

unable

of EM . to

com -

pute the full conditional distribution P (HIE , fJ). In such casesvariational methodology suggests that we consider a family of approximating distribu tions . Although we are no longer able to perform a true EM iteration given

that we cannot avail ourselvesof P (HIE , fJ), we can still perform coordinate ascent in the lower bound imizing

the KL divergence

(Q , fJ) . Indeed , the variational strategy of min with respect to the variational

parameters

that

define the approximating family is exactly a restricted form of coordinate ascent in the first argument of (Q, (J) . We then follow this step by an "M step " that increases the lower bound with respect to the parameters

(J.

This point of view , which can be viewed as a computationally tractable approximation to the EM algorithm , has been exploited in a number of recent architectures , including the sigmoid belief network , factorial hidden Markov

model

and

cuss in the following

hidden

Markov

decision

tree

architectures

sections , as well as the " Helmholtz

that

we

dis -

machine " of Dayan ,

et ale (1995) and Hinton , et ale (1995). 6 .3 .

EXAMPLES

We now return to the problem of picking a tractable variational parame terization for a given graphical model . We wish to pick a simplified graph which is both rich enough to provide distributions that are close to the true distribution , and simple enough so that an exact algorithm can be uti lized efficiently for calculations under the approximate distribution . Similar consideration ~ hold for the variational parameterization : the variational parameterization must be representationally rich so that good approximations

AN

are

available

KL

divergence

stuck

and

that

some

6 .3 . 1 .

Mean

. In

. It

field

Section and

bounds

machine

bounds

Recall written

also

the

as follows

Consider in

lJij8iSj Hand that Si

0 for

now

the

Sj we

for

the

E E

also

context

to

the

two

become

sums

8i

and

node

sum The

mann

, we

are

" mean machines

<j OijSiSj

+ Li

(JiOSi

Sj

reasonably

examples

ac -

.

that now

yielded

revisit and

the

discuss

machine

can

be

and

not

E E

. In

range

contributions

Zc

over

.lize . If

norma

a linear associated

summary

, we

, lJ )

Si

is given

on

E

contribution with

can

nodes

express

the

( 52 )

in

with

H the

and

the

evidence

updated nodes

} : = 8ijSjo jEE

machine

:

( 53 )

as follows

the

( Peterson form

we

terms

nodes

< j (Jij Si Sj + Ei

" approximation

P ( HIE contribution

.

:

associated

8io +

} : = { eEi {H }

graph

LIt . O~ t.OS t. < j Oij SiSj + """"' Zc '

eLi

function

the

E E , the

becomes

vanish

to

8j

when

, linear

in

distribution

and

vanishes

( 51 )

neighbors

conditional

8i

, 0 ) as follows

, (} ) =

is a particular

are

contribution

a Boltzmann field

'

that

Si . Finally

restricted

partition

have

found

. Boltzmann

, which

P ( HIE

Zc =

In

been

approach

the

nodes

constants

io include

updated

has

. We

block

for

of the

. For

8io =

The

the

Z

quadratic

with

distribution

the

it

algorithm

approaches

e

a constant

E E , the

parameters

of these

yield

machine

of

probability

representation

P ( HIE

where

getting

all

such

variational

Boltzmann

the

relate

nodes

associate

conditional

not

machine

joint

machine

reduces

and

to realize

of cases can

the

:

=

a Boltzmann

minimizes

parameters

several

143

that

possible

we discuss

P ( SlfJ ) = Oij

good

, in a number

Li

where

a procedure

necessarily

a sequential

within

. We

that

METHODS

approximations

section

discussed

lower

Boltzmann

of finding

is not

Boltzmann

5 .2 we

so that

variational this

VARIATIONAL

enough

; however

simple

solutions

TO

hope

minimum

relatively

lower

simple

simultaneously

curate

upper

yet

has

in a local

desiderata

In

INTRODUCTION

of variational

:

(JiOSi } 0

subset

( 54 )

H .

& Anderson

, 1987 ) for

approximation

Boltz

in which

a

144

MICHAELI. JORDAN ET AL.

0

f =:~~'

.

(b)

(a)

Figure 20. (a) A node Si in a Boltzmann machine with its Markov blanket . (b) The approximating mean field distribution Q is based on a graph with no edges. The mean field equations yield a deterministic relationship , represented in the figure with the dotted lines, between the variational parameters Jl,i and J.l.j for nodes j in the Markov blanket of node i .

completely factorized distribution is used to approximate P (HIE , fJ). That is, we consider the simplest possible approximating distribution ; one that is obtained by dropping all of the edges in the Boltzmann graph (see

Fig. 20). For this choice of Q(HIE , J..L), (where we now use J..L to represent the variational parameters ) , we have little choice as to the variational parameterization - to represent as large an approximating family as possible we endow each degree of freedom Si with its own variational parameter J..Li. Thus Q can be written as follows :

Q(HIE,J..L) = II J..Lfi(1- J..Li)l- Si,

(55)

iEH

wherethe product is takenoverthe hiddennodesH . Formingthe KL divergencebetweenthe fully factorizedQ distribution and the P distribution in Eq. (52), we obtain: D (QIIP) = L [JLiIn JLi+ (1 - JLi) In(l - JLi)] t -

L ijJ .LiJ -jL.". OioJLi+ In Zc, .J.(J.L t<

(56)

where the sums range across nodes in H . In deriving this result we have used the fact that , under the Q distribution , Si and Sj are independent random variables with mean values J1 ,i and J1 ,j . We now take derivatives of the KL divergencewith respectto Jli- noting that Zc is independent of J1 ,i- and set the derivative to zero to obtain the following equations:

JLi =a LJ 8ijJLj +(JiO,

(57)

145

AN INTRODUCTION TO VARIATIONAL METHODS

where a (z) = 1/ (1 + e- Z) is the logistic function and we define (}ij equal to (Jji for j < i . Eq . (57) defines a set of coupled equations known as the "mean field equations ." These equations are solved iteratively for a fixed point solution . Note that each variational parameter J1 ,i updates its value based on a sum across the variational parameters in its Markov blanket (cf . Fig . 20b) . This can be viewed as a variational form of a local message passing algorithm . The mean field approximation for Boltzmann machines can provide a reasonably good approximation to conditional distributions in dense Boltz mann machines , and is the basis of a useful approach to combinatorial opti mization known as "deterministic annealing ." There are also cases, however, in which it is known to break down . These cases include sparse Boltzmann machines and Boltzmann machines with "frustrated " interactions ; these are networks

whose

ing In

nodes the

that case

this

potential cannot

of

mean

us

machines

to

now .

"

P

of

() )

~

bound

in

( JijJ

Eq

. LiJ . Lj

+

2: : :

[ J. li

In

Galland

.

2: : :

)

( JioJ

. Li

;

and

Jordan

this

-

In

1993

) . ,

within

for

for

-

indeed

subroutines

Saul

( 47

,

help

problem

i < j

-

as by

neighbor

provide

algorithms

estimation

2: : :

also

can

pursued

lower

( EI

exact

parameter

the

between ( see

algorithms

approach

the

In

exact

use

field

out

constraints satisfied

,

the

consider

Writing

embody

simultaneously

networks led

" structured Let

be

sparse

observation

the

functions

case

( 1996

) .

Boltzmann ,

we

have

:

Z

i

, ui

+

( 1

-

, ui

)

In

( 1

-

JLi

) ]

( 58

)

1.

Taking

the

" Hebbian with where

derivative "

term

respect

to

the

brackets

distribution

performing

P

with J1, iJ1 , j Oil

.

respect

as It

well is

signify ( SIO

) .

Thus

an approximate

to as

not

a

hard an

average

we

have

Oij

yields

a

contribution to

from

show

that

with the

gradient

respect following

which the

this

has

a

derivative

derivative

is

the

unconditional

to gradient

simple of

algorithm

In

Z

( SiSj

) ;

for

M step :

~ (}ij CX(J.LiJ .Lj - (BiBj)).

(59)

Unfortunately , however , given our assumption that calculations under the Boltzmann distribution are intractable for the graph under consideration , it is intractable to compute the unconditional average. We can once again appeal to mean field theory and compute an approximation to (SiSj ) , where we now use a factorized distribution on all of the nodes; however , the M step is now a difference of gradients of two different bounds and is therefore no longer guaranteed to increase . There is a more serious problem , moreover, which is particularly salient in unsupervised learning problems . If the

146

MICHAELI. JORDANET AL.

data set of interest is a heterogeneous collection of sub-populations , such as in unsupervised classification problems , the unconditional distribution will generally be required to have multiple modes. Unfortunately the factorized mean field approximation is unimodal and is a poor approximation for a multi -modal distribution . One approach to this problem is to utilize multi -modal Q distributions within the mean-field framework ; for example , Jaakkola and Jordan (this volume ) discuss the use of mixture models as approximating distributions . These issues find a more satisfactory treatment in the context of di rected graphs , as we see in the following section . In particular , the gradient for a directed graph (cf . Eq . (68)) does not require averages under the uncondi tional distribution . Finally , let us consider the relationship between the mean field approx imation and the lower bounds that we obtained via a sequential algorithm in Section 5.2. In fact , if we run the latter algorithm until all nodes are eliminated from the graph , we obtain a bound that is identical to the mean field bound (Jaakkola , 1997) . To see this , note that for a Boltzmann machine in which all of the nodes have been eliminated there are no quadratic and linear terms ; only the constant terms remain . Recall from... Section 5.2 that the ,f ) , " constant that arises when node i is removed is JLf(JiO+ H (J1 where (JiGrefers to the value of (JiGafter it has been updated to absorb the linear terms from previously eliminated nodes j < i . (Recall that the latter update is given by BiD== BiG+ J1 ,tBij for the removal of a particular node j ). Collecting together such updates for j < i , and summing across all nodes i , we find that the resulting constant term is given as follows :

2:::i {OiOJ1 ,i +H(J1 ,i)} =2 :::j 8ijJ1 ,iJ.L+j 2::: ,i i< i 8ioJ1 - L [JLiInJLi+ (1 - JLi)In(l - J.Li)] 1,

(60)

This differs from the lower bound in Eq. (58) . . only by the term In Z , which disappears when we maximize with respect to J1 ,i . 6.3.2. Neural networks As discussed in Section 3, the "sigmoid belief network " is essentially a (directed ) neural network with graphical model semantics . We utilize the logistic function as the node probability function :

1 ) , P(Si==l/S7r (i) ==1+e-EjE7r (i)lJijSj -lJiO

(61)

where we assume that (Jij = 0 unless j is a parent of i . (In particular , Oij =1= 0 * Oji = 0) . Noting that the probabilities for both the Si = 0 case

ANINTRODUCTION TOVARIATIONAL METHODS

147

andthe Si = 1 casecanbe writtenin a singleexpression asfollows : P(S.IS (i)OijSj +OiO ) Si t 1r(t). ) - e(EjE7r 1+ eEjE7r (i)BijSj+BiO '

(62)

weobtainthe followingrepresentation for thejoint distribution: P(SI8) ==II i

(EjE7r (i)(}ij Sj+0;0) Si e . , 1+ eEjE7r (i)(}ijSj+()iO

(63)

Wewishto calculateconditionalprobabilities underthisjoint distribution. Aswehaveseen(cf. Fig. 6), inference for general sigmoidbeliefnetworks is intractable , andthusit is sensible to consider variationalapproximations . Saul, Jaakkola , andJordan(1996 ) andSaulandJordan(this volume ) have exploredtheviabilityof thesimplecompletely factorized distribution.Thus onceagainweset: Q(HIE, /1) == II /17i(1 - /1i)l - Si, iEH

(64)

andattemptto find thebestsuchapproximation by varyingtheparameters Jl,i. The computationof the KL divergence D(QIIP) proceeds muchas it doesin the caseof the meanfieldBoltzmannmachine . The entropyterm (QInQ) is the sameasbefore . Theenergyterm (QInP) is foundby taking the logarithmof Eq. (63) andaveraging with respectto Q. Puttingthese resultstogether , weobtain: InP(EIO) ~ L OijJ .liJ.lj + L OiOJ .li i<j i - ~1- ( In [1+ e2::jE7r (i)OijSj +OiO ]) - L [J.li In(J.li) + (1 - J.li) In(l - /li )] 1;

(65)

where(.) denotes anaverage with respectto the Q distribution.Notethat, despitethe fact that Q is factorized , weareunableto calculatethe average of In[l + eZi], wherezi denotesEjE7r(i) OijSj+ OiOo This is an important term whicharisesdirectlyfrom the directednatureof the sigmoidbelief network(it arisesfrom the denominator of the sigmoid , a factor which is necessary to definethe sigmoidas a localconditionalprobability). To

148

MICHAELI. JORDAN ET AL.

deal

with

this

term

parameters

of

~

Jensen

(

In

[

'

l

+

eZi

Jensen

'

this

to

s

.

)

(

given

Saul

works

[

.

and

in

+

65

(

the

For

fixed

this

In

[ e

{

iZie

=

~

i

(

Zi

)

+

(

~

~

i

(

Zi

)

+

In

for

~

i

Kij

i

=

is

is

the

)

a

the

( )

Boltzmann

via

their

ijJ1

, j

also

lize

Jordan

-

+

( JiG

+

a

~

)

+

iZi

+

The

]

e

(

65

found

tighter

)

)

.

that

bound

due

)

(

e

(

l

-

l

~

-

final

~

i

i

)

)

]

Zi

)

)

,

the

(

can

on

in

Zi

result

bound

the

be

log

of

a -

(

Zi

.

of

parents

i

has

)

,

where

)

utilized

case

~

66

likelihood

limiting

number

,

by

,

a

so

net

-

that

a

probabilistic

a

differentiating

(

.

the

J1 , i

( } ji

Yet

J1 , j

-

and

(

,

we

~

j

)

+

)

is

again

KL

obtain

the

diver

-

following

Kij

-

(

21

)

1996

)

67

)

we

Thus

,

see

that

the

from

as

in

the

case

are

(

a

term

Eq

.

(

67

)

of

linked

)

can

be

.

and

Saul

and

Jordan

~

ways

and

,

is

i

.

(

The

presented

this

two

obtain

for

bounds

the

is

contributions

parameters

lower

,

and

second

parameters

approximation

and

i

,

term

the

equation

different

j

first

and

node

.

algorithm

variational

upper

of

.

child

the

,

variational

variational

slightly

i

consistency

(

its

involves

Fig

the

,

that

node

again

see

i

Given

children

passing

Jordan

node

.

of

that

the

the

on

the

find

message

another

.

(

j

node

and

in

)

i

parents

node

we

update

both

1996

iZi

of

node

given

the

,

local

to

including

(

eZi

parameter

L

of

from

,

how

)

the

blankets

parameters

.

,

-

.

depends

"

a

of

Jaakkola

show

~

large

the

~

parents

for

as

equations

work

co

machine

these

e

O

a

,

that

"

Markov

,

ale

.

.

expression

the

blanket

Saul

-

+

parameters

contributions

interpreted

et

on

Eq

J

equation

the

(

1

that

has

invoked

from

over

Markov

[ e

=

parameters

contributions

sum

(

In

show

variational

L

an

consistency

iZi

expectation

the

a

(

over

in

:

parents

sum

Saul

form

bound

sign

a

tight

upper

negative

however

lower

.

other

{

tractable

volume

be

-

J

where

a

an

introduced

(

node

of

to

equations

J1 ,

,

providing

require

a

bound

and

approximate

respect

consistency

as

we

variational

.

values

with

viewed

=

a

can

function

gence

)

hidden

as

logistic

additional

with

a

Jensen

theorem

interpretation

the

be

tight

provide

each

introduced

:

]

to

Jordan

limit

such

eZi

)

)

appears

particular

l

1996

that

term

standard

(

which

central

In

to

Eq

(

particular

sufficiently

.

In

reduces

in

in

this

not

)

.

can

Note

that

1995

(

directly

al

parameters

.

was

(

et

provides

bound

which

These

inequality

Seung

Saul

inequality

]

s

i

,

volume

papers

different

the

uti

-

update

sigmoid

belief

in

)

Jaakkola

net

and

-

ANINTRODUCTION TOVARIATIONAL METHODS

149

.//~=:~~~ I

~jO 0 0 ,

I

,

I

I ,

I

I

I

,

I ,

I

,

I

I ,

,

I ,

I

I

I

,

I

~. ' I ,

O ----_tO ------O ,

,

,

,

,

,

,

,

,

,

,

,

,

,

I

,

,

,

0

0

(a) Figure

(

21

b

)

The

with

.

(

a

)

mean

the

A

lines

blanket

,

fixed

(

we

can

there

a

first

(

both

is

of

ten

,

(

and

Jordan

this

. 3

.

see

Fig

.

probability

22

(

a

)

)

Using

.

)

,

as

blanket

Jlj

for

.

the

nodes

figure

j

in

the

the

)

(

case

l

of

-

J . Li

of

and

a

"

of

weight

one

)

on

were

"

.

a

handwrit

-

with

advantage

of

data

in

.

the

graph

Indeed

-

,

Saul

performance

with

comparative

empirical

,

including

(

a

decay

competitive

missing

Dayan

the

parameters

degradation

,

work

comparisons

1996

)

.

models

(

notation

FHMM

)

is

developed

for

the

FHMM

a

earlier

is

given

)

machine

and

that

68

under

variational

architectures

Hinton

(

network

further

related

ij

and

.

Boltzmann

zero

with

the

)

is

important

For

Saul

appearance

the

results

.

the

equation

belief

that

( }

by

interesting

the

deal

parameters

parameters

between

An

the

obtained

J . Li

the

sigmoid

model

P({x1m )},{Yt}(8) =

i

values

slight

Markov

~

also

to

,

to

result

:

-

in

.

and

the

The

1

in

obtaining

Frey

distribution

(

extreme

report

is

see

i

Note

ability

digits

~

bounded

Markov

.

)

)

-

its

hidden

hidden

Markov

in

and

variational

systems

,

Factorial

factorial

8ij

the

networks

sampling

its

represented

respect

.

form

-

SI9

,

volume

the

belief

Gibbs

(

tested

is

in

sigmoid

3

)

~

calculate

are

1996

and

term

non

learning

(

J . Lj

1992

problem

pixels

with

)

second

for

.

i

P

,

the

approach

missing

(

-

supervised

model

joint

al

Jli

with

,

following

to

,

recognition

ical

.

~

parameters

et

digit

other

6

-

Neal

maximal

these

Saul

The

J . Li

need

by

term

that

on

(

no

noted

regularization

term

gradient

the

distribution

fact

with

,

parameters

Jl

takes

:

is

machine

relationship

variational

the

)

cx

network

deterministic

.

compute

volume

unconditional

(

belief

a

parameters

this

that

sigmoid

the

i

Ll8ij

Note

a

between

node

variational

Jordan

in

yield

,

of

Finally

for

Si

equations

dotted

Markov

node

field

(b)

multiple

chain

(

by

see

Section

:

structure

3

. 5

)

,

the

150

MICHAELI. JORDANET AL.

~ .

.

.

.

.

.

.

.

.

0

..0

..( )-........ ...

0

..0

..( )-....-.... ...

0

..0

..() -........ ...

.

.

.

Figure 22. (a) The FHMM . (b) A variational approximation for the FHMM can be obtained by picking out a tractable substructure in the FHMM graph. Parameterizing this graph leads to a family of tractable approximating distributions .

M[7f III (m )(X~m ))gTA(m )(Xlm )IXI ~{)]gTP(Yt I{Xlm )}~=1)(69 ) Computation under this probability distribution is generally infeasible , because, as we saw earlier , the clique size becomes unmanageably large when the FHMM chain structure is moralized and triangulated . Thus it is necessary to consider approximations . For the FHMM there is a natural substructure on which to base a variational algorithm . In particular , the chains that compose the FHMM are individually tractable . Therefore , rather than removing all of the edges, as in the naive mean field approximation discussed in the previous two sections , it would seem more reasonable to remove only as many edges as are necessary to decouple the chains . In particular , we remove the edges that link the state nodes to the output nodes (see Fig . 22(b )) . Without these edges the moralization process no longer links the state nodes and no longer creates large cliques . In fact , the moralization process on the delinked graph in Fig . 22(b ) is vacuous, as is the triangulation . Thus the cliques on the delinked graph are of size N2 , where N is the number of states for a single chain . Inference in the approximate graph runs in time O (MT N2 ) , where M is the number of chains and T is the length of the time series. Let us now consider how to express a variational approximation using the delinked graph of Fig . 22 (b) as an approximation . The idea is to intro duce one free parameter into the approximating probability distribution , Q , for each edge that we have dropped . These free parameters , which we denote as A~m), essentially serve as surrogates for the effect of the observation at time t on state component m . When we optimize the divergence D (QIIP ) with respect to these parameters they become interdependent ; this (deterministic ) interdependence can be viewed as an approximation to the probabilistic dependence that is captured in an exact algorithm via the moralization process.

AN INTRODUCTION TO VARIATIONALMETHODS

151

Referringto Fig. 22(b), we write the approximatingQ distribution in the followingfactorizedform:

M T Q{{xfm)}1{Yt},0,;\) = IIir )(xim ))tII=2A(m )(xfm )Ixf ~{), m =l(m

(70)

where .,\ is th ~ vector of variational parameters .,\~m). We define the transition matrix A (m) to be the product of the exact transition matrix A (m) and the variational parameter A~m) : A(m) (X ;m) IX;~{ ) = A (m) (X ;m) IX;~{ )A~m),

(71)

and similarly for the initial state probabilities ir(m): ir(m) (X ~m)) = 1[(m) (X ~m) )Aim).

(72)

This family of distributions respects the conditional independencestatements of the approximate graph in Fig. 22, and provides additional degrees of freedom via the variational parameters. Ghahramani and Jordan (1997) present the equations that result from minimizing the KL divergencebetween the approximating probability distribution (Eq. (70)) and the true probability distribution (Eq. (69)). The result can be summarized as follows. As in the other architectures that we have discussed, the equation for a variational parameter (A~m)) is a function of terms that are in the Markov blanket of the correspondingdelinked node (i .e., Yt). In particular , the update for A~m) dependson the parameters A~n), for n # m, thus linking the variational parameters at time t . Moreover, the update for A~m) depends on the expected value of the states X ;m), where the expectation is taken under the distribution Q. Given that the chains are decoupled under Q, expectations are found by running one of the exact algorithms (for example, the forward-backward algorithm for HMMs ), separately for each chain. These expectations of course depend on the current values of the parameters ,\~m) (cf. Eq. (70)), and it is this dependencethat effectively couples the chains. To summarize, fitting the variational parameters for a FHMM is an iterative , two-phaseprocedure. In the first phase, an exact algorithm is run as a subroutine to calculate expectations for the hidden states. This is done independently for each of the M chains, making referenceto the current values of the parameters A~m). In the second phase, the parameters ,\~m) are updated based on the expectations computed in the first phase. The procedure then returns to the first phase and iterates.

AN INTRODUCTION TO VARIATIONALMETHODS

.

Figure 24.

.

153

.

The "forest of trees approximation " for the HMDT . Parameterizing this

graphleadsto anapproximating familyof Q distributions .

We can also consider a "forest of trees approximation " in which the horizontal links are eliminated (see Fig . 24) . Given that the decision tree is a fully connected graph , this is essentially a naive mean field approximation on a hypergraph . Finally , it is also possible to develop a variational algorithm for the HMDT that is analogous to the Viterbi algorithm for HMMs . In particular , we utilize an approximation Q that assigns probability one to a single path in the state space. The KL divergence for this Q distribution is particularly easy to evaluate , given that the entropy contribution to the KL divergence (i .e., the Q In Q term ) is zero. Moreover , the evaluation of the energy (i .e., the Q In P term ) reduces to substituting the states along the chosen path into the P distribution . The resulting algorithm involves a subroutine in which a standard Viterbi algorithm is run on a single chain , with the other chains held fixed . This subroutine is run on each chain in turn . Jordan , et ale (1997) found that performance of the HMDT on the Bach chorales was essentially the same as that of the FHMM . The advantage of the HMDT was its greater interpretability ; most of the runs resulted in a coarse-to -fine ordering of the temporal scales of the Markov processes from the top to the bottom of the tree .

7. Discussion We have described a variety of applications of variational methods to prob lems of inference and learning in graphical models . We hope t .o have convinced the reader that variational methods can provide a powerful and elegant tool for graphical models , and that the algorithms that result are simple and intuitively appealing . It is important to emphasize, however, that research on variational methods for graphical models is of quite recent origin , and there are many open problems and unresolved issues. In this

154

MICHAELI. JORDAN ET AL.

section we discuss a number of these issues. We also broaden the scope of the presentation and discuss a number of related strands of research. 7 .1 .

RELATED RESEARCH

The methods that we have discussed all involve deterministic , iterative approximation algorithms . It is of interest to discuss related approximation schemes that are either non-deterministic or non-iterative . 7.1.1. Recognition models and the Helmholtz machine All of the algorithms that we have presented have at their core a nonlinear optimization problem . In particular , after having introduced the variational parameters , whether sequentially or as a block , we are left with a bound such as that in Eq . (27) that must be optimized . Optimization of this bound is generally achieved via a fixed -point iteration or a gradient -based algorithm . This iterative optimization process induces interdependencies between the variational parameters which give us a "best" approximation to the marginal or conditional probability of interest . Consider in particular a problem in which a directed graphical model is used for unsupervised learning . A common approach in unsupervised learning is to consider graphical models that are oriented in the "generative " direction ; that is, they point from hidden variables to observables. In this case the "predictive " calculation of P (E [H ) is elementary . The calculation of P (HIE ) , on the other hand , is a "diagnostic " calculation that proceeds backwards in the graph . Diagnostic calculations are generally non-trivial and require the full power of an inference algorithm . An alternative approach to solving iteratively for an approximation to the diagnostic calculation is to learn both a generative model and a "recognition " model that approximates the diagnostic distribution P (HIE ) . Thus we associate different parameters with the generative model and the recognition model and rely on the parameter estimation process to bring these parameterizations into register . This is the basic idea behind the "Helmholtz machine " (Dayan , et al ., 1995; Hinton , et al ., 1995) . The key advantage of the recognition -model approach is that the calculation of P (HIE ) is reduced to an elementary feedforward calculation that can be performed quickly . There are some disadvantages to the approach as well . In particular , the lack of an iterative algorithm makes the Helmholtz machine unable to deal naturally with missing data , and with phenomena such as "explaining away," in which the couplings between hidden variables change as a function of the conditioning variables . Moreover , although in some cases there is a clear natural parameterization for the recognition model that is induced

AN INTRODUCTION TO VARIATIONAL METHODS

155

from the generative model (in particular for linear models such as factor analysis ) , in general it is difficult to insure that the models are matched appropriately .l0 Some of these problems might be addressed by combining the recognition -model approach with the iterative variational approach ; essentially treating the recognition -model as a "cache" for storing good initializations for the variational parameters . 7.1.2. Sampling methods In this section we make a few remarks on the relationships between vari ational methods and stochastic methods , in particular the Gibbs sampler . In the setting of graphical models , both classes of methods rely on extensive message-passing . In Gibbs sampling , the message-passing is par ticularly simple : each node learns the current instantiation of its Markov blanket . With enough samples the node can estimate the distribution over its Markov blanket and (roughly speaking ) determine its own statistics . The advantage of this scheme is that in the limit of very many samples, it is guaranteed to converge to the correct statistics . The disadvantage is that very many samples may be required . The message-passing in variational methods is quite different . Its pur pose is to couple the variational parameters of one node to those of its Markov blanket . The messages do not come in the form of samples, but rather in the form of approximate statistics (as summarized by the varia tional parameters ) . For example , in a network of binary nodes, while the Gibbs sampler is circulating messages of binary vectors that correspond to the instantiations of Markov blankets , the variational methods are cir culating real-valued numbers that correspond to the statistics of Markov blankets . This may be one reason why variational methods often converge faster than Gibbs sampling . Of course, the disadvantage of these schemes is that they do not necessarily converge to the correct statistics . On the other hand , they can provide bounds on marginal probabilities that are quite difficult to estimate by sampling . Indeed , sampling -based methods - while well -suited to estimating the statistics of individual hidden nodes- are ill equipped to compute marginal probabilities such as P (E ) = EH P (H , E ) . An interesting direction for future research is to consider combinations of sampling methods and variational methods . Some initial work in this direction has been done by Hinton , Sallans , and Ghahramani (this volume ), who discuss brief Gibbs sampling from the point of view of variational approximation . laThe particular recognition model utilized in the Helmholtz machine is a layered graph , which makes weak conditional independence assumptions and thus makes it possible , in principle , to capture fairly general dependencies .

156

MICHAELI. JORDAN ET AL.

7 . 1 .3 .

Bayesian

methods

Variational

inference

parameter additional

.

nodes

thereby

treat

inference

in

a ,

A

in

each

a

setting

of

ensemble

tional

of

by

the

the

L

, KL

is

model

copying

the

ture

of

models they

do

not

best

a

family

Type

II

)

has

and inference approximation

In

marginal

this

The

,

) a

logistic

in

data

,

member

represent ) .

of

a

The

-

where

varia

-

ensemble

is

.

( 73

we

we

( O ) dO

;

a

know a

that

lower

find

the

et

aI

,

that

)

this

bound

.

In

minimizing

following

quantity

:

,

( 74

key

quantity

has

been

1996

aspect

) of

parametric

in

)

Bayesian

and

prior

on

mix

-

Markov

applications for itself

between

to

hidden

these family

the

connection

have

applied

minimization

inference

( 1997b

for

(O ) P

and a

using

)

of

on

variational

likelihood

,

originally to

6 ,

approach

factorization

Jordan

was

( OIE

) ) dlJ

6 ,

interesting

described

( OIE

Section

likelihood

particular .

often

. learning

way

( fJIE ( 91 ~E

bound

( E

is

different

maximization

P

averaging

any

Q ~p

Section

f

and

:

in

the

lower

=

One

Q

a

P

In

( Waterhouse

maximum

Bayesian

parameters

the

a ) .

specific

Jaakkola

tractable

)

ensemble

1997

given

)

as

best

( E

assume

( 1997b

and

for

in

( lJIE

from

architectures

factorizes

MacKay

the

,

Q

to

model

,

( MacKay

that

the

of

recently experts

P

Let

divergence

argument

the

logarithm and

f

=

argument

yields

selection More

Q

the

)

) .

as

)

probabilistic

networks of

1993

volume

.

"

neural

as

problem

useful

learning of

distribution

equivalent

In

which

,

KL

of

be

divergence

Camp

this

generic

inference be

thought

posterior

( QIIP

line

must

particular

van the

as

Bayesian

parameters

,

footing

" be

appropriate

same

minimization

Heckerman

" ensemble

" ensemble can

& to the

Following

.

same

of

treat

can

as

an

problem

probabilistic

known

( Hinton

K

( cf

the

. This

parameters

minimizing

general generally

approximations

fitting

the

the

quite

model

model

approximation

fit

can

on

variational

way

to

we

graphical

method

as

applied

inference

variational

troduced

a

graphical

and

be

Indeed

Bayesian

intractable

the

can

estimation

(J .

In

Q

is ,

just

that

determines related

variational

work

,

inference

. also

variational

developed

variational

approach regression

to with

a

find

Gaussian

methods an

analytically prior

on

the

.

7.1.4. Perspective and prospectives Per haps the key issue that faces developers of variational methods is the issue of approximation accuracy . At the current state of development of variational methods for graphical models , we have little theoretical insight

AN INTRODUCTION TO VARIATIONAL METHODS

157

into conditions under which variational methods can be expected to be accurate and conditions under which they might be expected to be inaccurate . Moreover , there is little understanding of how to match variational transformations to architectures . One can develop an intuition for when variational methods work by examining their properties in certain well -studied cases. For mean field meth ods, a good starting point is to understand the examples in the statistical mechanics literature where this approximation gives not only good , but indeed exact , results . These are densely connected graphs with uniformly weak (but non -negative ) couplings between neighboring nodes (Parisi , 1988) . The mean field equations for these networks have a unique solution that determines the statistics of individual nodes in the limit of very large graphs . In more general graphical models , of course, the conditions for a mean field approximation may not be so favorable . Typically , this can be diagnosed by the presence of multiple solutions to the mean field equations . Roughly speaking , one can interpret each solution as corresponding to a mode of the posterior distribution ; thus , multiple solutions indicate a mul timodal posterior distribution . The simplest mean field approximations , in particular those that utilize a completely factorized approximating distri bution , are poorly designed for such situations . However , they can succeed rather well in applications where the joint distribution P (H , E ) is multimodal , but the posterior distribution P (HIE ) is not . It is worth emphasizing this distinction between joint and posterior distributions . This is what allows simple variational methods - which make rather strong assumptions of conditional independence - to be used in the learning of nontrivial graphical models . A second key issue has to do with broadening the scope of variational methods . In this paper we have presented a restricted set of variational techniques , those based on convexity transformations . For these techniques to be applicable the appropriate convexity properties need to be identified . While it is relatively easy to characterize small classes of models where these properties lead to simple approximation algorithms , such as the case in which the local conditional probabilities are log-concave generalized lin ear models , it is not generally easy to develop variational algorithms for other kinds of graphical models . A broader characterization of variational approximations is needed and a more systematic algebra is needed to match the approximations to models . Other open problems include : (1) the problem of combining variational methods with sampling methods and with search based methods , (2) the problem of making more optimal choices of node ordering in the case of sequential methods , (3) the development of upper bounds within the block framework , (4) the combination of multiple variational approximations for

158 the

MICHAEL I. JORDAN ET AL.

same

model

tectures

,

that

Similar

on

of

and

for

pruned

values

variational

discrete

sampling

of

of

the

of

a

large

underlying

the

of

degree

.

these

on

probability

graph

for

methods

all

The

difficulty

lies

actual

model

-

methods

cases

the

archi

.

and

exact

in

to

for

variables

methods

versions

contingent

methods

random

foundations

is

properties

.

development

theoretical

accuracy

discrete

the

or

solid

that

)

exist

incomplete

probability

8

5

problems

providing

fact

(

continuous

open

based

in

and

combine

in

the

conditional

rather

than

on

the

.

Acknowledgments

We

wish

to

always

)

thank

Brendan

Peter

Frey

Dayan

for

,

David

helpful

Heckerman

,

comments

on

the

Uffe

Kjrerulff

,

manuscript

and

(

as

.

References

Bathe

,

K

.

Baum

,

L

. E

ring

J

in

of

.

(

. ,

1996

)

.

Petrie

the

T

,

T

Cowell

.

,

(

Ed

,

(

in

.

,

press

)

Luby

is

P

. ,

T

. ,

NP

-

Hinton

&

M

.

,

R

.

(

.

.

plete

in

press

In

M

A

. P

,

I

D

.

L

. ,

B

.

.

.

.

and

,

&

C

,

Ghahramani

(

,

,

Z

,

29

,

Thomas

E

B

.

.

.

.

S

,

.

.

.

A

,

of

Prentice

-

Hall

.

technique

Markov

occur

chains

Theory

for

Norwell

.

Bayesian

,

Zemel

model

142

. )

,

&

for

-

150

.

The

-

Annals

MA

,

60

,

R

New

York

networks

:

Kluwer

,

141

-

S

.

(

In

153

John

M

Academic

Wiley

.

I

.

.

Jordan

Publishers

inference

.

:

.

in

.

Bayesian

belief

.

1995

)

.

The

Helmholtz

Machine

.

reasoning

about

causality

and

persistence

.

.

:

,

A

Learning

unifying

in

Rubin

,

(

.

1994

)

framework

Graphical

.

D

. B

.

(

Journal

for

Models

1977

of

Localized

)

.

the

.

Maximum

-

Royal

partial

:

Dayan

In

D

.

,

D

.

S

P

.

probabilistic

Norwell

in

,

MA

likelihood

:

from

Statistical

Society

evaluation

Proceedings

of

.

(

(

1996

)

.

Does

Touretzky

,

of

the

the

M

Processing

1994

)

.

:

The

379

-

Kluwer

Tenth

incom

,

belief

B39

1

networks

Conference

.

-

,

-

.

San

38

.

Un

-

Mateo

,

.

C

wake

.

Systems

Backward

in

of

,

8

simulation

Proceedings

-

Mozer

the

Tenth

.

sleep

algorithm

&

M

.

E

learn

.

Cambridge

,

Bayesian

(

MA

:

networks

Conference

good

Hasselmo

.

.

San

MIT

Eds

. )

Press

,

.

Uncertainty

Mateo

,

CA

:

,

limitations

of

&

Hinton

,

G

.

Report

.

deterministic

Boltzmann

machine

learning

.

Net

-

.

,

&

,

245

-

A

,

273

. ,

E

.

(

CRG

Jordan

,

modelling

&

:

.

Information

)

-

Learning

Ghahramani

,

NJ

.

1993

Z

.

,

maximization

.

Neural

355

Technical

A

Information

inference

Intelligence

.

,

Toronto

.

Ed

.

probabilistic

904

5

)

.

.

R

)

Intelligence

.

4

171

of

algorithm

?

Favero

,

W

1989

. M

,

Kaufmann

,

N

EM

G

Artificial

Morgan

-

elimination

Artificial

in

work

(

-

(

,

estimators

Galland

889

Bucket

Hanks

Hinton

Advances

,

K

,

Kaufmann

density

R

7

,

Cliffs

1970

.

&

Morgan

164

Intelligence

Neal

Jordan

the

and

:

Bayesian

.

Laird

via

certainty

,

. ,

Englewood

(

Approximating

,

)

.

. ,

data

,

.

.

functions

Elements

Artificial

E

,

)

N

probabilistic

,

.

,

Models

1993

.

G

Publishers

,

CA

41

)

.

Weiss

to

,

Academic

Dempster

Draper

1991

(

hard

,

&

Introduction

Kanazawa

ference

Gilks

(

,

of

Intelligence

Dechter

.

Graphical

,

Computational

Fung

.

.

Computation

,

G

,

J

in

&

Neural

Dean

,

Learning

P

,

,

analysis

Thomas

networks

Dayan

Procedures

Soules

statistical

&

.

. )

,

,

Statistics

,

R

Dagum

Element

.

Mathematical

Cover

Frey

Finite

,

M

1996

-

.

I

.

)

TR

(

-

.

Switching

96

1997

-

)

.

3

,

state

Department

-

of

Factorial

Hidden

space

models

Computer

.

University

of

Science

Markov

.

models

.

Machine

.

&

.

Spiegelhalter

The

,

Statistician

D

.

,

(

43

1994

,

)

169

.

A

-

178

language

.

and

a

program

for

complex

ANINTRODUCTION TOVARIATIONAL METHODS

159

Heckerman , D. (in press). A tutorial on learningwith Bayesiannetworks. In M. I. Jordan (Ed.), Learningin GraphicalModels. Norwell, MA: Kluwer AcademicPublishers. Henrion, M. (1991). Search -basedmethodsto bound diagnosticprobabilities in very large belief nets. Uncertaintyand Artificial Intelligence: Proceedings of the Seventh Conference . San Mateo, CA: MorganKaufmann. Hinton, G. E., & Sejnowski , T. (1986). Learningandrelearningin Boltzmannmachines . In D. E. Rumelhart & J. L. McClelland, (Eds.), Parallel distributedprocessing : Volume 1, Cambridge, MA: MIT Press. Hinton, G.E. & van Camp, D. (1993). Keepingneural networkssimple by minimizing the descriptionlength of the weights. In Proceedings of the 6th Annual Workshopon ComputationalLearning Theory, pp 5-13. New York, NY: ACM Press. Hinton, G. E., Dayan, P., Frey, B., and Neal, R. M. (1995). The wake-sleepalgorithm for unsupervisedneural networks. Science , 268:1158- 1161. Hinton, G. E., Sallans, B., & Ghahramani, Z. (in press). A hierarchicalcommunity of experts. In M. I . Jordan (Ed.), Learningin GraphicalModels. Norwell, MA: Kluwer AcademicPublishers. Horvitz, E. J., Suermondt, H. J., & Cooper, G.F. (1989). Boundedconditioning: Flexible inferencefor decisionsunderscarceresources . Conferenceon Uncertaintyin Artificial Intelligence: Proceedingsof the Fifth Conference . Mountain View, CA: Association for UAI . Jaakkola, T. S., & Jordan, M. I . (1996). Computingupperandlowerboundson likelihoods in intractable networks. Uncertaintyand Artificial Intelligence: Proceedingsof the Twelth Conference . SanMateo, CA: MorganKaufmann. Jaakkola, T . S. (1997). Variational methodsfor inferenceand estimation in graphical models . Unpublisheddoctoral dissertation, Massachusetts Institute of Technology . Jaakkola, T. S., & Jordan, M. I . (1997a ) . Recursivealgorithmsfor approximatingprobabilities in graphical models. In M. C. Mozer, M. I. Jordan, & T . Petsche(Eds.), Advancesin Neural Information ProcessingSystems9. Cambridge, MA: MIT Press. Jaakkola, T. S., & Jordan, M. I . (1997b). Bayesianlogistic regression : a variational approach. In D. Madigan & P. Smyth (Eds.), Proceedingsof the 1997Conferenceon Artificial Intelligenceand Statistics, Ft . Lauderdale , FL. Jaakkola, T . S., & Jordan. M. I . (1997c ) . Variationalmethodsandthe QMR-DT database . Submitted to: Journal of Artificial IntelligenceResearch . Jaakkola, T . S., & Jordan. M. I . (in press). Improvingthe meanfield approximationvia the useof mixture distributions. In M. I . Jordan(Ed.), Learningin GraphicalModels. Norwell, MA: Kluwer AcademicPublishers. Jensen, C. S., Kong, A., & Kjrerulff, U. (1995). Blocking-Gibbs samplingin very large probabilistic expert systems. International Journal of Human-ComputerStudies, 42, 647- 666. Jensen,F. V., & Jensen,F. (1994). Optimal junction trees. Uncertaintyand Artificial Intelligence: Proceedings of the Tenth Conference . SanMateo, CA: MorganKaufmann. Jensen, F. V. (1996). An Introductionto BayesianNetworks. London: UCL Press. Jordan, M. I . (1994). A statistical approachto decisiontree modeling. In M. Warmuth (Ed.), Proceedings of the SeventhAnnual ACM Conferenceon ComputationalLearning Theory. New York: ACM Press. Jordan, M. I ., Ghahramani, Z., & Saul, L. K. (1997). Hidden Markov decisiontrees. In M. C. Mozer, M. I . Jordan, & T . Petsche(Eds.), Advancesin Neural Information ProcessingSystems9. Cambridge, MA: MIT Press. Kanazawa, K ., Koller, D., & Russell, S. (1995). Stocha .gtic simulation algorithms for dynamic probabilistic networks. Uncertaintyand Artificial Intelligence: Proceedings of the EleventhConference . SanMateo, CA: MorganKaufmann. Kj ~rulff, U. (1990). Triangulationof graphs- algorithmsgiving small total state space. ResearchReport R-90-09, Departmentof Mathematicsand ComputerScience , Aalborg University, Denmark. Kj ~rulff, U. (1994). Reduction of computational complexity in Bayesiannetworks

160

MICHAEL

through

removal

ceedings

of

the

I.

of

weak

Tenth

MacKay, D.J.C. (1997a). manuscript. Department MacKay, D.J.C. (1997b). eters. Submitted to Neural MacKay, D.J.C. (1997b). Learning

in

McEliece, of

Pearl s

Areas

belief

in

Merz, [ http:

dependences.

Ensemble Physics, Comparison Computation. Introduction

of

Models.

propagation

Department

R.

of

(1992).

R.

(1993).

Norwell,

MA:

Kluwer

G. (1988). J. (1988).

Statistical Probabilistic

San

C.,

networks.

Mateo,

Rockafellar, Rustagi,

CA:

inference

(1972).

(1976).

Report press). In

J.

L.

1,

(1987).

Convex

networks. Neural

Saul,

neural

ings

In

D.

K.,

&

networks.

S.

(1995). R. D.,

Networks:

inference

of

the

Tenth

Shenoy, P. 40,P. Research, Shwe, M. A.,

S.

P.,

for hidden Waterhouse,

of

experts.

Neural Williams,

Artificial

M.

I.

Touretzky, Processing

In

Publishers.

M.

Ma

M.

in

Analysis.

Methods

I. (in

I. Jordan

M.

Princeton

in

Intelligence

(1996).

Red In

Statistics.

Mechanics. Learning M. I.

Exploiting

C.

Systems

Resear

Mozer, 8.

press).

Cam

A mean

(Ed.),

Learning

Annealed theories of learning The Statistical Mechanics P Andersen, S. K., & Szolovits, in

belief

Conference.

(1992). 463 484. Middleton,

H. P., & Cooper, INTERNIST-1/QMR

Smyth,

of

Jordan,

Academic

Seung, Neural Shachter, bilistic

Jordan,

Information

L.

Kluwer

&

bel

A mean

995 1019.

Variational

Journal

K.,

of

CRG-TR-93-1 A view of M. I. Jordan

Sakurai, J. (1985). Modern Quantum Saul, L. K., & Jordan, M. I. (1994). 6, 1173 1183. Saul, L. K., Jaakkola, T. S., & Jordan, networks.

Computer

Kaufmannn.

R.

re

using

Publishers.

Morgan

Systems,

R.

J.

and

UCI

Field Theory. Reasoning

& Anderson,

Complex

J.

Subm

learning

Academic

Kl

Cheng,

algorithm.

Information

Probabilistic

MA:

Peterson,

Saul,

learning of appr to M

University

&

Connectionist

versity of Toronto Technical Neal, R., & Hinton, G. E. (in tal, sparse, and other variants. Inference,

Mateo,

Norwell,

D.J.C.,

AL.

Unce

San

J., & Murphy, P. M. (1996). /www. ics . uci/ - m1earn/MLRepository

fornia,

Parisi, Pearl,

ET

Conference.

MacKay,

Communication.

C.

Neal, 113. Neal,

Graphical

R.J.,

JORDAN

San

Mateo,

Valuation-based B., Heckerman,

&

F.

Uncertainty CA:

systems D. E

(1991). Probabilistic d knowledge base. Meth. Heckerman, D., & Jordan, M. I. (19 Markov probability models. Neural S., MacKay, D.J.C. & Robinson, T In D. S. Touretzky, M. C. Mozer,

Information C. K. I.,

G.

networks.

Processing Hinton,

G.

Systems E. (1991).

8.

Cambrid Mean

AN INTRODUCTION TO VARIATIONAL METHODS

161

temporallydistortedstrings.In Touretzky , D. S., Elman,J., Sejnowski , T., & Hinton, G. E., (Eds.), Proceedings of the 1990Connectionist ModelsSummerSchool . San Mateo, CA: MorganKaufmann . 9. Appendix In this section, we calculate the conjugate functions for the logarithm function and the log logistic function . For f (x) = In x , we have: j *(.;\) = min{ x .;\x - lnx } .

(75)

Taking the derivative with respect to x and setting to zero yields x = >..- 1. Substituting back in Eq . (75) yields :

f * (A) = In A + 1,

(76)

which justifies the representation of the logarithm given in Eq. (14). For the log logistic function g(x) == - In(l + e- X), we have:

g*(>..) = min x {>..x + In(l + e-X)}.

(77)

Taking the derivative with respect to x and setting to zero yields: A

=

=

e

i

-

+

(78)

X

e

-

x

'

from which we obtain : x

=

=

In

(79)

~

A

and

1 In(1+e-X)=1--=-"':\.

(80)

Pluggingtheseexpressions backinto Eq. (77) yields: f * (A) = - AIn;;\ - (1 - A) In(l - A),

(81)

whichis the binary entropyfunctionH (A). This justifiesthe representation of the logistic function givenin Eq. (19).

IMPROVING THE MEAN FIELD APPROXIMATION VIA THE USE OF MIXTURE DISTRIBUTIONS

TOMMI

S . JAAKKOLA

University Santa

of Cruz

,

California CA

AND

MICHAEL I . JORDAN MassachusettsInstitute of Technology Cambridge, MA

Abstract . Mean field methods provide computationally efficient approxi mations to posterior probability distributions for graphical models . Simple mean field methods make a completely factorized approximation to the posterior , which is unlikely to be accurate when the posterior is multi modal . Indeed , if the posterior

is multi - modal , only one of the modes can

be captured . To improve the mean field approximation in such cases, we employ mixture models as posterior approximations , where each mixture component

is a factorized

distribution

. We describe efficient

methods

for

optimizing the parameters in these models .

1.

Introduction

Graphical models provide a convenient formalism in which to express and manipulate conditional independence statements . Inference algorithms for graphical models exploit these independence statements , using them to compute conditional probabilities while avoiding brute force marginaliza tion over the joint probability table . Many inference algorithms , in particu lar the algorithms that construct a junction tree , make explicit their usage of conditional independence by constructing a data structure that captures the essential Markov properties underlying the graph . That is, the algorithm groups interacting variables into clusters , such that the hypergraph of clusters has Markov properties that allow simple local algorithms to be 163

164

TOMMI S. JAAKKOLA AND MICHAEL I . JORDAN

employed for inference. In the best case, in which the original graph is sparse and without long cycles, the clusters are small and inference is efficient. In the worst case, such as the caseof a densegraph, the clusters are large and inference is inefficient (complexity scalesexponentially in the size of the largest cluster). Architectures for which the complexity is prohibitive include the QMR database (Shwe, et al., 1991), the layered sigmoid belief network (Neal, 1992), and the factorial hidden Markov model (Ghahramani & Jordan, 1996). Mean field theory (Parisi, 1988) provides an alternative perspective on the inference problem. The intuition behind mean field methods is that in dense graphs each node is subject to influences from many other nodes; thus, to the extent that each influence is weak and the total influence is roughly additive (on an appropriate scale), each node should roughly be characterized by its mean value. In particular this characterization is valid in situations in which the law of large numbers can be applied. The mean value is unknown, but it is related to the mean values of the other nodes. Mean field theory consistsof finding consistencyrelations betweenthe mean values for each of the nodes and solving the resulting equations (generally by iteration ). For graphical models these equations generally have simple graphical interpretations ; for example, Peterson and Anderson (1987) and Saul, Jaakkola, and Jordan (1996) found, in the casesof Markov networks and Bayesian networks, respectively, that the mean value of a given node is obtained additively (on a logit scale) from the mean values of the nodes in the Markov blanket of the node. Exact methods and mean field methods might be said to be complementary in the sensethat exact methods are best suited for sparsegraphs and mean field methods are best suited for densegraphs. Both classesof methods have significant limitations , however, and the gap between their respective domains of application is large. In particular , the naive mean field methods that we have referred to above are based on the approximation that each node fluctuates independently about its mean value. If there are a strong interactions in the network, i .e. if higher-order moments are important , then mean field methods will generally fail . One way in which such higher-order interactions may manifest themselvesis in the presence of multiple modes in the distribution ; naive mean field theory, by assuming independent fluctuations , effectively assumesa unimodal distribution and will generally fail if there are multiple modes. One approach to narrowing the gap between exact methods and mean field methods is the "structured mean field" methodology proposedby Saul and Jordan (1996). This approach involves deleting links in the graph and identifying a core graphical substructure that can be treated efficiently via exact methods (e.g., a forest of trees). Probabilistic dependenciesnot

IMPROVING THE MEAN FIELD APPROXIMATION

165

accounted for by the probability distribution on this core substructure are treated via a naive mean field approximation. For architectures with an obvious chain-like or tree-like substructure, such as the factorial hidden Markov model, this method is natural and successful. For architectures without such a readily identified substructure, however, such as the QMR databMe and the layered sigmoid belief network, it is not clear how to develop a useful structured mean field approximation. The current paper extends the basic mean field methodology in a different direction . Rather than basing the approximation on the assumption of a unimodal approximating distribution ,1 we build in multimodali ty by allowing the approximating distribution to take the form of a mixture distribution . The components of the mixture are assumedto be simple factorized distributions , for reasonsof computational efficiency. Thus, within a mode we assumeindependent fluctuations , and multiple modes are used to capture higher-order interactions. In the following sections we describe our mixture -based approach that extends the basic mean field method. Section 2 describesthe basicsof mean field approximation , providing enough detail so as to make the paper selfcontained. Section 3 then describeshow to extend the mean field approximation by allowing mixture distributions in the posterior approximation . The following sections develop the machinery whereby this approximation can be carried out in practice. 2.

Mean

field

approximation

Assume we have a probability model P (S, S*), where S* is the set of instantiated or observed variables (the "evidence" ) and the variables S are hidden or unobserved. We wish to find a tractable approximation to the posterior probability P (SIS*). In its simplest form mean field theory assumesthat nodes fluctuate independently around their mean values. We make this assumption explicit by expressingthe approximation in terms of a factorized distribution Qmj (SIS*): Qmf (SIS*) =

II7.Qi(SiIS *),

(1)

where the parameters Qi (SiIS * ) depend on each other and on the evidence S* (i .e., we must refit the mean field approximation for each configuration of the evidence ) . lBy "unimodal distributions" we meandistributionsthat are log-concave . Meanfield distributions that are products of exponentialfamily distributions are unimodal in this sense .

166

TOMMI S. JAAKKOLAAND MICHAELI. JORDAN

The mean field approximationcan be developedfrom severalpoints of view, but a particularly usefulperspectivefor applicationsto graphical modelsis the "variational" point of view. Within this approachwe use the KL divergenceas a measureof the goodnessof the approximationand choosevaluesof the parametersQi(SiIS*) that minimizethe KL divergence :

KL( Qmj (SIS *) IIP(Sls*) ) = ~ Qmf (SIS *)log~

.

(2)

Why do we useKL(QmfIIP) rather than KL (PIIQmf)? A pragmaticanswer is that the use of KL (QmfIIP) implies that averagesare taken using the tractable Qmf distribution rather than the intractableP distribution (cf. Eq. (2)); only in the former casedo we haveany hopeof finding the best approximationin practice. A moresatisfyinganswerarisesfrom considering the likelihood of the evidence , P(S*). Clearly calculationof the likelihood permits calculationof the conditionalprobability P(SIS*). UsingJensen 's inequality we can lower bound the likelihood (cf. for exampleSaul, et al., 1996):

logP(S*)

-

-

>

log LsP(S,s*) P(S,S*) log L Q(SIS *)(J(SIS ;)S P(S,S*) LSQ(SIS *)log (J(SIS ;)'

for arbitrary Q(SIS*), in particular for the meanfield distribution Qmj (SIS*). I t is easily verified that the difference between the left hand side and the right hand side of this inequality is the KL divergence KL (QIIP ); thus, minimizing this divergence is equivalent to maximizing a lower bound on the likelihood . For graphical models, minimization of KL (QmfIIP ) with respect to Qi (SiIS*) leads directly to an equation expressing Qi (SiIS*) in terms of the Qj (Sj IS*) associatedwith the nodes in the Markov blanket of node i (Peterson & Anderson, 1987; Saul, et al., 1996). These equations are coupled nonlinear equations and are generally solved iteratively . 3. Mixture

approximations

In this section we describe how to extend the mean field framework by utilizing mixture distributions rather than factorized distributions as the approximating distributions . For notational simplicity , we will drop the

167

IMPROVING THE MEAN FIELD APPROXIMATION

dependenceof Q(SIS*) on S* in this section and in the remainder of the paper, but it must be borne in mind that the approximation is optimized separately for each configuration of the evidence S*. We also utilize the notation F (Q) to denote the lower bound on the right hand side of Eq. (3):

:F(Q)=:LsQ(S)log :~i~ .~):l Q(~ S When

Q

( S

)

takes

the

approximation

a

corresponding

I ~_

the

as

I

OUf distribution

factorized

is

as

to

Qmf

mean

bound

proposal

form

" factorized F

utilize

( S

field

( Qmj

"

)

in

(3)

Eq

.

( 1 ) ,

we

approximation

will

refer

and

will

to

the

denote

) .

a

mixture

=

1: : :

distribution

as

the

approximating

:

Qmix

( S

)

amQmj

( Slm

) ,

( 4 )

m

where

each

butions

.

and of

In

F

( Qmix

F

( Qmf

sider

the

The

sum the

of

mixing

to

one

KL

divergence

the

remainder

)

-

1m

)

component

the

distributions

proportions

,

are

Qmf

am

additional

,

which

( 81m are

parameters

to

)

are

factorized

constrained be

fit

distri

to via

the

be

-

positive

minimization

. of

mixture

corresponding

this

section

bound to

we

the

and mixture

focus

the

on

the

relationship

factorized components

between

mean .

To

field this

bounds end

,

con

-

:

P(8,8 *) :F(Qm1 ..x)-- ~ L...SQmix (8)logQ ( mix 8) }:,Sam [Qmj (Slm )log ~ .~~.!-~ m Qmix (8:2)] }:,Sam [Qmf (8Im )log ~( S 8~)+Qmf (Slm )log 2Qmix ~l(SI ~))-] m Qmj (,Slm (S }:mam :F(Qmflm )+m }:,SamQmf (Slm )log 2Qmix ~f(SI ~)1 (S (5) }:mG :m :F(Qmflm )+I(m;S) -

-

where I (m ; S) is the mutual information between the mixture index m and the variables S in the mixture distribution2 . The first term in Eq. (5) 2Recall that the mutual

l (x ;y) = ~ LIz

,y

information . between any two random variables x and y is

P (x , y) logP (ylx )fP (y ).

168

TOMMI

is a convex

S. JAAKKOLA

combination

AND

of factorized

MICHAEL

mean

field

I . JORDAN

bounds

and

therefore

it

yields no gain over factorized mean field as far ~ the KL divergence is concerned

. It

is the

second

term , i .e ., the

mutual

information

, that

char -

acterizes the gain of using the mixture approximation .3 As a non-negative quantity , I (m ; S) incre ~ es the likelihood bound and therefore improves the approximation . We note , however, that since I (m ; S) :::; log M , where M is the number of mixture components , the KL divergence between the mixture approximation and the true posterior can decrease at most logarithmically in the number

of mixture

4 . Computing

components

the mutual

.

information

To be able to use the mixture bound of Eq. (5) we need to be able to compute the mutual information term . The difficulty with the expression M it stands is the need to average the log of the mixture distribution over all configurations of the hidden variables . Given that there are an exponential number of such configurations , this computation is in general intractable . We will make use of a refined form of Jensen's inequality to make headway. Let

us first

rewrite

the mutual

"""

information

as

Qmf (8lm )

I(m;S) = L..., O :'mQmj (Slm )logQ . (8) m

,

S

(6)

m1 .X

=EO :'mQmj (8Im )[-lOg ~ ] (7) = ~ :'mR {SIm """O:'mQmj(8Im)[- logOO :'m{8fm {Slm R))Qmj Qmix (8)) ] (8) = L O :'mQmj (Slm) logR(8Im) - LGmlogam m ,S

"""

m

( I )[ R(Slm ) Qmix (8)] (9)

+~ QmQmj 8m - log_._~~~---Qmj{8Im )

where we have introduced additional smoothing distributions R(Slm ) whose role is to improve

the convexity

bounds that we describe below .4 With

some

foresight we have also introduced am into the expression ; as will become 3Of course , this is only the potential gain to be realized from the mixture mation . We cannot rule out the possibility that it will be harder to search for approximating distribution in the mixture approximation than in the factorized particular the mixture approximation may introduce additional local minima KL divergence .

approxi the best case. In into the

4In fact , if we use Jensen's inequality directly in Eq. (6), we obtain a vacuous bound of zero , as is easily verified .

IMPROVING THE MEAN FIELD APPROXIMATION

169

apparent in the following section the forced appearance of am turns out to be convenient in updating the mixture coefficients . To avoid the need to average over the logarithm of the mixture distri bution , we make use of the following convexity bound :

- log(x) ~ - .,\ x + log("\) + 1,

(10)

replacing the - log(.) terms in Eq. (9) with linear functions, separately for eachm . These substitutions imply a lower bound on the mutual information given by:

I (m; S) ~ } : amQmf (Slm)logR(Slm) - } : amlogam

-

m,S m - } : Am}: R(Slm)Qmix (S) + } : amlogAm + 1 (11) m S m IA(m; B).

(12)

The tightness of the convexity bound is controlled by the smoothing functions R (Stm) , which can be viewed as "flattening" the logarithm function such that Jensen's inequality is tightened. In particular , if we set R(Slm ) cx Qm/ (Slm )/ Qmix(S) and maximize over the new variational parameters Amwe recoverthe mutual information exactly. Sucha choicewould not reduce the complexity of our calculations, however. To obtain an efficient algorithm we will instead assumethat the distributions R have simple factorized forms, and optimize the ensuing lower bound with respect to the parameters in these distributions . It will turn out that this assumption will permit all of the terms in our bound to be evaluated efficiently. We note finally that this bound is never vacuous. To seewhy, simply set R(Slm ) equal to a constant for all { S, m} in which casemax,\ I ,\(m; S) = 0; optimization of R can only improve the bound. 5 . Finding

the mixture

parameters

In order to find a reasonable mixture approximation we need to be able to optimize the bound in Eq . (5) with respect to the mixture model . Clearly a necessary condition for the feasibility of this optimization process is that the component factorized mean field distributions can be fit to the model tractably . That is, we must assume that the following mean field equations :

88Qj :F(Qm /))=constant (Sj

(13)

170

TOMMI S. JAAKKOLA AND MICHAEL I. JORDAN

can be solved efficiently (not necessarily in closed form) for any of the marginals Qj (Sj ) in the factorized mean field distribution Qmf .5 Examples of such mean field equations are provided by Petersonand Anderson (1987) and Saul, et all (1996). Let us first consider the updates for the factorized mean field components of the mixture model. We needto take derivatives of the lower bound: F (Qmix) ~ L amF (Qmflm ) + IA(m; S) == FA (Qmix) m

(14)

with respect to the parameters of the component factorized distributions . The important property of I A(m; S) is that it is linear in any of the marginals Qj (Sjlk ), where k indicates the mixture component (seeAppendix A). As a result, we have:

8I).(m;S) = 8Qj(Sjlk) where

the

ally

constant

is

dependent

field

on

distribution

)

8

:

F ; \

can

be

the

us

I

; \

can

now

the

( m

;

)

.

write

ak8

for

the

turn

to

same

F ;\

+

8I

; \

( m

8Qj

specific

;

the

S

( Sjlk

is

gener

)

equations

=

-

mean

:

0

(

16

)

)

marginal

by

( and

Qj

( Sjlk

iteratively

)

.

We

can

optimizing

thus

each

of

mixture

-

Ek

coefficients

ak

log

therefore

CXk

ak

,

true

these

.

We

note

coefficients

for

F

; \

first

that

appear

( Qmix

)

( see

Eq

,

linearly

.

(

14

)

)

,

apart

in

and

we

:

:

F ; \

( Qmix

)

=

L

cxk

( -

E

( k

)

k

where

that

)

)

component

.

the

is

)-

( Sjlk

( Sjlk

particular

assumption

( Qmflk

any

Qj

the

components

model

term

The

: 8Qj

mixture

entropy

B

=

our

(15)

marginal

in

from

)

fitting in

Let

)

the

marginals

follows

efficiently

best

marginals

from

It

( Sjlk

solved

the

.

of

other

( Qmix

8Qj

find

independent

the

constant

-

E

( k

)

-

is

E

the

( k

collection

)

= =

of

F

( Qmflk

terms

)

linear

+

)

-

Lk aklogak +1 in

CXk

(17 )

:

L Qmj(Slk) logR(Slk) s

+

LAmL R(Slm}Qmf (Slk} + logAk m S

(18)

5In the ordinary mean field equations the constant is zero, but for many probability models of interest this slight difference posesno additional difficulties .

IMPROVING THE MEAN FIELD APPROXIMATION

171

Now Eq. (17) has the form of a free energy, and the mixture coefficientsthat maximize the lower bound : FA(Qmix) therefore come from the Boltzmann distri bution : e- E(k) ak=}=:kl e-E(k'). ( This with

fact

respect Finally

Am and ter

is easily

verified

updates

parameters

Lagrange

multipliers

to optimize

Eq . ( 17 )

to am ) .

, we must the

using

(19)

optimize

parameters

the

bound

of the smoothing

in Appendix Am , we readily

A . Differentiating obtain

with

respect

distribution the

bound

to the

parameters

. We discuss with

respect

the lat to the

:

;\m= EkakEsRam (Slm )Qmf (Slk )'

( 20 )

where a simplified form for the expressionin the denominator is presented in the Appendix . 6. Discussion We have presenteda general methodology for using mixture distributions in approximate probabilistic calculations. We have shown that the bound on the likelihood resulting from mixture -basedapproximations is composedof two terms, one a convexcombination of simple factorized mean field bounds and the other the mutual information between the mixture labels and the hidden variables S. It is the latter term which representsthe improvement available from the use of mixture models. The basic mixture approximation utilizes parameters am (the mixing proportions ), Qi (Silm ) (the parametersof the factorized component distri butions). These parameters are fit via Eq. (19) and Eq. (16), respectively. We also introduced variational parameters Am and Ri (Si 1m) to provide tight approximations to the mutual information . Updates for these parameters were presented in Eq. (20) and Eq. (29). Bishop, et al. (1998) have presentedempirical results using the mixture approximation . In experiments with random sigmoid belief networks, they showed that the mixture approximation provides a better approximation than simple factorized mean field bounds, and moreover that the approximation improves monotonically with the number of components in the mixture . An interesting direction for further research would involve the study of hybrids that combine the use of the mixture approximation with exact methods. Much as in the caseof the structured mean field approximation of

172

TOMMI S. JAAKKOLAAND MICHAELI. JORDAN

Sauland Jordan(1996), it shouldbe possibleto utilize approximationsthat identify core substructures that are treated exactly , while the interactions between

these substructures

are treated

via the mixture

approach . This

III

would provide a flexible model for dealing with a variety of forms of high order interactions , while remaining tractable computationally .

II

References

Bishop, C., Lawrence, N ., Jaakkola, T . S., & Jordan, M . I . Approximating posterior in belief networks using mixtures . In M . I . Jordan , M . J . Kearns , & S. A . III

distributions

Solla, Advances in Neural Information ProcessingSystems 10, MIT Press, Cambridge MA (1998) . Ghahramani , Z . & Jordan , M . I . Factorial

Hidden Markov models . In D . S. Touretzky ,

II

M . C. Mozer, & M . E. Hasselmo (Eds.) , Advances in Neural Information Processing Systems 8, MIT Press, Cambridge MA (1996) . Neal , R . Connectionist

learning

of belief networks , Artificial

Intelligence , 56: 71- 113

(1992) . Parisi , G. Statistical Field Theory. Addison-Wesley: Redwood City (1988) . Peterson , C . & Anderson , J . R . A mean field theory learning algorithm

for neural net -

works. Complex Systems 1:995- 1019 (1987) . Saul , L . K ., Jaakkola , T . S., & Jordan , M . I . Mean field theory for sigmoid belief networks . Journal of Artificial Intelligence Research, 4:61- 76 ( 1996) . Saul , L . K . & Jordan , M . I . Exploiting tractable substructures in intractable networks .

In D . S. Touretzky , M . C. Mozer, & M . E. Hasselmo (Eds.) , Advances in Neural Information Processing Systems 8, MIT Press, Cambridge MA (1996). Shwe

, M . A . , Middleton

, B . , Heckerman

P., & Cooper , G . F . Probabilistic

, D . E . , Henrion

, M . , Horvitz

diagnosis using a reformulation

, E . J . , Lehmann

, H .

of the INTERNIST

-

l / QMR knowledge base. Meth. Inform . Med. 30: 241-255 (1991).

A . Optimization of the smoothing distribution Let usfirst adoptthe followingnotation: 7rR ,Q(m, m') :=

Ls R(Slm )Qmf (Slm ')

(21)

II L Ri(Silm )Qi(Silm ') i Si H(Q IIRim)

H(m)

LSQmj (81m )logR(8Im ) L L Qi(8ilm )logRi(Silm ) i Si - Lam m logam

(22) (23)

(24) (25)

IMPROVINGTHE MEANFIELDAPPROXIMATION

173

wherewe haveevaluatedthe sumsoverS by makinguseof the assumption that R(Slm) factorizes.We denotethe marginalsof thesefactorizedforms by Ri(Siim). Eq. (11) now readsas: IA(m; S) = L amH(Q II Rim ) + H (m) m - ~ Am[~ Q(m') 7rR ,Q(m, m/)] + ~ amlogAm+ ~26)

Tooptimize thebound IA(m;S) with respect toR,weneed thefollowing derivatives : ~a-Rj (Q(IISjlk Ri~_ )) (27 ) ) l = 8k ,mQj Rj((Sjlk Sjlk ,Q n ~."Si ) --{}{}7rR R SJI .,Ikm )') = 8k,mQj (Sj Im')1 .#J. L .;Ri(SiIm)Qi(SiIm') .(28 J.((m Denoting the optimal

the product

in Eq . ( 28 ) by 7r~ ,Q ( m , m ' ) , we can now characterize

R via the consistency

equations

I ,X((m ; 81} = Q k 2j a Rj Sjlk Rj ( Sjlk }) - Ak [ ~ Note

that

the

second

term

does

:

Q (m , ) ~ R. ,Q ( k , m , ) Qj ( Sjlm ' ) ] = 0 ( 29 ) not

depend

on the

smoothing

Rj ( Sjlk ) and therefore these consistency equations are easily any specific marginal . The best fitting R is found by iteratively ing each of its not necessarily

marginal solved for optimiz -

marginals . We note finally that the iterative solutions normalized . This is of no consequence , however , since

information bound of Eq . ( 11 ) - once maximized invariant to the normalization of R .

over the A parameters

are the -

is

INTRODUCTION

TO

MONTE

CARLO

METHODS

D .J .C . MACKAY

Department of Physics , Cambridge University . Cavendish Laboratory , Madingley Road, Cambridge , CB3 OHE. United Kingdom .

ABSTRACT

This chapter describes a sequence of Monte Carlo methods : impor tance sampling , rejection sampling , the Metropolis method , and Gibbs sampling . For each method , we discuss whether the method is expected to be useful for high- dimensional problems such as arise in inference with graphical models . After the methods have been described , the terminology of Markov chain Monte Carlo methods is presented . The chapter

concludes

with

a discussion

of advanced methods , includ -

ing methods for reducing random walk behaviour . For details of Monte Carlo methods , theorems and proofs and a full

list of references, the reader is directed to Neal (1993) , Gilks, Richardson and Spiegelhalter (1996) , and Tanner (1996). 1. The problems The aims of Monte

to be solved Carlo methods

are to solve one or both of the following

problems .

Problem 1: to generatesamples{x(r)}~=l from a givenprobability distribution p (x ) .l Problem 2 : to estimate expectations of functions under this distribution , for example

==/>(x))=JdNx P(x)(x).

(1)

1Please note that I will use the word "sample" in the following sense: a sample from a distribution P (x ) is a single realization x whose probability distribution is P (x ) . This contrasts with the alternative usage in statistics , where "sample" refers to a collection of realizations { x } . 175

176 The

D.J.C. MACKAY probability

might

distribution

be

a

arising

in

model

from

data

' s

that

P

distribution

modelling

-

parameters

x

is

given

an

N

sometimes

-

will

solved

{

x

,

(

r

}

=

l

can

to

which

example

,

call

first

target

.

density

conditional

,

distribution

posterior

real

also

the

a

the

data

with

the

the

will

or

observed

the

solve

give

we

physics

spaces

on

we

~

,

vector

discrete

then

)

)

some

concentrate

it

ples

x

for

dimensional

consider

We

(

statistical

probability

We

will

of

generally

components

a

assume

Xn

,

but

we

will

.

problem

(

second

sampling

)

problem

by

,

because

using

if

the

we

have

random

sam

-

estimator



.

=

i

L

 (

x

(

r

)

)

.

(

2

)

r

Clearly

.if. .

tion

of



the

< 1 >

will

vectors

is

< 1 >

.

{

Also

decrease

,

as

x

(

as

~

,

is

one

of

The

the

accuracy

of

of

x

to

We

,

it

may

be

find

given

1

.

,

We

will

can

can



If

The

we

can

(

at

x

)

is

goes

as

first

is

to

that

a

that

we

)

j

>

(

x

)

x

)

~

.

-

< 1

2

dozen

expecta

-

variance

of

.

(

Carlo

methods

(

t

the

the

< / > ,

Monte

So

then

,

estimate

he

regardless

(

sam

of

3

)

.

equation

space

independent

high

pled

the

2

.

)

)

is

To

be

dimensionality

samples

{

P

x

(

(

x

obtain

typically

)

*

,

(

x

)

r

)

}

suffice

(

x

)

(

why

-

a

wish

to

multiplicative

draw

samples

constant

,

;

that

P

" is

(

x

,

)

,

we

that

)

can

=

=

P

*

we

samples

do

we

a

x

dif

from

?

which

such

other

samples

.

HARD

within

cause

independent

easy

from

to

can

Obtaining

not

density

least

*

dimensionality

.

often

P

P

x

of

as

,

function

difficult

(

(

increases

of

of

methods

the

evaluate

general

variance

Carlo

P

in

the

P

FROM

,

a

is

P

R

.

that

evaluate

2

dNx

few

SAMPLING

evaluated

from

samples

dimensionality

Carlo

assume

be

generated

Monte

he

as

P

IS

are

of

J

however

Monte

WHY

=

the

t

that

distribution

.

1

satisfactorily

later

for

=

of



will

=

0 '

2

variance

estimate

ficulties

1

the

~

number

properties

of

,

}

important

independent

precise

)

where

0 '

This

r

the

not

from

not

know

(

x

)

/

Z

.

(

easily

P

solve

(

the

x

)

?

normalizing

z ==JdNXp*(x).

problem

There

1

are

two

?

Why

difficulties

is

4

)

it

.

constant

(5)

178

D.J.C. MACKAY

N ? Let us concentrate on the initial

cost of evaluating Z . To compute

Z (equation (7)) we have to visit every point in the space. In figure Ib there are 50 uniformly spaced points in one dimension . If our system had N dimensions , N = 1000 say, then the corresponding number of points would be 501000, an unimaginable number of evaluations of P * . Even if each component xn only took two discrete values, the number of evaluations of P * would be 21000, a number that is still horribly huge, equal to the fourth power of the number of particles

in the universe .

One system with 21000states is a collection of 1000 spins , for example ,

a 30 X 30 fragment of an Ising model (or 'Boltzmann machine' or 'Markov field') (Yeomans 1992) whose probability distribution is proportional to P* (x ) -= exp [- ,sE(x )]

(9)

where Xn E { :i:1} and

E(x)==- [~LJmnXmXn +LHnxn ]. (10 ) m

, n

n

The energy function E (x ) is readily evaluated for any x . But if we wish to evaluate this function at all states x , the computer 21000 function evaluations .

time required

would be

The Ising model is a simple model which has been around for a long time ,

but the task of generating samplesfrom the distribution P (x ) = P* (x )jZ is still an active research area as evidenced by the work of Propp and Wilson

(1996). 1 .2 .

UNIFORM

SAMPLING

Having agreed that we cannot visit every location x in the state space, we

might consider trying to solve the second problem (estimating the expec-

tation of a function <J>(x)) by drawingrandomsamples{x(r)}~=l uniformly from the state spaceand evaluating P* (x ) at those points. Then we could introduce ZR , defined by

R ZR =rL=lp*(x(r)), and estimate ==JdN x< />(x)P(x)by p * ( ( r ) ==rLR < / > ( x ( r ) ) --;=l R.

(11)

(12)

MONTECARLOMETHODS Is

anything

and

wrong

with

P * ( x ) . Let

and

us

concentrate

is

often

this

on

that

the

set

Shannon

T

,

in

whose

- Gibbs

? Well ( x )

nature

concentrated

typical

strategy

assume

of

a

given

the

depends

benign

region

is

of

, it

a

P * (x ) .

small

volume

entropy

is

by

179

on

the

, smoothly

A

high

- dimensional

of

the

state

ITI

~

probability

2H

functions

<j >( x )

varying

distribution

space

(X ) ,

function

known

where

distribution

P

H

as ( X

)

its

is

the

(x ) ,

1

H

If

almost

is

a

all

the

benign

by

sampling the

the

is

the

2N

of

of

a

set

,

what

Ising H

is

H

model

max

=

N

sampling

?

the

once

is

. So

set

of

bits

so be

a

not

is

melts

temperature

set

of

is

of

of

for particles

the

study

, if

unlikely

1 .3 .

1000

in

the Ising

the to

to

Under

these for

phase Ising

model of

, <1> .

a is

at

disordered

But

N

required

high are

which

phase

roughly

samples

to

uniform

interesting

temperature to

an

tends

estimating more

critical

of

entropy conditions

Considerably

the

number

the

/ 2

.

bits

.

simply

)

to

the At

this

For

this

hit

the

order

is

about

universe

.

models

useful

~

1015

2N

- N / 2 =

, This

Thus

of

distribution be

chance

not

( 15

the

sampling

size

( x ) is

/ 2

roughly

uniform

modest P

is

2N

.

And

actually

square is

in

utterly

most

uniform

of

high

the

)

number

useless

for

- dimensional

, uniform

sampling

.

OVERVIEW

Having bution

==

of

problems is

N

a

required

distribution

and

.

as

Rmin

which

Let

space

has

samples

probability

technique

an

?

state

( 14

the

1 .

ordered

the

of

we hit

required the

sample

number

if to

- H .

,

interest

an

entropy

once

2N

order

such

distribution

typical

of

each

of

likely

are

size

uniform

of are

samples total

. SO

. So

estimate we

<jJ ( x )

principally

set

that

distribution

great

from

the

probability

~

satisfactory

of

temperatures model

2H

and

be

typical

good

many

set

will

)

order

uniform

Rmin

well are

intermediate

a

a

The

temperatures

to ,

may

temperatures

Ising

high

.

typical (x )

the

. The

size

set

the

large

, how

has

typical thus

in

giving

again

( 13

<jJ ( x ) P on

of

.

in

fdNx

takes

model

typical

At

located

~

sufficiently

times

Ising

in

tends

is =

Rmin

So

( x ) log2

chance R

of

the the

P

4> ( x )

samples

falling

typical

of

that

number of

, and

/ 2N

the

a

~

value

stand of

case

states

2H

hit

only

set

take

=

mass

the

values

number

typical

us

,

the

will

make

)

probability

function

determined

( X

established P

(x )

that

== P * ( x ) / Z

drawing is

difficult

samples even

from if

a

P * ( x ) is

high easy

- dimensional to

evaluate

distri , we

will

180

D.J.C. MACKAY

Q * (x ),' I I I " ~ ,

x

Figure 2. Functions involved in importance sampling. We wish to estimate the expectation of (x ) under P (x ) cx : P *(x ) . We can generate samples from the simpler distribution Q(x ) cx : Q*(x ). We can evaluate Q* and P * at any point .

now study a sequenceof Monte Carlo methods: importance sampling , rejection sampling , the Metropolis method , and Gibbs sampling . 2. 1mportance sam pIing Importance sampling is not a method for generating samples from P (x ) (problem 1) ; it is just a method for estimating the expectation of a func tion (x ) (problem 2) . It can be viewed as a generalization of the uniform sampling method . For illustrative purposes, let us imagine that the target distribution is a one- dimensional density P (x ) . It is assumed that we are able to evaluate this density , at least to within a multiplicative constant ; thus we can evaluate a function P * (x ) such that

P(x) = P* (x)/ Z.

(16)

But P (x ) is too complicated a function for us to be able to sample from it directly . We now assume that we have a simpler density Q (x ) which we can evaluate to within a multiplicative constant (that is, we can evaluate Q * (x ) , where Q (x ) = Q * (x ) / ZQ) , and from which we can generate samples. An example of the functions P * , Q * and is shown in figure 2. We call Q the sampler density . In importance sampling , we generate R samples { x (r)} ~=l from Q (x ) . If these points were samples from P (x ) then we could estimate ~ by equation (2) . But when we generate samples from Q , values of x where Q (x ) is greater than P (x ) will be over- represented in this estimator , and points

MONTECARLOMETHODS - 6 .2

- 6 .2

- 6 .4

- 6 .4

- 6 .6

- 6 .6

-6 .8

- 6 .8

-7

-7

- 7 .2

181

- 7 .2 10

1 00

1 000

1 0000

1 00000

1 000000

10

(a)

1 00

1 000

1 0000

1 00000

1 000000

(b)

Figure 9. Importance sampling in action: a) using a Gaussian sampler density; b) using a Cauchy sampler density. Horizontal axis showsnumber of samples on a log scale. Vertical " axis shows the estimate cI>. The horizontal line indicates the true value of
where Q (x) is less than P (x) will be under- represented. To take into account the fact that we have sampled from the wrong distribution , we introduce 'weights' - p * (x(r))

Wr= (J;{; (;)) which we use to adjust the ' importance

(17)

' of each point in our estimator

thus :

= l : r Wr(x(r)) . -

(18)

l : r Wr

If Q (x) is non- zero for all x where P (x) is non- zero, it can be proved that , . .

the estimator converges to <1>, the mean value of 4>(x ) , as R increases. A practical difficulty estimate

how

reliable

the

with importance sampling is that it is hard to "

estimator

is . The

"

variance

of is hard

to

estimate, becausethe empirical variancesof the quantities wr and wr4>(x(r)) are not necessarily a good guide to the true variances of the numerator and

denominator in equation (18). If the proposal density Q(x) is small in a region where 14 >(x) P* (x) I is large then it is quite possible, even after many points x (r ) have been generated , that none of them will have fallen in that region . This leads to an estimate of that is drastically wrong , and no indication in the empirical variance that the true variance of the estimator is large . , . .

MONTECARLOMETHODS

183

will fall outside Rp and will have weight zero.] Then we know that most samples from Q will have a value of Q that lies in the range

(217ra2 )N /2exp -/2-:N (-2 N:f:-V :").

(2)2

Thus the weights Wr = P*/ Q will typically have values in the range

(27r0 '2)N/2exp (~2:i:~2) .

(23)

So if we draw a hundred samples, what will the typical range of weights be? We can roughly estimate the ratio of the largest weight to the median weight by doubling the standard deviation in equation (23) . The largest weight and the median weight will typically be in the ratio :

wmax r ~ =exp (J2N ).

In

N

==

1000

samples

is

Thus very

likely

to

an

importance

likely

be

utterly

conclusion

,

In from

two

the

dimensions

set

the , still

of

P ,

Rejection

We

assume

again

a

function have

that

we

( within

a

multiplicative

by

a few

, we

clearly

this

may

, even

of

a

we

obtain

in

order

long

are

a typical exp

samples

time

unless in

to

set

.

vary

Q

by

, although

suffers

that

the

.

will

weights often

samples likely

weight

problem

huge

obtain

hundred

median

dimensions

to

take if

the

with

high

need

one

- dimensional

samples in

after

than

a high

samples

points of

from

. We

one

- dimensional for

us

is

lie a

good

typical large

in

set

,

factors

similar

to

,

each

( . . ; N) .

be

proposal

factor further

density

to

a simpler

ZQ

able

before

that

we

== P * ( x ) / Z

sample

density

, as

assume

P (x ) to

from

it

Q ( x ) which ) , and

know

which the

we we

value

of

that

is

directly can can

too . We

evaluate generate

a constant

c

that for

A

greater

for

those

factors

a

assume

such

with

by

times

weight

sampling

complicated

samples

and

probabilities

largest

sampling

P . Second

differ

3 .

1019

dominated

associated

because other

, the

estimate

importance . First

to

weights

roughly

sampling

approximation the

be

difficulties

typical

therefore

(24)

schematic We

proposal

picture

generate density

of

two

the

random

Q ( x ) . We

all

x , cQ * ( x ) >

two

functions numbers

then

evaluate

is . The

P * (x ) .

( 25 )

shown

in

first

, x , is

cQ * ( x ) and

figure

4a .

generated

generate

from a uniformly

the

MONTECARLOMETHODS

185

-

-4

Figure

c

5

such

.

A

that

As

a

tions

(

x

case

P

)

~

P

(

other

.

no

[

aQ

c

if

the

/

(

(

N

/

and

,

N

the

rejection

the

acceptance

volume

=

=

1000

under

c

grows

)

,

,

is

a

2

)

N

/

2

(

27r0

'

Q ~

)

N

/

2

=

create

(

a

x

1

.

scaled

up

by

a

01

,

/

we

.

The

=

the

our

the

. ]

N

c

c

?

0 ' ~

=

=

exp

(

The

Q

are

technique

x

)

,

useful

for

)

at

single

is

and

similar

of

the

10

density

P

)

~

20

,

000

.

What

:

curve

P

(

x

)

implies

is

1

/

20

,

.

Q

x

(

)

x

.

)

In

that

large

has

complex

this

property

to

the

In

the

general

,

.

method

only

and

)

since

that

000

26

will

immediate

the

for

generating

sampling

(

is

(

is

this

c

origin

one

-

samples

dimensional

from

high

.

rejection

to

is

value

.

normalized

study

a

the

there

)

under

N

whilst

Q

x

,

set

answer

volume

and

(

to

than

case

is

Q

need

two

larger

the

what

of

we

log

case

,

,

from

these

percent

not

So

with

samples

that

one

is

dimensionality

therefore

(

is

density

find

P

For

P

x

exp

of

that

c

aQ

obtain

-

one

method

)

factor

distribu

assume

this

all

P

=

of

fact

1

for

value

practical

sampling

Q

)

from

to

us

,

(

ratio

the

Metropolis

Importance

density

x

Gaussian

Let

say

bound

'

distributions

The

(

samples

if

?

-

27r0

the

,

not

1000

this

be

sampling

problems

to

x

Q

dimensional

ape

-

(

with

dimensional

.

(

exponentially

Rejection

4

is

will

4

sampling

P

=

=

-

because

upper

fur

N

value

bounds

N

~

be

cQ

rate

3

generating

ap

to

-

and

rate

acceptance

of

Imagine

in

-

cQ

rate

2

Gaussian

is

close

upper

for

1

rejection

than

C

With

.

using

larger

so

pair

)

deviation

cQ

2

0

broader

a

is

)

-1

slightly

5

are

be

that

~

a

figure

aQ

dimensionality

27ra

and

standard

must

such

)

consider

deviations

ap

x

-2

.

zero

whose

standard

)

,

mean

deviation

the

(

x

study

with

standard

1

Gaussian

cQ

-3

work

well

problems

.

if

the

it

proposal

is

difficult

-

186

D.J.C. MACKAY

r"\Q(x;X(l))

. . . . . .

. . . I . .

.

.

.

.

.

.

.

.

.

I

I

.

.

.

.

.

.

.

,

.

,

.

.

.

.

.

'

. "

.

.

"

.

.

'

.

-

-

, -

, -

-

X(l )

x

---.

Figure 6. Metropolis method in one dimension. The proposal distribution Q(x ' ; x ) is here shown as having a shape that changes as x changes, though this is not typical of the proposal densities used in practice.

The

Metropolis

which

algorithm

depends

the

on

simplest

the

case

current

is

x

not

of

a

Q

( x

.

for

,

x

whether

'

to

we

( t ) )

in

new

a

~

1

then

the

Otherwise

If

the

step

set

x

is

( t +

l

)

sampling

= =

,

samples

to

,

{

be

label

current

is

a

point

of

( t +

then

( x

' ;

a

in

in

( x

on

density

)

.

An

shows

.

T

' )

It

example

the

density

)

for

any

Q

( x

x

' ;

x

.

A

( t ) )

tentative

.

To

decide

quantity

' )

l

)

= =

x

' .

,

a

Markov

( t ) )

the

no

.

factor

.

.

.

the

.

( x

we

rejection

the

list

current

It

R

to

( t )

label

of

state

is

important

to

T

are

;

to

x

unity

' )

/

=

1

note

Q

( x

as

a

and

x

to

( t ) )

. T

that

.

compute

If

the

Gaussian

the

.

.

able

' ;

.

independent

correlated

be

,

that

t

produce

such

points

superscript

need

is

)

then

in

on

the

samples

we

Q

,

:

influence

not

The

and

1

and

chain

(

.

= =

does

.

rejected

causes

density

latter

is

sampling

time

,

a

step

have

distribution

( x

probability

the

r

)

.

rejection

iterations

P

)

28

If

another

a

( t

rejection

and

Here

P

/

x

.

probability

( x

( x

with

from

symmetrical

,

Q

X

accepted

distribution

P

)

superscript

acceptance

simple

( t ) ;

accepted

points

the

of

ratios

density

( t

is

.

states

target

the

probability

the

of

the

compute

( x

x

from

sequence

*

the

( x

Q

might

.

P

Q

( t ) )

fixed

P

figure

( 2 )

x

centred

any

to

density

discarded

used

simulation

from

posal

list

have

samples

the

*

collected

the

I

' )

difference

are

we

onto

:

Metropolis

To

on

that

written

samples

the

}

( X

be

this

x

compute

is

set

the

points

)

independent

to

a

( r

*

state

we

Note

rejected

x

Notation

are

,

( t ) .

can

' ;

Gaussian

similar

evaluate

we

state

new

accepted

x

)

a

density

( x

( 27 p

new

the

x

proposal

,

proposal

Q

a5

;

and

a

-

If

' ;

6

)

can

P a

( l

the

state

of

density

all

figure

we

from

( x

at

x

that

the

The

such

look

states

assume

use

.

Q

to

shown

generated

accept

( t )

density

x

different

is

x

distribution

' ;

is

two

before

state

simple

( x

density

( t ) )

As

new

Q

makes

state

proposal

for

proposal

x

a

The

necessary

' ;

current

be

( t )

instead

the

centred

Metropolis

pro

-

Q (

MONTECARLOMETHODS

187

Figure 7. Metropolis method in two dimensions, showing a traditional proposal density that has a sufficiently small step size ~ that the acceptance frequency will be about 0.5.

method simply involves comparing the value of the target density at the two points . The general algorithm for asymmetric Q , given above, is often called the Metropolis - Hastings algorithm . It can be shown that for any positive Q (that is, any Q such that Q (x ' ; x ) > 0 for all x , x') , as t - + 00, the probability distribution of x (t) tends to P (x ) == P * (x ) / Z . [This statement should not be seen as implying that Q has to assign positive probability to every point x ' - we will discuss examples later where Q (x' ; x ) == 0 for some x , x' ; notice also that we have said nothing about how rapidly the convergence to P (x ) takes place.] The Metropolis method is an example of a ' Markov chain Monte Carlo ' method (abbreviated MCMC ) . In contrast to rejection sampling where the accepted points { x (r)} are independent samples from the desired distribution , Markov chain Monte Carlo methods involve a Markov process in which a sequence of states { x (t)} is generated , each sample x (t) having a probability distribution that depends on the previous value, x (t - 1). Since successive samples are correlated with each other , the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from P . Just as it was difficult to estimate the variance of an importance sampling estimator , so it is difficult to assess whether a Markov chain Monte Carlo method has 'converged ' , and to quantify how long one has to wait to obtain samples that are effectively independent samples from P .

188

D.J.C. MACKAY

4.1. DEMONSTRATION OF THE METROPOLIS METHOD The Metropolis method is widely used for high- dimensional problems . Many implementations of the Metropolis method employ a proposal distribution with a length scale f that is short relative to the length scale L of the prob able region (figure 7) . A reason for choosing a small length scale is that for most high - dimensional problems , a large random step from a typical point (that is, a sample from P (x )) is very likely to end in a state which hM very low probability ; such steps are unlikely to be accepted. If f is large , movement around the state space will only occur when a transition to a state which has very low probability is actually accepted, or when a large random step chances to land in another probable state . So the rate of progress will be slow , unless small steps are used. The disadvantage of small steps, on the other hand , is that the Metropo lis method will explore the probability distribution by a random walk , and random walks take a long time to get anywhere . Consider a one- dimensional random walk , for example , on each step of which the state moves randomly to the left or to the right with equal probability . After T steps of size f , the state is only likely to have moved a distance about . ; Tf . Recall that the first aim of Monte Carlo sampling is to generate a number of inde pendent samples from the given distribution (a dozen, say) . If the largest length scale of the state space is L , then we have to simulate a random walk Metropolis method for a time T ~ (L / f ) 2 before we can expect to get a sam pIe that is roughly independent of the initial condition - and that 's assuming that every step is accepted: if only a fraction / of the steps are accepted on average, then this time is increased by a factor 1/ / . Rule of thumb : lower bound on number of iterations ora Metropo lis method . If the largest length scale of the space of probable states is L , a Metropolis method whose proposal distribution generates a random walk with step size f must be run for at least T ~ (L / f )2 iterations to obtain an independent sample . This rule of thumb only gives a lower bound ; the situation may be much worse, if , for example , the probability distribution consists of several islands of high probability separated by regions of low probability . To illustrate how slow the exploration of a state space by random walk is, figure 8 shows a simulation of a Metropolis algorithm for generating samples from the distribution :

1 21 x E { a, 1, 2 . . . , 20} P(x) = { 0 otherwise

(29)

MONTECARLOMETHODS

189

~~~III ~ . . - III.-~

(a)

I111111111111111111111 1111 !!1

(b) Metropolis

=1!!llliiilll -

Figure 8. Metropolis method for a toy problem . (a) The state sequence for t = 1 . . . 600 . Horizontal direction = states from 0 to 20; vertical direction = time from 1 to 600 ; the cross bars mark time intervals of duration 50. (b ) Histogram of occupancy of the states after 100, 400 and 1200 iterations . (c) For comparison , histograms resulting when successive points are drawn independently from the target distribution .

D.J.C. MACKAY

190 The proposal distribution is

X '

Q(x';x) = { ~ Because

the

when

0

and

state

end

state

end

x

states

is

in

4 .2 .

METROPOLIS

The

rule

{ O , 1 , 2 , . . . 20 steps

of

thumb

is of

{ un

,

}

in

the

such

that

each

variable

dom

walk

the

directions

with

step

least

need

T T

~

Now comes

( L

how ,

fall

~

from

but sharply

/ /

big if

f

the

) 2

;

iterations

to

.

random of

same

a

lower

step

the sizes

,

case

of

equal

and us

,

to all

umin

be

to will

an

section

a

largest

this

is

and

assumption

, a

by

where

independent

ran

-

effectively

controlled ,

-

adjusted

executing

generate be

loss dis

deviations

f

,

-

spherical

Without

the

Under others

distri

separable

that

taken

previous

obtain

standard

1 . the

time

the

has

to

target

. a

the

applies

is

to

assume

close

a

is

it

on

also

that

that

Let

The

bound

method

distribution in

abolish

the

distribution

of

as

other

iterations

to

using

giving

umax

is

target

just

to

the

we

sample

the

needed ,

here

we

f ) 2 . can

is .

Let

.

required

independent

try

simplest

and

.

about

amax

( amax

,

independently sizes

lengthscale

}

.

both

exploration

direction

probability

evolves

,

target

deviations

acceptance

samples

largest

n

standard

xn

each

{ xn

,

an

visit

hundred

distribution

the

axes

to

an

into

!

proposal

that

the

it

the

in

assume

with

these

independent

will

can

take

effectively

Metropolis

Consider

deviation

we

step

with

systematic

above

a

first

are

four to

A

above reach

DIMENSIONS

walk

and

Thus

states

to

it

is

end thumb

encounter

about

hundred

discussed

.

iterations

only

evolution

the of

The

400

around

HIGH

random

,

different

of

a

Gaussian

aligned

smallest

at

a

standard

generality

tribution

we of

.

four

.

important

get

of

problems

that

Gaussian

could

IN

that

dimensional

bution

}

instead

iterations

is

its of

iterations

does

.

methods

and

rule

100

occur

.

one

first

for it

10

long

the

iteration

21

the

~

How

will

=

=

steps

about

that

'

example

indeed

Carlo

rejections

reach

T

simulating

shows Monte

10

.

540th by

(30)

x

Xo

present

And

METHOD

of

higher

the

example

twenty

number

on

or

to

is

predicts .

generated

space

about

thumb

1

1

,

-

time

iteration

space

place

only

simple

state

of state

behaviour

toy

178th

=

take

a

the

:!:

state

it

distance

in

X

uniform '

the

take

the

rule

takes

This walk

the

is x

does

confirmed on

The

are

Since

)

to in

long

typically

whole

state

samples

?

( x

state

started

How

will

occurs ?

.

20

This

the

end

of

= it

.

traverse

in

8a

that

end

the

was

figure

predicts

P

takes

simulation in

=

distribution

proposal

The shown x

target

the

=

otherwise

too It

be big seems

? -

The

bigger bigger

plausible

it than that

is

,

amin the

the

smaller -

optimal

then

this the must

number

T

acceptance be

be

-

rate similar

to

MONTECARLOMETHODS

X2

191

X2

P(x) ;:.::.::-. -

(a)

Xl

(b) X2

X2

P (x2IXI)

(c)

;:

Xl

(d)

.............. t))

tX X ( + 2 ) (t+ l)X (t) Xl

Xl

Figure 9. Gibbs sampling. (a) The joint density P (x ) from which samples are required. (b) Starting from a state x (t), Xl is sampled from the conditional density P (xllx ~t)). (c) A sample is then made from the conditional density P (x2IxI ) . (d) A couple of iterations of Gibbs sampling.

umin. Strictly , this may not be true ; in special cases where the second small est an is significantly greater than amin, the optimal f may be closer to that second smallest an . But our rough conclusion is this : where simple spherical proposal distributions are used, we will need at least T ~ (am.ax/ umin)2 iterations to obtain an independent sample , where umax and amln are the longest and shortest lengthscales of the target distribution . This is good news and bad news. It is good news because, unlike the cases of rejection sampling and importance sampling , there is no cat &Strophic dependence on the dimensionality N . But it is bad news in that all the same, this quadratic dependence on the lengthscale ratio may force us to make very lengthy simulations . Fortunately , there are methods for suppressing random walks in Monte Carlo simulations , which we will discuss later .

192

D.J.C. MACKAY

5. Gibbs sampling We introduced importance sampling , rejection sampling and the Metropo lis method using one- dimensional examples . Gibbs sampling , also known as the heat bath method, is a method for sampling from distributions over at least two dimensions . It can be viewed as a Metropolis method in which the proposal distri bu tion Q is defined in terms of the conditional distri butions of the joint distribution P (x ) . It is assumed that whilst P (x ) is too complex to draw samples from directly , its conditional distributions P (Xi I{ Xj } jli ) are tractable to work with . For many graphical models (but not all ) these one- dimensional conditional distributions are straightforward to sample from . Conditional distributions that are not of standard form may still be sampled from by adaptive rejection sampling if the conditional distribution satisfies certain convexity properties (Gilks and Wild 1992) . Gibbs sampling is illustrated for a cage with two variables (Xl , X2) = x in figure 9. On each iteration , we start from the current state x (t), and Xl is sampled from the conditional density P (xllx2 ) ' with X2 fixed to x ~t). A sample x2 is then made from the conditional density P (x2IxI ) , using the new value of Xl . This brings us to the new state X(t+l ), and completes the iteration . In the general case of a system with I( variables , a single iteration involves sampling one parameter at a time :

X(t+l) 1 X(t+l) 2 X3 (t+l)

r.....I

(t),X3 (t),...XK (t)} P(XIIX2 I (t+l),X3 (t),...XK (t)} P(X2IXl IXl (t+l),X2 (t+l)'...XK (t)},etc P(X3 .

erty that every proposal is always accepted . Because Gibbs sampling is a Metropolis method , the probability distribution of x (t) tends to P (x ) as t - + 00, as long as P (x ) does not have pathological properties .

5.1. GIBBSSAMPLING IN HIGHDIMENSIONS Gibbs sampling suffers from the same defect as simple Metropolis algorithms - the state space is explored by a random walk , unless a fortuitous parameterization has been chosen which makes the probability distribution P (x ) separable . If , say, two variables x 1 and X2 are strongly correlated , having marginal densities of width L and conditional densities of width f , then it will take at least about (L / f ) 2 iterations to generate an indepen dent sample from the target density . However Gibbs sampling involves no adjustable parameters , so it is an attractive strategy when one wants to get

MONTECARLOMETHODS

193

a model running quickly . An excellent software package, BUGS, is available which makes it easy to set up almost arbitrary probabilistic models and simulate them by Gibbs sampling (Thomas , Spiegelhalter and Gilks 1992) . 6

.

Terminology

We

for

now

spend

method

A

p

a

and

( O )

( x

)

Mar

few

a

transition

is

given

.

construct

The

A

( t +

l

)

( X

.

distribution

7r

chain

is

must

such

( x

)

is

ergodic

the

MetropoliR

initial

( x

' ;

state

probability

x

)

distribution

.

at

the

( t

+

l

)

th

iteration

of

the

T

( x

' ;

x

) p

( t ) ( x

)

.

( 34

)

:

( x

)

is

the

invariant

distribution

of

the

often

( t )

)

convenient

all

density

-

t

= =

f

dNx

7r

( x

)

as

( x

' )

)

.

of

( x

that

t

-

T

which

( x

T

,

construct

of

P

' )

distribution

ergodic

to

B

desired

( X

be

( x

invariant

t

00

by

' ;

x

is

,

,

for

)

7r

( x

)

any

or

( x

' ;

x

)

if

.

p

mixing

T

( O )

( x

)

.

concatenating

( 35

)

( 36

)

( 37

)

simple

satisfy

=

JdNX

B

These

base

( x

' ;

x

)

P

( x

transitions

)

,

need

not

be

individually

.

Many

erty

the

that

an

also

transitions

the

an

JdNX

P

P

for

= =

distribution

p

base

which

.

The

It

by

T

' )

chain

7r

2

specified

of

on

by

the

desired

chain

theory

methods

.

probability

p

1

based

distribution

chain

We

be

Carlo

the

are

can

probability

kov

Monte

sketching

sampling

chain

and

chain

moments

Gibbs

Markov

The

Markov

useful

transition

probabilities

satisfy

the

detailed

balance

prop

-

:

T

( x

' ;

x

)

P

( x

)

=

T

( x

;

x

' )

P

( x

' )

,

for

all

x

and

x

' .

( 38

)

This equation says that if we pick a state from the target density P and make a transition under T to another state , it is just as likely that we will pick x and go from x to x ' as it is that we will pick x ' and go from x ' to x . Markov chains that satisfy detailed balance are also called reversible Markov chains . The reason why the detailed balance property is of interest is that detailed balance implies invariance of the distribution P (x ) under

194

D.J.C. MACKAY

the Markov chain T (the proof of this is left as an exercise for the reader ) . Proving that detailed balance holds is often a key step when proving that a Markov chain Monte Carlo simulation will converge to the desired distri bution . The Metropolis method and Gibbs sampling method both satisfy detailed balance , for example . Detailed balance is not an essential condi tion , however , and we will see later that irreversible Markov chains can be useful in practice .

7. Practicalities Can we predict how long a Markov chain Monte Carlo simulation will take to equilibrate ? By considering the random walks involved in a Markov chain Monte Carlo simulation we can obtain simple lower bounds on the time required for convergence. But predicting this time more precisely is a difficult problem , and most of the theoretical results are of little practical use. Can we diagnose or detect convergence in a running simulation ? This is also a difficult problem . There are a few practical tools available , but none of them is perfect (Cowles and Carlin 1996) . Can we speed up the convergence time and time between inde pendent samples of a Markov chain Monte Carlo method ? Here, there is good news.

7.1. SPEEDINGUP MONTECARLOMETHODS 7 .1 .1 . The method

Reducing

Monte

applicable

information For

random

hybrid

to many

to reduce systems

walk Carlo

behaviour

state

random

Metropolis

reviewed

continuous

, the

in

method

walk

in

spaces

which

behaviour

probability

P

(x )

methods

Neal

( 1993 makes

) is use

a

Metropolis of

gradient

. can

be

written

in

the

form

e - E (X ) P

(x )

==

( 39

)

Z

where not only E (x ), but also its gradient with respect to x can be readily evaluated . It seems wasteful to use a simple random - walk Metropolis method when this gradient is available - the gradient indicates which di rection one should go in to find states with higher probability ! In the hybrid Monte Carlo method , the state space x is augmented by momentum variables p , and there is an alternation of two types of proposal . The first proposal randomizes the momentum variable , leaving the state x unchanged . The second proposal changes both x and p using simulated Hamiltonian dynamics as defined by the Hamiltonian

H (x , p) = E (x ) + K (p) ,

(40)

MONTECARLOMETHODS g = gradE ( x ) E = findE ( x )

.

# set gradient # set objective

, .

,

for 1 = 1: L p = randn ( size (x) ) H = p' * p / 2 + E ;

195

using initial x function too

# loop L times # initial momentumis Normal(O, l ) # evaluate H(x ,p)

xnew = x gnew = g ; for tau = 1 : Tau

# make Tau ' leapfrog ' steps

p = p - epsilon * gnew / 2 ; # make half - step in xnew = xnew + epsilon * p ; # make step in x

p

gnew = gradE ( xnew ) ; # find new gradient p = p - epsilon * gnew/ 2 ; # makehalf - step in p endfor # find new value of H Enew = findE ( xnew ) ; Hnew = p ' * p / 2 + Enew ; dH = Hnew - H ; # Decide whether to accept if ( dH < 0 ) accept elseif ( rand ( ) < exp ( - dH) ) accept else accept endif ( accept ) g = gnew ; endif endfor

= 1 ; = 1 ; = 0 ;

if

Figure 10.

where als

used

to

This

density desired

E = Enew ;

Octave source code for the hybrid Monte Carlo method .

K ( p ) is a ' kinetic

are

PH ( x , p ) =

the

x = xnew ;

create

~ZH

' such

exp [ - H ( x , p ) ] =

is separable distribution

energy

( asymptotically

, so it is clear exp [ - E ( x ) ] jZ

as [

p ) =

) samples

pTp / 2 . These from

the

joint

two

propos

ZI H exp [ - E ( x ) ] exp [ - K ( p ) ] . that

the

marginal

. So , simply

distribution

discarding

-

density

the

( 41 ) of x is mom

en -

196

D.J.C. MACKAY

1

1

' " ", I'

(a)

" 1' 1' ,

0.5

' "

, ,

1' ' "

,

,. I'

"

1'

(b) 0.5

1' ..

", '

:

.. ,

' "

' " I'

1'

I ' ' "

0

I '

0 - 0 .5

-0.5

.

-1 -1

- 1 .5 -1

- 0 .5

0

0 .5

1

- 1 .5

1

(c)

0.5

-1

- 0 .5

0

0 .5

1

; ' ;' ;'

.

(d)

1

;' ;' ; ' ;' "

0 .5

:

of " "

.

/

: /

0

"

;'

"

",,; ,,;, ; ,

" "

/

"

... "

- 0 .5

.

-1 -1

- 0 .5

0

0 .5

1

Figure 11. (a,b) Hybrid Monte Carlo used to generate samples from a bivariate Gaussian with correlation p = 0.998. (c,d) Random- walk Metropolis method for comparison. (a) Starting from the state indicated by the arrow, the continuous line represents two successivetrajectories generated by the Hamiltonian dynamics. The squares show the endpoints of these two trajectories . Each trajectory consists of Tau = 19 'leapfrog' steps with epsilon = 0.055. After each trajectory , the momentum is randomized. Here, both trajectories are accepted; the errors in the Hamiltonian were + 0.016 and - 0.06 respectively . (b) The second figure shows how a sequenceof four trajectories converges from an initial condition , indicated by the arrow, that is not close to the typical set of the target distribution . The trajectory parameters Tau and epsilon were randomized for each trajectory using uniform distributions with means 19 and 0.055 respectively. The first trajectory takes us to a new state, (- 1.5, - 0.5) , similar in energy to the first state. The second trajectory happens to end in a state nearer the bottom of the energy landscape. Here, since the potential energy E is smaller, the kinetic energy ]( = p2/ 2 is necessarily larger than it was at the start . When the momentum is randomized for the third trajectory , its magnitude becomes much smaller. After the fourth trajectory has been simulated, the state appears to have become typical of the target density. (c) A random- walk Metropolis method using a Gaussian proposal density with radius such that the acceptance rate was 58% in this simulation . The number of proposals was 38 so the total amount of computer time used was similar to that in (a). The distance moved is small because of random walk behaviour. (d) A random- walk Metropolis method given a similar amount of computer time to (b).

MONTECARLOMETHODS

197

turn variables , we will obtain a sequence of samples { x (t)} which asymptotically come from P (x ) e The first proposal draws a new momentum from the Gaussian density exp [- K (p )]/ ZKe During the second, dynamical proposal , the momentum variable determines where the state x goes, and the gradient of E (x ) determines how the momentum p changes, in accordance with the equations

x = p

(42)

i> = - ~~~ax ~~~l .

(43)

Becauseof the persistentmotion of x in the direction of the momentum p, during eachdynamicalproposal, the state of the systemtendsto move a distancethat goeslinearly with the computertime, rather than as the square root. If the simulation of the Hamiltoniandynamicsis numericallyperfect then the proposalsareacceptedeverytime, because the total energyH (x , p) is a constantof the motion and so a in equation(27) is equalto one. If the simulationis imperfect, becauseof finite stepsizesfor example, then some of the dynamicalproposalswill be rejected. The rejectionrule makesuseof the changein H (x , p), which is zeroif the simulationis perfect. The occasional rejectionsensurethat asymptotically , we obtain samples(x (t), p(t)) from the requiredjoint densityPH(X, p). The sourcecode in figure 10 describesa hybrid Monte Carlo method whichusesthe 'leapfrog' algorithmto simulatethe dynamicson the function findE (x) , whosegradient is found by the function gradE(x) . Figure 11 showsthis algorithm generatingsamplesfrom a bivariateGaussianwhose energyfunction is E (x) == ~XTAx with A=

- 250 249.25 .75 - 249 250.75 .25 ] .

(44)

7.1.2. Overrelaxation The method of 'overrelaxation ' is a similar method for reducing random walk behaviour in Gibbs sampling . Overrelaxation was originally introduced for systems in which all the conditional distributions are Gaussian . (There are joint distributions that are not Gaussian whose conditional distributions are all Gaussian , for example , P (x , y) = exp (- x2y2)jZ .) In ordinary Gibbs sampling , one draws the new value x~t+l ) of the cur rent variable Xi from its conditional distribution , ignoring the old value x ~t). This leads to lengthy random walks in cages where the variables are strongly correlated , as illustrated in the left hand panel of figure 12.

198

D.J.C. MACKAY Gibbs sampling

Overrelaxation

1

(a)

1

-0.5 -1

-1

(b)

-1

Figure 12. Overrelaxation contrasted with Gibbs sampling for a bivariate Gaussian with correlation p = 0.998. (a) The state sequencefor 40 iterations , each iteration involving one update of both variables. The overrelaxation method had Q ' = - 0.98. (This excessively large value is chosen to make it easy to see how the overrelaxation method reduces random walk behaviour.) The dotted line shows the contour xT}::- lX = 1. (b) Detail of (a), showing the two steps making up each iteration . (After Neal ( 1995).)

In from

Adler a

's

( 1981

Gaussian

tribution

.

current

value

If

)

overrelaxation

that the of

is

is

v

r-..J Normal

to

the

,

=

J. L +

( O , 1 ) and

Adler

a:

is

a

Xi

method

J. L)

parameter

+

instead

side of

's

a: ( x ~t ) -

one

opposite

distribution

x ~t ) , then

x ~t + l )

where

biased

conditional Xi

method

( 1 -

is

of

the

conditional

Normal

( J. L, 0- 2 )

sets

Xi

a2

) 1 / 2uv

between

Xt(t+l)

samples

dis and

-

the

to

-

,

1

( 45

and

)

1 , commonly

set to a negative value . The transition matrix T (x ' ; x ) defined by this procedure does not satisfy detailed balance . The individual transitions for the individual coordinates just described do satisfy detailed balance, but when we form a chain by applying them in a fixed sequence, the overall chain is not reversible . If , say, two variables are positively correlated , then they will (on a short timescale )

MONTECARLOMETHODS

199

evolve in a directed manner instead of by random walk , as shown in figure 12. This may significantly reduce the time required to obtain effectively independent samples. This method is still a valid sampling strategy - it converges to the target density P (x ) - because it is made up of transitions that satisfy detailed balance . The overrelaxation method has been generalized by Neal (1995, and this volume ) whose 'ordered overrelaxation ' method is applicable to any system where Gibbs sampling is used. For practical purposes this method may speed up a simulation by a factor of ten or twenty . 7.1.3. Simulated annealing A third technique for speeding convergence is simulated annealing . In simulated annealing , a 'temperature ' parameter is introduced which , when large , allows the system to make transitions which would be improbable at temperature 1. The temperature may be initially set to a large value and reduced gradually to 1. It is hoped that this procedure reduces the chance of the simulation 's becoming stuck in an unrepresentative probability island . We asssume that we wish to sample from a distribution of the form

p(x) = .~:..:=.z~~~.~~.

(46)

where E (x ) can be evaluated . In the simplest simulated annealing method , we instead sample from the distribution ~ 1~

PT(X) ==ztT>e- T

(47)

and decreage T gradually to 1. Often the energy function can be separated into two terms ,

E (x ) == Eo(x ) + El (X) ,

(48)

of which the first term is ' nice' (for example , a separable function of x ) and the second is ' nasty ' . In these cases, a better simulated annealing method might make use of the distribution ~ l_i~

PT(X) = ~ e-EO (X)- T

(49)

with T gradually decreasing to 1. In this way, the distribution at high temperatures reverts to a well- behaved distribution defined by Eo. Simulated annealing is often used as an optimization method , where the

aim is to find an x that minimizes E (x ), in which case the temperature is decreased

to zero

rather

than

to

1 . As a Monte

Carlo

method

, simulated

200

D.J.C. MACKAY

annealing as described above doesn't sample exactly from the right distribution ; the closely related 'simulated tempering' methods (Marinari and Parisi 1992) correct the biasesintroduced by the annealing processby making the temperature itself a random variable that is updated in Metropolis fashion during the simulation. 7.2. CAN THE NORMALIZING CONSTANTBE EVALUATED? If the target density P (x ) is given in the form of an unnormalized density P* (x ) with P (x ) == 1y ;p * (x ) , the value of Z may well be of interest. Monte Carlo methods do not readily yield an estimate of this quantity , and it is an area of active researchto find ways of evaluating it . Techniquesfor evaluating Z include: 1 .

Importance

2 .

' Thermodynamic

sampling

( reviewed integration

by '

during

Neal

( 1993 simulated

) ) . annealing

,

the

' accep

-

tance ratio ' method , and ' umbrella sampling ' (reviewed by Neal (1993) ) . 3. ' Reversible jump Markov chain Monte Carlo ' (Green 1995) . Perhaps the best way of dealing with Z , however, is to find a solution to one's task that does not require that Z be evaluated . In Bayesian data mod elling one can avoid the need to evaluate Z - which would be important for model comparison - by not having more than one model . Instead of using several models (differing in complexity , for example ) and evaluating their relative posterior probabilities , one can make a single hierarchical model having , for example , various continuous hyperparameters which play a role similar to that played by the distinct models (Neal 1996) . 7.3. THE METROPOLIS METHOD FOR BIG MODELS Our original description of the Metropolis method involved a joint updating of all the variables using a proposal density Q (x ' ; x ) . For big problems it may be more efficient to use several proposal distributions Q (b) (x ' ; x ) , each of which updates only some of the components of x . Each proposal is indi vidually accepted or rejected , and the proposal distributions are repeatedly run through in sequence. In the Metropolis method , the proposal density Q (x /; x ) typically has a number of parameters that control , for example , its 'width ' . These parameters are usually set by trial and error with the rule of thumb being that one aims for a rejection frequency of about 0.5. It is not valid to have the width parameters be dynamically updated during the simulation in a way that depends on the history of the simulation . Such a modification of the proposal density would violate the detailed balance condition which guarantees that the Markov chain hag the correct invariant distribution .

MONTECARLOMETHODS

201

7.4. GIBBSSAMPLING IN BIGMODELS Our description of Gibbs sampling involved sampling one parameter at a time , as described in equations (31- 33) . For big problems it may be more efficient to sample groups of variables jointly , that is to use several proposal distributions :

x~t+1.)..X~t+1) fiJP(X1...XaIX~t~1...X~) (t+ l ) (t+ l ) Xa+ l . . . Xb

(50)

( I (t+l.). .Xa (t+l ),Xb (t) (t) etc.. (51) rv PXa+l . . .XbXl +l . . .XK)'

7.5. HOWMANYSAMPLES ARENEEDED ? At A the start of this chapter, we observed that the variance of an estimator depends only on the number

of independent

samples R and the value of

0-2:=J dNXP(x) j>(x) - <1 2,

(52)

We have now discussed a variety of methods for generating samples from P (x ) . How many independent samples R should we aim for ? In many problems , we really only need about twelve independent samples from P (x ) . Imagine that x is an unknown vector such aB the amount of corrosion present in each of 10,000 underground pipelines around Sicily , and (x ) is the total cost of repairing those pipelines . The distribution P (x ) describes the probability of a state x given the tests that have been carried out on some pipelines and the assumptions about the physics of corrosion . The quantity is the expected cost of the repairs . The quantity 0' 2 is the variance of the cost - 0' measures by how much we should expect the actual cost to differ from the expectation <1>. N ow , how accurately would a manager like to know ? I would suggest there is little point in knowing to a precision finer than about 0-/ 3. After all , the true cost is likely to differ by :f:0' from
7.6. ALLOCATION OFRESOURCES Assuming we have decided how many independent samples R are required , an important question is how one should make use of one's limited computer resources to obtain these samples. A typical Markov chain Monte Carlo experiment involves an initial period in which control parameters of the simulation such as step sizes may be adjusted . This is followed by a ' burn in ' period during which we hope the

202

D.J.C. MACKAY ( 1)

(

2

:-

)

-

: -

: -

::::J

: -

= -

= J

: -

=-

= J

: -

=-

=J

--

:-

:-

:-

:-

:-

:-

:-

:-

:-

_J

( 3 ) : : : ::: ~ :J :J :::) :) :) -) :) :) :) :)

Figure 13. Three possible Markov Chain Monte Carlo strategies for obtaining twelve samples using a fixed amount of computer time . Computer time is represented by horizontal lines; samples by white circles. (1) A single run consisting of one long 'burn in ' period followed by a sampling period . (2) Four medium- length runs with different initial conditions and a medium- length burn in period. (3) Twelve short runs.

simulation 'converges' to the desired distribution . Finally , as the simulation continues , we record the state vector occasionally so as to create a list of states { X(r )} ~ l that we hope are roughly independent samples from P (x ) . There are several possible strategies (figure 13) . 1. Make one long run , obtaining all 2. Make a few medium length runs taining some samples from each. 3. Make R short runs , each starting dition , with the only state that is simulation .

R samples from it . with different initial conditions , obfrom a different random initial conrecorded being the final state of each

The first strategy has the best chance of attaining 'convergence' . The last strategy may have the advantage that the correlations between the recorded samples are smaller . The middle path appears to be popular with Markov chain Monte Carlo experts because it avoids the inefficiency of discarding burn - in iterations in many runs , while still allowing one to detect problems with lack of convergence that would not be apparent from a single run .

MONTECARLOMETHODS

203

7.7. PHILOSOPHY One curious defect of these Monte Carlo methods - which are widely used by Bayesian statisticians - is that they are all non- Bayesian . They involve computer experiments from which estimators of quantities of interest are derived . These estimators depend on the sampling distributions that were used to generate the samples. In contrMt , an alternative Bayesian approach to the problem would use the results of our computer experiments to infer the properties of the target function P (x ) and generate predictive distribu tions for quantities of interest such M <1>. This approach would give answers which would depend only on the computed values of P * (x (r )) at the points { x (r )} ; the answers would not depend on how those points were chosen. It remains an open problem to create a Bayesian version of Monte Carlo methods . 8 . Summary - Monte Carlo methods are a powerful tool that allow one to implement any probability distribution that can be expressed in the form P (x ) = lzP * (X) . - Monte Carlo methods can answer virtually any query related to P (x ) by putting the query in the form

J cf >(x)P(x) ~ ~ Lr (x(r)).

(53)

- In high - dimensional problems the only satisfactory methods are those based on Markov chain Monte Carlo : the Metropolis method and Gibbs sampling . - Simple Metropolis algorithms , although widely used, perform poorly because they explore the space by a slow random walk . More sophisti cated Metropolis algorithms such as hybrid Monte Carlo make use of proposal densities that give faster movement through the state space. The efficiency of Gibbs sampling is also troubled by random walks . The method of ordered overrelaxation is a general purpose technique for suppressing them .

ACKNOWLEDGEMENTS This presentation of Monte Carlo methods owes a great deal to Wally Gilks and David Spiegelhalter . I thank Radford Neal for teaching me about Monte Carlo methods and for giving helpful comments on the manuscript .

204

D.J.C. MACKAY

References Adler

, S . L . : 1981 , Over

tition

function

- relaxation

for multiquadratic

method

for

the

Monte

- Carlo

evaluation

actions , Physical Review D -Particles

of

the

par -

and Fields

23( 12) , 2901- 2904. Cowles, M . K . and Carlin , B . P.: 1996, Markov-chain Monte-Carlo convergence diagnostics - a comparative review , Journal of the American Statistical Association 91 ( 434 ) , 883- 904 . Gilks , W . and Wild , P.: 1992, Adaptive rejection sampling for Gibbs sampling , Applied Statistics

41 , 337 - 348 .

Gilks , W . R ., Richardson , S. and Spiegelhalter , D . J .: 1996, Markov Chain Monte Carlo in Practice , Chapman and Hall . Green , P. J .: 1995, Reversible jump Markov chain Monte Carlo computation and Bayesian model

determination

, Biometrika

82 , 711 - 732 .

Marinari , E . and Parisi , G .: 1992, Simulated

tempering

- a new Monte - Carlo scheme ,

Europhysics Letters 19(6), 451- 458. Neal , R . M .: 1993, Probabilistic inference using Markov chain Monte Carlo methods , Technical Report CRG - TR - 99- 1, Dept . of Computer Science, University of Toronto . Neal , R . M .: 1995, Suppressing random walks in Markov chain Monte Carlo using ordered overrelaxation , Technical Report 9508 , Dept . of Statistics , University of Toronto . Neal , R . M .: 1996, Bayesian Learning for Neural Networks , number 118 in Lecture Notes in Statistics , Springer , New York . Propp , J . G . and Wilson , D . B .: 1996, Exact sampling with coupled Markov chains and

applications to statistical mechanics, Random Structures and Algorithms 9(1-2) , 223252 .

Tanner , M. A.: 1996 , Toolsfor StatisticalInference : Methods for theExplorationof Posterior DistributionsandLikelihood Functions , SpringerSeriesin Statistics , 3rd edn, SpringerVerlag . Thomas , A., Spiegelhalter , D. J. andGilks, W. R.: 1992 , BUGS: A programto perform BayesianinferenceusingGibbssampling , in J. M. Bernardo , J. o . Berger , A. P. Dawidand A. F. M. Smith(eds), BayesianStatistics1" ClarendonPress , Oxford, pp. 837-842. Yeomans , J.: 1992 , Statisticalmechanics of phasetransitions , Clarendon Press , Oxford. For a full bibliography and a more thorough review of Monte Carlo methods, the reader is encouragedto consult Neal (1993) , Gilks et ale (1996) , and Tanner (1996) .

SUPPRESSING MONTE

RANDOM

CARLO

USING

RADFORD Dept

.

M of

of : / / www

IN

ORDERED

MARKOV

CHAIN

OVERRELAXATION

. NEAL

Statistics

University http

WALKS

and Toronto

. cs

. toronto

Dept ,

.

Toronto . edu

of

Computer ,

/ ~

Ontario radford

Science ,

Canada /

Abstract . Markov chain Monte Carlo methods such as Gibbs sampling and simple forms of the Metropolis algorithm typically move about the distribu tion being sampled via a random walk . For the complex , high-dimensional distributions commonly encountered in Bayesian inference and statistical physics , the distance moved in each iteration of these algorithms will usually be small , because it is difficult or impossible to transform the problem to eliminate dependencies between variables . The inefficiency inherent in taking such small steps is greatly exacerbated when the algorithm operates via a random walk , as in such a case moving to a point n steps away will typ ically take around n2 iterations . Such random walks can sometimes be suppressed using "overrelaxed " variants of Gibbs sampling (a.k .a. the heatbath algorithm ) , but such methods have hitherto been largely restricted to prob lems where all the full conditional distributions are Gaussian . I present an overrelaxed Markov chain Monte Carlo algorithm based on order statistics that is more widely applicable . In particular , the algorithm can be applied whenever the full conditional distributions are such that their cumulative distribution functions and inverse cumulative distribution functions can be efficiently computed . The method is demonstrated on an inference problem for a simple hierarchical Bayesian model .

1. Introduction Markov chain Monte Carlo methods are used to estimate the expectations of various functions of a state , x = (Xl ' . . . ' XN) , with respect to a distri bution given by some density function , 1r(x ) . Typically , the dimensionality , N , is large , and the density 1r(X) is of a complex form , in which the compo 205

206

RADFORDM. NEAL

nents of x are highly dependent. The estimates are basedon a (dependent) sample of states obtained by simulating an ergodic Markov chain that has

7r(x) as its equilibrium distribution . Starting with the work of Metropolis , et ale (1953), Markov chain Monte Carlo methods have beenwidely used to solve problems in statistical physics and , more recently , Bayesian statistical inference . It is often the only approach known that is computationally feasible . Various Markov

chain Monte Carlo methods and their applications

are

reviewed by Toussaint (1989) , Neal (1993) , and Smith and Roberts (1993). For the difficult problems that are their primary domain , Markov chain Monte Carlo methods are limited in their efficiency by strong dependencies between

components

of the state , which

move about the distribution

force the Markov

chain

to

in small steps. In the widely -used Gibbs sam-

pling method (known to physicists as the heatbath method ) , the Markov chain operates by successively replacing each component of the state , Xi , by a value randomly chosen from its conditional distribution given the cur -

rent values of the other components, 1l"(Xi I { Xj }j #i ). When dependencies between variables

are strong , these conditional

distributions

will be much

narrower than the corresponding marginal distributions , 1l"(Xi) , and many iterations

of the Markov

chain

will

be necessary

for the state

to visit

the full

range of the distribution defined by 7r(x) . Similar behaviour is typical when the Metropolis algorithm is used to update each component of the state in turn , and also when the Metropolis algorithm is used with a simple proposal distribution that changes all components of the state simultaneously . This inefficiency due to dependencies between components is to a certain extent unavoidable . We might hope to eliminate the problem by trans forming to a parameterization in which the components of the state are no longer dependent . If this can easily be done, it is certainly the preferred solution . Typically , however , finding and applying such a transformation is difficult or impossible . Even for a distribution ~ simple as a multivari ate Gaussian , eliminating dependencies will not be easy if the state has millions of components , as it might for a problem in statistical physics or .

.

Image

processIng

However

.

, in the Markov

chain

Monte

Carlo

methods

that

are most

com -

monly used, this inherent inefficiency is greatly exacerbated by the random walk nature of the algorithm . Not only is the distribution explored by tak ing small steps , the direction of these steps is randomized in each iteration , with the result that on average it takes about n2 steps to move to a point n

steps away. This can greatly increase both the number of iterations required before equilibrium is approached , and the number of subsequent iterations that are needed to gather a sample of states from which accurate estimates for the quantities of interest can be obtained . In the physics literature , this problem has been addressed in two ways

SUPPRESSING

RANDOM

WALKS

USING

OVERRELAXATION

207

- by "overrelaxation" methods, introduced by Adler (1981), which are the main subject of this paper , and by dynamical methods , such as " hybrid Monte Carlo " , which I briefly describe next . The hybrid Monte Carlo method , due to Duane , Kennedy , Pendleton ,

and Roweth (1987) , can be seen as an elaborate form of the Metropolis algorithm (in an extended state space) in which candidate states are found by simulating a trajectory defined by Hamiltonian dynamics . These trajec tories will proceed in a consistent direction , until such time as they reach a region of low probability . By using states proposed by this determin istic process, random walk effects can be largely eliminated . In Bayesian inference problems for complex models based on neural networks , I have

found (Neal 1995) that the hybrid Monte Carlo method can be hundreds or thousands of times faster than simple versions of the Metropolis algorithm . Hybrid Monte Carlo can be applied to a wide variety of problems where the state variables are continuous , and derivatives of the probability density can be efficiently computed . The method does, however , require that careful choices be made both for the length of the trajectories and for the stepsize used in the discretization of the dynamics . Using too large a stepsize will cause the dynamics to become unstable , resulting in an extremely high rejection rate . This need to carefully select the stepsize in the hybrid Monte

Carlo

method

is similar

to the need to carefully

select the width

of the proposal distribution in simple forms of the Metropolis algorithm . (For example , if a candidate state is drawn from a Gaussian distribution centred

at the current

state , one must

somehow

decide

what

the standard

deviation of this distribution should be) . Gibbs sampling does not require that

the user set such parameters . A Markov

chain Monte

Carlo

method

that shared this advantage while also suppressing random walk behaviour would

therefore

Markov

be of interest

chain methods

.

based on " overrelaxation

" show promise in this

regard. The original overrelaxation method of Adler (1981) is similar to Gibbs sampling , except that the new value chosen for a component of the state is negatively correlated with the old value. In many circumstances , successive overrelaxation improves sampling efficiency by suppressing ran dom walk behaviour . Like Gibbs sampling , Adler 's overrelaxation method does not require that the user select a suitable value for a stepsize pa-

rameter. It is therefore significantly ea.sier to use than hybrid Monte Carlo (although one does still need to set a parameter that plays a role analogous

to the trajectory length in hybrid Monte Carlo) . Overrelaxation methods also do not suffer from the growth in computation time with system size that results from the use of a global acceptance test in hybrid Monte Carlo .

(On the other hand, although overrelaxation has been found to greatly improve sampling in a number of problems , there are distributions

for which

208

RADFORD M. NEAL

overrelaxation

is

ineffective

Unfortunately

to

,

problems

eral

have

will

see

ation

to

these

,

however

methods

is

reduced

in

In

on

this

the

can

useful

be

if

to

which

tribution

functions

strategies

that

Section

.

2

In

that

are

is

on

conclude

a

by

discussing

by

Green

of

linear

quadratic

later

studied

Barone

and

and

conditional

general

Han

by

1992

"

which

Gibbs

,

the

-

will

efficiently

.

,

is

forms

cumulative

mention

dis

several

problems

to

I

which

have

inverse

also

sam

method

performed

.

,

-

other

which

ordered

the

meth

of

for

strategy

its

ordered

a

might

overrelaxation

hierarchical

be

-

cumulative

ordered

simple

-

over

implementation

employing

the

for

been

first

(

)

,

Though

applied

model

in

practice

.

,

I

and

' s

)

distributions

used

in

and

hence

,

Markov

1984

and

itself

since

method

demonstrate

in

Adler

new

overrelaxation

overrelaxation

strategies

,

1971

The

1989

Gaussian

for

The

and

6

long

Whitmer

(

' s

conditional

introduced

have

I

Adler

.

( Young

,

that

based

.

have

.

cannot

computations

proposals

method

Gaussian

distributions

methods

the

with

)

and

distributions

review

to

work

Frigessi

(

I

4

future

functions

was

be

of

problem

equations

overrelaxation

,

practice

and

range

Section

used

methods

of

In

how for

Overrelaxation

,

inference

Overrelaxation

systems

the

Section

.

is

possibilities

In

.

previous

in

5

Bayesian

-

in

.

discuss

Section

overrelax

overrelaxation

for

.

conditional

applicable

functions

some

and

I

As

rejection

sampled

ordered

these

full

widen

generally

in

method

tion

,

.

of

of

"tribution

computed

follows

3

dis

functions

applied

which

employ

method

"

can

efficiently

introduced

distribution

on

,

-

more

invariant

ability

be

Sev

are

is

overrelaxation

performing

the

further

be

.

methods

probability

this

chain

distribution

more

discussed

.

for

Section

relaxation

are

computations

strategy

can

In

any

required

where

free

,

Markov

the

may

overrelaxation

these

to

-

one

be

the

only

Gaussian

that

the

,

principle

from

ergodic

can

of

distribution

In

sample

cumulative

applicable

distribution

undermine

rejection

.

problems

the

method

a

an

Most

. )

.

present

to

.

Moreover

the

way

statistics

detail

applicable

ods

I

used

only

in

for

by

order

)

correct

can

.

is

are

methods

below

the

walks

produce

discuss

3

rejections

obvious

,

of

would

be

such

well

method

overrelaxation

that

determined

any

paper

use

method

pling

2

,

works

distributions

for

ensure

random

Carlo

overrelaxation

Section

to

Monte

conditional

made

( see

suppress

hybrid

full

been

rejections

but

original

the

applicable

occasional

be

' s

all

proposals

we

Adler

where

generally

,

the

)

The

in

the

-

based

Adler

(

was

later

statistical

'

problems

starting

.

a

of

minimiza

method

by

method

to

is

the

sampling

same

limited

proposed

solution

for

literature

discussed

method

been

iterative

also

chain

physics

.

the

with

point

1981

context

by

Gaussian

for

)

found

the

more

,

SUPPRES ; SING

2

.

1

.

ADLER

Adler

'

state

'

s

(

log

Xi

I

x

{

=

X

j

}

(

;

#

i

Xl

)

,

,

-

(

1

+

xI

)

updated

(

in

component

i

which

Adler

.

'

XN

)

in

'

s

is

"

than

1

+

x

~

turn

)

will

)

.

As

in

the

the

full

.

Note

by

its

,

,

is

other

such

as

new

,

JLi

,

replaced

7r

the

(

X1

,

the

new

)

(

j

for

for

,

j

X

state

variance

x

X2

chosen

and

,

by

the

includes

value

components

,

when

of

The

mean

the

,

class

components

.

conditional

Xi

the

Adler

this

,

the

densities

that

ordering

for

conditional

used

)

sampling

fixed

,

distribution

Gaussians

of

value

the

terminology

Gibbs

on

old

all

"

functions

the

x.,1,

in

some

depend

,

that

multivariate

using

are

when

multiquadratic

the

,

applicable

such

(

is

general

method

,

209

OVERRELAXATION

METHOD

is

Gaussian

other

(

are

.

density

distributions

exp

.

are

probability

USING

OVERRELAXATION

method

:

j

WALKS

GAUSSIAN

overrelaxation

,

7r

S

RANDOM

#

0 '

i

.

;

,

In

value

JLi

+

0 :

(

Xi

-

J . Li

)

+

O '

i

(

1

-

0 :

2

)

1

/

2

n

(

1

)

where n is a Gaussian random variate with mean zero and variance one. The parameter 0:' controls the degreeof overrelaxation (or underrelaxation) ; for the method to be valid, we must have - 1 ~ 0:' ~ + 1. Overrelaxation to the other side of the mean occurs when (l' is negative. When a is zero, the method is equivalent to Gibbs sampling. (In the literature , the method is often parameterized in terms of w == 1- 0: . I have not followed this convention, as it appears to me to make all the equations harder to understand.) One can easily confirm that Adler's method leavesthe desired distri bution invariant - that is, if Xi has the desired distribution (Gaussian with mean JLiand variance a; ) , then x~ also has this distribution . Furthermore, it is clear that overrelaxed updates with - 1 < a < + 1 produce an ergodic chain. When (l' == - 1 the method is not ergodic, though updates with 0: == - 1 can form part of an ergodic schemein which other updates are performed as well, as in the "hybrid overrelaxation" method discussedby Wolff (1992) . 2.2. HOW OVERRELAXATIONCAN SUPPRESSRANDOM WALKS The effect of overrelaxation is illustrated in Figure 1, in which both Gibbs sampling and the overrelaxation method are shown sampling from a bivariate Gaussiandistribution with high correlation. Gibbs sampling undertakes a random walk, and in the 40 iterations shown (each consisting of an update of both variables) succeedsin moving only a small way along the long axis of the distribution . In the samenumber of iterations, Adler's Gaussian overrelaxation method with 0:' = - 0.98 coversa greater portion of the distri bution , since it tends to moveconsistently in one direction (subject to some random variation , and to "reflection" from the end of the distribution ). The manner in which overrelaxation avoids doing a random walk when sampling from this distribution is illustrated in the close-up view in Fig-

210

RADFaRD M. NEAL

1 -

+1 .

) .

0

1 .

.

- 1 .

. 0

.

.

+ 1

-1

0

Gibbs Sampling

+1

Adler 's Method , Q' = - 0 .98

+1 -

Figure 1. Gibbs sampling and Adler 's overrelaxation method applied to a bivariate Gaussian with correlation 0.998 (whose one-standard-deviation contour is plotted ) . The top left shows the progress of 40 Gibbs sampling iterations (each consisting of one update for each variable) . The top right shows 40 overrelaxed iterations , with a = - 0.98. The close-up on the right shows how successive overrelaxed updates operate to avoid a random walk .

ure

1

,

When

side

of

ity

of

the

sistent

in

which

each

which

updates

is

quite

state

,

move

to

of

When

to

at

1

is

other

on

a

point

the

combined

,

effect

single

elliptic

of

and

variable

move

is

to

reverses

.

randomness

in

when

this

the

cause

other

a

G '

con

=

Q '

is

,

the

close

will

occasional

1

-

until

When

-

-

probabil

contour

introduced

also

.

the

move

of

along

updated

to

easily

contour

state

will

is

to

most

motion

amount

contours

each

tending

visualized

the

the

small

I + 1

to

let

,

the

reversals

in

.

chosen

is

be

I 0

after

-

move

which

,

different

as

the

,

motion

a

scale

stays

updates

-

state

the

can

state

-

overrelaxed

-

effect

Successive

reached

direction

The

the

.

in

is

mean

.

density

not

changes

conditional

case

but

tion

the

these

direction

end

time

shows

0

well

required

.

,

this

for

As

the

randomization

the

correlation

state

will

to

move

of

the

from

bivariate

occur

on

one

end

Gaussian

about

of

the

the

distri

approaches

same

bu

-

SUPPRESSING

:i: l , the optimal

RANDOM

value

optimal a rather This comes about

of a approaches

one end of the distri

root

of the

ratio

as the

independent walk

bu tion

of eigenvalues

correlation

point

therefore

USING

- 1 , and the benefit

to the other of the

in n step rather

correlation

than

also goes to infinity

from

from

the n 2 steps

this

large . to move

to the square

matrix

gain

using

arbitrarily required

is proportional

goes to :1: 1 . The

211

OVERRELAXATION

than a == 0 ( Gibbs sampling ) becomes because the number , n , of typical steps

from

infinity

WALKS

, which

moving

needed

goes

to

to a nearly

with

a random

.

2.3. THE BENEFITFROMOVERRELAXATION Figure

2 shows

Gaussian tions

with

the benefit

of overrelaxation

p == 0 .998 in terms

of the state , Xl , and

is close

to optimal

in sampling

of reduced

xi . Here , a value

in terms

of speed

value The

the bivariate for two func -

of a = - 0 .89 was used , which

of convergence

( The value of 0:' == - 0 .98 in Figure 1 was chosen random walks visually clearer , but it is in fact this

from

autocorrelations for

this

distribution

.

to make the suppression somewhat too extreme

of for

of p .)

asymptotic

efficiency

of a Markov

chain

sampling

method

in esti -

mating the expectation of a function of state is given by its " autocorrelation time " the sum of the autocorrelations for that function of state at all lags , positive and negative ( Hastings 1970 ) . I obtained of the autocorrelation times for the quantities plotted series of 10 ,000 points autocorrelations that

appeared

the efficiency

using

truncation

with

zero ) . These

of E [Xl ] is a factor

(lI = - 0 .89 than

of E [xi ] , the benefit

In comparison these factors

at the lags past which

to be approximately

of estimation

overrelaxation

estimation

, with

from

numerical in Figure

when

of about

using

overrela ./xation

estimates 2 ( using a

the estimated estimates

show

22 better

when

Gibbs

sampling

is a factor

. For

of about

with Gibbs sampling , using overrelaxation will reduce by the variance of an estimate that is based on a run of given

length , or alternatively , it will reduce by the same factors the length that is required to reduce the variance to some desired level . Overrelaxation done

into

when

so far do not been

16 .

is not always overrelaxation

provide

mis - interpreted

a complete . Work

beneficial

, however

produces

an improvement

. Some research , but

physics

literature

has been

the

answer , and in some cases , appear

in the

of run

results to have

has concentrated

on

systems of physical interest , and has primarily been concerned with scaling behaviour in the vicinity of a critical point . Two recent papers have ad dressed

the question

Barone ate Gaussian interesting

and

in the context

of more

Frigessi

( 1990 ) look

distributions

, finding

cases . In interpreting

general

statistical

at overrelaxation the

these

rate

applied

of convergence

results , however

applications to multivari in a number

, one should

. of

keep in

212

RADFORDM. NEAL

0 . . -

CO

<0

~

N

0

0

500

1000

1500

2000

Plot of xi during Gibbssamplingrun

Plot of x~ during overrelaxedrun with a = - 0.89 Figure 2.

Sampling from a bivariate Gaussian with p = 0.998 using Gibbs sampling

and Adler 's overrelaxation

method with a = - 0.89 . The plots show the values of the

first coordinate and of its square during 2000 iterations of the samplers (each iteration consisting of one update for each coordinate).

SUPPRESSING

RANDOM

if

WALKS

mind

that

a method

converges

time

required

to

some

to tive

log(p). advantage

Hence, of

reach

for rates near one results of Barone overrelaxation

sampling. for

This

some

relax

(ie,

to

Green

by

nizing

that

on

contour

pectation variance

the

always

values

convergence one

true,

(1)

to

of

for

joint

the

0 <

state. as c

near

since

1

probability

state,

Green

and

In particular, generally

and

Han

we

on

require

for problems unrealistic to

ative autocorrelations be obtained with locally-antithetic

do of to

not depend interest, but zero, as the

As within

remarked the class

only

asymptotic

hope

state

are

distributions

3.

Adler s tional

positive, where

Previous

sta

when of stat interest

a modest

deg

may

th

Markov find antithet

produce estimation eff a sample of independent st character of equation (1), on negative autocorrelations can come rather from the fas chain moves more rapidly to above, the benefits of overrelaxation of multivariate Gaussian distri

on asymptotic variance). work, however, I take it beneficial as is typically the

are

density,

variance

where to

a

They in eq

the

chains be used during an initial period and during the subsequent generation In practice, however, we are usually

of

c

performance

of

of limit

c very

equilibrium,

the

however;

variance

function in the

a

correlations,

with

at

asymptotic

of a linear goes to zero

be

negative

look

acc

(l p2)/(l pl). what

distributions

with

(1992)

of

converge p1/p2, bu

confirm

equation

Han

judged of

apply

and

some

is not

distributions

level

methods 2 is not

is approximately and Frigessi for

O

geometrically

given

if two method

can

USING

for the

true example

when

conditional

proposals

overrelaxation distributions

More research as given that

for

are

in

in

correlati and se

distributions

more

method Gaussian.

general can

be applied Although

ov

214

RADFORD M. NEAL

exist , in both statistical physics and statistical inference , most problems to which Markov chain Monte Carlo methods are currently applied do not satisfy

this constraint

. A number

of proposals

have been made for more

general overrelaxation methods , which I will review here before presenting the

" ordered

overrelaxation

" method

in the

next

section

.

Brown and Woch (1987) make a rather direct proposal: To perform an over relaxed update of a variable whose conditional distribution Gaussian , transform

to a new parameterization

the conditional distribution and then transform

of this variable

is not

in which

is Gaussian , do the update by Adler 's method ,

back . This may sometimes

be an effective strategy , but

for many problems the required computations will be costly or infeasible .

A second proposal by Brown and Woch (1987), also made by Creutz (1987) , is based on the Metropolis algorithm . To update component i , we first find a point , xi , which is near the centre of the conditional distribution ,

7r(Xi I { Xj} j :;i:i ). We might , for example, choosexi to be an approximation to the mode , though other choices are also valid , as long as they do not

depend on the current Xi. We then take xi == xi - (Xi - xi ) as a candidate for the next state , which , in the usual Metropolis fashion , we accept with

probability min[l , 7r(X~I { Xj}j :;i:i ) / 7r(Xi I { Xj}j =Fi)]. If x~is not accepted, the new state

is the same

as the old state .

If the conditional distribution is Gaussian , and xi is chosen to be the exact mode , the state proposed with this method will always be accepted , since the Gaussian distribution to

Adler

's method

with

is symmetrical

G' == - 1 . Such

. The result is then identical

a method

can

be combined

with

other updates to produce an ergodic chain . Alternatively , ergodicity can be ensured by adding some amount of random noise to the proposed states . Green and Han ( 1992) propose a somewhat similar , but more general , method . To update component i , they find a Gaussian approximation to

the conditional distribution , 7r(Xi I { xj } j :;i:i ), that does not depend on the current Xi. They then find a candidate state xi by overrelaxing from the current state according to equation (1), using the JLi and O 'i that characterize this Gaussian approximation , along with some judiciously chosen

G'. This candidate state is then acceptedor rejected using Hastings' (1970) generalization of the Metropolis algorithm , which allows for non-symmetric proposal

distributions

.

Fodor and Jansen (1994) proposea method that is applicable when the conditional

distribution

is unimodal

, in which

the

candidate

state

is the

point on the other side of the mode whose probability density is the same as that of the current state . This candidate state is accepted or rejected based on the derivative of the mapping from current state to candidate state . Ergodicity may again be ensured by mixing in other transitions , such as standard Metropolis updates .

SUPPRESSING

The ing

proposed rate

tions ; if it even to the

have

decisions

rejection

been

appreciated

discussion

in Section

is easy to see that

requires

that

to the time which

update

such

long , depending

distribu seems

be apparent

a bivariate

,

not from

Gaussian

, it

of one of the two

variables

of motion of random

the long therefore

rejections

to move

The

it . Moreover

point

should

from

the direction suppression

between

along walks

be at least

comparable

the length

of the distri

on the degree

of correlation

bu tion , of the

.

section

based

, I present

the method

to suppress

dependencies

4 .1. THE

density

statistics

of overrelaxation

over states with rejected , thereby

random

walks

, which

real - valued preserving

with

to sample

from

proceed

in turn , based

new value , xi , obtained

2 ) Arrange labeling

a distribution by updating

on their

as follows

full

over x == ( Xl , . . . , XN ) with the values

conditional

of the components distributions

these them

r being

the

the K generated 3 ) Let

the

i is replaced

, from

the

index values

by a

conditional

K values plus the old value , xi , in non - decreasing as follows : X~l ) ~

,

, whose

:

K random values , independently 7r ( Xi I { Xj } j # i ) .

x ~O) ~ with

arbitrarily

METHOD

densities are 7r( Xi I { x j } j # i ) . In the new method , the old value , Xi , for component

1) Generate tribution

can be applied

components , and the potential for

even in distributions

OVERRELAXATION

7r ( X ) , and we will

Xi , repeatedly

order

.

ORDERED

As before , we aim

on

a new form

( in theory ) to any distribution in which changes are never strong

, but

us -

flaw :

conditional

of reducing

high . This

sampling

215

is achieved

serious

of the way

literature

for the method

balance

a potentially

be too

an overrelaxed

interval

Overrelaxation

In this

may

in the

can be arbitrarily

variables

4.

the

rate

is to reverse . Effective

required

detailed

from

is no obvious

2 . When

when

is rejected , the effect axis of the distribution

OVERRELAXATION

by characteristics

high , there

small

USING

in which

all suffer

is determined

is too

a quite

WALKS

generalizations

accept - reject

rejection

RANDOM

... ~ in this

x ~r ) == Xi ~ ordering

are equal

new value

for component

Here , K is a parameter of 0: in Adler ' s Gaussian

of the method overrelaxation

... ~

of the old

dis order ,

x ~K )

(2)

value . ( If several

of

the tie randomly

.)

to the old Xi , break i be X~ == x ~K - r ) .

, which plays a role analogous to that method . When K is one , the method

216

RADFORD M. NEAL

is equivalent to Gibbs sampling ; the behaviour as f( - t 00 is analogous to Gaussian

overrelaxation

As presented

with

a = - 1.

above , each step of this " ordered overrelaxation

" method

would appear to require computation time proportional to K . As discussed below , the method will provide a practical improvement in sampling efficiency only if an equivalent effect can be obtained using much less time . Strategies for accomplishing this are discussed in Section 5. First , however , I will show that

the method

is valid -

that

leaves the distribution 1r(x) invariant to that

of Adler

4 .2 . VALIDITY

's method

for Gaussian

OF ORDERED

the update

described

above

and that its behaviour is similar distributions

.

OVERRELAXATION

To show that ordered overrelaxation leaves 1r(x) invariant , it suffices to

show that each update for a component , i , satisfies "detailed balance" -

ie, that the probability density for such an update replacing Xi by x~ is the same as the probability density for xi being replaced by Xi, aBsumingthat the starting state is distributed according to 1r(x). It is well known that the detailed balance condition (also known as " reversibility " ) implies invariance

of 1r(x), and that invariance for each component update implies invariance for transitions in which eachcomponent is updated in turn . (Note, however, that the resulting sequential update procedure , considered as a whole , need not satisfy

detailed

balance ; indeed , if random

walks are to be suppressed

as we wish , it must not .) To see that detailed balance holds , consider the probability density that component i has a given value , Xi , to start , that Xi is in the end replaced

by some given different value, x~, and that along the way, a particular set of 1( - 1 other values (along with xi ) are generatedin step (1) of the update procedure . Assuming there are no tied values, this probability density is

1r(XiI {Xj}j#i) . I( ! 1r(X~I {Xj}j#i) II 1r(x~t) I {Xj}j#i) . I [s = f( - r] (3) r # t# s

where r is the index of the old value, Xi, in the ordering found in step (2), and s is the index of the new value, xi . The final factor is zero or one, depending on whether the transition in question would actually occur with the particular set of 1< - 1 other values being considered . The ' probabil -

ity density for the reverse transition , from xi to Xi, with the same set of [( - 1 other values being involved , is readily seen to the identical to the above. Integrating over all possible sets of other values, we conclude that

the probability density for a transition from Xi to x~, involving any set of other values, is the same aB the probability

density for the reverse transi -

tion from xi to Xi. Allowing for the possibility of ties yields the same result, after a more detailed accounting .

SUPPRESSING RANDOMWALKSUSINGOVERRELAXATION217 4.3. BEHAVIOUROF ORDEREDOVERRELAXATION In analysingorderedoverrelaxation , it can be helpfulto view it from a perspectivein whichthe overrelaxationis donewith respectto a uniformdistribution. Let P(x) bethe cumulativedistribution functionfor the conditional distribution 7r(Xi I {Xj}j #i) (hereassumedto becontinuous ) , and let p - l (x) be the inverseof F (x). , , Orderedoverrelaxationfor Xi is equivalentto the followingprocedure : First transformthe current valueto ui == F (Xi), then perform orderedoverrelaxationfor Ui, whosedistribution is uniform over [0, 1], yielding a newstate u~, and finally transformbackto xi == p - l (u~). Overrelaxationfor a uniformdistribution, startingfrom u, maybe analysed as follows. When K independentuniform variatesare generatedin step (1) of the procedure , the numberof them that are lessthan U will be binomiallydistributed with meanKu and varianceKu (l - u). This number is the index, r , of u == u{r) found in step (2) of the procedure . Conditional on a valuefor r , which let us supposeis greaterthan K / 2, the distribution of the new state, u' == u{K- r), will be that of the K - r + 1 order statistic of a sampleof sizer from a uniform distribution over [0, u]. As is well known (eg, David 1970, p. 11), the k'th orderstatistic of a sampleof sizen from a uniform distribution over [0, 1] hasa beta(k, n- k+ 1) distribution, with density proportionalto uk- l (l - u)n- k, meank/ (n + 1), and variance k (n - k + 1) / (n + 2) (n + 1)2. Applying this result, u' for a givenr > K / 2 will havea rescaledbeta(K - r + 1, 2r- 1<) distribution, with meanJl(r ) == u(K - r + l )/ (r+ 1) and variancea2(r ) == u2(K - r+ l ) (2r- K ) / (r+2) (r+ 1)2. When K is large, wecanget a roughideaof the behaviourof overrelax ation for a uniformdistribution by consideringthe casewhereu (and hence likely r / K ) is significantlygreaterthan 1/ 2. Behaviourwhenu is lessthan 1/ 2 will of coursebe symmetrical, and we expect behaviourto smoothly interpolatebetweentheseregimeswhenu is within about 1/ JK of 1/ 2 (for which r / K might be either greateror lessthan 1/ 2) When u ~ 1/ 2, we can usethe Taylor expansion . K u - K u2~~ - K u + 2u 8 K u + 2u 82 + . . . (4) JL(I{ u + 8) == Ku + 1 (Ku + 1)2 + (Ku + 1)3 to concludethat for largeK , the expectedvalueof u', averagingoverpossible valuesfor r == Ku + 8, with <5havingmeanzeroand varianceKu (l - u), is approximately K u - K u2 ~ ~ K u (1 - u) K u + 2u ~ Ku+ 1 + ([( u + 1)3

(1 - u)

+

1/ K

(5)

For u <{:: 1/ 2, the bias will of coursebe opposite, with the expectedvalue of u' beingabout (1- u) - 1/ [( , and for u ~ 1/ 2, the expectedvalueof u' will be approximatelyu.

218

RADFORD M. NEAL

Figure 9 . Points representing 5000 ordered overrelaxation updates . The plot on the left shows ordered overrelaxation for a uniform distribution . The horizontal axis gives

the starting point , drawn uniformly from [0, 1]; the vertical axis, the point found by ordered overrelaxation with f ( = 100 from that starting point . The plot on the right shows ordered overrelaxation for a Gaussian distribution . The points correspond to those on the left , but transformed by the inverse Gaussian cumulative distribution function .

The variance of u' will be, to order 1/ K , approximately u2(Ku ) + Ku (1 - u) [J.L'(J
Ku (1-u){Ku {l(u++2uf 1)4 ~i~K.=.~l (6) rv rv

By symmetry, the varianceof u' when u ~ 1/ 2 will be approximately 2u/ K . (Incidentally, the fact that u' has greater variance when u is near 1/ 2 than when u is near 0 or 1 explains how it is possible for the method to leave the uniform distribution invariant even though u' is biased to be closer to 1/ 2 than u is.) The joint distribution for u and u' is illustrated on the left in Figure 3. The right of the figure showshow this translates to the joint distribution for the old and new state when ordered overrelaxation is applied to a Gaussian distribu tion . 4.4. COMPARISONSWITH ADLER'S METHOD AND GIBBS SAMPLING For Gaussianoverrelaxation by Adler's method, the joint distribution of the old and new state is Gaussian. As seenin Figure 3, this is clearly not the casefor ordered overrelaxation. One notable difference is the way the tails of the joint distribution flare out with ordered overrelaxation., a reflection of the fact that if the old state is very far out in the tail , the new state will likely be much closer in. This effect is perhaps an advantage of the

SUPPRESSING

RANDOM

WALKS

USING

OVERRELAXATION

219

ordered overrelaxation method , as one might therefore expect convergence from

a bad starting

point

to be faster

with

ordered

overrelaxation

than

with Adler 's method. (This is certainly true in the trivial cagewhere the state consists of a single variable ; further analysis is needed to establish whether it true in interesting cases.) Although there is no exact equivalence between Adler 's Gaussian overrelaxation

method

and ordered

overrelaxation

, it is of some interest

to find

a

value of K for which ordered overrelaxation applied to a Gaussian distribu tion corresponds roughly to Adler 's method with a given Q' < O. Specifically , we can try to equate the mean and variance

of the new state , x ' , that

re-

sults from an overrelaxed update of an old state , x , when x is one standard deviation away from its mean. Supposing without loss of generality that

the mean is zero and the variance is one, we seefrom equations (5) and (6) that when x == 1, the expected value of x' using ordered overrelaxation is

<1>- 1 1>(- 1) + IlK ) ~ - 1+ l / K (- I ) ~ - 1+ 4.13/ K and the variance of x' is 2 <1>(- 1)1K (- 1)2 ~ 5.42/ K , where (x) is the Gaussiancumulative distribution function , and (x) the Gaussian density function . Since the corresponding values for Adler 's method are a mean of Q' and a variance of 1 - Q'2, we can get a rough correspondence by setting !( ~ 3.5/ (1 + Q') . For the example of Figure 2, showing overrelaxation by Adler 's method with Q' = - 0.89, applied to a bivariate Gaussian with correlation 0.998, or dered overrelaxation should be roughly equivalent when [{ = 32. Figure 4 shows visually

that this is indeed the case . Numerical

relation

indicate

times

that

ordered

overrelaxation

estimates with

of autocor -

[ { = 32 is about

a factor of 22 more efficient , in terms of the number of iterations required for a given level of accuracy , than is Gibbs sampling , when used to esti-

mate E [X1]' When usedto estimateE[xi ], orderedoverrelaxationis about a factor

of 14 more efficient . Measured

ficiency advantages are virtually for Adler

' s method

by numbers of iterations

, these ef-

identical to those reported in Section 2.3

.

Of course, if it were implemented in the most obvious way, with [{

random variates being explicitly generatedin step (1) of the procedure, ordered overrelaxation with [{ == 32 would required a factor of about 32 more computation time per iteration than would either Adler 's overrelaxation method or Gibbs sampling . Adler 's method would clearly be preferred in comparison to such an implementation of ordered overrelaxation . Interest ingly , however , even with such a naive implementation , the computational efficiency of ordered overrelaxation is comparable to that of Gibbs sampling - the factor of about 32 slowdown per iteration being nearly cancelled by the factor

of about 22 improvement

from the elimination

of random

This near equality of costs holds for smaller values of K as well -

walks .

the im -

provement in efficiency (in terms of iterations) using ordered overrelaxation

220

RADFORDM. NEAL

a . . , -

CX)

< 0

oqo

C\ J

a

0

500

1000

1500

2000

Plot of xi during ordered overrelaxation run with K = 32 Figure 4 . Sampling from a bivariate Gaussian with p = 0.998 using ordered overrelax ation with 1< = 32 . Compare with the results using Gibbs sampling and Adler 's method shown in Figure 2.

with K = 16 is about a factor of 12 for E[Xl] and 11 for E[xi ], and with K = 8, the improvementis about a factor of 8 for E [Xl] and 7 for E [xi ]. We therefore see that any implementation of ordered overrelaxation whose computational cost is substantially less than that of the naive approach of explicitly generating I( random variates will yield a method whose computational efficiency is greater than that of Gibbs sampling , when used with

any value for I ( up to that

ber of iterations

required

which

is optimal

in terms of the num -

for a given level of accuracy . Or rather , we see

this for the case of a bivariate Gaussian distribution , and we may hope that it is true for many other distributions of interest as well , including those

whose

conditional

overrelaxation

5 . Strategies

method

distributions

are non - Gaussian

, for which

Adler

's

is not applicable .

for implementing

ordered

overrelaxation

In this section , I describe several approaches to implementing ordered overrelaxation , which are each applicable to some interesting class of distribu tions , and are more efficient than the obvious method of explicitly gen-

SUPPRESSING RANDOM WALKS USING OVERRELAXATION

221

erating K random variates . In some cases, there is a bound on the time required for an overrelaxed update that is independent of K ; in others the reduction in time is less dramatic (perhaps only a constant factor ) . As was seen in Section 4.4, any substantial reduction in time compared to the naive implementation will potentially provide an improvement over Gibbs sampling . There will of course be some distributions for which none of these imple mentations is feasible ; this will certainly be the case when Gibbs sampling itself is not feasible . Such distributions include , for example , the complex posterior distributions that arise with neural network models (Neal 1995) . Hybrid Monte Carlo will likely remain the most efficient sampling method for such problems . 5.1. USING THE CUMULATIVE DISTRIBUTION FUNCTION The most direct method for implementing ordered overrelaxation in bounded time (independently of K ) is to transform the problem to one of perform ing overrelaxation for a uniform distribution on [0, 1] , as was done in the analysis of Section 4.3. This approach requires that we be able to efficiently compute the cumulative distribution function and its inverse for each of the conditional distributions for which overrelaxation is to be done . This requirement is somewhat restrictive , but reasonably fast methods for compu ting these functions are known for many standard distributions (Kennedy and Gentle 1980) . This implementation of ordered overrelaxation produces exactly the same effect as would a direct implementation of the steps in Section 4.1. As there , we aim to replace the current value, Xi , of component i , by a new value , xi . The conditional distribution for component i , 7r(Xi I { Xj } j #i ) , is here assumed to be continuous , with cumulative distribution distribution function F (x ) , whose inverse is p - l (x ) . We proceed as follows : 1) Compute u = F {Xi) , which will lie in [0, 1] . 2) Draw an integer r from the binomial (K , u) distribution . This r has the same distribution as the r in the direct procedure of Section 4.1. 3) If r > K - r , randomly generate v from the beta (K - r + 1, 2r - K ) distribution , and let u' = uv . If r < K - r , randomly generate v from the beta (r + 1, K - 2r ) distri bution , and let u' = 1 - (1- u) v . If r = K - r , let u' = u . Note that u' is the result of overrelaxing u with respect to the uniform distri bu tion on [0, 1] . 4) Let the new value for component i be xi = F - 1(u' ) .

222

RADFORD M. NEAL

Step

( 3 ) is

in n

a

on

of

size

sample

-

k

+

from

1)

beta

( Devroye When

binomial , Sections

for

Woch

other

quired the

5 .2 .

In

some

, an

K

ordered

times

potentially

aE

for

step

( 1 ) , depends

in

haps

in

way

a complex

updated of

, this

xi . This

however

need

, even

tional

distribution

components dered

Other ing

[(

generating some cost on Gilks to from sity the

than

from

one

the

value

distribution "

that

other

is

and

Wild

implement

( 1992 Gibbs

approximations

. When

and dis -

framework

cannot

reduce

be

for

. If

quite

the

re -

for

which

computed

.

of

Section

other

an

the

random

being

for

each

variate

update

the to

values

f(

times

of

scale

" , in

less

than

[(

-

total

the

itself

than

,

condi

the

of

generation

less

-

update

from

the

in

, per

contribution on

much

drawn

overrelaxation

dominant

con

themselves

generated

dependence

take

are

re - computed

be

be

4 . 1 , the

components are

ordered

then

take

therefore

variates

the

be

naturally

( and

of

components

lead

to

will

state

) is

density

takes

occur

whenever

family the ) . The

another

than

other , an

or -

aE long

aE

continually

adaptive

change

a succession value refined

( due

of is

to

is

it

is

from the

result

from " setup

dependence

widely

drawn to

that

of used

randomly

the

as

method

approximations

drawn

, with

, as

value

long

some

sampling

example , a

generat as

generated

with

rejection

scheme

which times

are

a method

parameters

this

one

values

using

important

. In using

more

" economies

distribution

sampling

are

similar

Brown

distributions

random

values

will

this

whenever of

a log - concave function

also same

incurred

be re -

conditional

may

will

[(

must

a parametric

components

is

the

update

which

could

. This

in

to that

.

can

from

inverse

procedure

the

update

situations

values

to

direct

once

found from

update

known

exceed

overrelaxation

update

other

values

is

are

by

, which

sampling

the

only

that

sampling

the

these

comes

, rather

its

Gibbs

on

[(

overrelaxation

a Gibbs

a

. Since

done

makes

ordered

or

will

suggestion

application

distribution

be

time

aE

general

though

computation

its

Xi , from

conditional

(k ,

SCALE

) . In

distribution

time

implementation

the

, however

overrelaxation

long

advantageous

ditional

to

3 ) . The

OF

beta variates

overrelaxation

time

that

function

ECONOMIES

cases

than

this . The

time

allow

distribution

USING

less

, or

ordered

, though

implementations time

cumulative

K

update

Section

a

random

expected

allows of

sampling

( see

statistic

[ 0 , 1 ] has

generating

bounded

order

X .4 ) .

a transformation

possible

k ' th

on

for

in

. 4 and

the

distribution

methods

implementation

perform

computation

, p . 11 ) that

uniform

computation

Gaussian

admits

IX

likely

) to

1970

distributions

Gibbs

in

( 1987

a

independent

a simple

tribution

from

, this

time

and

( David

. Efficient

and

feasible

spirit

fact

n

1986

in

quired

the

distribution

performed

in

based

same later

the

den

density values

,

SUPPRESSING RANDOM WALKS USING OVERRELAXATION

223

take much less time to generate than earlier values. Further time savings can be obtained by noting that the exact numerical values of most of the K values generated are not needed. All that is required is that the number , r , of these values that are less than the current Xi be somehow determined , and that the single value x ~K - r) == x~ be found . In particular , the adaptive rejection sampling method can be modified in such a way that large groups of values are "generated " only to the extent that they are localized to regions where their exact values can be seen to be irrelevant . The cost of ordered overrelaxation can then be much less than K times the cost of a Gibbs sampling update . This is a somewhat complex procedure , however , which I will not present in detail here. 6 . Demonstration

: Inference

for a hierarchical

Bayesian

model

In this section , I demonstrate the advantages of ordered overrelaxation over Gibbs sampling when both are applied to Bayesian inference for a simple hierarchical model . In this problem , the conditional distributions are nonGaussian , so Adler 's method cannot be applied . The implementation of ordered overrelaxation used is that based on the cumulative distribution function , described in Section 5.1. For this demonstration , I used one of the models Gelfand and Smith ( 1990) use to illustrate Gibbs sampling . The data consist of p counts , 81, . . . , 8p. Conditional on a set of unknown parameters , AI , . . . , Ap, these counts are assumed to have independent Poisson distributions , with means of Aiti , where the ti are known quantities associated with the counts 8i . For example , 8i might be the number of failures of a device that has a failure rate of Ai and that has been observed for a period of time ti . At the next level , a common hyperparameter {3 is introduced . Condi tional on a value for {3, the Ai are assumed to be independently generated from a gamma distribution with a known shape parameter , a , and the scale factor f3. The hyperparameter f3 is assumed to have an inverse gamma distribution with a known shape parameter , " and a known scale factor , b. The problem is to sample from the conditional distribution for j3 and the Ai given the observed 81, . . . , sp. The joint density of all unknowns is given by the following proportionality :

P(,a, A1, . . . , ApI 81, . . . , 8p) cx P (,a) P (Al , . . . , Ap I ,8) P (SI, . . . , Sp I AI , . . . ' Ap)

cx f3- ')'- 1e- c5 / .a

p p II ,a-at\i - 1e-Ai/(3. II t\iie-Aiti i=l i=l

(7) (8)

224

RADFORD

The conditional distribution gamma

M . NEAL

for ~ given the other variables is thus inverse

:

f-' I AI P (~ \

" " , Ap \ , Sl , " . , Sp)

<X

f-' ~- PQ- 'Y- l e -

{O+EiAi)//3

(9)

However, I found it more convenient to work in terms of T = 1/ {3, whose conditional density is gamma: P (T I A1, " . . . , Ap, 81, . . . , 8p) cx Tpo+')'- 1e- T(c5 +EiAi)

(10)

The conditional distributions for the Ai are also gamma: P (A1 ", I {A ' J.} J.#1". T, 81, . . . , 8p) <X Ai , Si+o- 1e- >"i(ti+7')

(11)

In each full iteration of Gibbs sampling or of ordered overrelaxation , these conditional distributions are used to update first the Ai and then T. Gelfand and Smith (1990, Section 4.2) apply this model to a small data set concerning failures in ten pump systems , and find that Gibbs sampling essentially converges within ten iterations . Such rapid convergence does not always occur with this model , however. The Ai and T are mutually dependent , to a degree that increases as Q :' and p increase. By adjusting a and p, one can arrange for Gibbs sampling to require arbitrarily many iterations to converge . For the tests reported here, I set p == 100, Q :' == 20, <5 == 1, and l' = 0.1. The true value of T was set to 5 (ie, ,8 = 0.2) . For each i from 1 to p , ti was set to if p , a value for Ai was randomly generated from the gamma distribution with parameters Q :' and ,8, and finally a synthetic observation , Si, was randomly generated from the Poisson distribution with mean Aiti . A single such set of 100 observations was used for all the tests , during which the true values of T and the Ai used to generate the data were of course ignored . Figure 5 shows values of T sampled from the posterior distribution by successive iterations of Gibbs sampling , and of ordered overrelaxation with [( == 5, [( == 11, and [( == 21. Each of these methods was initialized with the Ai set to sifti and T set to Q :' divided by the average of the initial Ai ; The ordered overrelaxation iterations took about 1.7 times as long as the Gibbs sampling iterations . (Although approximately in line with expectations , this timing figure should not be taken too seriously - since the methods were implemented in S-Plus , the times likely reflect interpretative overhead , rather than intrinsic computational difficulty .) The figure clearly shows the reduction in autocorrelation for T that can be achieved by using ordered overrelaxation rather than Gibbs sampling . Numerical estimates of the autocorrelations (with the first 50 points discarded ) show that for Gibbs sampling , the autocorrelations do not approach

SUPPRESSING RANDOM WALKSUSINGOVERRELAXATION 225 ~

( 0

&l )

. qo

0

100

200

300

400

500

600

Plot of 1" during Gibbs sampling run

Plot of T during

ordered overrelaxation

run with K = 5

Plot of T during ordered overrelaxation run with K = 11

Plot of r during ordered overrelaxation run with K = 21 Figure 5. Sampling from the posterior distribution for T using Gibbs sampling and ordered overrelaxation with K = 5, 1< = 11, and 1< = 21. The plots show the progress of T = 1/ {3 during runs of 600 full iterations (in which the Ai and T are each updated once).

226

RADFORD M. NEAL

zero until around lag 28, whereas for ordered overrelaxation with K = 5, the autocorrelation is near zero by lag 11, and for K = 11, by lag 4. For ordered overrelaxation with K = 21, substantial negative autocorrelations are seen, which would increase the efficiency of estimation for the expected value of T itself , but could be disadvantageous when estimating the expectations of other functions of state . The value f( = 11 seems close to optimal in terms of speed of convergence.

7. Discussion The results in this paper show that ordered overrelaxation should be able to speed the convergence of Markov chain Monte Carlo in a wide range of circumstances . Unlike the original overrelaxation method of Adler (1981) , it is applicable when the conditional distributions are not Gaussian , and it avoids the rejections that can undermine the performance of other generalized overrelaxation methods . Compared to the alternative of suppressing random walks using hybrid Monte Carlo (Duane , et at. 1987) , overrelax ation has the advantage that it does not require the setting of a stepsize parameter , making it potentially easier to apply on a routine bagis. An implementation of ordered overrelaxation based on the cumulative distribution function Wagdescribed in Section 5.1, and used for the demonstration in Section 6. This implementation can be used for many problems , but it is not as widely applicable ag Gibbs sampling . Natural economies of scale will allow ordered overrelaxation to provide at least some benefit in many other contexts , without any special effort . By modifying adaptive rejection sampling (Gilks and Wild 1992) to rapidly perform ordered overrelaxation , I believe that quite a wide range of problems will be able to benefit from ordered overrelaxation , which should often provide an order of magnitude or more speedup, with little effort on the part of the user. To use overrelaxation , it is necessary for the user to set a time -constant parameter - a for Adler 's method , [( for ordered overrelaxation - which , roughly speaking , controls the number of iterations for which random walks are suppressed. Ideally , this parameter should be set so that random walks are suppressed over the time scale required for the whole distribution to be traversed , but no longer . Short trial runs could be used to select a value for this parameter ; finding a precisely optimal value is not crucial . In favourable cases, an efficient implementation of ordered overrelaxation used with any value of K less than the optimal value will produce an advantage over Gibbs sampling of about a factor of [( . Using a value of K that is greater than the optimum will still produce an advantage over Gibbs sampling , up to around the point where [( is the square of the optimal value . For routine use, a policy of simply setting [( to around 20 may be

SUPPRESSING RANDOM WALKS USING OVERRELAXATION

227

reasonable . For problems with a high degree of dependency, this may give around an order of magnitude improvement in performance over Gibbs sampling , with no effort by the user. For problems with little dependency between variables , for which this value of K is too large , the result could be a slowdown compared with Gibbs sampling , but such problems are sufficiently easy anyway that this may cause little inconvenience . Of course, when convergence is very slow , or when many similar problems are to be solved, it will be well worthwhile to search for the optimal value of K . There are problems for which overrelaxation (of whatever sort ) is not advantageous , as can happen when variables are negatively correlated . Fur ther research is needed to clarify when this occurs , and to determine how these situations are best handled . It can in fact be beneficial to underrelax in such a situation - eg, to use Adler 's method with a > 0 in equation (1) . It is natural to ask whether there is an "ordered underrelaxation " method that could be used when the conditional distributions are non-Gaussian . I believe that there is. In the ordered overrelaxation method of Section 4.1, step (3) could be modified to randomly set xi to either x ~r+l ) or x ~r - l ) (with the change being rejected if the chosen r :t: 1 is out of range) . This is a valid update (satisfying detailed balance) , and should produce effects similar to those of Adler 's method with a > O. Acknowledgements . I thank David MacKay for comments on the manuscript . This work was supported by the Natural Sciences and Engineering Research Council of Canada . References Adler , S. L . ( 1981 ) " Over - relaxation method for the Monte Carlo evaluation of the partition function for multiquadratic actions " , Physical Review D , vol . 23 , pp . 2901 - 2904 . Barone , P. and Frigessi , A . ( 1990 ) " Improving stochastic relaxation for Gaussian random fields " , Probability in the Engineering and Informa tional Sciences , vol . 4 , pp . 369 - 389 . Brown , F . R . and Woch , T . J . ( 1987 ) " Overrelaxed heat - bath and Metropo lis algorithms for accelerating pure gauge Monte Carlo calculations " , Physical Review Letters , vol . 58 , pp . 2394 - 2396 . Creutz , M . ( 1987 ) " Overrelaxation and Monte Carlo simulation " , Physical Review D , vol . 36 , pp . 515 - 519 . David , H . A . ( 1970 ) Order Statistics , New York : John Wiley & Sons . Devroye , L . ( 1986 ) Non - uniform Random Variate Generation , New York : Springer - Verlag . Duane , S., Kennedy , A . D ., Pendleton , B . J ., and Roweth , D . ( 1987 ) " Hy brid Monte Carlo " , Physics Letters B , vol . 195 , pp . 216 - 222 .

228

RADFORDM. NEAL

Fodor, Z. and Jansen, K . (1994) "Overrelaxation algorithm for coupled Gauge-Higgs systems" , Physics Letters B , vol . 331, pp . 119- 123.

Gilks, W . R. and Wild , P. (1992) "Adaptive rejection sampling for Gibbs sampling " , Applied Statistics , vol . 41, pp . 337-348.

Gelfand, A . E. and Smith, A . F . M . (1990) "Sampling-based approaches to calculating marginal densities" , Journal of the American Statistical Association

, vol . 85 , PD . 398 - 409 .

Green, P. J. and Han, X . (1992) "Metropolis methods, Gaussian proposals and antithetic variables" , in P. Barone, et al. (editors) Stochastic Models , Statistical Methods, and Algorithms in Image Analysis , Lecture Notes in Statistics , Berlin : Springer -Verlag .

Hastings, W . K . (1970) "Monte Carlo sampling methods using Markov chains and their applications

" , Biometrika

, vol . 57 , pp . 97- 109 .

Kennedy, W . J. and Gentle, J. E. (1980) Statistical Computing, New York: Marcel

Dekker

.

Metropolis , N ., Rosenbluth , A . W ., Rosenbluth , M . N ., Teller , A . H ., and

Teller, E. (1953) "Equation of state calculations by fast computing machines" , Journal of Chemical Physics , vol . 21, pp . 1087- 1092. Neal , R . M . (1993) " Probabilistic inference using Markov Chain Monte Carlo methods " , Technical Report CRG -TR -93- 1, Dept . of Computer Science, University of Toronto . Obtainable in compressed Postscript by anonymous ftp to ftp .cs.toronto .edu, directory pub / radford , file review

.ps . Z .

Neal , R. M . (1995) Bayesian Learning for Neural Networks , Ph .D . thesis , Dept . of Computer Science, University of Toronto . Obtainable in compressed Postscript by anonymous ftp to ftp .cs.toronto .edu , directory

pub/ radford, file thesis.ps.Z. Smith, A . F. M . and Roberts, G. O. (1993) "Bayesian computation via the Gibbs

sampler

and related

Markov

chain Monte

Carlo

methods " ,

Journal of the Royal Statistical Society B, vol. 55, pp. 3-23. (See also the other papers and discussionin the same issue.) Toussaint, D . (1989) "Introduction to algorithms for Monte Carlo simulations and their application to QCD " , Computer Physics Communica tions

, vol . 56 , pp . 69 - 92 .

Whitmer , C. (1984) "Over-relaxation methods for Monte Carlo simulations of quadratic and multiquadratic

actions " , Physical Review D , vol . 29,

pp . 306 - 311 .

Wolff, U. (1992) "Dynamics of hybrid overrelaxation in the gaussianmodel" , Physics Letters B , vol . 288, pp . 166- 170. Young , D . M . (1971) Iterative Solution of Large Linear Systems, New York : Academic

Press .

CHAIN GRAPHS AND SYMMETRIC ASSOCIATIONS

THOMAS

S . RICHARDSON

Statistics Department University of Washington tsr @stat . washington . edu

Abstract . Graphical models based on chain graphs , which admit both directed and undirected edges, were introduced by by Lauritzen , Wermuth and Frydenberg as a generalization of graphical models based on undirected graphs , and acyclic directed graphs . More recently Andersson , Madigan and Perlman have given an alternative Markov property for chain graphs . This raises two questions : How are the two types of chain graphs to be interpreted ? In which situations should chain graph models be used and with which Markov property ? The undirected edges in a chain graph are often said to represent 'symmetric ' relations . Several different symmetric structures are considered , and it is shown that although each leads to a different set of conditional indepen dences, none of those considered corresponds to either of the chain graph Mar kov properties . The Markov properties of undirected graphs , and directed graphs , in cluding latent variables and selection variables , are compared to those that have been proposed for chain graphs . It is shown that there are qualita tive differences between these Markov properties . As a corollary , it is proved that there are chain graphs which do not correspond to any cyclic or acyclic directed graph , even with latent or selection variables .

1. Introduction The use of acyclic directed graphs (often called 'DAG 's) to simultaneously represent causal hypotheses and to encode independence and conditional in dependence constraints associated with those hypotheses has proved fruit ful in the construction of expert systems, in the development of efficient updating algorithms (Pearl [22]; Lauritzen and Spiegelhalter [19]) , and in 231

232

THOMASS. RICHARDSON

inferring causal structure (Pearl and Verma [25]; Cooper and Herskovits [5]; Spirtes, Glymour and Scheines[31]). Likewise , graphical models based on undirected graphs , also known as Mar kov random fields , have been used in spatial statistics to analyze data from field trials , image processing , and a host of other applications (Ham -

mersley and Clifford [13]; Besag [4]; Speed [29]; Darroch et ala [8]). More recently , chain graphs , which admit both directed and undirected edges have been proposed as a natural generalization of both undirected graphs

and acyclic directed graphs (Lauritzen and Wermuth [20]; Frydenberg [11]). Since acyclic directed graphs and undirected graphs can both be regarded as special cases of chain graphs it is undeniable that chain graphs are a generalization in this sense. The introduction of chain graphs has been justified on the grounds that

this admits the modelling of 'simultaneous responses' (Frydenberg [11]), 'symmetric associations' (Lauritzen and Wermuth [20]) or simply 'associative relations ' , as distinct from causal relations (Andersson , Madigan and

Perlman [1)). The existence of two different Markov properties for chain graphs raises the question of what sort of symmetric

relation

is represented

by a chain graph under a given Markov property , since the two properties are clearly

different . A second related

question

concerns

whether

or not

there are modelling applications for which chain graphs are particularly well suited , and if there are , which Markov

property

is most appropriate

.

One possible approach to clarifying this issue is to begin by considering causal systems , or data generating processes, which have a symmetric structure . Three simple , though distinct , ways in which two variables , X

and Y , could be related symmetrically are: (a) there is an unmeasured, ' confounding

' , or ' latent ' variable

that

is a common

cause of both X and

Y ; (b ) X and Yare both causes of some 'selection ' variable (conditioned on in the sample ) ; (c) there is feedback between X and Y , so that X is a cause of Y , and Y is a cause of X . In fact situations (a) and (b ) can easily be represented by DAGs through appropriate extensions of the formalism

(Spirtes, Glymour and Scheines[31]; Cox and Wermuth [7]; Spirtes, Meek and Richardson [32)). In addition , certain kinds of linear feedbackcan also be modelled with directed cyclic graphs (Spirtes [30]; Koster [16); Richardson [26, 27, 28]; Pearl and Dechter [24]). Each of these situations leads to a different set of conditional independences . However , perhaps surprisingly , none of these situations , nor any combination of them , lead in general to either of the Markov properties associated with chain graphs . The remainder of the paper is organized as follows : Section 2 contains definitions of the various graphs considered and their associated Markov properties . Section 3 considers two simple chain graphs , under both the ori ginal Markov property proposed by Lauritzen , Wermuth and Frydenberg ,

CHAIN GRAPHS AND SYMMETRIC ASSOCIATIONS

233

and the alternative given by Andersson , Madigan and Perlman . These are compared to the corresponding directed graphs obtained by replacing the undirected edges with directed edges in accordance with situations (a) , (b) and (c) above. Section 4 generalizes the results of the previous section : two properties are presented , motivated by causal and spatial intuitions , that the set of conditional independences entailed by a graphical model might satisfy . It is shown that the sets of independences entailed by (i ) an undirected graph via separation , and (ii ) a (cyclic or acyclic ) directed graph (possibly with latent and ! or selection variables ) via d-separation , satisfy both pro perties . By contrast neither of these properties , in general , will hold in a chain graph under the Lauritzen - Wermuth -Frydenberg (LWF ) interpreta tion . One property holds for chain graphs under the Andersson -Madigan Perlman (AMP ) interpretation , the other does not . Section 5 contains a discussion of data -generating processes associated with different graphical models , together with a brief sketch of the causal intervention theory that has been developed for directed graphs . Section 6 is the conclusion , while proofs not contained in the main text are given in Section 7.

2. Graphs and Probability Distributions Thissectionintroduces thevariouskindsof graphconsidered in this paper, togetherwith their associated Markovproperties . 2.1. UNDIRECTED ANDDIRECTED GRAPHS An undirected graph , UG , is an ordered pair (V , U ) , where V is a set of vertices and U is a set of undirected edges X - Y between vertices .l Similarly , a directed graph , DG , is an ordered pair (V , D ) where D is a set of directed edges X - + Y between vertices in V . A directed cycle consists of a sequence of n distinct edges Xl - + X2 - + . . . - + Xn - + Xl (n ~ 2) . If a directed graph , DG , contains no directed cycles it is said to be acyclic , otherwise it is cyclic . An edge X - + Y is said to be out of X and into Y ; X and Yare the endpoints of the edge. Note that if cycles are permitted there may be more than one edge between a given pair of vertices e.g~ X t - Y t - X . Figure 1 gives examples of undirected and directed graphs . 2.2. DIRECTED GRAPHS WITH LATENT VARIABLES AND SELECTION

VARIABLES

Cox and Wermuth [7] and Spirtes et al. [32] introduce directed graphs in which V is partitioned into three disjoint sets 0 (Observed), S (Selection) lBold face (X ) denote sets; italics (X ) denote individual vertices; greek letters (7r) denote paths.

234

THOMASS. RICHARDSON UG1

A -

A

C

I B - - D

UG2

B-

(a)

I

t - D

DGI

-t-C A ~C A A DG : \ J --DB-C~-o 1-~ B DG2

(b)

(c)

Figure1. (a) undirected graphs ; (b) a cyclicdirectedgraph; (c) acyclicdirectedgraphs

and L (Latent ), written DG (O, S, L ) (where DG may be cyclic). The interpretation of this definition is that DG representsa causalor data-generating mechanism; 0 representsthe subset of the variables that are observed; S represents a set of selection variables which, due to the nature of the mechanism selecting the sample, are conditioned on in the subpopulation from which the sample is drawn; the variables in L are not observedand for this reason are called latent.2 Example.. Randomized Trial of an Ineffective Drug with Unpleasant Side-Effects3 A simple causal mechanism containing latent and selection variables is given in Figure 2. The graph represents a randomized trial of an ineffective drug with unpleasant side-effects. Patients are randomly assigned to the treat ment or control group (A ). Those in the treatment group suffer unpleasant side-effects , the severity of which is influenced by the patient 's general level of health (H ) , with sicker patients suffering worse side-effects. Those patients who suffer sufficiently severe side-effects are likely to drop out of the study . The selection variable (Sel ) records whether or not a patient remains in the study , thus for all those remaining in the study S el = Stay In . Since unhealthy patients who are taking the drug are more likely to drop out , those patients in the treatment group who remain in the study tend to be healthier than those in the control group . Finally health status (H ) influences how rapidly the patient recovers. This example is of interest because, as should be intuitively clear , a simple comparison of the recovery time of the patients still in the treatment and control groups at the end of the study will indicate faster recovery among those in the treatment group . This comparison falsely indicates that the drug has a beneficial effect , whereas in fact , this difference is due entirely to the side-effects causing the sicker patients in the treatment group to drop out of the study .4 (The only difference between the two graphs in Figure 2 is that in DG1 (01 , 81 , L :J.) 2Note that the terms variable and vertex are used interchangeably. 31 am indebted to Chris Meek for this example. 4For precisely these reasons, in real drug trials investigators often go to great lengths to find out why patients dropped out of the study.

235 R

I

AssignmentA

H

(Treatment /COntrol )" " SideEffects

...............

Selection (StayIn/ Drop

DGl(Ot,SiL.)

01={A,Ef,H,R} 81={Sel} L1=0

02, ={A.Ef,R} 82 , ={Sel} L2,={H}

DG2 (02,sz,Lz)

Figure 2. Randomizedtrial of an ineffectivedrug with unpleasantside effectsleading to drop out. In DG1(Ol , Sl ,Ll ), H E 01 , and is observed , while in DG2(O2, S2, L2) H E L2 and is unobserved(variablesin L are circled; variablesin S are boxed; variables in 0 are not marked) .

A .1/.-C A ..-C IB I -..D B

A - . . c

CG1

A

B- J - D

B - - D

(a)

Figure 3.

CG2

I

(b)

(a) mixed graphs containing partially directed cycles; (b) chain graphs.

health status (H ) is observed so H E 01 , while in DG2 (O2 , 82 , L2 ) it is not observedso H E L2 .) 2 .3 .

MIXED

GRAPHS

AND

CHAIN

GRAPHS

In a mixed graph a pair of vertices may be connected by a directed edge or an undirected edge (but not both ) . A partially directed cycle in a mixed graph G is a sequence of n distinct edges (El " ' " En ) , (n ~ 3) , with endpoints Xi , Xi +l respectively , such that : (a) Xl == Xn + l ,

(b) ~i (1 ~ i ~ n) either Xi - Xi +l or Xi - * Xi +l , and (c) ~j (1 ~ j ::; n) such that Xj - tXj +l . A chain graph C G is a mixed graph in which there are no partially direc -

ted cycles (see Figure 3). Koster [16] considersclassesof reciprocal graphs containing directed and undirected edges in which partially directed cycles are allowed . Such graphs are not considered separately here, though many of the comments which apply to LWF chain graphs also apply to reciprocal graphs since the former are a subclass of the latter . To make clear which kind of graph is being referred to UG will denote undirected graphs , DG directed graphs , CG chain graphs , and G a graph

236

THOMASS. RICHARDSON

which

may

be

whatever

exists

a

has

)

Xi

Y

Xi

the

is

. 4

l

Xi

(

is

path

of

THE

E

X

,

,

to

X

no

( X

1

~

i

~

n

.

.

.

-

,

Xl

i . e

+

is

Y

,

,

.

.

.

.

Ei

occurs

it

+

)

and

( El

: =

vertex

.

.

.

,

,

is

in

En

Xn

-

graph

G

such

l

that

: =

Xi

than

A

a

)

+

Xi

more

cyclic

Y

.

Y

+

)

l

,

once

directed

( of

there

where

Xi

-

on

Ei

+

Xi

+

the

path

l

,

path

from

X

to

.

ASSOCIATED

a

UG2

WITH

I

,

and

{

'

than

X

is

I

C

AlLB

I

C

AlLB

I

C

most

terms

a of

the

I

XJL

Y

Z

instead

Y

{

of

;

;

(

. lL

Z

B

{

V

D

{

is

I

( a

)

sets

from

a

-

of

variable

variable

separation

e

}

a

~

(

Z

tail

in

UG

the

;

I

}

{

C

,

in

Fu

Z

,

)

. 7

following

conditio

-

C

'

f

{

A

,

D

,

C

}

;

;

AlLD

I

{

B

}

;

are

independence

listed

.

relations

For

instance

,

note

.

rather

than

define

a

vertices

.

vertices

vertices

unique

path

( Note

that

in

;

in

,

since

a

a

directed

cyclic

there

chain

may

graph

be there

. )

introduced .

BlLC

.

}

,

;

}

I

)

a

D

D

AlLD

empty

{

of

of

BlLC

' elementary

edges

pair

;

I

general

as

Here

the

a

means

for

global

deriving

property

the

is

conse

defined

-

directly

.

independent

of braces

Z

,

I

criterion

Y

disjoint

path

some

by

}

D

A

}

of

a

,

conditions

' X

JL

C

be

,

are

convenient

}

for

re

.

AlLE

only

pair

Markov

{

I

in

given

,

;

may

{

not

graphical

of

D

sequence

each

When

,

no

include

Z

l

B

D

and

}

that .

.

all

Z

{

I

conditions

means

used

I

B1l

a

relevant

'

not

Figure

AlLB

does

local

is

separated

I

I

between

UG

there

;

Yare

AlLB

between

if

by

AlLD

a

independence

:

A

as

,

does

in

;

C

}

Markov

Z is

{

here

set

Y

I

vertices

edge

7 ' XJL

I

. lL

edge

one

of

D

conditional

Property

and

,

of

global

quences

.

defined

one

that

separation

entails

sequence

6aften

X

AlLD

}

also

5 ' Path

more

if

throughout

form

UGI

,

)

separated

graphs

B1l

Here

Z

of

graph

empty

Markov

undirected

Fu

Y

be

set

undirected

be

E

to

a

an

may

Y

Y

Fu

a

In

Z

via

UG1

graph

. 6

(

said

XJl

the

the

,

variable

Yare

Fu

that

associates

Z

independences

Y

(

If

-

Global

Thus

VJL

l

X

edges

PROPERTY

G

and

and

UG

tion

between

of

otherwise

X

graph

Y

Undirected

in

,

property

a

X

at

. 5

MARKOV

with

then

is

)

form

Markov

lations

of

path

GRAPHS

vertices

nal

+

n

acyclic

GLOBAL

global

X

~

A

sequence

vertices

Xi

i

the

UNDIRECTED

A

~

.

a

distinct

and

1

path

a

.

+

these

of

of

,

~

then

of

consists

sequence

endpoints

or

2

anyone

type

are

Y omitted

given

Z from

' ;

if

Z singleton

=

0

,

the sets

abbrevia {

V

}

,

e

. g

.

2 .5 .

THE

CHAIN

GRAPHS

GLOBAL

MARKOV

DIRECTED

AND

SYMMETRIC

PROPERTY

237

ASSOCIATIONS

ASSOCIATED

WITH

GRAPHS

In a directed graph DG , X is a parent of Y , (and Y is a child of X ) if there

is a directed edgeX -)- Y in G. X is an ancestorof Y (and Y is a descendant of X ) if there is a directed path X -)- . . . -)-Y from X to Y , or X := Y . Thus

'ancestor' ('descendant') is the transitive , reflexive closure of the 'parent' ('child ') relation . A pair of consecutive edges on a path 7r in DG are said to collide at vertex A if both edges are into A , i .e. - +A ~ , in this case A is called

a collider

on 7r, otherwise

A is a non - collider

on 7r. Thus

every

vertex on a path in a directed graph is either a collider , a non-collider , or

an endpoint. For distinct vertices X and Y , and set Z <; V \ { X , Y } , a path 7r between X and Y is said to d-connect X and Y given Z if every collider on

7r is an

disjoint

ancestor

of a vertex

in

Z , and

no

non - collider

on

7r is in

sets X , Y , Z , if there is an X E X , and Y E Y , such that

Z . For

there

is a path which d-connects X and Y given Z then X and Yare said to be d-connected given Z . If no such path exists then X and Yare said to be

d-separatedgiven Z (seePearl [22]). Directed

Global

Markov

Property

; d - separation

(I =DS)

DG I=DSXli Y I Z if X and Yare d-separatedby Z in DG . Thus the directed graphs in Figure 1(b,c) entail the following conditional independences

via d- separation :

DG1

I=DS BJlC I { A , D } ;

DG2

I=DS BJlC I A ; AlLD I { B , C} ;

DG3

I=DS AJlB I C; AlLB I { a , D } ; AlLD I C; AJlD I { B , a } ; BJlD I C; BlLD I { A , C } .

Note that the conditional independences entailed by DG3 under d-separation 2 .6 .

THE

DIRECTED

are precisely GLOBAL GRAPHS

those entailed MARKOV WITH

by U G 2 under separation .

PROPERTY LATENT

AND

ASSOCIATED SELECTION

WITH VARIABLES

The global Markov property for a directed graph with latent and/ or selection variables is a natural extension of the global Markov property for

directed graphs. For DG (O , S, L ), and XUY (JZ ~ 0 define:

DG(O, S, L) FDSXlL Y I Z if andonly if DG FDSXJl Y I Z U S.

238

THOMASS. RICHARDSON

In

by

other

DG

led

words

(

by

O

,

S

the

,

L

)

is

,

are

under

in

0

in

patients

observed

in

S

will

S

are

See

Spirtes

and

the

DGI

.

)

= =

,

DS

the

health

[

,

81

,

(

. 7

.

O2

,

Ll

AJlR

I

DG2

is

]

,

L2

)

GLOBAL

)

,

the

)

{

H

,

Spirtes

,

S

,

}

,

;

unobserved

,

,

by

)

Meek

(

Thus

. 2

,

the

,

81

a

all

the

of

(

O

conditio

I

. .

llR

)

-

S

Richardson

L1

only

variables

set

P

,

not

relation

entails

and

O1

. . llR

I

AJlR

82

L

.

=

[ 32

,

shown

I

{

in

In

]

;

)

.

Cox

Figure

2

,

:

A

O2

.

are

independence

O

,

L

which

2

distribution

DG1

Sel

(

Section

In

conditional

FDS

in

in

Stay

S

variables

variables

=

the

in

observed

in

-

and

variables

only

example

(

,

subpopulation

observed

;

selection

the

a

DG

the

33

the

L

Sel

graph

entailed

82

all

,

independences

graph

status

in

every

in

the

.

Hence

hold

independences

DG2

.

Thus

OI

S

entai

occur

involving

which

conditional

(

I

However

2

]

. g

in

sample

following

DG1

tioned

7

,

variables

from

e

for

Richardson

[

the

since

those

,

upon

in

O

drawn

on

conditioned

hold

Wermuth

since

are

were

)

(

relations

latent

implicitly

DG

entailed

independence

no

relations

samples

which

entails

(

relations

those

which

of

conditioned

independences

and

of

in

independence

to

nal

,

includes

,

be

observed

independence

subset

DG

interpretation

Similarly

variables

conditional

the

conditional

.

of

exactly

always

the

,

observed

set

graph

set

Since

the

directed

conditioning

(

,

I

L2

)

H

{

;

H

O2

graph

A

,

does

tt

the

H

Ef

,

not

,

Sel

}

(

Ef

any

neither

DG1

,

,

,

independences

of

OI

}

.

entail

so

H

81

,

the

,

L1

above

)

is

men

-

entailed

by

.

MARKOV

PROPERTIES

ASSOCIATED

WITH

CHAIN

GRAPHS

There

are

for

2

.

7

A

.

1

The

path

V

I

is

in

from

(

V

a

if

any

graph

are

that

}

both

of

CG

from

in

Markov

be

it

see

V

\

to

Studeny

and

original

Bouckaert

set

W

on

induced

all

have

the

and

W

if

(

1r

,

there

X

-

with

Ant

re

-

formulated

chain

graph

[ 36

]

,

Andersson

on

W

)

of

CG

=

endpoint

in

,

)

(

an

is

tY

subgraph

edges

been

is

property

edges

the

Wand

a

directed

X

denote

properties

applied

(

)

proposed

Markov

to

all

between

W

been

relation

graph

anterior

which

is

(

vertices

may

be

in

Y

Let

these

that

derived

.

to

W

have

independence

chain

said

E

which

conditional

Frydenberg

is

such

all

,

-

W

W

removing

criteria

a

Wermuth

some

to

recently

properties

definitions

graph

to

)

Markov

both

chain

anterior

separation

undirected

In

-

V

by

8More

global

.

Lauritzen

V

7r

obtained

a

.

path

the

different

graphs

vertex

a

{

two

chain

terms

rather

of

than

et

an

ale

[ 2

]

)

.

CHAINGRAPHSAND SYMMETRICASSOCIATIONS

II B - -- D

AI XI B .:_-- : n

(a)

(b)

I

B -

C -

/ 1""

D

B -

C -

(c)

239

D

(d)

Figure 4. Contrast between the LWF and AMP Markov properties. Undirected graphs used to test AJLD I { B , C } in CGl under (a) the LWF property , (b) the AMP property . Undirected graphs used to test BJLD I C in CG2 under (c) the LWF property , (d) the AMP property .

in V \ W . A complex in CG is an induced subgraph with the following form: X - + Vl - . . . - Vn ~ y (n ~ 1). A complex is moralized by adding the undirected edge X - Yo Moral (CG) is the undirected graph formed by moralizing all complexesin CG, and then replacing all directed edgeswith undirected edges. LWF

Global

CG

Markov

FLWF

XlL

graph

Moral

Hence

the

Y

I

( CG

Z

if

( Ant

chain

independences

Property

AJLB

CG2

FLWF

AJLB

;

that

I

,

2 . 7 .2 .

The

In

a

and

chain

has

an

between

all X

and

the

( W

a

) CG

set

on

the

W

on

}

{ a

I

{ A

same

entail

;

BJlC

, D

}

, a

the

I

;

}

and

( W if

path ' !r . g

there are

( See

A is

to

path ( X 5 .)

conditional

;

;

AlLD

I

by

graph be

{ B

CG2

, a

}

;

under

DG3

the

under

Now

Markov

connected

and

d - sep

-

,

Con

subgraph edges V

in

' !r

from

- t

Y

a

in

and

there ( W ,

chain V

)

property if

W

directed

directed Figure

undirected

1 ) .

vertex a

C

}

by

extended all

) ) .

, D

entailed

V

The

contains

( Con W

said

.

the

.

between }

{ A

I

chain

Ware

W

in

following

AJLD

Figure

- Perlman

E

Z

)

:

those

( see

and

by

entailed

as

edges W

in of

edges

I

BJLD

V

some Con

, a

separation

vertices

edges ancestor

;

)

property

AiLB

undirected

set

3 ( b

{ B

- Madigan

to

vertex

which

under

only

undirected be

G2

graph

connected

;

Y

( PLWF

) ) ) .

independences

are

Andersson

containing is

U

C

I

Graphs

from

Z

Markov

conditional

property

aration

C

I

the

Markov

U

Figure

AJLD

Chain

separated

Y

LWF

BJLD

Notice

U

in

the

! = LWF

is

( X

graphs

under

GCl

LWF

X

for

is )

=

Ext

CG

( W

some are

such

path I

( CG

graph to

a { V

V

, W

) ,

) ,

and

all

is

said

to

W

E that

W

in Y

is

let

9Note that other authors, e.g. Lauritzen [17], have used 'ancestral' to refer to the set named ' anterior ) in Section 3.

240

THOMASS. RICHARDSON

A ,B-D -e",oE -. -X

A

A

B -... E -....X

B

I

x

- x

I C- eF

(b)

(a)

(e)

(c)

Figure 5. Constructing an augmented and extended chain graph: (a) a chain graph CG ; (h) directed edgesin Anc ({ A , E , X } ); (c) undirected edgesin Con(Anc ({ A , E , X } )) (d) Ext (CG , Anc ({ A , E , X } )); (e) Aug (Ext (CG , Anc ({ A , E , X } ))). X

Z

J/

X

i/

Z

X

!/

Z

X-

Z

J/

(a)

(b)

X --.. .A

X-

y- !

t~ ~

(c)

A

(d)

Figure 6. (a) Triplexes (X , Y, Z ) and (b) the corresponding augmented triplex . (c) A chain graph with a hi-flag (X , A , B , Y ) and two triplexes (A , B , Y ), (X , A , B ) ; (d) the corresponding

augmented chain graph .

Anc (W ) = { V I V is an ancestor of some W E W } . A triple of vertices (X , Y, Z ) is said to form a triplex in CG if the

induced subgraph CG ({ X , Y, Z } ) is either X - + Y - Z , X - + y ~ Z , or X - Y .(- Z . A triplex is augmented by adding the X - Z edge. A set of four vertices (X , A , B , Y ) is said to form a bi-flag if the edges X - + A , Y - + B ,

and A - B are present in the induced subgraph over { X , A , B , Y } . A bi-flag is augmented by adding the edge X - Yo Aug (CG ) is the undirected graph formed by augmenting all triplexes and bi -flags in CG and replacing all

directed edgeswith undirected edges(seeFigure 6). Now let Aug[CG ; X , Y , Z] = Aug(Ext (CG , Anc(X U Y U Z))). AMP

Global Markov Property

(FAMP)

GG FA MP XJl Y I Z if X is separated from Y by Z in the undirected graph Aug [GG; X , Y , Z]. Hence the conditional

independence

relations

associated with the chain

graphs in Figure 3(b) under the AMP global Markov property are: GGl

FAMP AJlB ; AlLB I C; AJlB I D ; AlLD ; AlLD I B ; BlLC ; BJlC I A ;

~

18

CG2 FAMP AlLB ; AlLB I D ; AlLD ; AlLD I B ; BlLD , { A , a } .

241

:C ASSOCIATIONS CHAINGRAPHS ANDS"YrMMETRJ A- - C I B- - D

A ')[] B--c -eIJ

A - - C ' (f) B - eIJ " """-'

DG1u (Ola,Sla,Lla)

LWF

the

and

special

graph

,

,

directed

. 8

.

X

- -

graphs

said

to

XlL

Y

,

G1

Y

global

XlL

I

Z

,

and

said

.

I

weakly

is

R

is

P

in

said

be

G

R

be

Markov

for

]

;

all

I

3 . Directed

in

.

strongly

FR

X

Y

I

here

[ 12

] ;

and

Graphs

if

Z

if

are

with

Y

[ 36

vertex

and

in

]

dis

-

dashed

]

to

[ 11

] ;

;

Andersson

Symmetric

such

Spirtes

be

V

,

,

a

Y

G

Z

Z

Y

Z

strongly

in

( and

] ;

et

al

G

FR

to

RX1L

Y

be

I

Z

property

[ 2

P

[ 21

] )

.

hence

Meek

.

-

distribution

I

[ 30

,

said

The

- MarkovianR

XlL

sep

given

distribution

~

.

-

a

and

G

I

FR2

d

For

is

Y

G2

under

.

that

XR

a

if

if

DG3

X

,

are

only

property

Z

is

respectively

set

which

only

known

R2

and

,

Markov

there

and

Frydenberg

Bouckaert

,

,

if

subsets

P

complete

XlL

with

global

sets

Z

equivalent

disjoint

A

I

Markov

G

for

P

disjoint

properties

Studeny

Z

using

.

property

all

graph

distribution

( Geiger

[ 31

Y

Y

Markov

are

and

both

[ 7

by

R1

XlL

LWF

if

- MarkovianR

G

,

Wermuth

property

)

of

and

for

directed

generalization

Cox

properties

FRI

separation

XJl

if

G

to

complete

the

)

,

COMPLETENESS

G1

- MarkovianR

implies

which

global

under

under

a

properties

AMP

Markov

if

property

Z

a

global

CG2

UG2

the

AND

under

.

- separation

( acyclic

are

Markov

under

( d

undirected

graphs

AMP

equivalent

complete

there

DG)(,{ Olc,slc,Llc )

separation

an

property

undirected

and

graphs

G2

Thus

to

Y

,

Markov

Markov

is

and

LWF

chain

is

either

EQUIVALENCE

be

aration

in

with

which

with

the

MARKOV

Two

graph

graphs

- -

coincide

chain

graphs

between

lines

ale

a

chain

tinguish

P

properties

of

Thus

acyclic

2

AMP

case

.

lc ={A,B,C,D} SIc=0 Ltc =0

A chain graph and directed graphs in which C and D are symmetrically

Figure 7. related.

Both

B- Jt

tb ={A,B,C,D} SIb={S} Lib =0 DG1h (Olb,slb,Llb)

la ={A,B,C,D} Sla =0 Lla ={ 71

CG )

A - - C

All

of

the

weakly

] ;

Spirtes

)

et

.

Relations

In this section the Markov properties of simple directed graphs with symmetrically related variables are compared to those of the corresponding chain graphs . In particular , the following symmetric relations between variables X and Yare considered : (a) X and Y have a latent common cause; (b ) X and Yare both causes of some selection variable ; (c) X is a cause of Y , and Y is a cause of X , as occurs in a feedback system . The conditional independences relations entailed by the directed graphs

242

THOMAS S. RICHARDSON

A B J D ~~

A B- J - D

CG2

A J B ~~

A B- l ~ D

Z8={A,B,C,D} S2 ,a=0 LZ8={ Tl,TZ}

Zbs(A.B.C.D} SZb s(SI.S2} LZb=0

Ok ={A,B,C,D} ~ =0 L:zc=0

DG2a ( 1a,s2a , L2a )

DG2b ( 2b,s2b , L2~

DG~ 2c,8::zc , L2 .c)

Figure 8. A chain graph and directed graphs in which the pairs of vertices B and C , and C and D , are symmetrically related.

in Figure 7 are:

DG1a(Ola, Sla, L1a) FDSAlLB ; AlLB I C; AiLB I D; AlLD ; AJlD I B; BJlC ; BJlC I A;

DG1b (OIb,SIb,LIb) FnsAJlB I C; AJlB I D; BJlC I D; AJlD I C; AJlB I {a,D};AJlD I {B,a};BJlC I {A,D}; DG1c (OIc , SIc , LIc ) FDS AJlB ; AJlB

I {a ,D }.

It follows that none of these directed graphs is Markov equivalent to Gal under the LWF Markov property . However , DGla (Ola , Sla , L1a ) is Markov equivalent to GGl under the AMP Markov property . Turning now to the directed graphs shown in Figure 8, the following conditional independence relations

are entailed

:

DG2a(O2a,S2a,L2a) FDS AlLB ; AJLB I D; AlLD; AlLD I B; BJLD; BJlD I A; DG2b(O2b, S2b, L2b) FDS ARB I C; ARB I {a , D} ; AlLD I C; ARD I {B , a } ; BlLD I C; BlLD I {A, a } ; DG2c(O2c, S2c, L2c) doesnot entail any conditionalindependences . I t follows that none of these directed graphs is Markov equivalent to CG2 under the AMP Markov property. However, DG2b(O2b, S2b, L2b) is Markov equivalent to CG2 under the LWF Markov property. Further , note that DG2b(O2b, S2b, L2b) is also Markov equivalent to UG2 (under separation ) and DG3 (under d-separation) in Figure l (a). There are two other simple symmetric relations that might be considered: (d) X and Y have a common child that is a latent variable; (e) X and Y have a common parent that is a selection variable. However, without additional edgesX and Y are entailed to be independent (given S)

CHAIN

GRAPHS

AND

SYMMETRIC

243

ASSOCIATIONS

in these configurations , whereas this is clearly not the case if there is an edge between X and Y in a chain graph . Hence none of the simple directed graphs with symmetric relations corresponding to CG1 are Markov equivalent to CG1 under the LWF Markov property , and likewise none of those corresponding to CG2 are Markov equivalent to CG2 under the AMP Markov property . In the next section a stronger result is proved : in fact there are no directed graphs , however complicated , with or without latent and selection variables , that are Markov equivalent to CG1 and CG2 under the LWF and AMP Markov properties , respecti vely . 4 . Inseparability

and Related

Properties

In this section two Markov properties , motivated by spatial and causal intuitions

, are introduced

. It is then shown that

these Markov

properties

hold for all undirected graphs , and all directed graphs (under d-separation ) possibly

with

latent

and selection variables . Distinct

vertices X and Yare

inseparableR in G under Markov Property R if there is no set W such that

G FR XJlY

I W . If X and Yare not inseparableR , they are separableRo Let

[G]knsbe the undirectedgraph in whichthere is an edgeX - Y if and only if X and Yare

inseparable R in Gunder

R . Note that in accord with the

definition of Fvs for DG (O, S, L ), only vertices X , Y E 0 are separableDs

or inseparableDs , thus [DG(O, S, L)]~: is definedto havevertex set 00 For an undirectedgraph model [UG]1nsis just the undirectedgraph U G . For an acyclic , directed graph (without latent or selection variables )

under d-separation , or a chaingraph undereither Markov property [G]~S is simply the undirected graph formed by replacing all directed edges with

undirectededges , hencefor any chaingraph CG, [CG]~n~p = [CGJln ; Fo In any graphical model , if there is an edge (directed or undirected ) bet ween a pair of variables

then those variables

are inseparable R. For undirec -

ted graphs , acyclic directed graphs , and chain graphs (under either Markov

property ), inseparability R is both a necessaryand a sufficient condition for the existence of an edge between a pair of variables . However , in a direc ted graph with cycles, or in a (cyclic or acyclic ) directed graph with latent and / or selection variables , inseparability DSis not a sufficient condition for there to be an edge between a pair of variables (recall that in DG (O , S, L ), the entailed

conditional

independences

are restricted

to those that are ob -

servable) .

An inducing path betweenX and Y in DG (O , S, L ) is a path 7rbetween X and Y on which (i) every vertex in 0 U S is a collider on 7[, and (ii ) every collider is an ancestor of X , Y or S.10 In a directed graph , DG (O , S, L ), loThe notion of an inducing path was first introduced

for acyclic directed graphs with

244

THOMAS S. RICHARDSON A .....- B

A-...c~1B

A ---..0 - - ' B

A ~ 1WB

Figure 9. Examples of directed graphs DG (O , S, L ) in which A and B are inseparableDs.

variables X , Y EO , are inseparable DSif and only if there is an inducing

path between X and Y in DG (O, S, L ).ll For example, C and D were inseparableDs in DGla (Ola , Sla , LIa ) , and DG1b(Olb , SIb , LIb ), while in DG1c (Olc , SIc , L1c ) A and B were the only separableDs variables . Figure 9 contains further examples of graphs in which vertices are inseparable DS. 4 .1 .

' BETWEEN

A vertex

SEPARATED

' MODELS

B will be said to be betweenR X and Y in G under Markov

pro -

perty R , if and only if there exists a sequence of distinct vertices (X ==

Xo, Xl , . . . ,Xn := B , Xn+l , . . . , Xn+m := Y) in [G]~Ssuchthat eachconse cutive

pair of vertices

Xi , Xi + l in the sequence are inseparableR

in G un -

der R . Clearly B will be betweenR X and Y in G if and only if B lies

on a path betweenX and Y in [G]~ns. The set of verticesbetweenX and Y under property R in graph G is denoted BetweenR(G; X , Y ), abbreviated to Between R(X , Y ), when G is clear from context. Note that for any chain graph CG , BetweenLWF (CG ; X , Y ) = BetweenAMP (CG ; X , Y ), for all vertices

X

and

BetweenR

Y .

Separated

Models

A model G is betweenR separated, if for all pairs of vertices X , Y and sets W (X , Y Et W ) :

G FR XlLY I W =~ G Fn XlLY I W n Betweenn(G; X , Y )

(where{X , Y} U W is a subsetof the verticesin [G]~ns). It follows that if G is betweenR separated , then in order to make some

(separable ) pair of vertices X and Y conditionally independent , it is always sufficient to condition on a subset (possibly empty ) of the vertices that lie on paths between X and Y . The intuition that only vertices on paths between X and Yare relevant to making X and Y independent is related to the idea, fundamental to much latent variables in Verma and Pearl [37]; it was subsequently extended to include selection variables in Spirtes, Meek and Richardson [32]. 11Inseparability

DS is a necessary and sufficient condition for there to be an edge between

a pair of variables in a Partial Ancestral Graph (PAG), (Richardson [26, 27]; Spirtes et ale [32, 31]) , which represents structural features common to a Markov equivalence class of directed graphs .

CHAIN

[

G

]

GRAPHS

AND

InS

SYMMETRIC

P

R

S

.

ASSOCIATIONS

X

A

t

\

-

BCY

Figure

10

,

D

of

,

E

.

,

F

BetweenR

,

Q

,

T

graphical

in

the

in

then

)

are

All

:

i

,

)

All

/

Suppose

a

(

X

Y

WnBetweens

lows

is

,

there

on

Xn

1r

+

l

,

arables

V

,

is

then

.

(

E

.

.

,

Xn

+

,

separated

.

m

X

,

,

d

-

S

X

,

A

and

)

.

Y

Y

)

,

the

which

are

shown

by

12Where

' A

of

A

,

or

and

they

B

are

share

,

]

)

)

=

{

A

,

should

be

,

and

dependent

.

.

here

.

It

is

easy

without

'

to

see

selection

separated

'

for

or

by

directed

'

d

-

separ

graphs

-

with

.

I

1r

in

does

W

U

E

,

G

not

V

\

UG

~

s

given

=

pairs

=

XO

of

pair

I

,

X

,

X1

,

Y

,

)

.

vertices

of

and

W

(

X

XlLY

X

Betweens

(

contradiction

but

connecting

connect

W

vertices

it

.

.

.

)

.

fol

,

if

Xn

=

insep

separated

in

I

B

,

BetweenLWF

LWF

(

AlLD

I

or

Figure

{

{

=

-

Hence

0

betweenLWF

AlLD

-

But

are

variables

Y

.

CG2

,

present

directly

separated

each

and

B

12

given

consecutive

but

.

XlLY

of

Y

are

appendix

and

,

)

3

G

}

}

:

,

CGl

G

betweenAMP

;

A

,

D

)

=

{

G

}

and

.

that

FAMP

causally

some

14

proof

path

X

Y

quantities

replacing

s

;

,

separated

The

=

G

X

they

betweens

path

1r

not

~

GG2

cause

I

a

FLWF

note

)

.

;

,

graphs

the

UG

CG1

'

property

0

G

interact

two

between

a

(

(

then

is

sequence

is

CoConR

correspondence

directed

in

edge

,

betweenDS

=

that

an

separableLwF

AMP

are

this

GGl

For

)

on

}

natural

[

is

such

is

)

a

if

'

a

=

L

L

V

Q

CoConR

models

=

Since

graphs

is

Dare

,

there

Y

,

in

Hare

is

Then

F

Hausman

connected

GGl

so

S

,

.

,

regions

to

vertex

=

chain

This

0

'

E

that

directly

=

there

general

,

constitutes

because

(

O

T

dependent

is

graphs

(

,

contiguous

(

by

some

7r

Betweens

In

)

(

that

V

V

D

not

state

contradiction

,

,

are

variables

for

Betweens

given

.

C

graph

selection

,

n

. e

,

This

only

over

i

B

vertices

undirected

'

or

,

.

DG

connected

/

if

"

DF

vertices

undirected

for

(

and

A

are

connected

carries

'

{

,

which

proof

and

=

S

that

variables

'

)

that

graphs

proof

latent

V

(

Y

graphically

causally

The

the

latent

,

,

directed

Proof

that

X

Rand

principles

1

W

;

,

intuition

Theorem

ated

G

P

way

causal

they

ii

while

some

spatial

also

(

,

modelling

connected

in

(

}

Q

/ E

C

245

BJlD

connected

common

I

'

cause

{

means

A

that

(

or

some

,

G

}

either

combination

,

A

is

a

cause

of

of

these

B

)

.

,

B

is

a

THOMASS. RICHARDSON

246

but BetweenAMP (CG2; B , D ) = { a } , and yet CG2 ~ AMPBJlD I { a } .

4 .2 .

A

' CO - CONNECTION

vertex

W

Ware

There

is

does

It

is

easy

separated X

X

Y

BetweenR

Y

is

' in

Co A

model vertices

G

G

FR

principle

Theorem

I W

be

( i )

Directed

)

that

: Again graphs

in

the

FR

the

X

, Yand

[ G ] kns

in

the

which

sequence

[ G ] ~ns

in

which

the

sequence

and

in

[ G ] ~ns } .

X

Y

and

Y

in

[ G ] ~ns , and

for

any

CoConAMP

chain

( CG

(b ) graph

betweenR sets

of

if is

CG

and not

, and

; X , Y ) .

( G ; X , Y ) , so being being

G B

co - connectedR X

vertices

and

Y . Both

which

are

to -

[ G ] ~ns .

tt

XJlY

I W of

the

inclusion is

W

, if

for

all

pairs

): n

COConR

vertices

or

W

determined

W

exclusion

irrelevant

(G ; X , Y )

in

[ G ] ~ns ) .

of to

vertices

whether

that X

are

not

Y

are

and

. models

with

are

latent

co - connections and

/ or

determined

selection

variables

. , are

.

for AMP

) in

, Y ) in

X

to in

co - connectionR

graph

are

proof and

that

(X , Y

set

, possibly

determined graphs

be

a subset

Undirected

if

Models

given

graphs

Chain

G

G

variables

to

X

than

W

some

independent

co - connectionvs ( iii

is

by

CoConR

Y

to sets

* =>

, W

, . . . , Bm of

( G ; X , Y ) are

and

said

and

states

2

( ii )

be

, Y

( X , Y ) from to

Y

; X , Y ) =

CoConR

{ X , Y } U W

CoConR

entailed

will X

, B2

co - connectedR

Determined

XlLY

( where

be

[ G ] kns . Note

' X

in

variables

co - connectedR

requirement

- ConnectionR

of

is

(G ; X , Y ) ~

between

of

pairs

from

( CG

a weaker

Y

R .

will

in

and

, . . . , An

pairs

( W , B1

I V

B

( G ; X , Y ) and

pologically

Proof

by

, A2

consecutive

separated

BetweenR

and

directed

that

not

vertices

{ V

X

R .

Gunder

, Y , CoConLWF

Clearly

This

see

is

from

vertices

X

to B

( X , AI

consecutive

and

R in

to

:

vertices

and

(G ; X , Y ) =

(a )

co - connectedR

Gunder

X

inseparable

if

be

of

contain

CoConR

only

Y R in

a sequence

not

are

to

of

contain

inseparable

There

Let

said

' MODELS

[ G ] ~ns satisfying

a sequence

not

are ( ii )

in

is

does

in

be

vertices

(i )

to

will

DETERMINED

co - connectionAMP undirected chain

graphs

determined graphs

is

given

are

given

in

. here the

. The

appendix

proofs .

for

CHAINGRAPHSAND SYMMETRICASSOCIATIONS

247

SinceBetweens (X , Y) ~ CoCons(X , Y), an argumentsimilar to that usedin the proof of Theorem1 (replacing'Betweens ' with 'CoCons') shows that if UG Fs XJLY I W then UG Fs XJlY I W n CoCons(X , Y). Conversely , if UG Fs XlLY I W n CoCons(X , Y ) then X and Yare separatedby W n CoCons(X , Y) in UG. SinceW n CoCons(X , Y ) <; W , it followsthat X and Yare separatedby W in UG.13 0 For undirectedgraphsUG Fs XJlY I W =~ UG Fs XJlY I W n Betweens (X , Y), i.e. undirectedgraphscouldbe saidto be betweensdetermined. Chain graphsare not co-connectionLwF determined.In CG1 Band Care separableLwF , sinceCG1 FLWFBJlC I {A, D} , but CG1~ LWF BlLC I {D } and CoConLWF (CG1; B , C) = {D} . In contrast, chaingraphsareco-connectionAMP determined. 5.

Discussion

The two Markov properties presented in the previous section are based on the intuition that only vertices which , in some sense, come 'between ' X and Y should be relevant as to whether or not X and Yare entailed to be inde pendent . Both of these properties are satisfied by undirected graphs and by all forms of directed graph model . Since chain graphs are not betweenR separated under either Markov property , this captures a qualitative difference between undirected and directed graphs , and chain graphs . On the other hand since chain graphs are co-connectionAMP determined , in this respect , at least , AMP chain graphs are more similar to directed and undirected graphs .

5.1. DATAGENERATING PROCESSES Since the pioneering work of Sewall Wright [38] in genetics, statistical models based on directed graphs have been used to model causal relations , and data generating processes. Models allowing directed graphs with cycles have been used for over 50 years in econometrics , and allow the possibility of representing linear feedback systems which reach a deterministic equilibrium subject to stochastic boundary conditions (Fisher [10] ; Richardson [27]) . Besag [3] gives several spatial -temporal data generating processes whose limiting spatial distributions satisfy the Markov property with respect to a naturally associated undirected graph . These data generating processes are time - reversible and temporally stationary . Thus there are data generating mechanisms known to give rise to the distributions described by undirected and directed graphs . 13This is the ' Strong Union Property ' of separation in undirected

graphs (Pearl [22]) .

248

THOMAS S. RICHARDSON

Cox [6] states that chain graphs under the LWF Markov property "do not satisfy the requirement of specifying a direct mode of data generation ." However , Lauritzen14 has recently sketched out , via an example , a dynamic data generating process for LWF chain graphs in which a pair of vertices joined by an undirected edge, X - Y , arrive at a stochastic equilibrium , as t - + 00; the equilibrium distribution being determined by the parents of X and Y in the chain graph . A data generation process corresponding to a Gaussian AMP chain graph may be constructed via a set of linear equations with correlated

errors (Andersson et at. [1]). Each variable is given as a linear function of its parents in the chain graph , together with an error term . The distribution over the error terms is given by the undirected edges in the graph , as in a Gaussian undirected graphical model or 'covariance selection model '

(Dempster [9]), for which Besag [3] specifies a data generating process. The

linear

model

constructed

in this

way

differs

from

a standard

linear

structural equation model (SEM): a SEM model usually specifieszeroesin the

covariance

model

sets

to

matrix zero

for

elements

the

error

of the

terms , while inverse

error

the

covariance

covariance

matrix

selection .

The existence of a data generating process for a particular chain graph

(under either Markov property ) is important since it provides a full justifi cation for using this structure . As hM been shown in this paper , the mere fact that

two variables

are 'symmetrically

related ' does not , on its own ,

justify the use of a chain graph model . 5 .2 .

A THEORY

OF

INTERVENTION

IN

DIRECTED

GRAPHS

Strotz and Wold [35], Spirtes et at. [31] and Pearl [23] develop a theory of causal intervention for directed graph models which makes it sometimes possible to calculate

the effect of an ideal intervention

in a causal system .

Space does not permit a detailed account of the theory here, however, the central idea is very simple : manipulating a variable , say X , modifies the structure of the graph , removing the edges between X and its parents , and instead making a 'policy ' variable the sole parent of X . The relationships between

all other variables

and their parents

are not affected ; it is in this

sense that the intervention is 'ideal ' , only one variable is directly affected .IS Example .- Returning to the example considered in section 2.2, hypotheti cally a researcher could intervene to directly determine whether or not the patient suffers the side-effects, e.g. by giving all of the patients (in both the 14Personal communication

.

15It should also be noted that for obvious physical reasons it may not make sense to speak of manipulating certain variables , e.g. the age or sex of an individual .

CHAINGRAPHS ANDSYMMETRIC ASSOCIATIONS R

249

R

Recovery

Assignment A f

Policy t

A

A ~>

(Treatment /contro !)", IH Health

~f

Side Effects ~ f

DG

DGManip (Ef)

DGManip (H) (c)

(b)

(a)

Figure 11. Intervening in a causal system: (a) before intervention ; (b) intervening to directly control side-effects; (c) intervening to directly control health status.

side

-

to

.

initially

suffers

side

the

result

(

treatment

-

effects

of

One

.

The

yet

the

ferent

be

the

.

certain

is

on

true

tain

can

it

is

also

often

may

be

34

]

;

case

present

data

was

even

(

if

a

,

present

which

(

In

the

an

may

be

Richardson

the

sense

of

[ 27

absence

observation

is

(

of

of

that

a

they

is

,

theory

basis

2

]

;

)

of

for

]

)

may

at

.

same

slogan

11

set

of

"

]

a

in

;

va

certain

and

Verma

that

the

equivalence

. class

to

Pearl

[ 23

data

the

predict

]

)

.

where

graphs

the

A

is

Causation

theory

another

feedback

,

is

a

.

not

,

generating

system

chain

-

variables

Spirtes

causal

is

-

.

knowing

distributions

Correlation

-

cer

two

selection

settings

for

thus

share

or

]

;

dif

control

than

enough

[ 31

of

intervention

the

be

to

17

knowledge

. kov

of

.

,

,

Thus

be

et

part

the

/

.

a

hence

clearly

models

behaviour

.

makes

able

more

and

33

shows

and

background

Mar

,

will

are

particular

Spirtes

it

16

)

data

often

[

unknown

importance

.

of

there

Frydenberg

dynamic

represent

the

basis

important

the

that

c

.

experiments

[

a

see

great

Ch

A

latent

;

patient

(

observational

are

when

an

equivalent

equivalent

]

in

model

specification

the

[ 37

directly

statistically

manipulate

Richardson

model

of

are

scientists

when

and

status

,

basis

,

Markov

Pearl

Spirtes

constitutes

;

element

all

even

interventions

intervention

on

:

,

particular

certain

mechansim

17This

;

by

which

of

16In

]

B

such

the

11

is

controlled

out

and

[ 28

the

-

perform

that

Verma

f

:

simplistic

common

generated

results

of

in

to

-

health

-

assigned

Figure

equivalent

on

A

of

whether

in

s

precipi

was

theory

point

ruled

of

statistically

and

result

patient

graph

'

or

the

the

The

to

and

over

the

Richardson

B

the

,

features

t

shows

intervention

are

intervening

be

is

structural

this

misses

often

objection

riables

of

but

directly

models

This

-

)

patient

purely

B

,

variables

the

differentiated

effect

This

.

control

A

b

independent

that

graphs

(

prevents

group

expected

to

either

11

which

becomes

be

models

not

example

and

)

would

objection

could

For

Figure

,

control

as

between

which

in

to

common

which

intervention

!

,

something

graph

the

intervening

distinction

[

effects

After

)

_

the

intervention

groups

_

tates

control

I

and

II1I1I1II

treatment

. "

resear

-

250

THOMASS. RICHARDSON

cher would be unable to answer questions concerning the consequences of intervening in a system with the structure of a chain graph. However, Lauritzen 18has recently given, in outline, a theory of intervention for LWF chain graphs, which is compatible with the data generating processhe has proposed. Such an intervention theory would appear to be of considerable use in applied settings when the substantive researchhypothesesare causal in nature. 6.

Concl usion

The

examples

which

given

a pair

symmetric perties between

Markov

with

chain

graphs

not , in

described

directed

this

general

' weaker

, will

lead

shown

quite are

with

a directed , the

to

graph

any

of an

chain

different

of those

structure

of a chain

relationship

that or

edge , rather

, should

the

pro -

directed

either

marginalizing

model

, than

in

differences and

) , and

Markov

via

ways , different

Markov

undirected

undirected

graph

' , substantively

many

qualitative

symmetric

model

inclusion

are

. Further

variables , the

, correspond

' safer

to

, there

. Consequently

by

there related

associated

reason

' or

that

and ! or selection

edge , in a hypothesized

as being

clear

symmetrically

been

properties latent

. For

make

be

has

with

does be

, ~

( possibly

associated

tioning

paper may

, in general

particular

the

graphs

can

this

relationships . In

graph

in

of variables

not

inclusion

condi

-

than

a

be regarded of

a directed

edge . This not

paper

has

correspond

question full

of

answer

data

Acknow

ledgements like

Peter

Spirtes

ful

on

an

this

England

the

topic

. I am

and Isaac

, UK

chain

also

the

18Personalcommunication .

symmetric

, this

, and

( under

grateful

to

. Finally

correspond

Markov

do

to . A , of a

property

),

.

Glymour

Perlman

for

helpful

anonymous

, I would

Institute

for

Mathematical

revised

version

of this

like

, Steffen

, Richard

Wermuth

three

which interesting

, in general

a given

Cox , Clark

, Michael Nanny

do

the

specification

of intervention

, David

open

graphs

the

graphs

relations

leaves

chain

involve

Meek

suggestions

Newton

, where

Besag

, Chris Studeny

many

relations

theory

Julian

, Milan

comments

ledge

for

Madigan

are

. However

would

associated

to thank

, David

tions

question

process

with

ritzen

there

graphs

symmetric

this

generating

that

chain

which to

together

I would

shown

to

reviewers

to

gratefully

Sciences paper

was

Lau -

Scheines

,

conversa

-

for

acknow

, Cambridge

prepared

use -

.

,

251

CHAIN GRAPHSAND SYMMETRIC ASSOCIATIONS 7 . Proofs

In DG (O , S, L ) supposethat J1 , is a path that d-connects X and Y given Z US , C is a collider on Ji" and C is not an ancestor of S. Let length (C, Z) be 0 if C is a member of Z ; otherwise it is the length of a shortest directed path <5 from C to a member of Z. Let Coll (Jl,) = { C I C is a collider on IL, and C is not an ancestor of S} . Then

let

size(J,L,Z) = IColl(J,L)I +

L

length (C, Z)

CEColl (J-L)

where

d

I Coll

( J. l , )

- connecting

X

and

X

Y

and

d

d

-

Z

connects

tices

and

1

given

Z

that

vertex

j

Proof

a

S

is

not

in

Z

)

..

t5i

S

.

that

contrary

to

Form

than

Ci

Jl "

Suppose

that

19It

X

and

is

J. L '

Z

)

not

Y

<

,

then

Y

B

A

)

path

US

JJ

,

if

Z

)

size

is

given

Z

a

is

( J1 "

at

minimal

acyclic

acyclic

path

<

denotes

is

J. L

acyclic

,

there

JiI

Z

)

least

,

'

.

that

If

d

- connects

d

- connects

there

one

is

a

minimal

path

acyclic

. 19

the

d

a

subpath

of

J. L

between

ver

-

,

Y

there

-

connecting

}

~

is

J1 ,

a

is

point

of

then

Ci

,

in

S

t5i

( } i

and

and

collider

Ci

do

Y

Ci

from

t5 j

from

of

does

X

each

to

not

on

some

intersect

.

path

t5i

between

for

path

at

ancestor

that

,

directed

directed

an

path

0

only

t5i

not

prove

in

the

S

not

intersection

following

be

the

,

a

and

collider

Ci

hence

no

intersect

exists

the

way

vertex

vertex

loss

of

J. L ( X

,

)

)

( Wx

and

because

that

there

c5i

in

is

a

an

8i

jj

,

then

on

J1 ,

vertex

except

J. L

,

Wy

Y

given

contains

DG

acyclic

,

if

Y

X

x

)

,

J. L

is

U

S

is

path

d

a

.

not

to

on

at

is

path

a

t5i

Ci

by

minimal

,

)

8i

.

It

Figure

( } i

t5i

.

Let

J. L '

now

12

and

acyclic

Y

)

given

J1 , .

be

the

easy

. )

and

and

to

Moreover

than

or

X

both

is

colliders

( cyclic

other

both

on

Y

vertex

on

on

y

,

at

is

is

( See

more

- connecting

that

W

J . L ( Wy

no

there

J1 ,

that

after

and

Z

J. L

on

on

W

J. . L'

intersects

to

that

,

if

to

generality

Wx

:

closest

closest

X

Z

prove

then

,

.

X

to

Z

X

acyclic

such

( Jl "

,

{

path

is

- connects

size

d

U

S

any

Ci

W

Z

intersects

shortest

be

hard

given

)

of

be

let

Wy

of

( J. L ' ,

L

t5i

Jl , '

concatenation

size

,

assumption

without

show

other

acyclic

S

now

path

let

,

where

if

then

and

,

.

Z

( J1 , '

and

( A

minimal

on

a

,

the

a

a

vertex

will

showing

X

that

Z

We

no

Z

JiI

( J. l , )

given

size

given

ancestor

be

of

in

is

that

Y

( O

an

no

Let

is

DG

such

and

member

is

and

Y

there

such

between

J. L

in

,

and

Coll

and

.

If

U

,

proofs

B

Lemma

of

X

S

following

A

#

S

U

X

the

cardinality

U

Z

path

In

( i

the

between

given

- connecting

J. L

is

given

Y

that

I

path

J. . L

d

- connecting

Z

.

and

a

252

THOMASS. RICHARDSON , Jl

Figure

12

.

Finding

intersects

a

with

x

a

-

- connecting

path

path

- - - . . . . . C ;

4

-

"

-

d

directed

"

'

tSi

-

- . . . . . Cj

T

Jl

j . L'

from

a

4

of

smaller

collider

-

size

Ci

- Y

X

-

to

-

a

.

C ;

/

than

-

Jl

in

.

-

"

"

j .L,

vertex

-

"

Figure

13

.

Finding

paths

shortest

Jl ,

a

,

hi

T

not

,

remains

.

Let

the

that

because

Ci

not

be

Z

J1 ,

is

Y

in

a

)

(

Proof

0

O

,

..

not

on

path

Jl ,

( O

,

J1 ,

such

,

L

,

L

J1 ,

C

only

on

some

( Ci

Jl ,

colliders

the

length

-

. Y

than

IJ . ,

in

the

case

where

two

T

)

d

a

Xi

-

)

Z

J1 , ' ,

a

uS

+

only

di

,

( 0

: : ;

.

.

Hence

i

( i

#

j

)

do

.

Y

)

Let

.

It

Z

)

Jl , '

is

<

on

from

to

to

a

the

e

size

Jl , '

T

Ci

be

now

~

( Jl "

that

a

y

Z

)

may

member

member

of

Z

.

,

{

Xn

<

X

n

+

X

,

= =

Y

,

B

m

.

0

between

ZU

.

T

,

( Jl , ' ,

path

.

t5j

collider

from

and

Xl

be

size

assumption

JI "

,

l

and

the

and

Jl , ( Cj

path

on

t5j

and

shortest

the

t5i

on

,

path

Xo

than

. )

also

OJ

connecting

= =

Xi

,

shortest

vertex

and

( T

given

to

shorter

then

13

is

t5j

of

a

is

,

Figure

on

contrary

is

minimal

that

,

Z

.

is

Y

of

B

size

of

( See

Ci

,

and

( X

is

a

d

- connecting

of

are

from

,

-

"

smaller

member

.

to

di

vertices

ancestor

that

,

minimal

)

.

,

)

B

and

}

Xn

~

+

are

Y

0

l

,

.

,

.

,

given

then

.

,

there

Xn

+

m

inseparableDs

= =

in

.

Since

an

S

a

of

a

if

not

,

that

)

)

length

is

of

,

S

intersect

vertex

If

DG

and

minimal

sequence

in

DG

is

not

2

is

,

JL

/

'

assumption

false

X

the

to

the

closest

Ci

are

T

X

is

- connects

than

Lemma

ZUS

is

t5i

,

IJ . '

that

this

Cj

Jl ,

less

Hence

d

and

on

is

shown

Jl , ( X

J1 , '

W

to

on

of

show

of

be

vertex

concatenation

to

from

Suppose

where

.

contrary

to

intersect

path

intersect

path

minimal

It

not

,

case

Z

- connecting

hj

directed

is

d

and

the

.

- - . . . . cj

Z

directed

in

Z

not

j

to

S

is

path

an

ancestors

of

some

at

OJ

path

vertex

,

and

6j

S

Z

that

is

given

ancestor

in

of

as

j

01

E

6j

S

.

Z

,

.

and

A

.

It

.

.

U

Z

vertex

,

( j

every

in

Ok

.

Let

follows

6jl

sequence

S

a

Z

8j

by

: #

of

j

' )

vertices

collider

.

be

a

Lemma

do

on

Denote

directed

that

fJ j

intersect

Xi

in

that

colliders

shortest

1

not

J1 ,

the

,

0

,

such

and

J1 ,

and

no

that

CHAIN

GRAPHS

AND

SYMMETRIC

ASSOCIATIONS

253

each Xi is either on J1 , or is on a directed path 8j from OJ to Zj can now be constructed

:

Base Step : Let Xo := X .

Inductive

Step : If Xi is on some path 8j then define Wi+l to be OJ;

otherwise , if Xi is on J1 " then let Wi + l be Xi , Let Vi + l be the next vertex on

J.L, after Wi+l , such that ~ +1 E O . If there is no vertex OJ' between Wi+l and ~ +1 on J1 " then let Xi +l := ~ +1' Otherwise let OJ* be the first collider on J1 " after

Wi + l , that

is not an ancestor

of S , and let Xi + l be the first

vertex in 0 on the directed path 8j* (such a vertex is guaranteed to exist since Zj *, the endpoint of 8j*, is in 0 ). It follows from the construction that if B is on J1 " and B E 0 , then for some i , Xi == B .

Claim : Xi and Xi +l are inseparableDs in DG (O , S, L ) under d-separation . If Xi and Xi + l are both on JL, then JL(Xi , Xi + l ) is a path on which every non-collider is in L , and every collider is an ancestor of S. Thus J1 , (Xi , Xi + l )

d-connects Xi and Xi +l given Z US for any Z ~ O\ { Xi , Xi +l } . So Xi and Xi +l are inseparableDs . If Xi lies on some path 8j, but Xi +l is on Jl, then the path 7r formed by concatenating the directed path Xi +- . . . +- Cj and

l1I(Oj , Xi +l ) again is such that every non-collider on 7r is in L , and every collider is an ancestor of S, hence again Xi and Xi + l are inseparableDs. The cases in which either

Xi + l alone , or both Xi and Xi + l are not on J1 ,

can be handled similarly . Corollary

0

1 If B is a vertex on a minimal

d- connecting

path 7r between

X and Y .qiven Z U S in DG (O, S, L ), Z U { X , Y, B } ~ 0 , then B E BetweenDS(X , Y ) . Proof . This follows directly from Lemma 2

0

Corollary 2 If J.L is a minimal d-connecting path between X and Y given ZUS in DG (O , S, L ) , C is a collider on J.L that is an ancestor ofZ but not S,

l5 is a shortest directedpath from C to some Z E Z , and Z U {X , Y, C} ~ 0 , then Z E COCOllvs(X , Y ). Proof : By Lemma

1, 5 does not intersect

IL except at C . Let the sequence of

vertices on 5that are in 0 be (VI , . . . , Vr .= Z ) . It follows from the construc tion in Lemma 1 that there is a sequence of vertices (X == Xo , Xl , . . . , Xn == VI , Xn + l , . . . , Xn +m . = Y ) in 0 such that consecutive pairs of vertices are inseparableDs, Since, by hypothesis , C is not an ancestor of S, it follows that no vertex on 5is in S. Hence <5(~ , ~ + l ) is a directed path from ~ to ~ + l on which , with the exception

of the endpoints , every vertex is in L and is a non -

collider on 6, it follows that ~ and ~ +I are inseparableDsin DG (O , S, L ). Thus the sequences(X .= Xo, Xl , . ' . ' Xn .= VI, . . . , Vr .= Z ) and (Y .=

254

THOMASS. RICHARDSON

Xn +m, . . . , Xn == VI , . . . , Vr == Z ) establish that Z E CoConDS(X , Y ) in

DG (O , S, L ).

0

Theorem 1 (i ) A directedgraph DG (O , S, L ) is betweenDS separatedunder d-separation .

Proof.- Suppose, for a contradiction , that DG (O , S, L ) FDS XlLY I W , (W U { X , Y } ~ 0 ), but DG (O, S, L ) ~ DSXJlY I W n BetweenDS (X , Y ). In this case there is some minimal path 7r d-connecting X and Y given

S U (WnBetweenDS(X , Y )) in DG (O, S, L ), but this path is not d-connecting given S U W . It is not possible for a collider on 7r to have a descendant in S U (W n BetweenDs(X , Y )), but not in S U W . Hence there is some

non-collider B on 7r, s.t . B E S U W , but B ft S U (W n BetweenDS (X , Y )). This implies B E W \ BetweenDS(X , Y ) , and since W ~ 0 , it follows that

B E O . But in this caseby Corollary 1, B E BetweenDS (X , Y ), which is a contradiction

Theorem mined

.

0

2 ( ii ) A directed graph DG (O , S, L ) is co-connectionvs deter -

.

Proof : Since Betweenvs (X , Y ) ~ CoConvs (X , Y ) , the proof of Theorem 1 above, replacing 'betweenDs' with 'co- connectedDs' , suffices to show that

if DG (O , S, L ) FDS XJlY I W then DG (O, S, L ) FDS XJlY I W n CoConDS (X , Y ). To prove the converse, suppose, for a contradiction, DG (O, S, L ) FDS XlLY I WnCoConDs (X , Y ), but DG (O , S, L ) I=#DsXlLY I W , where wu { X , Y } ~ O . It then follows that there is some minimal d-connecting path 7r between X and Y in DG (O , S, L ) given W U S. Clearly it is not possible

for there to be a non-collider on 7r which is in S U (W n CoConns(X , Y )), but

not

in S U W . Hence

it follows

that

there

is some collider

C on 7r which

has a descendantin S U W , but not in S U (W n CoConDS (X , Y )). Hence C is an ancestor of W \ CoConDS(X , Y ) , but not S. Consider a shortest directed path l5 from C to some vertex W in W . It follows from Lemma 1, and the minimality of 7r that l5 does not intersect 7r except at C . It now

follows by Corollary 2, that W E CoConDs(X , Y ). Therefore if C is an ancestor

of a vertex

in S U W , then

C is also

an ancestor

of a vertex

in

S U (W n CoConDS(X , Y )) . Hence 7r d-connects X and Y given S U (W n

CoConDS (X , Y )), which is a contradiction. Lemma

0

3 Let C G be a chain graph with vertex set V ; X , Y E V and

W ~ V \ { X , Y } . Let H be the undirectedgraph Aug[CG ; X , Y, W ]. If there

CHAIN GRAPHS ANDSYMMETRIC ASSOCIATIONS 255 ABC l- !- J (a)

A--B- C l~ J?-
A- B- C , l.2-<J~ ! .u(c)

Figure 14. (a) Achain graph CGwithCoConAMP (CG ;X,Y) = {B,W}; (b) apath JLinAug [CG ;X,Y,{W}]; (c) apathJL ' inAug [CG ; X,Y,{W}] every vertex ofwhich occurs onJI . andisinCoConAMP (CG ;X,Y). is a path I.L connecting X and Y in H , then there is a path Il ' connecting X and Y in H such that if V is a vertex on J.L' then V is on tL, and

V E CoConAMP (X , Y ) U { X , Y } . Proof : If X and Yare

adjacent in H then the claim is trivial since J.L' ==

(X , Y ) satisfies the lemma. Supposethen that X and Yare not adjacent in

H .

Let the vertices on Jj be (X == Xl , . . . , Xn == Y ) . Let a be the greatest

-

-

-

.

,

-

.

.

j such that Xj is adjacent to X in H . Let {3 be the smallest k > a such that Xk is adjacent to Y in H . (Since X and Yare not adjacent a , (3 < n .)

It is sufficient to prove that { Xo , . . . , X ,a} ~ COCOllAMP (X , Y ), since then the path J.L' == (X , Xa , . . . , X {3, Y ) satisfiesthe conditions of the lemma.

This can be provedby showingthat there is a path in [CG]~~ p from X to each Xi (a ~ i ~ ,8) which does not contain Y . A symmetric argument

showsthat thereis alsoa path from Y to Xi (a ~ i ~ ,8) in [CG]~~p, which does not contain X . The proof is by induction on i . Base case : i = a . Since X is adjacent to Xa in H , either there is a (directed or undirected ) edge between X and Xi in C G , or the edge was

added via augmentation of a triplex or bi-flag in Ext (CG , Anc({ X , Y } U W )). In the former case there is nothing to prove since X and Xi are

adjacentin [CG]~~p. If the edgewasaddedvia augmentationof a triplex then ------ there --- - is a vertex T such that (X . ,. T ,. Xi .) is a triplex in CG , hence T

is adjacentto X and Xi in [CG]~~p. SinceX and Yare not adjacentin H , T ~ Y , so (X , T , Xi ) is a path which satisfies the claim . If the edge Wag added via augmentation of a bi -flag then there are two vertices To, -

Tl , forming a hi-flag (X , To, Tl , Xi ). From the definition of augmentation it then follows that To and Tl are adjacent to X and Xi in H . Since we suppose that X and Yare not adjacent in H it follows that neither To

nor Tl can be Y . Hence(X ,To,Tl , Xi ) is a path in [CG]~~p satisfyingthe claim .

Inductive

case : i > a ; suppose that there is a path from X to Xi - l

in [CG]~~p which doesnot containY. Since i - I < ,8, Xi - l is not adjacent to Y in H . By a similar proof to that in the base case it can easily be shown that there is a path from

Xi - l to Xi in [CG]~~p which doesnot containY . This path may then be

256

THOMAS S. RICHARDSON CG

A -B C 1-X XI ,/ 1-V ", T -"lIY ~ (b) H2 B c 1 ",/-y1-V ", T/-X -U

Aug[CG;X,Y,{V,W}] BCD

- J- u J--v~ (a) HI X -

Uy

(c)

(d)

~~

.i ~

. :~

Figure 15 . (a)A chain graph ,G CG ,{U in which CoConAMP (induced CG ;X,Y ) ={Uof }; (Aug b)[the undirected graph Aug [ C ; X , Y , , W }]; ( c ) HI , the subgraph G;X G ,Y {U,[W over (X,Y)U{X undirec ted graph H2 ,,Aug C}];X G ,Y,CoConAMP {U,W }nCoConAMP (X,,Y Y})].={U,X,Y};(d)the concatenated with the path from X toXi -lpath (whose existence is guarantee by the induction hypothesis ) to form a connecting X and Xi-l0in [CGJ ~~Pwhich does not contain Y. Lemma 4,Let CG be aCoConAMP chain graph the induced subgraph of Aug [C;G X Y,W ]over (X,,and Y)Ulet {XHI ,Y}.be Let H2 be the undirecte graph Aug [C;G X,Y,WnCoConAMP (X,Y)].HIisasubgraph ofH2 . Proof .then We first prove that if(X a,vertex V,is in HIthen Vand isinY Hoccur 2.IfVin occurs in HI V E CoConAMP Y ) U {X Y }. Clearly X both HIand H2 ,so suppose that VECoConAMP (X,Y). Itfollows from the definition ofthe extended graph that ifVisavertex in Ext (CG,vertex T)then there isaV path consisting of undirected edges from V to some in T . Since is in Ext (CG , Anc ({X , Y } U W )) there is a,path 1T ofE the form (VW := Xo -n,... -t ... -tbe Xn +m :=W )in CG where W {X ,Y}U ,and m-Xn ~o.Let Xk the first vertex on 1T which is inthen {X,Y }UV WE ,i.CoConAMP e.Vi(0~ i,Y<),kand )Xi ~,Xk {X),Y }aU W. Now , if Xk E W since ( X 7r (V is path from VtoXkwhich does not contain XorY,itfollows that XkEWn CoConAMP ( X , Y ). Hence V occurs in Ext (CG , Anc (WnCoConAMP (X,Y))), and so also in H2 . Alternatively , if Xk E {X , Y }, then again Xk occurs in Ext (CG ,Anc ({X,Y})),and thus inH2 . Hence ,reasons ifthere isan edge A-B in HI ,then and inH2 .There are three why there may be an edge inA HI : Boccur

CHAINGRAPHSAND SYMMETRICASSOCIATIONS

257

(a) There is an edge(directedor undirected) betweenA and B in CG. It then followsimmediatelythat there is an edgebetweenA and B in H 2. (b) The edgebetweenA and B in HI is the result of augmentinga triplex in Ext(CG, Anc({X , Y } U W )). Then there is somevertex T such that (A, T, B) forms a triplex in Ext(CG, Anc({X , Y } U W )). Since, by hypothesis,A, B E CoConAMP (X , Y), it followsthat T E CoConAMP (X , Y)U {X , Y } , and henceT occursin HI . It then followsby the previousreasoning that T is in H2, and so the triplex is alsopresentin Ext(CG, Anc({X , Y } U (W n CoConAMP (X , Y)))). Hencethereis an edgebetweenA and B in H2. (c) The edgebetweenA and B in HI is the resultof augmentinga bi-flag in Ext(CG, Anc({X , Y } U W )). This caseis identicalto the previousone, exceptthat therearetwo verticesTo, TI, suchthat (A, To, TI, B) formsa biflag in Ext(CG, Anc({X , Y } UW)). As before, it followsfrom the hypothesis that A, B E CoConAMP (X , Y), that To, TI E CoConAMP (X , Y) U {X , Y} , henceTo and TI occur in H2 and the hi-flag is in Ext(CG, Anc({X , Y } U (W n CoConAMP (X , Y )))). Thus the A- B edgeis alsopresentin H2. 0

Theorem 2 (iii ) Chaingraphsare co-connectionAMP determined . Proof: (CG FAMPXlLY I W => CG FAMPXlLY I W n CoConAMP (X , Y)) Let H be Aug[CG; X , Y, W ]. SinceCG FAMPXJlY I W , X and Yare separatedgivenW in H . Claim: X and Yare separatedin H by W n CoConAMP (X , Y). Suppose , for a contradiction, that there is somepath J.Lin H , connectingX and Y on which there is no vertex in W n CoConAMP (X , Y ). It then followsfrom Lemma3 that thereis a path J.L' in H composedonly of verticeson J1 , which are in CoConAMP (X , Y). Sinceno vertex on J.Lis in W n CoConAMP (X , Y), it then followsthat no vertexon J.L' is in W . SoX and Yare not separated by W in H , contradictingthe hypothesis . However , Aug[CG; X , Y,W n CoConAMP (X , Y)] is a subgraphof H , so X and Yare separatedby W n CoConAMP (X , Y) in Aug[CG;X ,Y,W n CoConAMP (X , Y )]. Thus CG FAMPXJlY , W n CoConAMP (X , Y ). (CG FAMPXJlY I W n CoConAMP (X , Y) => CG FAMPXJlY I W ) The proof is by contraposition.Supposethat thereis a path J.Lfrom X to Y in Aug[CG; X , Y, W ]. Lemma3 impliesthat thereis a path J.L' from X to Y in Aug[CG; X , Y, W ] everyvertexof whichis in {X , Y } UCoConAMP (X , Y). It then followsfrom Lemma4 that this path existsin Aug[CG; X , Y, W n COCOllAMP (X , Y )]. 0

258

THOMAS S. RICHARDSON

References

1.

~. N

2.

. . NM NN

3.

,...... N

4 .

5 .

0. N

6.

. . ,CX ..>. ,0.')..

7. 8.

. . (,0 ,... ,r...

9. 10.

11.

S. A. Andersson , D. Madigan , andM. D. Perlman . An alternativeMarkovproperty for chaingraphs . In F. V. JensenandE. Horvitz, editors, Uncertainty in Artificial Intelligence : Proceedings of the12thConference , pages40-48, SanFrancisco , 1996 . MorganKaufmann . S. A. Andersson , D. Madigan , and M. D. Perlman . A newpathwiseseparation criterionfor chaingraphs . In preparation , 1997 . J. Besag . On spatial-temporalmodelsand Markovfields. In Transactions of the 7th PragueConference on InformationTheory , StatisticalDecisionFunctionsand RandomProcesses , pages47- 55. Academia , Prague , 1974 . J. Besag . Spatialinteractionandthe statisticalanalysisof lattice systems(with discussion ). J. RoyalStatist. Soc. SeT . B, 36:302-309, 1974 . G. F. CooperandE. Herskovits . A Bayesian methodfor the inductionof probabi listic networksfromdata. MachineLearning , 9:309-347, 1992 . D. R. Cox. Causalityandgraphicalmodels . In Proceedings , 49thSession , volume1 of Bulletinof theInternationalStatisticalInstitute, pages363-372, 1993 . D. R. Cox and N. Wermuth . MultivariateDependencies : Models , Analysisand Interpretation . ChapmanandHall, London , 1996 . J. N. Darroch , S. L. Lauritzen , andT. P. Speed . Markovfieldsandlog-linearmodels for contingency tables. Ann. Statist., 8:522-539, 1980 . A. Dempster . Covariance selection . Biometrics , 28:157-175, 1972 . F. M. Fisher. A correspondence principlefor simultaneous equationmodels . Econometrica , 38(1) :73- 92, 1970 . M. Frydenberg . ThechaingraphMarkovproperty . Scandin . J. Statist., 17:333-353, 1990 .

12. D. Geiger . Graphoids : a qualitativeframeworkfor probabilisticinference . PhD thesis,UCLA, 1990 . 13. J. M. Hammersley and P. Clifford. Markovfieldson finite graphsand lattices. Unpublished manuscript , 1971 . 14. D. Hausman . Causalpriority. Nous, 18:261-279, 1984 . 15. D. Heckerman , D. Geiger , and D. M. Chickering . LearningBayesiannetworks : the combination of knowledge andstatisticaldata. In B. Lopezde Mantarasand D. Poole , editors, Uncertaintyin Artificial Intelligence . Proceedings of the 10th Conference , pages293-301, SanFrancisco , 1994 . MorganKaufmann . J. T. A. Koster. Markovpropertiesof non-recursive causalmodels . Ann. Statist., 24:2148 - 2178 , October1996 . S. L. Lauritzen . Graphical Models . Number81in OxfordStatisticalScience Series . Springer -Verlag , 1993 . S. L. Lauritzen , A. P. Dawid, B. Larsen , andH.-G. Leimer . Independence properties of directedMarkovfields. Networks , 20:491-505, 1990 . S. L. Lauritzenand D. J. Spiegelhalter . Localcomputationwith probabilitiesin graphicalstructuresandtheir applicationto expertsystems(with discussion ). J. RoyalStatist. Soc. SereB, 50(2):157-224, 1988 . S. L. Lauritzenand N. Wermuth . Graphicalmodelsfor association betweenvariables , some of which are qualitative and some quantitative . Ann . Statist ., 17:31- 57, 1989 . C. Meek. Strongcompleteness andfaithfulness in Bayesian networks . In P. Besnard andS. Hanks,editors, Uncertainty in ArtificialIntelligence : Proceedings of the11th Conference , pages403-410, SanFrancisco , 1995 . MorganKaufmann . J. Pearl. Probabilistic Reasoning in IntelligentSystems . MorganKaufman , 1988 . J. Pearl. Causaldiagramsfor empiricalresearch(with discussion ). Biometrika , 82:669-690, 1995 . J. Pearland R. Dechter . Identifyingindependencies in causalgraphswith feedback. In F. V. Jensen andE. Horvitz, editors , Uncertainty in Artificial Intelligence .'

CHAINGRAPHS ANDSYMMETRIC ASSOCIATIONS

25.

29. 30. . . co ~ t~

31.

. LC ~

32.

. ~ ~

33.

IIIII1IIII

38.

259

Proceedingsof the 12th Conference , pages454- 461, San Francisco , 1996. Morgan Kaufmann. J. Pearl and T ~Verma. A theory of inferred causation. In J. A. Allen, R. Fikes, and E. Sandewall , editors, Principles of KnowledgeRepresentationand Reasoning : Proceedin .qs of the SecondInternational Conference , pages441- 452, SanMateo, CA, 1991. MorganKaufmann. T . S. llichardson. A discoveryalgorithmfor directedcyclic graphs. In F. V . Jensen and E. Horvitz, editors, Uncertaintyin Artificial Intelligence: Proceedingsof the 12th Conference , pages454- 461, San:Francisco , 1996. MorganKaufmann. T. S. Richardson. Modelsof feedback : interpretation and discovery . PhD thesis, Carnegie -Mellon University, 1996. T. S. Richardson. A polynomial-time algorithm for decidingMarkov equivalenceof directed cyclic graphicalmodels. In F. V. Jensenand E. Horvitz, editors, Uncertainty in Artificial Intelligence: Proceedings of the 12th Conference , pages462- 469, San Francisco , 1996. MorganKaufmann. T . P. Speed. A note on nearest-neighbourGibbs and Markov distributions over graphs. SankhyaSer. A, 41:184- 197, 1979. P. Spirtes. Directedcyclic graphicalrepresentations of feedbackmodels. In P. Besnard and S. Hanks, editors, Uncertaintyin Artificial Intelligence: Proceedings of the 11th Conference , pages491- 498, SanFrancisco , 1995. MorganKaufmann. P. Spirtes, C. Glymour, and R. Scheines . Causation, Predictionand Search . Lecture Notesin Statistics. Oxford University Press, 1996. P. Spirtes, C. Meek, and T .S. Richardson. Causalinferencein the presenceof latent variables and selectionbias. In P. Besnardand S. Hanks, editors, Uncertainty in Arti_ : Proceedings of the 11th Conference , pages403- 410, San .ficial Intelligence Francisco, 1995. MorganKaufmann. P. Spirtesand T . S. Richardson. A polynomial-time algorithm for determiningdag equivalencein the presenceof latent variablesand selectionbias. In D. Madigan and P. Smyth, editors, Preliminary papersof the Sixth International Workshopon AI and Statistics, January4-7, Fort Lauderdale , Florida, pages489- 501, 1997. P. Spirtesand T . Verma. Equivalenceof causalmodelswith latent variables. Technical Report CMU-PHIL-33, Departmentof Philosophy , CarnegieMellon University, October 1992. R. H. Strotz and H. O. A. Wold. Recursiveversusnon-recursivesystems:an attempt at synthesis. Econometrica , 28:417- 427, 1960. (Also in CausalModelsin the Social Sciences , H.M. Blalock Jr. ed., Chicago: Aldine Atherton, 1971). M. Studenyand R. Bouckaert. On chaingraphmodelsfor descriptionof conditional independence structures. Ann. Statist., 1996. Acceptedfor publication. T . Vermaand J. Pearl. Equivalenceand synthesisof causalmodels. In M. Henrion, R. Shachter,L. Kanal, and J. Lemmer, editors, Uncertaintyin Artificial Intelligence: Proceedingsof the 12th Conference , pages220- 227, San Francisco , 1996. Morgan Kaufmann. S. Wright. Correlationand Causation. J. Agricultural Research , 20:557- 585, 1921.

THE MULTIINFORMATION FUNCTIONAS A TOOLFOR MEASURINGSTOCHASTIC DEPENDENCE

,

M . STUDENY

,

AND

J . VEJNAROVA

Institute of Information Theory and Automation Academy of Sciences of Czech Republic Pod vodarenskou

viii

4, 182 08 Prague

AND

Laboratory of Intelligent Systems University of Economics Ekonomicka

957 , 148 00 Prague

Czech Republic

Abstract . Given a collection of random variables [~i]iEN where N is a finite nonempty set, the corresponding multiinformation

function assigns to each

subset A c N the relative entropy of the joint distribution of [~i]iEA with respect to the product of distributions of individual random variables ~i for i E A . We argue that it is a useful tool for problems concerning stochastic

(conditional ) dependenceand independence(at least in the discrete case). First , the multiinformation

function makes it possible to express the

conditional mutual information between [~i]iEA and [~i]iEB given [~i]iEC (for every disjoint A , B , C c N ), which can be considered as a good measure of conditional stochastic dependence. Second, one can introduce reasonable measuresof dependence of level r among variables [~i]iEA (where A c N , 1 ~ r < card A ) which are expressible by means of the multiinformation

function . Third , it enablesone to derive theoretical results on (nonexistence of an) axiomatic characterization models

1.

of stochastic conditional

independence

.

Introduction

Information

theory provides a good measure of stochastic dependence be-

tween two random variables, namely the mutual information [7, 3]. It is always nonnegative and vanishes if the corresponding two random variables 261

, , M. STUDENY ANDJ. VEJNAROVA

262

are stochastically independent . On the other hand it achieves its maximal

value iff one random variable is a function of the other variable [28]. Perez [15] wanted also to express numerically the degree of stochastic dependence among any finite number of random variables and proposed a numerical characteristic called "dependence tightness ." Later he changed the terminology , calling the characteristic multiinformation and encouraging research on asymptotic properties of an estimator of multiinformation

[18]. Note that multiinformation also appeared in various guises in earlier information -theoretical papers. For example, Watanabe [24] called it "total correlation" and Csiszar [2] showedthat the IPFP procedure convergesto the probability distribution minimizing multiinformation within the considered family of distributions having prescribed marginals . Further

prospects

occur

when

one considers

multiinformation

as a set

function . That means if [l;i]iEN is a collection of random variables indexed by a finite set N then the multiinformation function (corresponding to [~i]iEN) assigns the multiinformation of the subcollection [~i]iEA to every A c N . Such a function was mentioned already in sixties by Watanabe

[25] under the name "total cohesion function ." Some pleasant properties of the multiinformation function were utilized by Perez [15] in probabilistic decision -making . Malvestuto named the multiinformation

function "en-

taxy " and applied it in the theory of relational databases [9]. The multiinformation

function plays an important

role in the problem of finding

"optimal dependencestructure simplification " solved in thesis [21], too. Finally , it has appeared to be a very useful tool for studying of formal properties of conditional independence . The

first

author

in modern

statistics

to deal

with

those

formal

prop -

erties of conditional independencewas probably Dawid [5]. He characterized certain statistical concepts (e.g. the concept of sufficient statistics) in terms of generalizedstochastic conditional independence. Spohn [17] studied stochastic conditional independence from the viewpoint of philosophical logic and formulated the same properties as Dawid . The importance of conditional

independence in probabilistic

reasoning was explicitly

discerned

and highlighted by Pearl and Paz [13]. They interpreted Dawid's formal properties in terms of axioms for irrelevance models and formulated a nat ural conjecture that these properties characterize stochastic conditional in -

dependencemodels. This conjecture was refuted in [19] by substantial use of the multiinformation function and this result was later strengthened by showing that stochastic conditional independence models cannot be char-

acterized by a finite number of formal properties of that type [20]. However , as we have already mentioned , the original prospect of multiin formation was to express quantitatively the strength of dependence among random variables . An abstract view on measures of dependence was brought

MULTIINFORMATION

AND

STOCHASTIC

263

DEPENDENCE

by Renyi [16] who formulated a few reasonablerequirements on measures of dependenceof two real-valued random variables. Zvarova [28] studied in more detail information -theoretical measures of dependence including mutual information . The idea of measuring dependence appeared also in nonprobabilistic calculi for dealing with uncertainty in artificial intelligence

[22, 23]. This article is basically an overview paper , but it brings several minor new results which (as we hope ) support our claims about the usefulness of the

multiinformation

function

. The

basic

fact

here

is that

the

multiinfor

-

mation function is related to conditional mutual information . In the first part of the paper we show that

the conditional

mutual

information

com -

plies with severalreasonablerequirements(analogousto Renyi's conditions) which should be satisfied by a measure of degree of stochastic conditional dependence . The second part of the paper responds to an interesting suggestion from Naftali Tishby and Joachim Buhmann . Is it possible to decompose multi -

information (which is considered to be a measure of global dependence) into level-specific measures of dependence among variables ? That means one would like to measure the strength of interactions of the "first level" by a special measure of pairwise dependence, and similarly for interactions of "higher levels." We show that the multiinformation can indeed be viewed as a sum of such level-specific measures of dependence. Nevertheless , we have found recently that such a formula is not completely new : similar level-specific measures of dependence were already considered by Han [8] . Finally , in the third part of the paper , as an example of theoretical use of the multiinformation

function

we recall

the results

about

nonexistence

of

an axiomatic characterization of conditional independence models . Unlike

the original paper [20] we present a long didactic proof emphasizing the essential steps . Note

that

all results

of the

paper

are formulated

for random

taking a finite number of values although the multiinformation can

be used

also

in the

case of continuous

variables

. The

reason

variables

function is that

we wish to present really elementary proofs which are not complicated by measure

2.

- theoretical

Basic

technicalities

.

concepts

We recall well -known information

- theoretical

concepts in this section ; most

of them can be found in textbooks, e.g. [3]. The reader who is familiar with information

theory

can skip the section .

Throughout the paper N denotes a finite nonempty set of factors or in short a factor set. In the sequel, whenever A , B C N the juxtaposition AB

, , M. STUDENYAND J. VEJNAROVA

264 will

be

i

2

EN

used

,

. 1

.

to

the

The

should

nonempty

~

finite

Xi

: f :

B

c

c

By

a

B

Xi

and

a

of

frames

.

x

Or

0

i

Having

N

a

)

=

any

B

b

is

the

( pAB

)

1 b

b

=

a

{

.

C

Nand

probability

P

( a

,

b

)

defines

a

image

is

of

the

A

b

( a

)

;

distribution

( by

b

=

N

the

in

a

A

.

to

a

fixed

symbol

for

set

{

P

( y

set

E

N

}

Y

)

;

N

is

XA

Whenever

XB

we

y

will

be

is

an

understand

E

Y

}

=

1

.

understood

By

any

arbitrary

collection

distribution

of

P

over

E

X

( a

XN

\

A

}

B

b

,

1

a

discrete

for

such

N

the

as

every

a

[ ~

that

pB

( b

over

E

marginal

follows

XA

i ] iEA

:

.

.

In

the

sequel

N

)

\

>

B

0

the

conditional

defined

by

:

) for

c

over

defined

.

every

a

E

XN

\

B

.

)

distribution

B

A

subvector

: =

' ( b

)

is

L

of

N

one

can

frames

[ ~

i ] iEN

use

a

induces

probability

\

the

{

P

( y

)

;

)

.

set

y

E

the

Z

Y

P

transformed

is

,

&

on

Provided

of

a

B

under

the

symbol

condition

pAl

b

transformation

distribution

finite

distribution

f

i

random

E

=

nonempty

probability

P

:

factor

;

distribution

A

P

)

a

Xi

2

joint

p0

between

a

( z

discrete

values

to

denote

.

Supposing

Q

any

.

when

c

frame

distribution

the

b

disjoint

b )

into

}

A

take

projection

distribution

of

a

mapping

mapping

i

.

to

A

the

with

probability

( conditional

For

( pi

Every

tions

.

is

particular

PB

: =

for

{

situation

= f

finite

Y

{

convention

: ; f

describes

has

the

0

nonempty

on

where

distribution

pi

i ] iEB

In

that

over

XN

P

[ ~

.

N

coordinate

a

P

and

L

pi

It

i

E

Nand

,

its

probability

natural

0

and

of

.

c

is

the

distribution

then

on

,

i ] iEN

A

( a

the

E

Xi

function

on

pA

adopt

B

variables

i

for

i

distribution

[ ~

describes

,

equivalently

pA

we

XA

factor

frame

factor

distribution

vector

Having

U

instead

random

a

lliEA

E

probability

distribution

the

every

distribution

random

It

called

real

)

A

i

.

probability

( discrete

union

by

discrete

to

product

nonnegative

probability

to

to

N

x

set

DISTRIBUTIONS

Cartesian

A

the

denoted

corresponding

assigned

by

every

i

set

is

the

denoted

for

sometimes

correspond

variable

denotes

notation

be

PROBABILITY

factors

frame

the

will

DISCRETE

random

0

shorten

singleton

the

f

Z

on

the

( y

.

In

Y

and

f

:

z

E

distribu

Y

-

7

-

Z

is

a

formula

)

=

z

}

such

for

a

of

f

( ~

every

case

distribution

vector

of

)

.

we

a

random

say

Z

that

,

Q

vector

is

~

an

,

Q

MULTIINFORMATION

Supposing say B

A

that - t

It

A

A

( P

is )

on

c

there

exists

= = pB

( a , b )

=

when

E

XB

;

A

, B

, G

Supposing that

we

write

A

is

A

holds

for is

Jl

BIG

every

the

when

the

=

remaining

is

>

a

distribution

is

a

- t

XA

( b ) ,

to

such

b

E

a

and

XB

.

,

b

of

deterministic

E

a

random

vector

function

that

we

write

,

XA

f

N

P

that

XB E

function

outside

over

respect

distribution

the

O } ;

is with

f

the

The

symbol

T

serve

class

is

of

uniquely

set

it

another

determined

can

take

arbitrary

The

of

( N

)

of

Let

a

denote a

i

( [ Xi

P B

pAC

E

is

a

distribution

given

C

(a , c )

. pBC

Xc

. It

over

with

respect

N

to

we

P

and

stochastic

the

[ ~ i ] iEN

the

( b , c )

describes

vector

known

and

values

of

point

of

E

I

T

set

of N

situation

in

view

when

every

[ ~ i ] iEA

situation

and

[ ~ i ] iEB

are

) .

ordered

, where

A

triplets #

0

#

(A

B

independence

model (N

and

those

) , its

their

a

is

, Yi ] iEN

N

, BIG

. These

)

of

triplets

statements

model

induced ( A

a

c

, BIG

T

by

a

)

E

T

( over

within

N

)

are

Y

N

I

)

such

is N

that

an

on Y

i

T

(B c

T

A

1L

( N

)

=

P

( [ Xi ] iEN

)

. Q

( [ Yi ] iEN

E

the

P

over

BIG

N ( P

) .

model

.

independency model

XN

inducing

)

) .

)

is

inducing 3

. Put

mod

for

[ Xi

, Yi ] iEN

-

.

I

and Zi

=

Q Xi

E

ZN

be X

define

)

( N

, AIG

independency

independency

: = lliEN

class

distribution )

probabilistic

probabilistic

the

triplet

.

( N

over

( N

of the

model

probability

distribution on

subset is

images

distribution

, J

a

independency

model

also

is image

symmetric

probability

and

an

triplets

I

J

over

symmetric of

probability

n

N

collection

conditional

closure

be

E

C

factor

of

distribution

R

,

the

of

Supposing I

P

probability every

XB

=

random

( from

)

some

class

(c )

independency

in

2 . 1 the

E a

independency by

Lemma

b

. pC

are

independency

induced

and of

equality

[ ~ i ] iEC

symmetric

probabilistic

Proof

,

will

, BIG

just

~,'joint

.

triplets

The

di

MODELS

, an (A

consists

els

of

N

general

) .

the

identification

set

Supposing

A

XA

subsets

for

factor In

if

of

disjoint

(N

)

unrelated

INDEPENDENCY

the

are

(a , b , c )

E

values

pairwise

N

independent ( P

a

2 .2 .

will

c

distribution

completely

for

a

for

that

( b )

conditionally

pABC

T

for

[ ~ i ] iEA

. Note

pB

: XB

P

subvector [ ~ i ] iEB

P B

f

265

DEPENDENCE

.

say

p

mapping

( b )

situation

{ b

and on

0

random

set

disjoint

a

( a , b )

subvector

the

are

STOCHASTIC

dependent

pAB

whose

values

N

pAB

the

[ ~ i ] iEN

, B

functionally

if

reflects

random

AND

.

a Y

i

, , M. STUDENY ANDJ. VEJNAROVA

266

It is easy to verify that for every (A , BIG } E 7 (N ) one has A lL BIG (R) iff [ A Jl BIG (P ) & A Jl BIG (Q) J. 0

2.3. RELATIVE ENTROPY Supposing Q and Y we say that Q implies Q (y ) = a entropy of Q with

R are probability distributions on a nonempty finite set is absolutely continuous with respect to R iff R (y ) = a for every y E Y . In that case we can define the relative respect to R as

1l(QIR)=2:::{Q(y).In~ ;yEY&Q(y)>O}. Lemma 2 .2 Suppose that Q and R are probability distributions on a nonempty finite set Y such that Q is absolutely continuous with respect to R . Then

(a) 1l (QIR) ~ 0, (b) 1l (QIR) = 0 iffQ = R. Proof Consider the real function

0, 0, h(y) = 0 otherwise. Since

Jensen's inequality [3] with respect to R and write :

0=c.p(l)=c.p(yEY Lh(y).R(y))~yEY Lc.p(h(y)).R(y)=1l(QIR ). Owing to strict convexity of c.p the equality holds iff h is constant on the set { y E Y ; R(y) > O} . That means h := 1 there, i .e. Q = R. 0 Supposing that (A , BIG ) E T (N ) and P is a probability distribution over N the formula

A P(x) A P(x) -

pAC (ZAC ).pBC (ZBC ) 0

pC(xc)

for x E XABC with pC (xc ) > 0, for remaining x E XABC .

(1)

defines a probability distribution on XABC. Evidently , pABa is absolutely A continuous with respect to P . The conditional mut.ual information between A and B given G with respect to P , denoted by I (A ; BIG IIP ) is the relative entropy of pABa with respect to P . In case that P is known from the context we write just I (A ; BIG ).

MULTIINFORMATION

AND

Consequence

2 . 1 Supposing

ity

over N

distribution

that

( A , BIG ) E T ( N ) and P is a probabil

I ( A ; BIG

II P ) ~ 0 ,

(b)

I ( A ; BIG

liP ) = 0 iff A ..l1- BIG

Owing

nothing

to Lemma

but

the

267

DEPENDENCE

-

one has

(a )

Proof

STOCHASTIC

(P ) .

2 .2 it suffices

corresponding

to realize

conditional

that

pABa

independence

=

P means

statement

.

0

distribution

P

2.4. MULTIINFORMATION FUNCTION The

multiinformation

(over a

factor as follows :

function

set

induced

N ) is a real

M ( D liP ) = 1i ( PDI

by

function

a probability

on the

power

set of N

defined

p {i}) for 0 # D c N, M (011 P) = o.

n iED

We again omit the symbol of P when the probability distribution is clear from the context. It follows from Lemma 2.2(b) that M (D ) = 0 whenever card D = 1 . Lemma

2

over

N

.

.

3

Let

(

A

,

BIG

)

E

T

(

N

)

and

P

be

a

probability

distribution

Then

I

(

A

;

BIG

)

=

=

M

(

ABG

)

+

M

(

G

)

-

M

(

AG

)

-

M

(

BG

)

(2)

.

Proof. Let us write 1l(pABCIP) as

'"L,."",{pABC pABC (X pC ((XC ));XEXABC XBC (X).InpAC (XAC ))'.PBC &pABC (X)>O }.

Now tor

lli

we of

~ A

can the

p

artificially ratio

{ i } ( Xi

strIctly

)

. lliEB

positive

erties

of

multiply in

for

logarithm

the

both

argument

p

{ i } ( Xi

)

any

considerea

one

can

. lli

c

pABC

( x )

write

. In

+

}: : : ; {

pABC

( x )

pC II

-

2: : {

pABC

( X )

. In

pAC n

-

~

{

pABC

( X )

iEC

. In

iEAC

pBC n

iEBC

as

and

logarithm

{ i } ( Xi

)

by

. lliEC

p X

a

sum

of

.

{ i } ( Xi Using

four

the a

)

;

x

E

XABC

&

pABC

is

- known

( x )

>

( XC

)

;

x

E

XABC

&

pABC

( x )

>

O

O

}

}

{ i } ( x '. )

( XAC

)

;

X

E

XABC

&

pABC

( x )

>

O }

;

X

E

XABC

&

pABC

( x )

>

a

p { i } ( x I. )

( XBC

)

P { i } ( X '' )

always prop

:

p { i } ( X ', )

p

-

product

which

well

terms

denomina

special

-r \ \ Xl

iEABC

. In

p

it

.& II

numerator

the

configuration

pABCI }: : : ; {

the

of

} .

-

, , M. STUDENYAND J. VEJNAROVA

268 The

first

ABC

term

. To

is nothing

see that

configurations

for

that

of xs

is groups

but

the

the

second

which

the

having

the

for

the

2 .5 .

ENTROPY

If

is a discrete

Q

of

other

AND

logarithm to

( y , xc ) . In

function

sum

there

has

for

in groups

the

same

of

value

,

C : ,:/

pc ( XC ) = lliEC P { i } ( Xi )

0

terms

.

L pABC JlEXAB pABC (JI, ZC) > 0

( y , xc ) =

probability

0

ENTROPY

distribution the

. pC ( xc ) .

.

CONDITIONAL

by

can

L In PC ( x ~ ) zcE Xc lliEC p { t } ( Xi ) pC (zc O

two

Q is defined

multiinformation

projection

pABC

=

entropy

same

L In PC ( x ~ ) II iEC P { t } ( x 1..) zc E Xc pc (zc ) > 0

Similarly

of the

is M ( C ) one

corresponding

L L oeCEXC YE XAB pc (zc 0 pABa (JI,ZC

=

value

term

on

a nonempty

finite

set

Y

the

formula 1

1 (Q ) = Lemma

2 .4

nonempty

1l ( Q ) ~ 0 , 1l ( Q ) = . Since

every

y ; the

factor

0 iff

equality

set

N

will

- the see that using

&

Q (y ) > O } .

probability

here

distribution

the

procedure

by

on

)

for

symbol

of

as in the

function

on

a

1 . It

power

0 #

D

when proof

set

one

gives

has

In Q ( y ) - l :2:: 0

0 for

both

of N

is clear

of Lemma

defined

from

the is not

0

over

as follows

M (D) = - H(D) + L H({i}) for every D c N .

iED

such

(b ) . P

H ( 011 P ) =

2 .3 it

every

( a ) and

distribution

C N ,

it

1.

(y ) - l ~

a probability

the

P

Q (y ) =

Q ( y ) . InQ

if Q ( y ) =

induced

1l ( pD

that

real

O. Hence

only

function

II P ) =

such

increasing

Q (y ) >

function

omit

same

a discrete

y E Y

is an

with

is a real

often

is

exists

occurs

H (D

We

there

entropic

Q

; y E Y

Y . Then

logarithm

y E Y

The

that

set

(b) Proof

{ Q ( y ) . In Q( Y )

Suppose

.finite

(a )

for

L

a :

o .

context difficult

. By to

MULTIINFORMATION

Hence

, using

the

formula

I ( A ; BIG

Supposing is defined

AND

( 2 ) from

) =

A ,B

c

are

use

the

symbol

distribution

Proof

P

) =

One the

can

be

is nothing of pAl

) -

entropy

of

(3)

A

given

B

H (B ) .

indicate

the

corresponding

distribution

easily

see using

pB

but

gives

H ( AIB write

it

the

Measure this

over

probability

N , A ,B

section

c

N

are

sequence

used

the

in the

pB

(b

proof

O} .

(4 )

of Lemma

2 .3

are

for

2 . 1 zero one

from

let

, one

pAlb

& can

P

( ab ) >

utilize

the

a }

definition

dependence why

conditional

quantitative

mutual measure

of

informa

-

degree

of

. the

random known

following

vectors ( fixed

mutual

bound find

,

0

us consider

conditional

always

( a ) . ln ~

(4 ) .

arguments

discrete

is a lower can

hand

stochastic

already

the

b E XB

O

as a suitable

topic

~ c are

~ BC

other

L aE XA pAl b(a

dependence

this

values

since

(b) .

several

stochastic

and

the

&

form

considered

motivate

; a E XA

conditional

~ A , ~ B , and

possible

method

&

AB

( ab )

II P ) . On in

we give be

conditional

~ AC

the

) ; bEXB

(b)

expression

of

should

To

( b ) oH ( Allplb

( ab ) . In pAB

b and

which

bound

H ( AB

II P ) to

} : { pB

L pB bE XB pB (b O

of

) =

a probability

AB

L...., { P

that

conditional

) .

expression

~

tion

the

) + H ( BG

. Then

H ( AIBIIP

In

derives

H ( G ) + H ( AG

disjoint

H ( AIB

2 . 5 Let

disjoint

3.

2 . 3 one

269

P .

Lemma

that

) -

DEPENDENCE

difference

H ( AIB

We

Lemma

- H ( ABC

N

as a simple

STOCHASTIC

for

or

and

a distribution

the

values having

task

joint

prescribed

information those

specific

. Suppose

distributions

) . What

then

I ( A ; B I C ) ? By , and

it

prescribed

is the

are Con -

precise

marginals

, , M. STUDENYAND J. VEJNAROVA

270

for A AC and BC such that I (A ; BIG ) = 0 (namely the " conditional P given by the formula ( 1) ) . 3.1. MAXIMAL

DEGREE OF CONDITIONAL

product "

DEPENDENCE

But one can also find an upper bound . Lemma 3 .1 Let over N . Then

(A , BIG ) E T (N ) and P be a probability

distribution

I (A ; BIG ) ~ mill { H (AIG ) , H (BIC ) } .

Proof that :

It follows from ( 3) with help of the definition

of conditional

entropy

I (A ; BIG ) = H (AIG ) - H (AIBG ) . Moreover , 0 ~ H (AIBC ) follows from (4 ) with Lemma 2.4 ( a) . This implies I (A ; BIG ) ::::; H (AIG ) , the other estimate with H (BIC ) is analogous . 0 The following proposition generalizes an analogous result obtained in the unconditional case by Zvarova ( [28) , Theorem 5) and loosely corre sponds to the condition E ) mentioned by Renyi [16] . Proposition 3 . 1 Supposing tribution over N one has

(A , BIG ) E T (N ) and P is a probability

I (A ; BIG liP ) = H (AIG II P )

ProD/: By the formula

mentioned

iff

dis -

BG - t A (P ) .

in the proof of Lemma 3.1 the considered

equality occurs just in case H (AIBC II P ) = O. Owing to the formula (4 ) and Lemma 2.4 (a) this is equivalent to the requirement H (A II pi bc) = 0 for every (b, c) E XBC with pBC (b, c) > O. By Lemma 2.4 (b ) it means just that for every such a pair (b, c) E XBC there exists a E XA with pAl bc(a ) = 1. Of course , this a E XA is uniquely determined . This enables us to define the required function from XBC to XA . 0 A natural question that arises is how tight is I ( A ; BIG ) from Lemma 3.1? More exactly , we ask ways find a distribution having prescribed marginals I ( A ; BIG ) = min { H (AIG ) , H (BIG ) } . In general , the shown by the following example .

the upper bound for whether one can al for AC and BC with answer is negative as

MULTIINFORMATION

AND

STOCHASTIC

DEPENDENCE

271

Example 3.1 Let us put XA = XB = Xc = { O, 1} and define PAC and PBC as follows

PAC(O,0) = ~, PAC(O,1) = PAc(1, 1) = ! ' PAc(l ,0) = 1' PBC(O,O) = PBC(O, 1) = PBc(l , 0) = PBc(l , 1) = i . Since (PAC)C = (PBC)C there exists a distribution on XABC having them as marginals . In fact , any such distribution P (O, 0, 0)

=

a,

P (O, 0, 1)

=

(3,

P(O,l ,O) P(O, 1, 1) P(l , 0, 0) P(l , 0, 1) P(l , 1,0)

= = = = =

! ~~~a-

P (l , 1, 1)

=

(3,

P can be expressed as follows

a, (3, a, {3, ~,

wherea E [1\ , ! ]"B E [0, ! ]. It is easyto showthat H(AIC) < H (BIC). On the other hand, for every parameter a either P (O, 0, 0) and P (l , 0, 0) are simultaneously nonzero or P (O, 1, 0) and P (l , 1, 0) are simultaneously nonzero . Therefore A is not functionally dependent on BC with respect to P and by Proposition 3.1 the upper bound H (AIC ) is not achieved. <> However , the upper bound given in Lemma 3.1 can be precise for specific

prescribed marginals. Let us provide a general example. Example 3.2 Supposethat PBG is given, consider an arbitrary function 9 : XB - t XA and define PAC by the formula PAc(a, c) = L { PBC(b, c) ; bE XB & g(b) = a }

for a E XA , c E Xc .

One can always find a distribution P over ABC having such a pair of distri butions PAC, PBC as marginals and satisfying I (A ; BIG liP ) = H (AIG II P ). Indeed, define P over ABC as follows: P (a, b, c) = PBc (b, c) P (a, b, c) = 0

if g(b) = a, otherwise.

This ensuresthat BC - t A (P ), then use Proposition 3.1.

<>

3.2. MUTUAL COMPARISONOF DEPENDENCEDEGREES A natural intuitive requirement on a quantitative characteristic of degreeof dependenceis that a higher degreeof dependenceamong variables should

, , M. STUDENYAND J. VEJNAROVA

272

be reflected by a higher value of that characteristic . Previous results on conditional mutual information are in agreement with this wish : its minimal value characterizes independence , while its maximal values more or less correspond to the maximal degree of dependence. Well , what about the behavior "between" these "extreme " cases? One can imagine two "comparable " nonextreme cases when one case represents evidently a higher degree of dependence among variables than the other case. For example , let us consider two random vectors ~AB resp. 1]AB (take C = 0) having distributions PAB resp. QAB depicted by the following dia grams .

PAB

0

~

~

QAB

0

0

~

1

1

1

'7

'7

'7

!7

~ 7

0

!

!

0

!

!

0

7

7

7

7

Clearly , (PAB)A = (QAB )A and(PAB)B = (QAB)B. But intuitively , QAB expresses a higherdegree of stochastic dependence between 1JA = ~A and 1JB= ~B thanPAB. ThedistributionQABis more"concentrated " than PAB:QABisanimage ofPAB.Therefore , wecananticipate I (A;BI011 P) ~ I (A; BI011 Q), whichis indeed thecase . Thefollowing proposition saysthatconditional mutualinformation has the desired property . Notethat thepropertyis not derivable fromother properties of measures of dependence mentioned eitherbyRenyi[16] or by Zvarova [28] (in theunconditional case ). Proposition3.2 Suppose that(A,BIG) E T(N) andP, Q areprobability distributions overN suchthatpAC= QAC , pBC= QBCandQABC is an imageof pABC.Then I (A; BIG liP) :::; I (A; BIG IIQ) .

Proof Let us write P instead of pABG throughout the proof and similarly for Q. Supposethat Q is an image of P by f : XABC - t XABC. For every

MULTIINFORMATION AND STOCHASTIC DEPENDENCE

273

x E XABC with Q(x) > 0 put T = {y E XABC; f (y) = x & P(y) > O} and write (owing to the fact that the logarithm is an increasingfunction):

LP{y).lnP {y)~yET LP{y).In(L P{Z)) yET zET

= Q(x) . In Q(x) .

We can sum it over all such xs and derive

L P(y) .1nP(y) ~ L Q(x) .1nQ(x) . yEXABC zEXABC P(y O Q(z O Hence

- H(ABCIIP) ::::; - H(ABCIIQ) . Owingto the assumptionspAG = QAG, pBG = QBGonehasH (AC IIP) = H (AC IIQ), H (BC IIP) = H (BC IIQ) and H (C IIP) = H (C IIQ) . The formula (3) then givesthe desiredclaim. D Nevertheless

hold

laxed

,

when

,

the

as

mentioned

assumption

that

demonstrated

Example

depicted

the

3

by

. 3

the

by

Take

C

following

=

the

0

inequality from Proposition 3.2 may not marginals for AC and EC coincide is refollowing

and

consider

example

the

.

distri',but ionsPABand QAB

diagrams :

QAB 0

~

"18

"38

Evidently , QAB is an image of PAB, but I (A; BI011P ) > I (A ; BI011Q). 0 Remark One can imagine more general transformations of distributions : instead of "functional " transformations introduced in subsection 2.1 one can consider transformations by Markov kernels. However, Proposition 3.2 cannot be generalizedto such a case. In fact, the distribution PAB from the motivational example starting this subsection can be obtained from QAB by an "inverse" transformation realized by a Markov kernel.

, , M. STUDENYAND J. VEJNAROVA

274

3.3. TRANSFORMED DISTRIBUTIONS Renyi's condition F) in [16] states that a one-to-one transformation of a random variable does not change the value of a measure of dependence. Similarly , Zvarova [28] requires that restrictions to sub-u-algebras (which somehow correspond to separate simplifying transformations of variables) decreasethe value of the measureof dependence. The above mentioned requirements can be generalizedto the "conditional" case as shown in the following proposition. Note that the assumption of the proposition means (under the situation when P is the distribution of a random vector [~i]iEN) simply that the random subvector [~i]iEA is transformed while the other variables ~i , i E BG are preserved. Proposition 3.3 Let (A , BIG ) , (D , BIG ) E 7 (N ), P, Q be probability distributions over N . Suppose that there exists a mapping 9 : XA - t XD such that QDBC is an image of pABC by the mapping f : XABC - t XDBC defined by f (a, b, c) = [g(a), b, c]

for a E XA , (b, c) E XBC .

Then I (A ; BIG IIP ) ~ I (D ; BIG II Q) . Proof Throughout the proof we write P instead of pABa and Q instead of QDBC. Let us denote by Y the class of all (c, d) E XCD such that P (g- l (d) x XB x { c} ) > 0 where g- l (d) = { a E XA ; g(a) = d} . For every (c, d) E Y introduce a probability distribution RCdon 9- 1(d) x XB by the formula: Rcd(a, b) =

P (a, b, c) P (g- l (d) X XB X { c} )

for a E 9- 1(d), b E XB .

It can be formally considered as a distribution on XA x XB . Thus, by Consequence2.1(a) we have 0 ~ I (A ; BI011Rcd) for every (c, d) E Y . One can multiply this inequality by P (g- l (d) X XB x { c} ), sum over Y and obtain by simple cancellation of P (g- l (d) X XB X { c} ):

o~ L L (c,d)EY(a,b}E9 -1 (}> d)x P(abc OXB (abc ).P(g-l(d)XXBX{C}) P(abc ).InP({a}P xXBx{C}).P(g_l(d) x{b} x{c}) .

, , M. STUDENY ANDJ. VEJNAROVA

276

where the remaining values of P zero. Since A 1L BIG (P ) one has by Consequence2.1(b) I (A ; BIG liP ) = O. Let us consider a mapping 9 : XAC ~ XDE defined by

9(0,0) = 9(1,0) = (0,0)

9(0,1) = 9(1,1) = (1,0) .

Thenthe imageof P by the mappingf : XABC-t XDBEdefinedby f (a, b, c) = [g(a, c), b] for (a, c) E XAC, b E XB , is the followingdistributionQ on XDBE : 1 Q(O,0, 0) = Q(I , 1,0) = "2' Q(O, 1,0) = Q(I , 0, 0) = 0 . EvidentlyI (D; BIE IIQ) = In2. 4 . Different

levels of stochastic

<>

dependence

Let us start this section with some motivation . A quite common "philosoph ical " point of view on stochastic dependence is the following : The global strength of dependence among variables [~i ]iEN is considered as a result of various interactions among factors in N . For example , in hierarchical log-linear models for contingency tables [4] one can distinguish the first -order interactions , i .e. interactions of pairs of factors , the second-order interactions , i .e. interactions of triplets of factors , etc . In substance , the first -order interactions correspond to pairwise dependence relationships , i .e. to (unconditional ) dependences between ~i and ~j for i , j E N , i :tf j . Similarly , one can (very loosely ) imagine that the second-order interactions correspond to conditional dependences with one conditioning variable , i .e. to conditional dependences between ~i and ~j given ~k where i , j , kEN are distinct . An analogous principle holds for higher -order interactions . Note that we have used the example with loglinear models just for motivation - to illustrate informally the aim of this section . In fact , one can interpret only special hierarchical log-linear models in terms of conditional (in )dependence. This leads to the idea of distinguishing different "levels" of stochastic dependence. Thus , the first level could "involve " pairwise (unconditional ) dependences. The second level could correspond to pairwise conditional dependences between two variables given a third one, the third level to pairwise conditional dependences given a pair of variables , etc . Let us give a simple example of a probability distribution which exhibits different behavior for different levels. The following construction will be used in the next section , too .

MULTIINFORMATION

Construction distribution

AND

STOCHASTIC

DEPENDENCE

A Supposing A c N , card A ~ 2 , there P over N such that M ( B II P ) = In 2

whenever

M ( B II P ) = a

otherwise

exists

277

a probability

A c B c N , .

Proof Let us put Xi = {a, I } for i E A, Xi = {a} for i E N \ A. DefineP on XN as

follows

P([Xi]iEN) = 21-cardA P([Xi]iEN)= a

whenever EiEN Xi is even, otherwise . 0

The

distribution

dependences one

can

easily

ditionally to

is that

in

learning

the

per haps

help

the basis

order

and

subsets

of

get

N . In

a measure

[~ k ] kEK to

[~j ] jEB

given the

the

K

c

degree

models

degree

[26 ]

level of

provide

for

network

.

of depen

-

dependence is similar

a good

to

theoretical

that

the

\ { i , j } . This of dependence

mentioned

together

classification

with for

the

each

conditional

possibility

level

mutual

conditional

where

case

above

.

DEPENDENCE

of stochastic

[~ k ] kEC

a suitable

of each

of dependence

OF

argued

special

N

of the

log - linear

we

of

algorithms

measures

-

.

I ( i ; j I K ) of conditional

, where

measure

tests

is arbi

distributions

distribution

can

[~ i ] iEA D

conclusion

find

strength

. They

an analogue

measure

to

( with

A \ { i , j } . Or

variables

standard

a considered

MEASURES

section

the

model

the

main

level - specific

whether

in

- SPECIFIC

fail

i is con -

A \ { i ,j }

given

." Such

. The model

to measure

statistical

I ( A ; B I C ) is a good [~ i ] iEA

underlying

of

- level

E A , i i= j ,

[~ i ] i .ED , where

independent models

quantitative

to have

C

P , the

variables

highest

i ,j

2 . 3 ) that

ofj

distribution the

approximations

numerically

previous

the

the

pair

subset

independent

has

an

- independent

by

LEVEL

the

such

only

every Lemma

proper

" completely

a wish

necessary

any

- independent

recognize

, we wish

of expressing

In

to

interactions

4 .1.

of

, for

2 . 1 and

" although

. Good

pseudo

for Thus

[~ i ] iEN

justifies

one

fearful

given

network

separately

may

j

of A , are

case

A exhibits

A . Indeed

conditionally

[26 ] pseudo

Bayesian

dence

of

subset in

set

Consequence

dependent

proper

This

( by

, supposing

called

Construction

factor

i is not

" collectively

are

of

verify

P ) but

equivalently

trary

from

the

independent

respect

are

P

within

A ,B ,C

when

A

and

dependence leads for

directly a specific

C B

information

dependence N

are

are

singletons

between to level

pairwise

our .

~ i and proposal

between disjoint , we

will

~j given of how

278

, , M. STUDENY ANDJ. VEJNAROVA

Suppose that P is a probability distribution over N , A c N with card A 2:: 2. Then for each r = 1, . . . , card A - I we put : ~ (r , AIIP ) = } : { / (a;bIKIIP ) ; {a,b} C A, K c A \ {a, b} , cardK = r - I } .

If thedistribution P isknownfromthecontext , wewrite~(r, A) instead of ~(r,A IIP). Moreover , wewill occasionally writejust ~(r) asa shorthand for Ll(r, N). Weregardthisnumber asa basisof a measure of dependence of levelr among factorsfromA. Consequence 2.1 directlyimplies : Proposition4.1 Let P be a probabilitydistributionoverN, A c N, cardA~ 2, 1 ~ r ~ cardA- 1. Then (a) Ll(r, A IIP) ~ 0, (b) Ll(r, A IIP) ==0 iff [V(a, blK) E T(A) cardK = r - 1 a 11blK(P)]. So, the number d (r ) is nonnegative and vanishes just in case when there are no stochastic dependences of level r . Particularly , Ll (1) can be regarded as a measure of degree of pairwise unconditional dependence. The reader can ask whether there are different measures of the strength of level-specific interactions . Of course, one can find many such information -theoretical measures. However , if one is interested only in symmetric measures (i .e. measures whose values are not changed by a permutation of variables ) based on entropy , then (in our opinion ) the corresponding measure must be nothing but a multiple of d (r ). We base our conjecture on the result of Han [8] : he introduced certain level-specific measures which are positive multiples of ~ (r ) and proved that every entropy -based measure of mul tivariate "symmetric " correlation is a linear combination of his meMures with nonnegative coefficients . Of course, owing to Lemma 2.3 the number Ll (r ) can be expressed by means of the multiinformation function . To get a neat formula we introduce a provisional notation for sums of the multiinformation function over sets of the same cardinality . We denote for every A c N , card A ~ 2:

a (i , A ) = L { M (D II P ) ; DcA

, card D = i }

for i = 0, ..., card A .

Of course o-(i ) will be a shorthand for o-(i , N ). Let us mention that 0-(0) = 0' (1) = 0 . Lemma 4.1 For every r = 1, . . . , n - 1 (where n = cardN ~ 2)

Ll(r ) =

21)-a(r+l)-r.(n-r)-a(r)+(n-2 (r+ r+1)-u(r-l)-

279 MULTIINFORMATION AND STOCHASTIC DEPENDENCE Proof Let us fix1~r~n- 1and write byLemma 2.3 2Ll (r)=(a,bL)IK {M(abK )+M(K)- M(aK )- M(bK )}, (7) EC where.c is the classof all (a, blK) E T (N ) wherea, b are singletonsand cardK = r - 1. Note that in .c the triplets (a, blK) and (b, alK ) are distinguished: hencethe term 2d (r ) in (7). Evidently, the sumcontainsonly the terms M (D) suchthat r - 1 :::; cardD :::; r + 1 , and onecan write /),.(r ) = L { k(D) . M (D) ; D c N, r - 1 ~ cardD ~ r + 1 } , wherek(D) are suitable coefficients . However , sinceeverypermutation7r of factors in N transforms (a, blK) E .c into (7r(a),7r(b)I7r(K )) E .c the coefficientk(D) dependsonly on cardD . Thus, if one dividesthe number of overall occurrencesof terms M (E) with cardE = cardD in (7) by the number of sets E with cardE = cardD, the absolutevalue of 2k(D) is obtained. Sincecard.c = n . (n - 1) . (~--~) onecanobtain for cardD = r + 1 that k(D ) = ~.n(n- 1)(~=;)/ (r~1) = (r! l ). Similarly, in casecardD = r - 1 onehask(D ) = ! .n(n- l )(~--i )/ (r~l ) = (n- ; +l ). Finally, incasecardD = r onederives- k(D ) = ! . 2n(n - 1)(; --; )/ (~) = r (n - r ). To get the desired formula it sufficesto utilize the definitionsof a(r - 1), a(r ), a(r + 1). 0 Lemma4.1 providesa neat formula for ~ (r ), but in the casewhen a great numberof conditionalindependence statementsare known to hold, the definition formula is better from the computationalcomplexityviewpoint. 4.2. DECOMPOSITION OF MULTIINFORMATION Thus, for a factor set N , cardN ~ 2, the numberM (N ) quantifiesglobal dependence amongfactors in N and the numbers~ (r, N ) quantify levelspecificdependences . So, oneexpectsthat the multiinformationis at least a weightedsumof thesenumbers.This is indeedthe case,but asthe reader can expect, the coefficientsdependon cardN . For everyn ~ 2 and r E { I , . . . , n - I } we put (3(r, n) = 2 . r - l .

(~)-1,

Evidently , ,a(r , n) is always a strictly positive rational number.

,

280

,

M . STUDENY

AND

J . VEJNAROVA

over Proposition 4.2 Let P be a probability distribution Then n- l M (N IIP) = L (3(r, n) . ~ (r, N IIP) . r=l

N , card N ~ 2 .

Proof . UsingLemma 4.1wewrite(notethatthesuperfluous symbol of P isomitted throughout theproofand,a(r) isused instead of,a(r,n)) n~ l ,e(r) 0~(r) = n~ l ,e(r) 0( r ; 1) 0u(r + 1) - n~ l f3(r) 0r 0(n - r) 00'(r) + n~ l f3(r) 0( n - ; + 1) 0u(r - 1) 0 Letusrewrite thisintoa moreconvenient form: t=2(3(j - l )o(;) ou(j )- ~ .1 .1=1{3(j )ojo(n- j )ou(j )+"t' .1=0{3(j +l )o( n; j ) ou(j )o Thisis, in fact, Ej=oI(j ) . a(j ), where I(j ) aresuitable coefficients . Thus , l(n) = ,fi(n - 1) . (~) = 1, I(n - 1) = ,B(n - 2) . (n; l) - {3(n - 1) . (n - 1) = ~ - ~ = 0, andmoreover , forevery2 :::;j :::; n - 2 onecanwrite l(j ) = (3(j - 1) . (~) - (3(j ) .j . (n - j ) + (3(j + 1) . (n2j) = = (j )-1. {(n - j + 1) - 2(n - j ) + (n - j - 1)} = O. Hence , owingto 0'(0) = 0'(1) = 0 andn ~ 2 weobtain

n-l E (3(r) .Ll(r) = r=l

n

L

j

l

=

(

j

)

.

u

(

j

)

=

u

(

n

)

=

M

(

N

)

.

2

0 If one considers

a subset A c

N

in

the

role

of

N

in the preceding

statement , then one obtains cardA

M(AIIP) = L

- l

(3(r, cardA) . ~ (r, A liP)

(8)

r = l

for every A c N , card A ~ 2. One can interpret this in the following way. Whenever [~i]iEA is a random subvector of [~i]iEN, then M (A II P ) is a measure of global dependenceamong factors in A , and the value {3(r , card A ) . ~ (r , A IIP ) expressesthe contribution of dependencesof level

MULTIINFORMATION AND STOCHASTICDEPENDENCE

281

r among factors in A . In this sense, the coefficient f3(r , card A ) then reflects the relationship between the level r and the number of factors. Thus, the "weights" of different levels (and their mutual ratios, too) depend on the number of factors in consideration. The formula (8) leads to the following proposal. We proposeto measure the strength of stochastic dependenceamong factors A c N (card A ~ 2) of level r (1 ~ r ~ card A - I ) by meansof the number:

A(r, A IIP) = (j (r, cardA) . d (r, A IIP) . The symbol of P is omitted whenever it is suitable. By Proposition 4.1 A(r , A) is nonnegative and vanishesjust in caseof absenceof interactions of degree r within A . The formula (8) says that M (A) is just the sum of A(r , A)s. To have a direct formula one can rewrite the definition of "\(r , A) using Lemma 4.1 as follows:

=(a-r).(r:1)-1 -2.(a-r).(;)-1.a (r,A)+(a-r).(r:1)-1

A(r, A)

oa(r + l , A )

oa(r - l , A ) ,

where a = card A , 1 :s; r :s; a - I . Let us clarify the relation to Han's measure[8] ~ 2e~n) of level r among n = card N variables .

We have:

A(r , N ) = (n - r ) . Ll2e~n) for every1 ~ r ~ n - 1, n 2: 2 . We did not study the computational complexity of calculating the particular characteristics introduced in this section - this can be a suhiect of .. future , more applied research. 5. Axiomatic

characterization

The aim of this section is to demonstrate that the multiinformation func tion can be used to derive theoretical results concerning formal properties of conditional independence . For this purpose we recall the proof of the result from [20] . Moreover , we enrich the proof by introducing several concepts which (as we hope ) clarify the proof and indicate which steps are substan tial . The reader may surmise that our proof is based on Consequence 2.1 and the formula from Lemma 2.3. However , these facts by themselves are not sufficient , one needs something more . Let us describe the structure of this long section . Since the mentioned result says that probabilistic independency models cannot be characterized

282

, , M. STUDENYAND J. VEJNAROVA

by means of a finite number of formal properties of (= axioms for ) indepen dency models one has to clarify thoroughly what is meant by such a formal property . This is done in subsection 5.1: first (in 5.1.1) syntactic records of those properties are introduced and illustrated by examples , and then (in 5.1.2) their meaning is explained . The aim to get rid of superfluous for mal properties motivates the rest of the subsection 5.1: the situation when a formal property of independency models is a consequence of other such formal properties is analyzed in 5.1.3; "pure " formal properties having in every situation a nontrivial meaning are treated in 5.1.4. The subsection 5.2 is devoted to specific formal properties of probabilis tic independency models . We show by an example that their validity (= probabilistic soundness) can be sometimes derived by means of the multi information function . The analysis in 5.2.1 leads to the proposal to limit attention to certain "perfect " formal properties of probabilistic indepen dency models in 5.2.2. Finally , the subsection 5.3 contains the proof of the nonaxiomatizability result . The method of the proof is described in 5.3.1: one has to find an infinite collection of perfect probabilistically sound formal properties of independency models . Their probabilistic soundness is verified in 5.3.2, their perfectness in 5.3.3. 5.1. FORMAL PROPERTIES OF INDEPENDENCY MODELS We have already introduced the concept of an independency model over N as a subset of the class T (N ) (see subsection 2.2.) . This is too general a concept to be of much use. One needs to restrict oneself to special independency models which satisfy certain reasonable properties . Many authors dealing with probabilistic independency models have formulated certain reasonable properties in the form of formal schemata which they named axioms . Since we want to prove that probabilistic independency models cannot be characterized by means of a finite number of such axioms we have to specify meticulously what is the exact meaning of such formal schemata . Thus , we both describe the syntax of those schemata and explain their semantics . Let us start with an example . A semigraphoid [14] is an independency model which satisfies four formal properties expressed by the following schemata having the form of inference rules .

(A,BIG) -+ (B, AIC) (A,BOlD) -+ (A, OlD) (A,BCID) -t (A, BICD) [(A,BICD) A (A, OlD)] -t

symmetry decomposition weak union (A , BOlD )

contraction .

Roughly, the schematashould be understood as follows : if an independency

MULTIINFORMATION

AND

STOCHASTIC

DEPENDENCE

283

model contains the triplets before the arrow , then it contains the triplet after the arrow . Thus , we are interested in formal properties of independency models of such a type .

5.1.1. Syntax of an inference rule Let us start with a few technical definitions . Supposing S is a given fixed

nonempty finite set of symbols, the formulas (K, I , K, 21K , 3), where K, 1, K, 2, K, 3 are disjoint will

subsets of S represented

be called

We write

terms

K ~

by juxtapositions

of their

elements ,

over S .

to denote that

K and

are juxtapositions

of all ele-

ments of the samesubset of S (they can differ in their order). We say that a term (1\:::1, 1\:::2/1\:::3) over S is an equivalent version of the term ( 1, 21 3) over S if K, i ~ i for every i = 1, 2, 3. We say that (K, I , K, 21K , 3) is a symmetric version of ([, 1, [,21[,3) if K1 ~ 2, K2 ~ [, 1, K, 3 ~ [,3. For example, the term (AE , BCID ) over S = { A , B , C, D , E , F } is an equivalent version of the term (AE , CBID ) and a symmetric version of the term (BC , EAID ). A regular inference rule with r antecedents and s consequents is specified by

(a) positive integers r , s, (b ) a finite set of symbols S, possibly including a special symbol 0,

(c) a sequence of orderedtriplets [Sf , S~, Sf], k = 1, . . . , r + s of nonempty subsetsof S such that for every k the sets Sf , S~, s~ are pairwise disjoint . Moreover , we have several technical

requirements

:

- S has at least three symbols ,

- if Sf containsthe symbol0, then no other symbolfrom S is involved in Sf (for everyk = 1, . . . , r + s and everyi = 1, 2, 3), - if k, 1E { I , . . . , r + s} , k ~ l , then Sf ~ sf for somei E { I , 2, 3} , - every0' E S belongsto someSf , - there is no pair of different symbols u , T E S such that

Vk = 1, . . . , r + s Vi = 1, 2, 3 [u E Sf ~ T E Sf ]. A syntactic record of the corresponding inference rule is then

[ (st , si IS} ) A. . .A (S[ , S~ISr) ] -t [ (S[ +l , S; +lIS~+l ) V. . . V (S[ +8, S; +8IS~+8) ]

whereeachsf is representedby a juxtaposition of involvedsymbols. Here the terms (Sf , S~IS~) for k = 1, . . . , r are the antecedentterms, while (Sf , S~IS; ) for k = r + 1, . . . , r + s are the consequent terms.

, , M. STUDENY ANDJ. VEJNAROVA

284

Example 5.1 Take r = 2, s = 1, and S = { A , B , C, D } . Moreover, let

us put [Sf , si , 8J] = [{ A} , {B} , {a , D}], [S; , s~, sl ] = [{A} , {a } , {D}], [Sf , s~, sl ] = [{A} , {B , a } , {D}]. All our technicalrequirementsare satisfied . One possible corresponding syntactic record was already mentioned under the label "contraction " in the definition of semigraphoid . Thus , contraction is a regular inference rule with two antecedents and one consequent . Note that another possible syntactic record can be obtained for example by replacing the first antecedent term by its equivalent version :

[(A , BIDC ) 1\ (A , CID )] - + (A , BOlD ).

0

Of course, the remaining semigraphoid schemata are also regular infer ence rules

Remark

in the sense of our

Our technical

definition

requirements

.

in the above definition

anticipate

the

semantics of the symbols. The symbols from S are interpreted as (disjoint ) subsets of a factor set N and the special symbol 0 is reserved for the empty set. Terms are interpreted as elements of T (N ). The third requirement

ensures that

no term in a syntactic

record of an inference

rule

is an equivalent version of another (different ) term . Further requirements avoid redundancy of symbols in S : the fourth one means that no symbol is unused , while the fifth one prevents their doubling , as for example in the

"rule" : [(A , BEl CD ) 1\ (A , CID )] ~

(A, EBCID )

where the symbol B is doubled by the symbol E .

5.1.2. Semantics of an inference rule Let us consider a regular inference rule a with r antecedentsand s consequents. What is its meaning for a fixed nonempty factor set N ? A substitution mapping (for N ) is a mapping m which assignsa set m(u) C N to every symbol u E S in such a way that : - m(0) is the empty set, - { m (a) ; a E S } is a disjoint collection of subsetsof N , - UO 'ESk 1 m (o-) # 0 for every k = 1, . . . , r + s, - UO ' ESk 2 m (o-) # 0 for every k = 1, . . . , r + s. Of course, it may happen that no such substitution mapping exists for a factor set N ; for example in case of contraction for N with card N = 2. However, in case such a mapping m exists an inference instance of the considered inference rule (induced by m) is (r + s)-tuple [tl , . . . ,tr +s] of elementsof T (N ) defined as follows: tk = ( U m(O ") , U m(O ") I U m(O ") ) O 'Esk O 'Esk O ' ESk 1 2 3

for k = l ' 888,r + s 8

MULTIINFORMATION

The

( r + s ) - tuple

tuple

made

s- tuple Example and It

triplets

5 . 2 Let

with

divided

by m ) is then

for

N . The

are called

possible

. However

i3 = ( { I } , { 2 , 3 } 10) ,

2 = ( { 2 } , { 3 } 10) ,

3 = ( { 2 } , { I , 3 } 10) ,

t1 = ( { 2 } , { 3 } 1{ 1} ) ,

t2 = ( { 2 } , { 1} 10) ,

t3 = ( { 2 } , { I , 3 } 10) ,

[..1 = ( { 3 } , { 1 } 1{ 2 } ) , tl = ( { 3 } , { 2 } 1{ 1} ) ,

t2 .. = ( { 3 } , { 2 } 10) , [..3 = ( { 3 } , { I , 2 } 10) , t2 = ( { 3 } , { 1} 10) , t3 = ( { 3 } , { I , 2 } 10) .

number

inference

finite

rule

instances

and

Having

a fixed

consequents

factor

under

for N ) { tl , . . . , tr } C I 5 . 3 Let

set N

On the other

hand , the contraction

inference

5 . 1 .3 . The dency

instance

Logical

aim

implies

another

of the class of probabilistic [20 , 9 ] or various approach

hides

model

r antecedents

I

c

and

5 .2 . The

independency

model under in I .

= { ( { I } , { 2 } 10) , ( { I } , { 3 } 1{ 2 } ) } is not but

i3 tt M

for

the 0

rules

rules

is to sketch

probabilistic reasonable

properties models

class of independency

models

classes of possibilistic wish

formal

independency

independency

independency a deeper

s

~ 0.

, one has iI , i2 E M

of inference

inference

, especially

graph - isomorphic

set is

[tl , . . . , tr + s] E T ( N ) r + s ( of a

Example

M

. Indeed

implication

can have in mind

number

factor

of the triplet ( { I } , { 2 } 10) only is closed instance for N has both antecedents

model

for a

.

a with

{ tr + l , . .., tr + s } n I with

, the

for a given

an independency

rule

instance

is finite

[ i1 , i2 I i3 ]

of regular

models

we say that

0

set . Therefore rule

are

mappings

mappings

is sensible

inference

inference

us continue

closed

factor

inference

definition

I over N = { I , 2 , 3 } consisting contraction since no inference under

a fixed

a regular

iff for every

substitution

of a regular

and the following

T ( N ) is closed

Example

of possible

, there

substitution

i2 = ( { I } , { 2 } 10) ,

always

instance

t3 = ( { I } , { 2 , 3 } 10) .

and t3 is the consequent

i1 = ( { 2 } , { 1 } 1{ 3 } ) ,

regular

.

contraction

inference

i1 = ( { I } , { 3 } 1{ 2 } ) ,

of all inference

rthe

consequents

5 . 1 and consider

corresponding

other inference instances , induced by other for N . In this case one finds 5 other ones :

fixed

the

, and

[tl , t2 I t3 ] where

t2 = ( { 1 } , { 3 } 10) ,

tl , t2 are the antecedents

Of course , the

into

antecedents

m ( A ) = { 1} , m (B ) = { 2 } , m ( C ) = { 3 } , m ( D ) = 0 .

mapping

= ( { 1} , { 2 } 1{ 3 } ) ,

Example

285

DEPENDENCE

are called

tr + l , . . . , tr + s which

us continue

is a substitution

Here

tl , . . . , tr which

of the triplets

N = { 1 , 2 , 3 } . Put

( induced tl

STOCHASTIC

[tl , . . . , tr I tr + l , . . . , tr + s] is formally

of the

made

AND

or hope

of indepen

fact , one

models

instead

models . For example

the class of

[14 ] or the class of EMVD independency to characterize

models the

-

. In

- models

[ 1 , 6] . Such an respective

class

, , M. STUDENY ANDJ. VEJNAROVA

286 of

independency

regular of

models

inference

the

respective

For

,

of

verification

finite

number

done

by

needs a

a

desired

We

say

that

a

N

and

for

under

recognize

them .

completely collection be

the

under and

in

be rules

solution

) .

rules

the

a

can

inference

ideal

inference

interested

process

closed

of

an

regular

are

is

a

laborious ,

automatic

desired

such

without Indeed

model

would

we

of

One

from

following

such

relation

.

every is

,

collection

models

distribution

superfluous

Therefore

of

( AJ and

M

a

collection

collection

inference

whenever

regular

write

T

inference

F

' "

independency

closed

if

for

model

under

every

rules every

Mover

inference

T

logically

implies

( nonempty N

rule

finite

the

vET

)

following

, then

a

factor holds

M

is

:

closed

UJ .

Usually

, an

derivability mind

. . We

would

three

easy

We

sufficient

give

hope

be

it

5 .4

, BIE

)

Let

1\

(A

we

sequence

logical

insight

implication to

than

is

explain a

what

pedantic

( syntactic we

)

have

definition

in

, which

one

is

of

the

following

consequent )

1\

logically

(A

, DICE

) ]

implied a

by

special

symbols

S

regular

inference

rule

UJ

with

: - t

the

(A

derivation =

{ A

, B

, DIE

) .

semigraphoid

inference

sequence , C , D

, E

} .

of Here

terms is

the

rules over

the

derivation

: ( A

, B

I E

2 .

(A

, C

I BE

3 .

( A

, DICE

4 .

( A

, BC

5 .

( A

, C

6 .

( A

, CD

7 .

( A

, D

sequence

consider

, CIBE

set

1 .

The

better

construct

corresponding

for example

.

us

rule it

gives

and

inference show

condition illustrative

complicated

antecedents

This

an

that

too

Example

To

.

is

such

a

characterization

.

independency

course finite

rules

to

rules

removing

collection

regular

[ (A

for

inference

possible

given

under

axiomatic

independency

probability

Of

(a

criterion

among

set

.

minimal

closed the

probabilistic

inference

computer

be

it

a

known

those

about models

of

make

whether

of

a

should

case

inducing

of

of

speak

independency

the

an

class

can

of

in would

construction

the

. We

class

example

characterization

of

as

rules

last

) ,

) ,

I E

)

)

is

I E

)

)

is

, E

I E

term is

) ,

is either

is

directly

directly

is

derived

derived

directly

directly the an

consequent antecedent

from

2 . and

from

derived

4 . by

from

derived

term

6 . by of of

c,,} . c,,} , or

contraction

decomposition

3 . and

from

term

1 . by

5 . by

,

contraction

decomposition Every it

term is

,

" directly

,

. in

the derived

derivation "

from

.

MULTIINFORMATION

preceding

terms

inference

( in

rule

Now

, let

us

independency

rules

To

instance we the

I t4

U)

t.A) for

N Ul

induced

by

, . . . , U7

of

let a

( C

) I m

(B

) U

m

( E

) )

=

t2

,

U3

=

( m

( A

) , m

( D

) I m

( C

) U

m

( E

) )

: = t3

,

U4

=

( m

( A

) , m

( B

) U

) I m

(E

) ) ,

Us

=

( m

( A

) ,

m

( C

) I m

U6

=

( m

(A

) , m

( C

) U

(E

) ) ,

( m

( A

) , m

( D

the

fact

can

{ Ul

,

we

only

which

: = tl

an

(N

( N

)

inference

mapping

m

which

" copies

)

. So

, "

,

) I m

) )

: = t4

.

closed

under

assumption }

c

is

closed

inference

has

to

as

every { ti

M

.

, t2

semigraphoid , t3

}

,

t4

Especially under

a

instance

c

M E

inference by

M

induction

,

on

which

was

the

"-' .

a

concentrate " later

a

inference

<>

its

an

inference

of

inference

antecedent

,

For

also

"

m

to

( B

rules

technical like

is

whose

=

0 ) .

( which

those

demonstrated

in

)

which

reasons avoid

trivial

example

with

inference

instance as

( for m

" pure .

would

rule

antecedents

mapping

instances

have an

regular

of

class

5 . 2 . 2 ) we

may of

one

a

substitution on

- see

image

of

consequent for

possibly

symmetric

example

an

it

clear

(E is

, M

" informative

become

T

T

) ) ,

( D

, . . . , Uj

decomposition wish

c

inference

rules

that that

m

the

. Thus

inference

of

( C

(E

M

from

that

m

) I m

that

derive

M

:

) , m

sense

semigraphoid

consider

of

( A

case

a

substitution

elements

( m

the

of

semigraphoid

us

=

in

inference

consequent by

the

is

following

.

Exam

pIe

5 .5

, BCID

Take

N

{ 3 } . It

) =

Let

,

{ 3 }

of

say t . IJ

consider

, DIAC and

the

of

that ( for

put

) ,

t3

the

a

the

) ] m

inference

1{ 1 }

image

, we

us

(B

{ 1 , 2 }

symmetric

Thus

/\

induces

( { 2 }

instance

under

U2

Pure

287

semigraphoid all

) )

happen

=

] of

a

( E

=

virtue

and under

closed

sequence

N

closed

) I m

may

the

is

set

N

( B

5 . 1 .4 .

rules

factor over

) , m

It

t2

, t3

conclusion

will

fixed

M

a

1 , . . . , 7

have

by

( A

desired

Thus

)

( m

one

[ (A

that

sequence

to

the

sequence

=

U7

=

. DEPENDENCE

Ul

Jwing

j

, t2

a model

construct

derivation

rule

derivation

consider

show [ tl

can

the

STOCHASTIC

.

( i . e . an ) .

AND

following

- t ( A )

(B

, AID

=

{ I } ,

instance =

( { 2 }

,

[ tl { 1 }

antecedent

regular

arbitrary

tl

m

( B

, t21

1{ 3 }

inference

rule

:

) .

)

=

t3 ]

{ 2 } , with

Here

m

tl the

( C ) =

=

( { I }

0 , ,

consequent

m

{ 2 } t3

( D

)

I { 3 } is

) ,

. 0

rule set

=

the

.

inference factor

regular ) .

N

'..A) is )

in

pure which

if

there a

is

consequent

no

inference either

, , M. STUDENY ANDJ. VEJNAROV A

288 coincides

with

Such

a

condition late

a

give

E

is

S

are

s }

3j

E

distinguished

Lemma

5 .1

A

distinguished . . verSlons

Proof

.

First

0 #

m ( Sj

which

) c

to

ence image

We

of

symbols

sufficient

5 . 1 . 1 . To

regular

formu

inference

. We

say

-

rule that

the

(K

\

) U ( I:, \ K ) . A (

1 , 1: , 211: , 3 )

term

over

( K1

S

if

, IC2IJCg

ICi

i

term

of

1, 2 , 3.

rule

",

all

is pure

if

antecedent

: whenever substitution

) U m ( #

\ IC ) c

m (

every

consequent

terms

of

c,,) and

and

of

T ( N ) by

either

sets

K,

m

one

, terms

assumption

coincide

symbol mapping ( m ( IC ) \

) . Hence

elements

leave

it

to

a pure

the

reader

inference

from

their

m (

any

) ) U (m (

) \ in

substitution of

antecedent

or

distin

say

and

that

a

probabilistic means

property

difficult

weak

regular

not

pendency

rules

universal

. In

over were

them

multiinformation

trans

-

. There

-

a respective with

its

four found

be

the

all

infer

-

symmetric

a

good to

factors

probabilistically

as

a

if

-

that

for

lot was

has

consequence certain

every

expresses

this

of

. Is

inference purpose

all

a for

models

regular

characterize

of

sound

independency

[ 10 , 11 ] a

, namely

rule

a given tool

, it

con

see

", .

inference

soundness

. However

5 . 1 that easily

.

under

of

can

RULES

probabilistic

effort

whose

regarded function

is

closed

Lemma one

rules

sound

is

function can

",

is

soundness

function

of

hand

pure

rule

by

means

other

INFERENCE

model

is shared

multiinformation of

SOUND

probabilistic

models

inference

not

probabilistically

multiinformation

perhaps

by the

are

inference

, every

verify

verify . On

union

Y

which

to

The

to

independency

That mal

-

m ( IC ) ) ,

c,,) are

mapping

consequent

an

. c are

has

distinguished

, no with

rule

PROBABILISTICALL

We

)

and

0

is

5 .2 .

the

a

a

.

(,,) if

term

i =

from

every

mentioned

decomposition

some

c a

some

set

is

antecedent

need

from

",

an

.

traction

the

the

of

. We

concepts

in

S;

following

m ( IC \

can

verification

that

as

(,,) from

for

for

m ( IC )

instance

{ 1, 2}

c,,) both

the

the

S

image

.

distinct

, under

syntactic

inference

in

(, ) , then

implies

formed fore

note

in

' "

regular

symmetrlc

guished

of

distinguished

in in

symmetric

. Suppose

having

c

the

for

means

record K: ,

with

suitable

definitions

distinguished

c,,) is

not

by

two

{ I , . . . , r + S

or

is

syntactic sets

over are

definition

we

symbol 3k

antecedent

formulated

it

with

an

rule

probabilistic

of

lately deeper

" conditional

?

inde

-

sound

verified

appeared

it

, although

probabilistically

not

-

with

help

that

at

properties "

inequalities

of

least of

MULTIINFORMATION

AND

STOCHASTIC

289

DEPENDENCE

for the multiinformation (or entropic) function [27, 12]. Thus, the question whether every probabilistically means of the multiinformation

sound inference rule can be derived by

function

remains open . However , to support

our arguments about its usefulness we give an illustrative lieve that method

an example

is more didactic

than a technical

example . We bedescription

of the

.

Example 5 .6 To show the probabilistic soundness of weak union one has to verify for arbitrary factor set N , for any probability distribution P over N , and for any collection of disjoint sets A , B , C, D c N which are nonempty with the possible exceptions of C and D , that

A 11 BCID (P ) ~

A 11 BICD (P ) .

The assumption A 11 BOlD (P ) can be rewritten by Consequence 2.1(b) and

Lemma

the distribution

2 .3 in terms

of the

multiinformation

function

M

induced

by

P :

0 = M (ABCD) + M (D) - M (AD) - M (BCD) . Thenonecan"artificially" addandsubtractthetermsM (CD) - M (ACD) andby Lemma2.3 derive: 0 = {M (ABCD) + M (CD) - M(ACD) - M (BCD)} + {M(ACD) + M (D) - M (AD) - M (CD)} = I (A; BICD) + I (A; aiD ) . By Consequence2.1(a) both I (A ; BICD ) and I (A; aiD ) are nonnegative, and therefore they vanish! But that implies by Consequence2.1(b) that A Jl BICD (P ). <> Note that one can easily see using the method shown in the preceding example that every semigraphoid inference rule is probabilistically sound . 5.2.1. Redundant rules However , some probabilistically sound inference rules are superfluous for the purposes of providing an axiomatic characterization of probabilistic in dependency models . The following consequence follows directly from given definitions . Consequence 5.1 If (AJis a regular inference rule which is logically im plied by a collection of probabilistically sound inference rules , then (AJis probabilistically sound .

, , M. STUDENYAND J. VEJNAROVA

290

A clear example of a superfluousrule is an inferencerule with redundant antecedent terms.

Example 5.7 Theinference rule [- (A, BC -I D) /\ (C, B I A)]- -t (A, B I CD) is a probabilistically sound regular inference rule . But it can be ignored since it is evidently logically implied by weak union . 0 Therefore we should limit ourselves to "minimal " probabilistically sound inference rules , i .e. to probabilistically sound inference rules such that no antecedent term can be removed without violating the probabilistic soundness of the resulting reduced inference rule . However , even such a rule can be logically implied by probabilistically sound rules with fewer antecedents . We need the following auxiliary construction of a probability distribution to give an easy example . Construction B Supposing A c N , card A ~ 2, there exists a probability distribution P over N such that

M (B IIP) = max {O, card(A n B) - I } . In 2 for B c N .

Proof Let us put Xi on XN as follows :

= { a , I } for i E A , Xi

P ( [Xi ]iEN ) = ~

whenever

P ( [Xi )iEN ) = 0

otherwise

= { a } for i E N \ A . Define

[Vi , j E A

P

Xi = xi ] ,

. 0

Example from

5 . 8 We

Example

have

already

5 .4 is logically

implied

Hence , (.&) is probabilistically Let us consider antecedent term : [ ( A , BIE This

) 1\

disprove

its

pendency Use

model 2 . 1 one

-, [ { I } Jl alternative

)]

inference over

-+

[ ( A , B I E ) 1\

( A , D ICE B with

)]

=

which

inference

(.&)

rules .

by

a removal

of an

-+

and one consequent

is not

{ 1, 2 , 3 , 4 }

{ 1}

A =

made

one has to find

Jl

and

a probabilistic

closed A

{ 2 } 10 ( P ) , { 1 }

{ 4 } 10 ( P ) ] for the constructed " reduced " inference rule

use Construction

rule

rule

5 .1 .

2 antecedents

set N

N

that

inference

by the semigraphoid

with

a factor

B with

the

by Consequence

soundness

verifies

that

( A , DIE ) .

rule

probabilistic

Construction

quence

sound

earlier

a " red uced " inference

( A , GIBE

is a regular

verified

=

under

this

{ 1 , 4 } . By Jl

distribution

. To inde rule .

Conse -

{ 3 } 1{ 2 } ( P ) , but

P . As

concerns

an

(A , D I E ) { I , 3 , 4 } and

a distribution

P over

N

such

MULTIINFORMATION ANDSTOCHASTIC DEPENDENCE 291 that {I } 11 {2} 10(P), {I } Jl {4}1{3} (P), but -,[ {I } 11 {4} 10(P)]. As concerns the third possible"reduced " inference rule [ (A, C I BE) /\ (A, D ICE)] -t (A, D IE) useagainConstruction B with A = { I , 2, 3, 4}. Thus, onehasa distribution P with { I } 11{3} 1{2} (P), {I } 11{4} 1{3} (P), but -,[ {I } lL {4} 10(P)]. <) 5.2.2. ] Jerfect rules Thus , one should search for conditions which ensure that an inference rule is not logically implied by probabilistically sound inference rules with fewer antecedents . We propose the following condition . We say that a probabilistically sound regular inference rule with r antecedents (and s consequents) is perfect if there exists a factor set Nand an inference instance [tl ' . . . ' tr I tr +l , . . . , tr +s] E T (N )r+s such that the symmetric closure of every proper subset of { tl , . . . , tr } is a probabilistic independency model over N . Lemma

5

rule

with

. 2

r

Let

M

is

rule

-

Proof

is

of

the

all

[ tl

.

.

,

of

+

for

sequents

)

,

for

N

owing

( by

the

therefore

if

Therefore

M

Owing

the

,

.

.

.

,

tr

}

.

.

.

of

{

,

it

,

.

.

.

,

that

tf

to

the

has

,

{

of

'

.

.

is

+

I

+

not

.

the

iI

,

.

}

n

'

,

tr

}

.

c

<

.

.

,

}

=1=

0

closed

tr

+

s

}

under

,

.

.

( with

.

is

n

and

an

inference

,

tf

{

+

r

tf

+

E

which

C, , } .

1

I

,

.

( N

.

+

s

.

.

,

tf

under

antecedents

is

.

an

inference

and

+

s

}

n

M

s

=

0

I

is

)

under

fact

( IJ

was

is

defined

v

that

rule

rule

M

con

-

How

-

the

model

closed

the

inference

.

perfectness

independency

)

Since

closed

1

( of

inference

0

closure

is

-

) f

assumption

such

mentioned

antecedents

contradicts

any

=

M

r

7

-

( IJ

symmetric

most

s ]

:::;

of

the

probabilistic

the

M

as

that

at

and

a

)

show

r

the

under

that

'

us

and

if

( N

with

M

r

7

soundness

I

N

regular

instance

c

Let

[ tl

v

}

closed

.

inference

M

probabilistic

s

be

'

.

if

r

set

sound

rules

assumption

tr

be

Define

rule

fact

I

I

,

factor

inference

,

inference

tl

a

pure

.

)

.

{

has

to

one

tl

+

( N

tl

exists

,

that

that

definition

{

tion

T

inference

closure

,

E

contradiction

with

to

symmetric

So

s ]

there

such

" "

{

an

Then

sound

antecedents

perfectness

a

such

.

probabilistical1y

probabilistically

1

under

sound

of

1

-

antecedents

Suppose

ever

r

tr

of

set

~

every

most

probabilistically

instance

{

.

definition

,

N

closed

,

r

under

at

not

Let

perfect

Mover

closed

with

M

a

,

model

-

the

be

antecedents

independency

in

c, , )

v

I

,

C

.

and

M

.

.

pure

by

to

defini

-

contain

0

The preceding lemma implies the following consequencewith help of the definition of logical implication .

, , M. STUDENY ANDJ. VEJNAROVA

292

Consequence 5.2 No perfect probabilistically sound pure inference rule is logically implied by a collection of probabilistic ally sound inference rules with fewer antecedents . Contraction

is an example of a perfect pure regular

inference rule .

5.3. NOFINITEAXIOMATIC CHARACTERIZATION 5 .3 . 1 . It

is

a

finite

Method

of

clear

in

5

fect

of

. 3

of

Let

us

suppose

system

independency

Let of

regular

it

is

dency iff

us

that

we

, pure

of

to

have

disprove

the

existence

of

probabilistic

found

inference

regular

as

suppose

rule

inference

for

inde

every

with

rules

independency

model

c

,

where

r

By

at

r

least

~

r

-

3

a

per

-

antecedents

characterizing

models

>

r

.

.

probabilistic

closed

under

rules

in

T

is

We

)

with

at

most

from

T

has

from

5 .2 is r

in

M

.

we

a

that

for

there

every

probabilistic in

T f

T

.

,

a

factor

exists

factor

set

independency

.

Hence

~

3

,

every

which

According

sound

find

closed

-

at T

,

is

choose

rules

that

such

pure

the

inference

finite an

model rule

in

exceeds to

a N

T

the

be

maximal

with

N

)

prob

-

number there

c" , J

-

( over

must

assumption rule

system indepen

r

exists

a

antecedents

,

.

Lemma which

contradiction

rules

probabilistically

N

probabilistically

( N all

of

over

However

T

under sound

perfect

a

rules

M

closed

antecedents

rule

for

inference

abilistically of

how characterizing

.

Proof T

1 models

infinite

rules

.

sound

every

5 .2

inference

, probabilistically

Then

Consequence

regular

models

Lemma

proof

light

system

pendency

the

the

1

under

antecedents

most

r

is

not sound

but

-

Therefore

1 M

closed .

set

every not

antecedents is under

a

N

and

an

independency

probabilistically under ,

probabilistic f.I)

M

model

sound c" , J . is

Since

closed

every under

contradicts

rule inference

model the

fact

M rule

inference every

independency which

inference

over that

N " "

. is

0

Thus , we need to verify the assumptions of the preceding lemma . Let us consider for each n ~ 3 the following inference rule 1 (n ) with n antecedents and one consequent :

) . , (n) [ (A , B11B2) 1\ . . . 1\ (A, Bn- lIBn) 1\ (A,BnIB1)] -t (A,B2IB1 It is no problem to verify that each ~ (n) is indeed a regular inference rule. Moreover, one can verify easily using Lemma 5.1 that each 1 (n) is a pure rule.

, , M. STUDENY ANDJ. VEJNAROVA

294

Proof . It suffices to find a probabilistic independency model Mt with M c

Mt and t fj! Mt for every t E T (N ) \ M . Indeed, then M = ntET(N)\M Mt , and by Lemma 2.1 M is a probabilistic

independency model .

Moreover, one can limit oneself to the triplets of the form (a, biG) E T (N ) \ M where a, b are singletons. Indeed, for a given general (A , BIG ) E T (N ) \ M choose a E A , b E B and find the respective probabilistic inde pendency model Mt for t = (a, blG) . Since Mt is a semigraphoid , t ft Mt implies (A , BIG ) t$ Mt .

In the sequelwe distinguish 5 casesfor a given fixed (a, biG) E T (N )\ M . Each case requires a different construction of the respective probabilistic independency model Mt , that is a different construction of a probability

distribution P over N such that { a} 1L { i } I { i + I } (P ) for i = 1, . . . , n - 1, but -, [ { a} lL { b} I C (P ) ]. One can verify thesestatementsabout P through the multiinformation function induced by P . If the multiinformation func tion is known (as it is in the case of our constructions ) one can use Conse-

quence2.1(b) and Lemma 2.3 for this purpose. We leavethis to the reader. Here

is the

list

of cases .

I . Vi = 1, . . . , n - 1 { a, b} # { a, i } (C arbitrary ). In this caseuse Construction A where A = { a, b} . II . [3j E { l , . . . , n - I } { a, b} = { O,j } ] and C \ {j - l ,j + I } =10 . In this casechooser E C \ {j - 1, j + I } and use Construction A where A = { a, j , r } . III . [3j E { 2, . . . , n - I } { a, b} = { O,j } ] and C = {j - 1,j + 1} . In this caseuse Construction A where A = { a, j - 1, j , j + I } . IV . [3j E { 2, . . . , n - I } { a, b} = { O,j }] and C = {j - I } . Use Construction B where A = { a, j , j + 1, . . . , n} . V . [3j E { 1, . . . , n - 1} { a, b} = { 0, j } ] and C = 0. Use Construction

B where

A = N . 0

Consequence 5.3 Each above mentioned rule 1' (n) is perfect. ProD/: Let us fix n ~ 3, put N = { a, 1, . . . , n} and tj = ({ O} , { j } l{j + 1} ) for j = 1, . . . , n (convention n + 1 := 1), tn+l = ({ O} , { 2} 1{ 1} ). Evidently , [ tl , . . . , tn I tn+l ] is an inference instan~e of "'Y(n). To show that the symmetric closure of every proper subset of { tl , . . . , tn} is a probabilistic independency model it suffices to verify it only for every subset of cardinality n - 1 (use Lemma 2.1). However , owing to possible cyclic re-indexing of N

it suffices to prove (only) that the symmetric closure M of { tl , . . . , tn- I } is a probabilistic

independency

model . This follows from Lemma

5.5.

0

RMATIONAND STOCHASTIC DEPENDENCE MUL TIINFO

295

Proposition 5.1 There is no finite system T of regular inference rules characterizing independencymodels as independency models - probabilistic -

closedunderrulesin Y

Proof An easy consequence of Lemmas 5.3, 5.4 and Consequence 5.3.

0

Conclusions

Let us summarize

the paper . Several results support

our claim that condi -

tional mutual information I (A ; B IG) is a good measureof stochastic conditional dependence between random vectors ~A and ~B given ~c . The value

of I (A ; B IC) is always nonnegativeand vanishesiff ~A is conditionally independent of ~B given ~ciOn

the other hand , the upper bound for I (A ; BIG )

is min { H (AIG ), H (BIG )} , and the value H (AIG) is achievedjust in case ~A is a function of ~BC. A transformation of ~ABC which saves ~AC and ~BC

increasesthe value of I (A ; BIG ). On the other hand, if ~A is transformed while ~BC is saved, then I (A ; BIG ) decreases . Note that the paper [29] deals with a more practical use of conditional mutual information : it is applied to the problem of finding relevant factors in medical decision-making . Special level-specific measures of dependence were introduced . While

the value M (A) of the multiinformation function is viewed as a measure of global stochastic dependencewithin [~i]iEA' the value of .t\(r , A ) (for 1 ::; r ::; card A - I ) is interpreted as a measure of the strength of dependence

of level r among variables [~i]iEA' The value of .t\(r , A ) is always nonnegative and vanishes iff ~i is conditionally

independent

of ~j given ~K for arbitrary

distinct i , j E A , K c A , card K = r - 1. And of course, the sum of .t\(r , A)s is just M (A ). Note that measures.t\(r , A) are certain multiples of Han's [8] measures of multivariate symmetric correlation . Finally , we have used the multiinformation function as a tool to show that

conditional

independence

models have no finite

axiomatic

character -

ization . A didactic proof of this result, originally shown in [20], is given. We analyze thoroughly syntax and semanticsof inferencerule schemata (= axioms ) which characterize formal properties of conditional independence models . The result of the analysis is that two principal features of such schemata are pointed out : the inference rules should be (probabilistically ) sound and perfect . To derive the nonaxiomatizability result one has to find an infinite collection of sound and perfect inference rllles . In the verification of both soundness and perfectness the multiinformation function proved to be an effective

tool .

Let us add a remark concerning the concept of a perfect rule . We have used this concept only in the proof of the nonaxiomatizability result . How -

ever, our aim is a bit deeper, in fact. We (vaguely) guessthat probabilistic

, , M. STUDENY ANDJ. VEJNAROVA

296

independency models have a certain uniquely determined "minimal " axiomatic characterization , which is of course infinite . In particular , we conjecture that the semigraphoid inference rules and perfect probabilistically sound pure inference rules form together the desired axiomatic characteri zation of probabilistic independency models . Acknowledgments We would like to express our gratitude to our colleague Frantisek Matus

who directed our attention to the paper [8]. We also thank to both reviewers for their valuable

comments

and correction

v

of grammatical

errors . This

work was partially supported by the grant VS 96008 of the Ministry of Ed ucation of the Czech Republic and by the grant 201/ 98/ 0478 " Conditional independence structures : information theoretical approach " of the Grant Agency of Czech Republic . References 1.

2.

de Campos , L .M . ( 1995) Independence relationships in possibility theory and their application to learning in belief networks , in G . Della Riccia , R . Kruse and R . Viertl (eds.) , Mathematical and Statistical Methods in Artificial Intelligence , Springer Verlag , 119- 130. Csiszar , I . ( 1975) I -divergence geometry of probability distributions and minimazi tion problems , Ann . Probab ., 3 , 146- 158.

3. Cover, T .M ., and Thomas, J.A . (1991) Elements of Information Theory, John Wiley , New

York

.

4. Darroch , J.N ., Lauritzen , S.L ., and Speed, T .P. (1980) Markov fields and log-linear interaction

models for contingency

tables , Ann . Statist ., 8 , 522- 539.

5.

Dawid , A .P. (1979) Conditional independence in statistical theory, J. Roy. Stat.

6.

Fonck P. (1994) Conditional independencein possibility theory, in R.L . de Mantaras and D . Poole (eds.), Uncertainty in Artificial Intelligence: proceedings0/ the 10th

7.

Gallager, R .G. (1968) Information Theory and Reliable Communication, John Wi -

Soc . B , 41 , 1- 31 .

conference , Morgan Kaufman , San Francisco , 221- 226. ley , New York .

8. Han T .S. (1978) Nonnegative entropy of multivariate symmetric correlations, Infor mation

9.

and

Control

, 36 , 113 - 156 .

Malvestuto , F.M . (1983) Theory of random observables in relational data bases, Inform . Systems , 8 , 281- 289.

10.

Matus , F ., and Studeny, M . (1995) Conditional independenciesamong four random

11.

variables I ., Combinatorics , Probability and Computing , 4 , 269- 278. Matus , F . ( 1995) Conditional independencies among four random variables II ., Com binatorics , Probability and Computing , 4 , 407- 417.

12.

Matus , F . (1998) Conditional independencies among four random variables III ., submitted

13. 14.

to Combinatoncs , Probability

and Computing .

Pearl, J., and Paz, A . (1987) Graphoids: graph-based logic for reasoning about relevance relations , in B . Du Boulay, D . Hogg and L . Steels (eds.) , Advances in Artificial Intelligence - II , North Holland , Amsterdam , pp . 357- 363 . Pearl , J . ( 1988) Probabilistic Reasoning in Intelligent Systems : networks o/ plausible inference , Morgan Kaufmann , San Mateo .

MULTIINFORMATION

AND

STOCHASTIC

297

DEPENDENCE

Perez, A . (1977) c-admissible simplifications of the dependencestructure of a set of random variables , Kybernetika , 13 , 439- 449.

Renyi , A . (1959) On measures of dependence, Acta Math. Acad. Sci. Hung., 10, 441 - 451 .

Spohn, W . (1980) Stochastic independence, causal independence and shieldability , J . Philos . Logic , 9 , 73- 99.

Studeny, M . (1987) Asymptotic behaviour of empirical multiinformation , Kybernetika

, 23 , 124 - 135 .

Studeny , M . (1989) Multiinformation ditional

and the problem of characterization of con-

independence relations , Problems of Control and Information

Theory , 18 ,

3 - 16 .

Studeny , M . (1992) Conditional independence relations have no finite complete characterization , in S. Kubik and J.A . Visek (eds.), Information Theory, Statistical ,

Decision Functions

and Random Processes: proceedings of the 11th Prague confer -

ence - B, Kluwer , Dordrecht (also Academia, Prague), pp. 377- 396. Studeny , M . (1987) The concept of multiinformation in probabilistic decisionmaking (in Czech), PhD . thesis, Institute of Information Theory and Automation , Czechoslovak Academy of Sciences, Prague .

Vejnarova, J. (1994) A few remarks on measuresof uncertainty in Dempster-Shafer theory , Int . J . General Systems , 22 , pp . 233- 243. Vejnarova J . ( 1997) Measures of uncertainty and independence concept in different calculi , accepted to EP IA '97. Watanabe , S. ( 1960) Information theoretical analysis of multivariate correlation , IBM Journal of research and development , 4 , pp . 66- 81.

Watanabe, S. (1969) Knowing and Guessing: a qualitative study of inference and information

, John Wiley , New York .

Xiang , Y ., Wong, S.K .M ., and Cercone, N . (1996) Critical remarks on single link search in learning belief networks, in E. Horvitz and F . Jensen (eds.), Uncertainty in Artificial

. -,. ~. LC . . C ~"1 cow C '1 C '1 C "1 ~ C "1

27.

cisco

Intelligence : proceedings of 12th conference , Morgan Kaufman , San Fran -

, 564 - 571 .

Zhang , Z ., and Yeung , R . ( 1997) A non - Shannon type conditional information equality , to appear in IEEE Transactions on Information Theory .

in -

28. Zvarova, J . (1974) On measures of statistical dependence, Casopis pro pestovani matematiky , 99 , 15- 29.

. lC ,....4

. ~ ,....4

. t ,....4

. ( X) ,....4

. 0' ) ,....4

0 N

.

1 ''1 "'.4 C

29. Zvarova, J., and Studeny, M . (1997) Information -theoretical approach to constitu tion and reduction

of medical data , Int . J . Medical Informatics , 45 , 65- 74.

A TUTORIAL ON LEARNING WITH BAYESIAN NETWORKS

DAVID

HECKERMAN

Microsoft Research, Bldg 98 Redmond

WA , 98052 - 6399

heckerma @microsoft .com

Abstract

. A Bayesian

bilistic

relationships

with

statistical

data

analysis

variables

techniques

, it readily

can

predict a causal bining

prior

Bayesian In this

knowledge and

knowledge

to improve

both

Bayesian

and

unsupervised

A

a real - world

comes for

Bayesian

the parameters methods

learning

with avoiding

the

for com -

networks

overfitting

Bayesian

statistical

offer

of data .

networks

methods

from

for using

task , we describe of a Bayesian

incomplete

data . In addition

learning

to

has both

and structure

. We illustrate

and

form ) and data . Four ,

to the latter

with for

the model

.

, and

domain

Bayesian

for constructing

regard

relationships

representation

in causal

all

are missing

to techniques

the graphical

data meth -

network

Bayesian

for

supervised

- modeling

approach

case study .

network

is a graphical

a set of variables

become expert

a popular systems

. Over

representation ( Heckerman

the

model last

for

probabilistic

decade , the

for encoding

uncertain

et al . , 1995a ) . More 301

Bayesian expert recently

,

, we re -

Introduction

among in

methods

causal

for

among

entries

a problem

, it is an ideal

approach

for learning

- network

learn about

. Three , because

often

models . With

techniques

to

proba -

advantages

dependencies

some data

in conjunction

and summarize

late

1.

( which

methods principled

these

ods for learning

using

be used

understanding semantics

paper , we discuss

including

can

of intervention

probabilistic

statistical

an efficient prior

network

encodes

used in conjunction

has several

encodes

where

that

. When

model

model

situations

be used to gain

and

the

model

of interest

, the graphical

handles

the consequences

is a graphical

variables

. One , because

Two , a Bayesian hence

network

among

relationships network

has

knowledge , researchers

302

DAVIDHECKERMAN

have developed methods for learning Bayesian networks from data . The techniques that have been developed are new and still evolving , but they have been shown to be remarkably effective for some data -analysis prob lems. In this paper , we provide a tutorial on Bayesian networks and associated Bayesian techniques for extracting and encoding knowledge from data . There are numerous representations available for data analysis , including rule bases, decision trees, and artificial neural networks ; and there are many techniques for data analysis such as density estimation , classification , regression , and clustering . So what do Bayesian networks and Bayesian methods have to offer ? There are at least four answers. One, Bayesian networks can readily handle incomplete data sets. For example , consider a classification or regression problem where two of the explanatory or input variables are strongly anti -correlated . This correlation is not a problem for standard supervised learning techniques , provided all inputs are measured in every case. When one of the inputs is not observed , however , most models will produce an inaccurate prediction , because they do not encode the correlation between the input variables . Bayesian networks offer a natural way to encode such dependencies. Two , Bayesian networks allow one to learn about causal relationships . Learning about causal relationships are important for at least two reasons. The process is useful when we are trying to gain understanding about a problem domain , for example , during exploratory data analysis . In addi tion , knowledge of causal relationships allows us to make predictions in the presence of interventions . For example , a marketing analyst may want to know whether or not it is worthwhile to increase exposure of a particular advertisement in order to increase the sales of a product . To answer this question , the analyst can determine whether or not the advertisement is a cause for increased sales, and to what degree. The use of Bayesian networks helps to answer such questions even when no experiment about the effects of increased exposure is available . Three , Bayesian networks in conjunction with Bayesian statistical techniques facilitate the combination of domain knowledge and data . Anyone who has performed a real-world analysis knows the importance of prior or domain knowledge , especially when data is scarce or expensive . The fact that some commercial systems (i .e., expert systems) can be built from prior knowledge alone is a testament to the power of prior knowledge . Bayesian networks have a causal semantics that makes the encoding of causal prior knowledge particularly straightforward . In addition , Bayesian networks encode the strength of causal relationships with probabilities . Consequently , prior knowledge and data can be combined with well-studied techniques from Bayesian statistics .

A TUTORIAL

ON LEARNING

WITH

BAYESIAN

NETWORKS

303

Four , Bayesian methods in conjunction with Bayesian networks and other types of models offers an efficient and principled approach for avoiding the over fitting of data . As we shall see, there is no need to hold out some of the available data for testing . Using the Bayesian approach , models can be "smoothed " in such a way that all available data can be used for training .

This tutorial is organizedasfollows. In Section 2, we discussthe Bayesian interpretation of probability and review methods from Bayesian statistics for combining prior knowledge with data . In Section 3, we describe Bayesian networks and discuss how they can be constructed from prior knowledge alone. In Section 4, we discuss algorithms for probabilistic inference in a Bayesian network . In Sections 5 and 6, we show how to learn the proba bilities in a fixed Bayesian-network structure , and describe techniques for handling incomplete data including Monte -Carlo methods and the Gaussian approximation . In Sections 7 through 12, we show how to learn both the probabilities and structure of a Bayesian network . Topics discussed include methods for assessing priors for Bayesian-network structure and param eters, and methods for avoiding the overfitting of data including Monte Carlo , Laplace , BIC , and MDL approximations . In Sections 13 and 14, we describe the relationships between Bayesian-network techniques and meth ods for supervised and unsupervised learning . In Section 15, we show how Bayesian networks facilitate the learning of causal relationships . In Section 16 , we illustrate techniques discussed in the tutorial using a real - world case study . In Section 17 , we give pointers to software and additional liter ature

.

2 . The Bayesian To understand

Approach

to Probability

Bayesian networks

and Statistics

and associated learning

techniques , it is

important to understand the Bayesian approach to probability and statis tics . In this section , we provide an introduction to the Bayesian approach for those readers familiar only with the classical view . In a nutshell , the Bayesian probability of an event x is a person 's degree of beliefin that event . Whereas a classical probability is a physical property

of the world (e.g., the probability that a coin will land heads), a Bayesian probability is a property of the person who assignsthe probability (e.g., your degree of belief that the coin will land heads) . To keep these two concepts of probability distinct , we refer to the classical probability of an event as the true or physical probability of that event , and refer to a degree of belief in an event as a Bayesian or personal probability . Alternatively , when the meaning is clear , we refer to a Bayesian probability simply as a probability . One important difference between physical probability and personal

304

DAVIDHECKERMAN

probability is that , to measure the latter , we do not need repeated tri als. For example , imagine the repeated tosses of a sugar cube onto a wet surface . Every time the cube is tossed, its dimensions will change slightly . Thus , although the classical statistician has a hard time measurin ~ th ~ ......,

probability that the cube will land with a particular face up , the Bayesian simply restricts his or her attention to the next toss, and assigns a proba bility . As another example , consider the question : What is the probability

that the Chicago Bulls will win the championship in 2001? Here, the classical statistician

must remain silent , whereas the Bayesian can assign a

probability (and perhaps make a bit of money in the process). One common criticism of the Bayesian definition of probability is that probabilities seem arbitrary . Why should de~rees of belief satisfy the rules of -

-

probability ? On what scale should probabilities be measured? In particular ,

it makes senseto assign a probability of one (zero) to an event that will (not) occur, but what probabilities do we aBsignto beliefs that are not at the extremes ? Not surprisingly , these questions have been studied intensely . With regards to the first question , many researchers have suggested different sets of properties that should be satisfied by degrees of belief

(e.g., Ramsey 1931, Cox 1946, Good 1950, Savage1954, DeFinetti 1970). It turns out that each set of properties

leads to the same rules : the rules of

probability . Although each set of properties is in itself compelling , the fact that different sets all lead to the rules of probability provides a particularly strong argument for using probability to measure beliefs. The answer to the question of scale follows from a simple observation : people find it fairly easy to say that two events are equally likely . For exam-

ple, imagine a simplified wheel of fortune having only two regions (shaded and not shaded) , such M the one illustrated in Figure 1. Assuming everything about the wheel as symmetric (except for shading) , you should conclude that it is equally likely for the wheel to stop in anyone position .

From this judgment and the sum rule of probability (probabilities of mutually exclusive and collectively exhaustive sum to one) , it follows that your probability

that the wheel will stop in the shaded region is the percent area

of the wheel that is shaded (in this case, 0.3). This probability wheel now provides a reference for measuring your probabilities of other events. For example , what is your probability that Al Gore will run on the Democratic

ticket in 2000 ? First , ask yourself

the question :

Is it more likely that Gore will run or that the wheel when spun will stop in the shaded region ? If you think that it is more likely that Gore will run , then imagine another wheel where the shaded region is larger . If you think that it is more likely that the wheel will stop in the shaded region , then imagine another wheel where the shaded region is smaller . Now , repeat this process until you think that Gore running and the wheel stopping in the

A TUTORIAL

Figure 1.

ON

LEARNING

1'he probability

WITH

BAYESIAN

305

NETWORKS

wheel : a tool for assessing probabilities .

shaded region are equally likely . At this point , yo.ur probability

that Gore

will run is just the percent surface area of the shaded area on the wheel .

In general , the process of measuring a degree of belief is commonly referred to as a probability assessment. The technique for assessment that we have just

described

is one of many available

techniques

discussed in

the Management Science, Operations Research, and Psychology literature . One problem with probability assessment that is addressed in this litera ture is that of precision . Can one really say that his or her probability for event

x is 0 .601 and

not

0 .599 ? In most

cases , no . Nonetheless

, in most

cases, probabilities are used to make decisions, and these decisions are not sensitive to small variations in probabilities . Well -established practices of sensitivity

analysis help one to know when additional

precision

is unneces -

sary (e.g., Howard and Matheson , 1983) . Another problem with probability assessment is that of accuracy . F'or example , recent experiences or the way a question is phrased can lead to assessmentsthat do not reflect a person 's true beliefs (Tversky and Kahneman , 1974) . Methods for improving accu-

racy can be found in the decision-analysis literature (e.g, Spetzler et ale (1975)) . Now let us turn

to the issue of learning

with

data . To illustrate

the

Bayesian approach , consider a common thumbtack - .- -one with a round , flat head that c.an be found in most supermarkets . If we throw the thumbtack

up in the air, it will come to rest either on its point (heads) or on its head (tails) .1 Supposewe flip the thumbtack N + 1 times, making sure that the physical properties of the thumbtack and the conditions under which it is flipped remain stable over time . From the first N observations , we want to determine the probability of heads on the N + 1th toss. In the classical analysis of this problem , we assert that there is some physical probability of heads, which is unknown . We estimate .this physical probability from the N observations using c,riteria such as low bias and low variance . We then use this estimate as our probability for heads on the N + 1th toss. In the Bayesian approach , we also assert that there is 1This example is taken from Howard (1970).

306

DAVIDHECKERMAN

some physical probability

of heads, but we encode our uncertainty about

this physical probability using (Bayesian) probabilities, and use the rules of probability to compute our probability of heads on the N + Ith toss.2 To examine the Bayesian analysis of this problem , we need some nota -

tion . We denote a variable by an upper-case letter (e.g., X , Y, Xi , 8 ), and the state or value of a corresponding variable by that same letter in-- lower -- ---

-

case (e.g., x , Y, Xi, fJ) . We denote a set of variables by a bold-face uppercase letter (e.g., X , Y , Xi ). We use a corresponding bold-face lower-case letter (e.g., x , Y, Xi) to denote an assignmentof state or value to each variable in a given set. We say that variable set X is in configuration x . We

use p(X = xl ~) (or p(xl~) as a shorthand) to denote the probability that X == x of a person with state of information ~. We also use p(x It;) to denote the probability

distribution

for X (both mass functions and density ~

functions) . Whether p(xl~) refers to a probability , a probability density, or a probability distribution will be clear from context . We use this notation for probability throughout the paper . A summary of all notation is given at the end of the chapter . Returning to the thumbtack problem , we define e to be a variable3 whose values () correspond to the possible true values of the physical probability . We sometimes refer to (J as a parameter . We eXDress the uncerA

tainty about e using the probability density function p(()I~) . In addition , we use Xl to denote the variable representing the outcome of the Ith flip ,

1 == 1, . . ., N + 1, and D = { X I = Xl , . . . , XN = XN} to denote the set of our observations . Thus , in Bayesian terms , the thumbtack

problem

reduces

to computing p(XN+IID , ~) from p((}I~). To do so, we first use Bayes' rule to obtain the probability for e given D and background knowledge ~:

distribution

p((}ID ,~)=!J!I~ p(DI p)(DI ~)(}~

(1)

where

p(DI~)=Jp(DlfJ ,~)p(fJl ~)dfJ

(2)

Next, we expand the term p(DlfJ, ~) . Both Bayesiansand claBsicalstatisti cians agree on this term : it is the likelihood function for binomial sampling . 2Strictly speaking, a probability belongsto a singleperson, not a collectionof people. Nonetheless , in parts of this discussion , we refer to "our " probability English .

to avoid awkward

3Bayesianstypically refer to e as an uncertain variable, becausethe value of e is uncertain

. In contrast

, classical

statisticians

often

refer to e as a random

variable . In this

text , we refer to e and all uncertain / random variables simply as variables.

A TUTORIALON LEARNINGWITH BAYESIANNETWORKS 307 In particular, giventhe valueof 8 , the observationsin D are mutually independent , and the probability of heads(tails) on anyone observationis () (1 - lJ). Consequently , Equation 1 becomes p(fJID,~) == !!(~I~) Oh(1 - _Of p(DI~)

(3)

wherehand t arethe numberof headsandtails observedin D, respectively . The probability distributions p(OI~) and p(OID,~) are commonlyreferred to as the prior and posteriorfor 8 , respectively . The quantitieshand tare said to be sufficientstatisticsfor binomialsampling, becausethey provide a summarizationof the data that is sufficientto computethe posterior from the prior. Finally, we averageoverthe possiblevaluesof 8 (usingthe expansionrule of probability) to determinethe probabilitythat the N + Ith tossof the thumbtackwill comeup heads: p(XN+l == headsID,~) == J p(XN+l == headsIO , ~) p(OID,~) dO = J 0 p(OID,~) dO== Ep(8ID,<)(0)

(4)

whereEp((JID,~)(fJ) denotesthe expectationof fJwith respectto the distribution p(lJlD , ~). To completethe Bayesianstory for this example, we needa methodto assessthe prior distribution for 8 . A commonapproach , usually adopted for convenience , is to assumethat this distribution is a betadistribution: p

(

fJl

~

)

=

Beta

(

fJiah

,

at

)

=

:

-

r

r

where

ah

ah

+

r

(

l

to

at

)

>

,

=

0

and

=

1

and

r

.

(

.

The

be

of

The

ID

,

~

)

ah

at

-

r (

(

ah

+

are

a

+ h

)

so

also

N r

(

that

the

_ _ +

t

( JQh

.

in

l

l

(

1

-

(

l

-

beta

)

( X.

t

-

l

(

distribution

r

(

to

Figure

)

x

+

,

l

)

=

as

hyperparameters

can

be

xr

(

5

Qt

+

t

-

a

)

=

and

and

normalized

.

By

Equation

3

,

the

:

l

=

=

Beta

(

( }

lah

+

h

,

at

+

tions

say

that

for

the

binomial

set

of

sampling

beta

distributions

.

Also

t

)

6

)

)

(

We

)

.

.

tion

x

ah

2

bu

( J

fJ

hyperparameters

distri

-

-

reasons

beta

h

h

the

The

several

+

( X.

distribution

shown

a

O )

referred

( )

be

) at

_

at

satisfies

often

for

will

_ (

of

parameter

convenient

tion

) r

which

are

is

bu

r

and

the

zero

a )

parameters

distributions

prior

=

the

(

ah

function

than

distri

( }

are

from

beta

beta

posterior

0

Gamma

greater

Examples

(

>

the

them

must

p

is

quantities

distinguish

at

at

)

(

is

,

the

expectation

a

conjugate

family

of

( J

with

of

respect

distribu

-

to

this

308

DAVIDHECKERMAN

B

o [ ZJ

Bela ( I , I )

Beta ( 2,2 )

Figure 2.

distribution

hag

a

simple

form

Beta ( 3,2 )

Beta ( 19,39 )

Several beta distributions .

:

J IJBeta(IJIG G 'h ' h, G 't) dIJ= -;; Hence of

, heads

given in

a the

beta N

prior +

Ith

, toss

(7)

we have a simple expressionfor the probability :

P(XN +l=headsID ,~)=~ ~.:!:_~ o:+N

(8)

Assuming p ((JI~) is a beta distribution , it can be assessedin a number of ways . For example , we can assessour probability for heads in the first toss of the thumbtack (e.g., using a probability wheel) . Next , we can imagine having seen the outcomes of k flips , and reassessour probability for heads in the next toss . From Equation 8, we have (for k = 1)

G 'h ah+1 p(X1=headsl ~)=G =heads ,~)=G 'h+at p(X2=headslX1 'h+at+1 Given these probabilities, we can solve for ah and at . This assessment technique is known as the method of imagined future data. Another assessmentmethod is basedon Equation 6. This equation says that , if we start with a Beta(O, O) prior4 and observe ah heads and at tails , then our posterior (i.e., new prior) will be a Beta(ah, at ) distribution . Recognizingthat a Beta(O, 0) prior encodesa state of minimum information , we can assessO'h and at by determining the (possibly fractional) number of observations of heads and tails that is equivalent to our actual knowledge about flipping thumbtacks. Alternatively , we can assessp(Xl = headsl~) and 0' , which can be regarded as an equivalent sample size for our current knowledge. This technique is known as the method of equivalent samples.

4Technically ,be the hyperparameters prior should besmall positive numbers so that p(81 ~)can normalized . ofthis

A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 309 Other techniques for assessing beta distributions

are discussed by Winkler

(1967) and Chaloner and Duncan (1983) . Although the beta prior is convenient , it is not accurate for some prob lems. For example , suppose we think that the thumbtack may have been purchased at a magic shop . In this case, a more appropriate prior may be a mixture of beta distributions - for example ,

p((JI~) = 0.4 Beta(20, 1) + 0.4 Beta(l , 20) + 0.2 Beta(2, 2) where 0.4 is our probability that the thumbtack is heavily weighted toward

heads (tails) . In effect, we have introduced an additional hidden or unobserved variable H , whose states correspond to the three possibilities: (1) thumbtack is biased toward heads, (2) thumbtack is biased toward tails ,

and (3) thumbtack is normal; and we have assertedthat (J conditioned on each state of H is a beta distribution . In general , there are simple methods

(e.g., the method of imagined future data) for determining whether or not a beta prior is an accurate reflection of one's beliefs . In those cases where the beta prior

introducing

is inaccurate , an accurate

prior

can often

be assessed by

additional hidden variables , as in this example .

So far , we have only considered observations

drawn from a binomial

dis -

tribution . In general , observations may be drawn from any physical proba -

bility distribution :

p(xllJ, ~) = f (x , lJ)

where f (x , 6) is the likelihood function with parameters 6. For purposes of this discussion , we assume that

the number

of parameters

is finite . As

an example , X may be a continuous variable and have a Gaussian physical probability distribution with mean JLand variance v :

p(xI9,~) ==(27rV )-1/2e-(X-J.L)2/2v where (J == { J.L, v } . Regardless of the functional form , we can learn about the parameters given data using the Bayesian approach . As we have done in the binomial case, we define variables corresponding to the unknown parameters , assign priors to these variables , and use Bayes' rule to update our beliefs about these parameters given data :

p(8ID,~)

(pJ(,~ ) p ( 81 ~ ) --p(DI DI ~)

We then average over the possible values of e to make predictions.

(9) For

example ,

P(XN+lID,~) = J P(XN+119 ,~) p(9ID,~) d8

(10)

310

DAVIDHECKERMAN

For a class of distributions known a8 the exponential family , these computations can be done e-fficiently and in closed form .5 Members of this claBSinclude the binomial, multinomial , normal, Gamma, Poisson, and multivariate-normal distributions . Each member of this family has sufficient statistics that are of fixed dimension for any random sample, and a simple conjugate prior .6 Bernardo and Smith (pp. 436- 442, 1994) havecompiled the important quantities and Bayesian computations for commonlv ., used members of the exponential family. Here, we summarize these items for multinomial sampling, which we use to illustrate many of the ideas in this paper. In multinomial sampling, the observedvariable X is discrete, having r possible states xl , . . . , xr . The likelihood function is given by p(x = XkIlJ, ~) = {}k,

k = 1, . . . , r

where () = { (}2, . . . , Or} are the parameters. (The parameter (JI is given by 1 - )=:%=2 (Jk.) In -this case, as in the case of binomial sampling, the parameters correspond to physical probabilities. The sufficient statistics for data set D = { Xl = Xl , . . . , XN = XN} are { NI , . . . , Nr } , where Ni is the number of times X = xk in D . The simple conjugate prior used with multinomial Jampling is the Dirichlet distribution : p ( 81 ~ ) == Dir where p ( 8ID

0

=

distribution lent

Ei

, ~ ) == Dir samples

( 81G: l , . . . , G: r ) =

= l Ok , and ( 8lo1

, including , can

also

Q' k >

I1r k =r l ( rG: )( Ok ) kIIlJ =l

0 , k == 1 , . . . , r . The

+ N1 , . . . , Or + Nr ) . Techniques the be

methods used

to

conjugate prior and data set D , observation is given by

posterior for

imagined

future

assess

Dirichlet

distributions

probability

distribution

assessing

of

the

(11)

~k- l

distribution

data

and

the

. Given for

beta

equiva

the

-

this next

p(XN +l == x k ID , ~) == J (Jk Dlr :'l + Nl , . . ., O :'r + Nr ) d8 == O . (810 :'k Nk a+ N (12) As we shall see, another important quantity in Bayesian analysis is the marginal likelihood or evidencep(D I~). In this case, we have p(DI~) :=

r (a) - . II r (O :'k + N~l r (a: + N ) k=l r (O :'k)

(13)

5Recent advances in Monte-Carlomethodshavemadeit possibleto workefficiently with manydistributionsoutsidethe exponential family. See , for example , Gilkset al. (1996 ). 6Infact, exceptfor a few, well-characterized exceptions , theexponential familyis the only classof distributionsthat havesufficientstatisticsof fixeddimension(Koopman , 1936 ; Pitman, 1936 ).

A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 311 We note

that

the explicit

mention

cause it reinforces

the notion

once

is firmly

the

this

concept

remainder

of this

In closing classical

this

section

probability

opposite

of the classical

Namely

, in the

imagine all the binomial some

and

data

will

adds

same

from

the

clutter

Bayesian

prediction

are

data . As an illustration

in a manner

" estimate

that

,

" for the

is essentially

, fJ is fixed

( albeit

unknown

the

) , and

may be generated by sampling by fJ. Each data set D will occur

produce

an estimate

the expectation

and variance

we

from with

fJ* ( D ) . To evaluate of the estimate

with

sets : L p ( DI (}) ()* ( D ) D

Varp (DIB) ( (}* )

==

L p ( DI (}) ( ()* ( D ) - Ep (DIB) ( (}* ) ) 2 D

an estimator

of these

and

, they

==

choose

. In

.

Ep (DIB) ( (}* )

variance

,

.

approach

p ( DlfJ ) and

the

. Nonetheless

~ explicitly

. Here , the Bayesian

is obtained

approach

classical

, we compute

to all such

We then

of heads

simply

that , although

for learning

problem

~ is useful , be -

are subjective mention

yield

data sets of size N that distribution determined

probability

an estimator respect

methods

of knowledge

notation

not

sometimes

the thumbtack

physical

, we shall

, we emphasize

may

different

let us revisit

of the state probabilities

in place , the

tutorial

approaches

fundamentally

that

that

estimates

somehow over the

balances possible

(14)

the bias ( () - Ep (D 18) ( (}* ) ) values

for fJ.7 Finally

, we

apply this estimator to the data set that used estimator is the maximum - likelihood

we actually observe . A commonly (ML ) estimator , which selects the

value

p ( D I 0 ) . For binomial

of lJ that

maximizes

the likelihood

sampling

, we

have

OML (D) == ~ r~ N -k L-Ik=l Forthis(andothertypes ) ofsampling , theML estimator is unbiased . That is, for all valuesof 0, theML estimator haszerobias. In addition , for all values of (J, thevariance of theML estimator is nogreater thanthatof any otherunbiased estimator (see , e.g., Schervish , 1995 ). In contrast , in the Bayesian approach , D is fixed, andweimagine all possible valuesof (Jfromwhichthisdataset couldhavebeen generated . Given(J, the "estimate " of thephysical probability of heads isjust (Jitself. Nonetheless , weareuncertain about(J, andsoour finalestimate is the expectation of (Jwithrespect to ourposterior beliefs aboutits value : Ep(BID,~)(O) = J 0 p (OID, E) d(}

(15)

7Low bias and varianceare not the only desirablepropertiesof an estimator. Other desirablepropertiesinclude consistencyand robustness .

312

DAVIDHECKERMAN

The expectations in Equations 14 and 15 are different and , in many cases, lead to different "estimates " . One way to frame this difference is to say that the classical and Bayesian approaches have different definitions for what it means to be a good estimator . Both solutions are "correct " in that they are self consistent . Unfortunately , both methods have their draw backs, which h~ lead to endless debates about the merit of each approach . For example , Bayesians argue that it does not make sense to consider the expectations in Equation 14, because we only see a single data set . If we saw more than one data set, we should combine them into one larger data set . In contrast , cl~ sical statisticians argue that sufficiently accurate priors can not be assessed in many situations . The common view that seems to be emerging is that one should use whatever method that is most sensible for the task at hand . We share this view , although we also believe that the Bayesian approach has been under used, especially in light of its advantages mentiol )ed in the introduction (points three and four ) . Consequently , in this paper , we concentrate on the Bayesian approach . 3 . Bayesian

N etwor ks

So far , we have considered only simple problems with one or a few variables . In real learning problems , however , we are typically interested in looking for relationships among a large number of variables . The Bayesian network is a representation suited to this task . It is a graphical model that efficiently encodes the joint probability distribution (physical or Bayesian ) for a large set of variables . In this section , we define a Bayesian network and show how one can be constructed from prior knowledge . A Bayesian network for a set of variables X = { Xl , . . . , Xn } consists of (1) a network structure S that encodes a set of conditional independence assertions about variables in X , and (2) a set P of local probability distri butions associated with each variable . Together , these components define the joint probability distribution for X . The network structure S is a di rected acyclic graph . The nodes in S are in one-to- one correspondence with the variables X . We use Xi to denote both the variable and its correspond ing node , and PSi to denote the parents of node Xi in S as well as the variables corresponding to those parents . The lack of possible arcs in S encode conditional independencies . In particular , given structure S , the joint probability distribution for X is given by

n p(x) ==i=l IIp (xilpai )

(16)

The local probabili ty distributions P are the distributions corresponding to the terms in the product of Equation 16. Consequently, the pair (8 , P)

A TUTORIAL

ON

LEARNING

WITH

BAYESIAN

NETWORKS

313

encodesthe joint distribution p(x ). The probabilities encoded by a Bayesian network may be Bayesian or physical . When building Bayesian networks from prior knowledge alone , the probabilities will be Bayesian . When learning these networks from data , the

probabilities will be physical (and their values may be uncertain) . In subsequent sections , we describe how we can learn the structure and probabilities of a Bayesian

network

from data . In the remainder

of this section , we ex-

plore the construction of Bayesian networks from prior knowledge . As we shall see in Section 10, this procedure can be useful in learning Bayesian networks

~

well .

To illustrate the process of building a Bayesian network , consider the problem of detecting credit -card fraud . We begin by determining the vari ables to model . One possible choice of variables for our problem is Fraud

(F ) , Gas (G) , Jewelry (J ), Age (A), and Sex (8 ) , representing whether or not the current purchase is fraudulent , whether or not there was a gaB purchase in the last 24 hours , whether or not there w~ a jewelry purch ~ e in the last 24 hours , and the age and sex of the card holder , respectively . The states of these variables are shown in Figure 3. Of course, in a realistic problem , we would include many more variables . Also , we could model the

states

of one or more

of these

variables

at a finer

level

of detail

. For

example , we could let Age be a continuous variable . This initial task is not always straightforward . As part of this task we must (1) correctly identify the goals of modeling (e.g., prediction versus ex-

planation versus exploration), (2) identify many possibleobservationsthat may be relevant to the problem, (3) determine what subset of those observations is worthwhile to model, and (4) organize the observations into variables having mutually exclusive and collectively exhaustive states . Diffi culties here are not unique to modeling with Bayesian networks , but rather are common to most approaches. Although there are no clean solutions , some guidance is offered by decision analysts (e.g., Howard and Matheson ,

1983) and (when data are available) statisticians (e.g., Tukey, 1977). In the next phase of Bayesian-network construction , we build a directed acyclic graph that encodes assertions of conditional independence . One approach for doing so is based on the following observations . From the chain rule of probability , we have n

p(x) ==II p(xilxl, . . ., Xi- I)

(17)

i = 1

Now, for every Xi , there will be some subset IIi <; { XI , . . ., Xi - I } such that Xi and { X I , . . . , X i- I } \ lli are conditionally independent given Ili . That

is , for any x ,

p(xilxl , . . . , Xi- I ) = p(xil7ri)

(18)

314

DAVIDHECKERMAN p(a=<30) = 0.25 p(a=30- 50) = 0.40

p(g=yeslf=yes} = 0.2 p(g=yeslj=no} = 0.01

Figure 9 .

A Bayesian -network

p(j =yeslj=yes,a= *,s= *) = 0.05 p(j =yeslj=no,a=<30,s=male) = 0..0001 p(j =yeslj=no,a=30-50,s=male) = 0.0004 p(j =yeslj=no,a=>50,s=male) = 0.0002 p(j =yeslj=no,a=<30,s=jemale) = 0..0005 p(j =yeslj=no,a=30-50,s=female) = 0.002 p(j =yeslj=no,a=>50,s=female) = 0.001 for detecting

credit -card fraud . Arcs are drawn from

cause to effect. The local probability distribution (s) associated with a node are shown adjacent to the node . An asterisk is a shorthand

for "any state ."

Combining Equations 17 and 18, we obtain n

p(x) = IIp (xil7ri)

(19)

i= l

Comparing Equations 16 and 19, we seethat the variables sets (III , . . . , lln ) correspond to the Bayesian-network parents (Pal , . . . , Pan) , which in turn fully specify the arcs in the network structure S .

Consequently, to determine the structure of a Bayesian network we (1) order the variables somehow, and (2) determine the variables sets that satisfy Equation 18 for i = 1, . . . , n . In our example , using the ordering

(F, A , S, G, J ) , we have the conditional independencies p(alf ) = p(slf , a) = p(glf , a, s) = p(jlf , a, s, g) =

p(a) p(s) p(glf ) p(jlf , a, s)

(20)

Thus , we obtain the structure shown in Figure 3. This approach

has a serious drawback . If we choose the variable

order

carelessly, the resulting network structure may fail to reveal many conditional independenciesamong the variables. For example, if we construct a Bayesian network for the fraud problem using the ordering (J, G, S, A , F ) ,

A TUTORIAL ON LEARNING WITH BAYESIAN NETWORKS

315

we obtain a fully connected network structure . Thus , in the worst case, we have to explore n ! variable orderings to find the best one. Fortunately , there is another technique for constructing Bayesian networks that does not require an ordering . The approach is based on two observations : (1) people can often readily assert causal relationships among variables , and (2) causal relationships typically correspond to assertions of conditional dependence. In particular , to construct a Bayesian network for a given set of variables , we simply draw arcs from cause variables to their immediate effects . In almost all cases, doing so results in a network structure that satisfies the definition Equation 16. For example , given the assertions that Fraud is a direct cause of Gas, and Fraud , Age, and Sex are direct causes of Jewelry , we obtain the network structure in Figure 3. The causal semantics of Bayesian networks are in large part responsible for the success of Bayesian networks as a representation for expert systems (Heckerman et al ., 1995a) . In Section 15, we will see how to learn causal relationships from data using these causal semantics . In the final step of constructing a Bayesian network , we assessthe local probability distribution (s) p (xilpai ) . In our fraud example , where all vari ables are discrete , we assessone distribution for Xi for every configuration of P ~ . Example distributions are shown in Figure 3. Note that , although we have described these construction steps as a simple sequence, they are often intermingled in practice . For example , judg ments of conditional independence and/ or cause and effect can influence problem formulation . Also , assessments of probability can lead to changes in the network structure . Exercises that help one gain familiarity with the practice of building Bayesian networks can be found in Jensen (1996) .

4. Inference in a Bayesian Network Once we have constructed a Bayesian network (from prior knowledge , data , or a combination ) , we usually need to determine various probabilities of interest from the model . For example , in our problem concerning fraud detection , we want to know the probability of fraud given observations of the other variables . This probability is not stored directly in the model , and hence needs to be computed . In general , the computation of a probability of interest given a model is known as probabilistic inference . In this section we describe probabilistic inference in Bayesian networks . Because a Bayesian network for X determines a joint probability distri bution for X , we can- in principle - use the Bayesian network to compute any probability of interest . For example , from the Bayesian network in Fig ure 3, the probability of fraud given observations of the other variables can

316

DAVIDHECKERMAN

be computed

as follows :

(f,a ,g (f,a p(IIa,s,g,).)=pP (a,s,s,g .) =Llf ~p'P (f',s ,),j) ,a,g ,s,j) ,g,).)

(21)

For problems with many variables , however, this direct approach is not practical . Fortunately , at least when all variables are discrete , we can exploit the conditional independencies encoded in a Bayesian network to make this computation more efficient . In our example , given the conditional independencies in Equation 20, Equation 21 becomes

as

.

-

p(JI , , g,J) -

p(J)p(a)p(s)p(gIJ )p(jlf , a, s)

-

p(J )p(glf )p(jlf , a, s)

-

LI ' p(f ')p(glf ')pUlf ', a, s)

Several researchers have developed probabilistic for Bayesian

networks

with

-

(22)

LI ' p(J')p(a)p(s)p(gIJ')p(jIJ ', a, s)

discrete

variables

that

inference algorithms exploit

conditional

in -

dependence roughly as we have described , although with different twists . For example , Howard and Matheson (1981) , Olmsted (1983) , and Shachter

(1988) developed an algorithm that reversesarcs in the network structure until the answer to the given probabilistic query can be read directly from the graph . In this algorithm , each arc reversal corresponds to an applica -

tion of Bayes' theorem. Pearl (1986) developed a message - passing scheme that updates the probability distributions work

in response

to observations

for each node in a Bayesian net-

of one or more

variables

. Lauritzen

and

Spiegelhalter (1988) , Jensenet al. (1990) , and Dawid (1992) created an algorithm that first transforms the Bayesian network into a tree where each node in the tree corresponds to a subset of variables in X . The algorithm then exploits several mathematical properties of this tree to perform proba -

bilistic inference. Most recently, D 'Ambrosio (1991) developedan inference algorithm that simplifies sums and products symbolically , as in the trans formation from Equation 21 to 22. The most commonly used algorithm for

discrete variables is that of Lauritzen and Spiegelhalter (1988), Jensen et al (1990) , and Dawid (1992) . Methods multivariate

for

exact

- Gaussian

inference

or Gaussian

in

Bayesian

- mixture

networks

distributions

have

that

encode

been

devel -

oped by Shachter and Kenley (1989) and Lauritzen (1992), respectively. These methods also use assertions of conditional independence to simplify inference . Approximate methods for inference in Bayesian networks with other distributions , such as the generalized linear -regression model , have

also been developed (Saul et al., 1996; Jaakkola and Jordan, 1996) .

A TUTORIAL

ON

LEARNING

WITH

BAYESIAN

NETWORKS

317

Although we use conditional independence to simplify probabilistic inference, exact inference in an arbitrary Bayesian network for discrete vari -

ables is NP-hard (Cooper, 1990). Even approximate inference (for example, Monte-Carlo methods) is NP-hard (Dagum and Luby, 1993) . The source of the difficulty

lies in undirected cycles in the Bayesian-network structure -

cycles in the structure where we ignore the directionality of the arcs. (If we add an arc from Age to Gas in the network structure of Figure 3, then we

obtain a structure with one undirected cycle: F - G - A - J - F .) When a Bayesian-network structure contains many undirected cycles, inference is intractable . For many applications , however, structures are simple enough

(or can be simplified sufficiently without sacrificing much accuracy) so that inference is efficient . For those applications where generic inference meth ods are impractical , researchers are developing techniques that are custom

tailored to particular network topologies (Heckerman1989; Suermondt and Cooper, 1991; Saul et al., 1996; Jaakkola and Jordan, 1996) or to particular inference queries (Ramamurthi and Agogino, 1988; Shachter et al., 1990; Jensen and Andersen, 1990; Darwiche and Provan, 1996) . 5 . Learning In the

next

probability

Probabilities

several

sections

distributions

set of techniques

in a Bayesian , we show

Network

how to refine

the structure

and local

of a Bayesian network given data . The result is

for data analysis that combines prior knowledge with data

to produce improved knowledge . In this section , we consider the simplest version of this problem : using data to update the probabilities of a given Bayesian

network

structure

.

Recall that , in the thumbtack problem , we do not learn the probability of heads. Instead , we update our posterior distribution for the variable that represents the physical probability of heads. We follow the same approach for probabilities in a Bayesian network . In particular , we assume- perhaps from causal knowledge about the problem - that the physical joint proba bility

distribution

for X can be encoded in some network

structure

S . We

write n

p(xllJs,sh) = IIp (xilpai, lJi, Sh)

(23)

i= l

where 0 i is the vector of parameters for the distribution p(Xi Ipai ' Oi, Sh) , Os is the vector of parameters (01' . . . , On), and Sh denotes the event (or "hypothesis" in statistics nomenclature) that the physical joint probability distribution

can be factored according to 5 .8 In addition , we assume

BAs defined here, network-structure hypotheses overlap. For example, given X = { X1 , X2 } , any joint distribution for X that can be factored according the network

318

DAVIDHECKERMAN

that

we

have

a

probability

in

Section

ing

2

a

(

(

can

(

}

encode

)

ISh

)

.

be

We

,

a

pervised

ing

more

than

/

ships

.

tic

a

Examples

of

(

e

. g

.

,

,

)

bilities

in

.

Buntine

In

a

learning

e

,

principle

,

)

Neal

In

this

and

,

,

,

each

tions

,

,

)

and

Xi

pat

by

1

,

=

-

(

E

~

.

(

~

2

.

,

ijk

Xl

pari

)

(

~

( Jijk

~

.

2

)

containing

-

-

t

Therefore

and

X

2

,

Geiger

.

)

3

1

Xi

=

=

we

overlap

should

(

1996

add

)

describe

,

. g

.

(

Ji

,

(

( } ij2

be

to

set

-

,

proba

-

deci

1996

)

,

to

-

kernel

(

learn

Fried

-

proba

-

techniques

for

models

include

the

Herskovits

1994

.

;

,

,

1992

)

Heckerman

,

and

MacKay

,

1992a

and

.

ideas

for

learning

probabilities

distribution

Ti

possible

collection

S

)

=

)

.

values

of

xt

multinomial

Pai

=

( Jijk

.

In

,

.

this

.

.

,

X

~

i

distribu

Namely

,

we

,

-

assume

we

of

(

define

,

.

.

.

0

configurations

The

the

,

( } ijri

of

parameter

( Jijl

vector

of

is

24

)

Pai

,

given

parameters

)

according

model

to

averaging

definition

conditions

(

the

.

factored

the

>

denote

for

conditions

such

used

. g

-

probabilistic

and

having

ri

also

,

Bayesian

e

,

-

probabilis

,

,

-

Thus

methods

Cooper

parameters

,

can

.

regression

Buntine

-

noth

relation

)

,

su

is

produce

be

(

)

of

for

classifi

dictionary

,

,

function

independence

studied

.

. g

a

Goldszmidt

and

of

problems

one

,

basic

I1XiEPat

presents

-

the

h

,

=

net

probabilistic

1992b

cases

e

e

1996

convenience

arc

,

regression

,

.

:

-

compute

methods

that

most

is

the

,

function

-

configuration

are

As

defin

function

of

most

,

Ipat

.

by

Bayesian

as

with

can

the

(

.

viewed

multinomial

each

qi

~

For

no

Such

in

discrete

lJij

structure

,

al

,

regression

forms

function

(

)

and

(

is

for

.

( }

Sh

1992a

)

the

X

a

D

linear

unrestricted

E

distribution

8i

,

generalized

noise

k

and

Ji

collection

1994

,

distribution

case

density

in

conditional

these

illustrate

P

where

(

Friedman

of

the

a

probability

distribution

a

linear

we

joint

as

88

prior

models

,

et

D

sample

local

,

;

Saul

using

variable

local

one

Book

physical

of

parameters

or

Nonetheless

and

tutorial

each

and

a

MacKay

generalized

;

,

regression

Gaussian

and

1993

structure

case

(

the

Xl

familiar

distribution

with

1996

;

,

any

multinomial

1992b

.

;

.

regression

Geiger

I pai

,

network

available

unrestricted

linear

. g

from

random

Readers

by

1993

,

Bayesian

are

Xi

regression

methods

1995

(

as

/

}

.

organized

(

XN

a

classification

classification

estimation

man

)

viewed

linear

,

element

the

a

.

be

networks

trees

.

probabilities

that

,

include

density

Sh

recognize

can

neural

,

.

assessing

Given

p

models

outputs

sion

siD

probabilistic

regression

bilistic

J

.

an

learning

junction

network

cation

(

and

:

distribution

will

Bayesian

,

of

distribution

learning

'

to

about

simply

the

Xl

uncertainty

stated

to

local

{

refer

8s

(

=

We

problem

p

refer

aE

.

variable

The

distribution

8i

D

X

our

valued

s

now

posterior

(

sample

of

we

-

p

work

,

vector

function

a

random

distribution

to

.

the

,

insure

network

described

no

structure

.

overlap

in

.

Section

Heckerman

7

.

A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 319

sample 1

sample 2 . . .

Figure

. 4

.

A

pendence

and

Y

the

two

for

all

with

for

Bayesian

for

are

.

of

and

j

,

Given

use

We

x

use

and

x

the

to

class

The

D

is

.

that

assumption

of

structure

the

"

two

X

states

We

unrestricted

local

(

-

BsID

,

Sh

of

that

)

-

X

,

the

7

Y

parameter

.

and

inde

Both

-

variables

yand

OsISh

contrast

i )

X

to

denote

=

=

we

can

are

of

compute

Pai

the

form

no

D

Oij

distribution

-

.

closed

are

sample

vectors

)

in

there

random

this

functions

,

and

that

,

(

to

dimensional

model

efficiently

parameter

p

-

functions

is

the

low

regression

distribution

assumption

say

"

are

linear

first

sample

assumption

the

network

that

of

p

.

the

denote

term

generalized

distribution

is

depicting

of

.

the

this

sumptions

That

.

structure

parameters

distributions

example

dom

We

Y

multinomial

terior

network

the

binary

states

i

-

learning

missing

is

mutually

under

two

data

complete

pos

in

.

independent

The

the

-

as

ran

-

-

second

. 9

n qi II II p(Oij~Sh) i=1j =1

We refer to this assumption , which was introduced by Spiegelhalter and Lauritzen (1990) , as parameter independence. Given that the joint physical probability distribution factors according to some network structure S , the assumption of parameter independence can itself be represented by a larger Bayesian-network structure . For example , the network structure in Figure 4 represents the ~ sumption of parameter independence for X == { X , Y } (X , Y binary ) and the hypothesis that the network structure X ~ Y encodes the physical joint probability distribution for X . 9The computation is also straightforwardif two or more parametersare equal. For details, seeThiesson(1995).

320

DAVIDHECKERMAN

Under the assumptions of complete data and parameter independence , the parameters remain independent given a random sample :

p(lJsID , Sh) = Thus

, we

the

one

Dir

( Oij

can

update

- variable I O: ' ijl

each case

p ( OijID

where

Nijk As

rations

that

the of

p ( XN

Thus

is

in

+ IID ,

in

.

, . . . , O: ' ijri

85 , Sh

CaEe

of

Assuming

) '

, Sh

the

vector

we

)

=

number

parameters

each

obtain

( ( JijIO

cases

in , we

obtain

predictions

of

) ,

where

XN

XN

+ l ,

to

Xi

==

+ l x7

is and

+

the

D

in

Pai

which

Xi

average

pal

prior

to ,

+

=

over

. For case

==

the

, . . . , O: ' ijri

interest next

has

, just

as

in

distribution

distribution

Nijl

can

(25)

independently

( Jij

posterior

: ' ijl

example

Oij

vector

the

Dir

of

thumbtack

n qi IIlp{lJijID ,Sh ) iII=lj=

xf the

example be

k

and

Pai

=

possible

after and

(26)

)

, let

seen

where

Nijri

us D

j

pal

.

configu

-

compute .

depend

Suppose on

i .

,

p(XN +IID,Sh)= Ep (6sID ,Sh ) (g (Jijk ) To compute

this expectation

, we first

n

use the fact that

the parameters

n

J 1J1 ; II1J (Jijk p (8ijID , Sh) d8ij .=1(Jijk p (8sID , Sh) d8s = t=

fI O :'ijk + Nijk i=1 O :'ij + N ij

(27)

r' """",,r " where Qij = Lk ,: 1 Gijk and Nij = Lik -=1 Nijk These computations are simple because the unrestricted multinomial distributions are in the exponential family - Computations for linear re gression with Gaussian noise are equally straightforward (Buntine , 1994 ; Heckerman and Geiger , 1996) -

6. Methods for Incomplete

Data

Let us now discuss methods for learning about parameters when the random sample is incomplete (i .e., some variables in some cases are not observed) . An important distinction concerning missing data is whether or not the

A TUTORIAL

ON LEARNING

WITH

BAYESIAN

NETWORKS

321

absence of an observation is dependent on the actual states of the vari ables. For example , a missing datum in a drug study may indicate that a patient became too sick- perhaps due to the side effects of the drug - to

continue in the study. In contrast, if a variable is hidden (i.e., never observed in any case) , then the absenceof thi ~ data is independent of state. Although Bayesian methods and graphical models are suited to the analysis of both situations , methods for handling missing data where absence is independent of state are simpler than those where absence and state are dependent . In this tutorial , we concentrate on the simpler situation only .

Readers interested in the more complicated caseshould see Rubin (1978) , Robins (1986) , and Pearl (1995) . Continuing with our example using unrestricted multinomial distribu tions , suppose we observe a single incomplete case. Let Y C X and Z C X denote the observed and unobserved variables in the case, respectively . Under the assumption of parameter independence , we can compute the

posterior distribution of Oij for network structure S as follows:

p(OijIy, Sh) == L p(zly, Sh) p(OijIy, z, Sh)

(28)

z

== (1- p(patIY ,Sh )){P(OijISh )} + ri

L p(x7,paily ,Sh ) {p(Oijlxr ,pat,Sh )}

k= l

(See Spiegelhalter and Lauritzen (1990) for a derivation.) Each term in curly brackets in Equation 28 is a Dirichlet distribution . Thus , unless both Xi and all the variables in Pai are observed in case y , the posterior dis-

tribution of Oij will be a linear combination of Dirichlet distributions -

that is, a Dirichletmixturewithmixingcoefficients (1- p(pa1IY , Sh)) and p(xf,pa1Iy,Sh),k ==1,...,Ti. When we observe a second incomplete case, some or all of the Dirichlet components in Equation 28 will again split into Dirichlet mixtures . That is,

the posterior distribution for Oij we becomea mixture of Dirichlet mixtures. As we continue to observe incomplete cases, each missing values for Z , the

posterior distribution for Oij will contain a number of components that is exponential in the number of cases. In general , for any interesting set of local likelihoods and priors , the exact computation of the posterior distribution for Os will be intractable . Thus , we require an approximation for incomplete data .

322

DAVIDHECKERMAN

6.1. MONTE-CARLOMETHODS One class of approximations is based on Monte -Carlo or sampling methods . These approximations can be extremely accurate , provided one is willing to wait long enough for the computations to converge. In this

section

, we discuss

one of many

Monte

- Carlo

methods

known

as

Gibbs sampling, introduced by Geman and Geman (1984). Given variables X = { Xl , . . . , Xn } with some joint distribution p(x ) , we can use a Gibbs sampler to approximate the expectation of a function J (x ) with respect to p(x ) as follows. First , we choose an initial state for each of the variables in X somehow (e.g., at random). Next, we pick some variable Xi , unassign its current state , and compute its probability distribution given the states of the other n - 1 variables . Then , we sample a state for Xi

based on this probability distribution , and compute f (x ). Finally, we iterate the previous two steps, keeping track of the averagevalue of f (x ) . In the limit , as the number of cases approach infinity , this average is equal to

Ep(x) (f (x )) provided two conditions are met. First , the Gibbs sampler must be irreducible: The probability distribution p(x ) must be such that we can eventually sample any possible configuration of X given any possible ini -

tial configuration of X . For example, if p(x ) contains no zero probabilities, then the Gibbs sampler will be irreducible . Second, each Xi must be chosen infinitely often . In practice , an algorithm for deterministically rotating through the variables is typically used. Introductions to Gibbs sampling and other Monte -Carlo methods - including methods for initialization and

a discussion of convergence - are given by Neal (1993) and Madigan and York (1995) . To illustrate

Gibbs sampling , let us approximate the probability

den-

sity p(lJsfD, Sh) for someparticular configuration of lJs, given an incomplete data set D == { Yl , . . . , YN} and a Bayesian network for discrete variables with independent Dirichlet priors . To approximate p (8 siD , Sh) , we first ini tialize

the

states

of the

unobserved

variables

in each case somehow

. As a

result , we have a complete random sample Dc . Second, we choose some vari -

able Xii (variable Xi in case1) that is not observedin the original random sample D , and reassign its state according to the probability

p(xillDc \ Xii, Sh) =

distribution

P(x~l' Dc \ XilISh)

Lx ~~p(xii , Dc \ xillSh )

where Dc \ Xii denotes the data set Dc with observation Xii removed, and the sum in the denominator shall

see in Section

7 , the terms

runs over all states of variable in the numerator

Xii . As we

and denominator

can be

computed efficiently (seeEquation 35). Third , we repeat this reassignment for all unobserved variables in D , producing a new complete random sample

A TUTORIAL ON LEARNING WITH BAYESIAN NETWORKS

323

D~. Fourth , we compute the posterior density p(8sID~, Sh) as described in Equations 25 and 26. Finally , we iterate the previous three steps, and use the averageof p(8sID~, Sh) as our approximation.

6.2. THEGAUSSIAN APPROXIMATION Monte -Carlo methods yield accurate results , but they are often intractable for example , when the sample size is large . Another approximation that is more efficient than Monte -Carlo methods and often accurate for relatively large samples is the Gaussian approximation (e.g., Kass et al ., 1988; Kass and Raftery , 1995) . The idea behind this approximation is that , for large amounts of data , p ( 8sID cx

, Sh

p ( DI8s

) , Sh

Gaussian

)

. p ( 8s

ISh

distribution

.

can often be approximated as a multivariate-

) In

particular

g ( 8s

)

, let

= = log

( p ( DI8s

, Sh

)

. p ( 8sISh

(29)

) )

Also

,

define

9

s

configuration

be

also

posteriori

( MAP

nomial

to

of

g

)

configuration

maximizes )

( 8s

the

p

configura

( 8sID

! ion

about

the

8s

to

where

( 8s

negative and

using

p

-

8s

) t

Hessian

( 8sID

is of

Equation

, sh

)

( 8s

~

g

the g 29

<X

)

p

,

)

p

)

-

we

and

that is

Using

a g

maximizes

( 8s

of

-

ro at

Os

known

_w 8s

vector .

as

second

Raising

the

( 8s

) ,

( 8s

-

we

g

s ) .

This a

Taylor

poly

-

obtain

t

8s

( 8s

( 9

maximum

degree

) A

9

(30)

)

( 8s

lis )

) , to

and the

A power

is

the of

e

obtain

( DI8s

( DI8s

) ,

s

-

2

evaluated

, Sh -

~

.

1

( 8s

transpose

( 9s

8s

8

approximate

g

, Sh

of

of

)

p

h , S

( 8sISh -

)

p

( 8slS

) h )

( 31

)

1 - t exp{ - 2(9s- 9s)A(9s- 9s) }

Hence, p(9sID , Sh) is approximately Gaussian. To compute the Gaussian approximation,_we must compute Os as well as the negative Hessianof 9 (~s) evaluated at 9 s. In the following section, we discussmethods for finding 8s. Meng and Rubin (1991) describea numerical technique for computing the secondderivatives. Raftery (1995) shows how to approximate the Hessian using likelihood-ratio tests that are available in many statistical packages. Thiesson (1995) demonstrates that , for unrestricted multinomial distributions , the secondderivatives can be computed using Bayesian-network inference.

324

6

DAVID

. 3

.

THE

As

MAP

the

sample

sharper

,

limit

,

we

simply

A

further

increases

approximate

of

( J s

ML

size

tending

do

make

size

AND

of

to

not

a

need

APPROXIMATIONS

the

lis

increases

function

the

by

the

of

maximum

the

the

the

the

based

on

or

maximum

( ( J s

peak

will

)

become

-

( J s

expectations

.

.

In

Instead

this

,

we

.

observation

I Sh

ALGORITHM

configuration

configuration

the

p

EM

Gaussian

MAP

prior

THE

MAP

averages

on

is

effect

,

at

compute

based

approximation

,

AND

data

delta

to

predictions

HECKERMAN

that

diminishes

likelihood

,

.

( ML

as

the

Thus

)

sample

,

we

can

configuration

:

Os =arg max ,Sh )} Os{p(DIOs One class of techniques for finding a ML or MAP is gradient -based optimization . For example , we can use gradient ascent, where we follow the derivatives of g (88) or the likelihood p (DI88 , sh ) to a local maximum . Russell et ale ( 1995) and Thiesson (1995) show how to compute the deriva tives of the likelihood for a Bayesian network with unrestricted multino mial distributions . Buntine (1994) discusses the more general case where the likelihood function comes from the exponential family . Of course, these gradient -based methods find only local maxima . Another technique for finding a local ML or MAP is the expectation maximization (EM ) algorithm (Dempster et al ., 1977) . To find a local MAP or ML , we begin by assigning a configuration to 88 somehow (e.g., at ran dom ) . Next , we compute the expected sufficient statistics for a complete data set , where expectation is taken with respect to the joint distribution for X conditioned on the assigned configuration of 88 and the known data D . In our discrete example , we compute

N ,Os ,Sh ) Ep (xID ,Os,Sh )(Nijk) = L 1=1p(xf,pa1lYl

(32)

where Yl is the possibly incomplete lth case in D . When Xi and all the variables in Pai are observed in case Xl , the term for this case requires a trivial computation : it is either zero or one. Otherwise , we can use any Bayesian network inference algorithm to evaluate the term . This computa tion is called the expectation step of the EM algorithm . Next , we use the expected sufficient statistics as if they were actual sufficient statistics from a complete random sample Dc . If we are doing an ML calculation , then we determine the configuration of Os that maximize p (DcIOs, Sh) . In our discrete example , we have

()'l..Jk -Ek x(x,9 ID ,S s),S (N )) riEp =l(Ep ,9 ID sh )ijk (Nijk h

A TUTORIALON LEARNINGWITH BAYESIANNETWORKS 325 If we are doing a MAP calculation, then we determinethe configurationof Osthat maximizesp(OsIDc,Sh). In our discreteexample, we havel0

(JJok i=Lk rG Ep x(x,9 ID .,,S (Nijk h ,=:lijk (G :'i+jk +(Ep O ID "),S )(N h)ijk )) This assignment is called the maximization step of the EM algorithm . Dempster et ale (1977) showed that , under certain regularity conditions , iteration of the expectation and maximization steps will converge to a local maximum . The EM algorithm is typically applied when sufficient statistics exist (i .e., when local distribution functions are in the exponential fam ily ) , although generalizations of the EM algroithm have been used for more complicated local distributions (see, e.g., Saul et ale 1996) . 7 . Learning

Parameters

and Structure

Now we consider the problem of learning about both the structure and probabilities of a Bayesian network given data . Assuming we think structure can be improved , we must be uncertain about the network structure that encodes the physical joint probability distribution for X . Following the Bayesian approach , we encode this uncertainty by defining a (discrete ) variable whose states correspond to the possible network -structure hypotheses Sh, and assessing the probabilities p (Sh) . Then , given a random sample D from the physical probability distri bu tion for X , we compute the posterior distribution p (sh ID ) and the posterior distributions p (9 siD , Sh) , and use these distributions in turn to compute expectations of interest . For example , to predict the next case after seeing D , we compute

p(XN+IID ) =

Ep ) jP (XN+118s ,Sh) p(8sID,Sh) d8s Sh (ShID

In performing the sum , we assume that the network -structure are mutually exclusive . We return to this point in Section 9 .

(33)

hypotheses

lOThe MAP configuration lis depends on the coordinate system in which the parameter variables are expressed. The expression for the MAP configuration given here is obtained by the following procedure. First , we transform each variable set (Jij = ((Jij2, . . . , fJijri ) to the new coordinate system cPij = (cPij2, . . . , cPijri) ' where cPijk = log((Jijk/ fJijl ) , k = 2, . . . , ri . This coordinate system, which we denote by cPs , is sometimes referred to as the canonical coordinate system for the multinomial distribution (see, e.g., Bernardo and Smith , 1994, pp . 199- 202). Next , we determine the configuration of cPsthat maximizes p( cpsiD c, Sh) . Finally , we transform this MAP configuration to the original coordinate system. Using the MAP configuration corresponding to the coordinate system cPshas several advantages, which are discussedin Thiesson (1995b) and MacKay (1996) .

326

DAVIDHECKERMAN

The computation of p (OsID , Sh) is as we have described in the previous two sections . The computation of p (ShID ) is also straightforward , at least in principle . From Bayes' theorem , we have

p(ShlD ) = p(Sh)p(DISh )jp(D) where

p ( D ) is

ture we

. Thus need

to

possible

normalization

compute

nomial

the

the

an

plete

data

rate

we

marginal

parameter

( Jij

- sided

of

the

i- j

p ( DISh

was

first ,

impractical

data

pair

models

over

with

more

than

not

the

almost

problem

all

of :

proach

is

possi

is

ble

lated

?

whether The several good

models

, and

to

approaches

so , how or

not

question researchers hypothesis

each

do

of

models

of

use

often

it

as

a

sepa

-

, the

the

marginal

if

it

were

these

results

" good

accuracy

is

of

good

applied

good

for to .

to

models

in this

former ) from

ap among

. The

from

exhaustive

latter

among

all

. These

particular

And

.

address

The

Bayesian ?

can

decades

model

. In

is

user

intractable

models are

questions

when for

is

the

problem

correct

by

- network

hypotheses

hypothesis

the

is

produced

where

averaging

models

important

is

approaches

model

described

Bayesian

approach

two

) .

have

structure

this

(35)

)

( 1992 we

situations

by

number

search

yields

-

, each

have

of

consider

( i . e . , structure

that

shown

we

, the

selective

is

have

product

possible

, use

model

several

a model

data

, we

-

com

j . Consequently

that

confronted

and

we

Sec -

, and

missing

Herskovits

, in

hypotheses

been

accurate

no effect

bottleneck

of

these

pretend

raise

and

number

a manageable

and

in

multi

priors

. II r ( Qijk + Nijk k= 1 r ( G' ijk )

)

33 . If

of

"

detail

13 ) :

computation

the

" good

select

yield If

,

) ) for

unrestricted

and

the

approach

n . Consequently

selection

models

just

Cooper

in

types

a

.i

Equation

Equation

have

select

approaches

tures

all

other

model to

possible

approach

,

, who

context

-

ri

by

in

are . In

every

the by

important

variables

Statisticians

is

Bayesian

models

n

for

in

with

there

qi

full

exponential

exclude

struc

structures

( p ( DISh

, Dirichlet

when

( given

derived

the

. One

average

upon

network

likelihoods

) == II II r ( Qij ) i = 1 j = 1 r ( G' ij + Nij

Unfortunately

the

depend

data

example

independently

problem

of

each

,

updated

n

often

the

independence

thumbtack

for

formula

of

our

discussed

is

likelihood

This

not for

marginal

, consider

have

vector

likelihoods

does

distribution

likelihood

computation

,

. As

multi

that

posterior

marginal

introduction

distributions

parameter

the

.

discuss

9 . As

constant

determine

structure

We tion

a

, to

( 34 )

, do

- network how

re -

do

these struc

we

-

decide

" ? difficult

to

answer

experimentally accurate

predictions

that

in the

theory

. Nonetheless

selection

( Cooper

and

of

a single

Herskovits

,

A TUTORIAL

ON

LEARNING

WITH

BAYESIAN

NETWORKS

327

1992; Aliferis and Cooper 1994; Heckermanet al., 1995b) and that model averaging using Monte -Carlo methods can sometimes be efficient and yield

even better predictions (Madigan et al., 1996). These results are somewhat surprising , and are largely responsible for the great deal of recent interest in learning with Bayesian networks . In Sections 8 through 10, we consider different definitions of what is means for a model to be "good" , and discuss the computations entailed by some of these definitions . In Section 11, we discuss

model

search .

We note that model averaging and model selection lead to models that generalize well to new data . That is , these techniques help- us to avoid the ~ overfitting of data . As is suggested by Equation 33 , Bayesian methods for model averaging and model selection are efficient in the sense that all cases . III D can be used to both smooth and train the model . As we shall see

. III

the

following

approach

in

8 .

Criteria

two general

for

sections

,

this advantage

holds true for the Bayesian

.

Model

Selection

Most of the literature on learning with Bayesian networks is concerned with model selection . In these approaches, some criterion is used to measure the degree to which a network structure (equivalence class) fits the prior knowl edge and data . A search algorithm is then used to find an equivalence class that receives a high score by this criterion . Selective model averaging is more complex , because it is often advantageous to identify network struc tures that are significantly different . In many cases, a single criterion is unlikely to identify such complementary network structures . In this section , we discuss criteria for the simpler problem of model selection . For a discussion of selective model averaging , see Madigan and Raftery (1994) .

8.1. RELATIVE POSTERIOR PROBABILITY A criterion that is often used for model selection is the log of the relative posterior probability log p (D , Sh) = log p (Sh) + log p (DISh ) .11 The logarithm is used for numerical convenience. This criterion has two components : the log prior and the log marginal likelihood . In Section 9, we examine the computation of the log marginal likelihood . In Section 10.2, we discuss the assessment of network -structure priors . Note that our comments about these terms are also relevant to the full Bayesian approach . l1An

equivalent

criterion

that

is

often used is log(p(ShID)/p(S~ID))

=

log(p(Sh)/ p(S~)) + log(p(DISh)/ p(DIS~)). The ratio p(DISh)/p(DIS~) is knownas a Bayes ' factor .

328

DAVID

The

log

marginal

described

by

HECKERMAN

likelihood

Dawid

ha . s the

( 1984

) . From

following

the

chain

log

p ( xllxl

interesting rule

of

interpretation

probability

, we

have

N

log

p ( DISh

)

=

L

, . . . , Xl

- I , Sh

)

( 36

)

1= 1

The

term

after

p ( xllx1

, . . . , Xl

averaging

of

as

log

over

the

utility

p ( x ) . 12

highest

or

Thus

,

model

that

utility

function Dawid

the

,

omitted

Finally

the

of

Xl

this

made

log

the

priors

predictor

can

of

function ( or

structure

)

data

D

this

criterion

Sh

thought

likelihood on

the

model be

utility

marginal

equal

by

term

under

highest

assuming

-

, we

,

of

v,

=

a { Xl

reward

this

each is

validation on

this

, known

all

' . . . ' X ' - l , Xl

but

is

under

for

p ( x ) , we

under

every .

case

If

the

obtain

of } .

the

also

a

the

log

in

the

cross

- one

cases

Then

,

- out

in

we

utility

the

predict function

random is

cross

and leave

the

some

prediction the

as

one

+ l , . . . , XN

prediction

prediction

log

between

cross

model

procedure

for

function

relationship

form train

and

repeat

rewards

utility

first

say ,

the

one

we

case

the

notes

using

sample

sum

for

log

prediction

sequential

) also

validation

random

The

this the

,

prediction

.

with

best

. When

cross

the

. ( 1984

validation

model

the

is

for

probability is

)

parameters

reward

a

posterior

- I , Sh

its

sample

. , and

probabilistic

- validation

and

criterion

N

CV

( Sh

, D

)

== E

logp

( x / IVz

, Sh

)

( 37

)

1= 1

which

is

similar

training p ( xIIVI

, Sh

Whereas

)

data

likelihood ,

8 .2 .

we

the

and

this

.

utility to

,

we

. use

to

Various we the

x2 , Sh

lead

with

For for

) , we

the

from

problem

this

example

selection

for of

a

Xl

that

altogether

. Namely test

this

that

cases

testing

and

model

the

log

, when

that

compute

for

attenuating 36

and

is we

training

Equation

training

when

and

Xl

for

criterion

,

training use

approaches see

interchange

problem

findings

12This

) . but

problem

.

x2

for

over

fits

problem - marginal using

this

.

CRITERIA

Consider

people

One

p ( x2IV2

avoids

never

LOCAL

of

1984 ,

criterion

criterion

37

can

described

.

interchanged

compute

interchanges ,

36

are

Equation we

( Dawid

been

Equation cases

in

. Such

have

set

test

, when

testing the

to

and

function

assess rule

of

in

diagnosing

Suppose

their particular

that

is true

known

an the

as

probabilities , see

Bernardo

set

a

ailment of

proper .

For ( 1979

given

ailments

scoring a ) .

the

under

rule

characterization

observation

of

consideration

, because of

its proper

use

a are

encourages scoring

rules

A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 329

.

Figure 5.

.

.

A Bayesian -network structure

for medical diagnosis .

mutually exclusive and collectively exhaustive , so that we may represent these ailments using a single variable A . A possible Bayesian network for this classification problem is shown in Figure 5. The posterior -probability criterion is global in the sense that it is equally sensitive to all possible dependencies. In the diagnosis problem , the posterior probability criterion is just as sensitive to dependencies among the finding variables as it is to dependencies between ailment and findings . Assum ing that we observe all (or perhaps all but a few ) of the findings in D , a more reasonable criterion would be local in the sense that it ignores dependencies among findings and is sensitive only to the dependencies among the ailment and findings . This observation applies to all classification and regression problems with complete data . One such local criterion , suggested by Spiegelhalter et ale (1993) , is a variation on the sequential log-marginal -likelihood criterion :

N LC(Sh,D) ==}: logp " D" Sh) 1=1 (aIIF

(38)

where al and Fl denote the observation of the ailment A and findings F in the lth case, respectively . In other words , to compute the lth term in the product , we train our model S with the first 1 - 1 cages, and then determine how well it predicts the ailment given the findings in the lth case. We can view this criterion , like the log-marginal -likelihood , ag a form of cross validation where training and test cages are never interchanged . The log utility function has interesting theoretical properties , but it is sometimes inaccurate for real-world problems . In general , an appropriate reward or utility function will depend on the decision-making problem or problems to which the probabilistic models are applied . Howard and Math eson (1983) have collected a series of articles describing how to construct utility models for specific decision problems . Once we construct such util -

330

DAVIDHECKERMAN

ity models , we can use suitably modified forms of Equation 38 for model selection . 9 . Computation

of the Marginal

Likelihood

As mentioned , an often -used criterion for model selection is the log relative posterior probability log p (D , Sh) == log p (Sh) + log p (DISh ) . In this section , we discuss the computation of the second component of this criterion : the log marginal likelihood . Given (1) local distribution functions in the exponential family , (2) mutual independence of the parameters (Ji, (3) conjugate priors for these parameters , and (4) complete data , the log marginal likelihood can be computed efficiently and in closed form . Equation 35 is an example for unrestricted multinomial distributions . Buntine (1994) and Heckerman and Geiger (1996) discuss the computation for other local distribution func tions . Here , we concentrate on approximations for incomplete data . The Monte -Carlo and Gaussian approximations for learning about parameters that we discussed in Section 6 are also useful for computing the marginal likelihood given incomplete data . One Monte -Carlo approach , described by Chib (1995) and Raftery (1996) , uses Bayes' theorem : p ( DISh

For

any

uated

configuration directly

.

computed the

using

can

6 . 1 . Other

DiCiccio As

et

we

have

on

the

tationally based

as

Recall be

be

( 1995

Monte

that

approximated

- network

be

sophisticated

, Monte

into

Equation

obtain

the

- Carlo for

large

the

can

- Carlo

be

numerator

, the

sampling

posterior , as

we

methods

large

amounts

of

a multivariate

) =

eval

-

can

be

term

in

described are

closed

form

40 , integrating

large data

( DIOs . In

, and

efficient

data

sets

, p ( DllJs

in

described

particular

contrast ,

compu

-

, methods

and

can

, Sh ) . p ( lJsISh

) can

. Consequently

) dOs

be

as

logarithm

often ,

( 40 )

, substituting the

but

.

distribution

, Sh ) p ( OsISh

taking

accurate . In

more

- Gaussian

Jp

are

databases are

on

approximation

numerator in

methods

approximation methods

in

the

. Finally Gibbs

Monte

- Carlo

as

evaluated

inference using

, especially

, for

in term

( 39 )

) .

p ( DISh can

term

likelihood

computed

discussed

Gaussian

prior

, the

, more

ale

inefficient

accurate

Os , the

addition

Bayesian

denominator

Section by

of

In

) == p ( OsISh ) p ( DIOs , Sh ) p ( OsID , Sh )

Equation of

the

result

31 , we

:

h - h - h d 1 logp(DIS) ~ logp(DIOs ,S ) +logp(OsIS) + 2 log(27r ) - 2 logIAI (41)

A

where

TUTORIAL

d

is

ON

the

dimension

multinomial

1

)

Sometimes

Geiger

et

This

,

ale

(

errors

1996

we

)

.

,

1995

)

,

tering

who

(

We

1989

the

use

see

Thus

lin

for

,

large

we

~

N

,

can

derived

The

by

BIC

on

,

predicts

the

it

1987

_so

in

more

Kass

et

-

ale

,

,

1996

)

.

we

shown

see

,

.

)

,

Becker

efficient

for

Stutz

data

by

with

,

.

and

approximation

N

which

by

that

. g

clus

-

)

increase

IAI

e

Another

program

1996

the

incorrectly

Cheeseman

AutoClass

-

only

have

(

Carlo

large

using

doing

by

log

Monte

for

researchers

described

and

to

IAI

,

(

(

dj2

.

logp

.

(

Thus

,

provides

1978

:

log

p

ML

retaining

(

increases

the

DI98

,

as

d

Sh

)

log

configuration

13

of

,

N

.

88

.

)

interesting

in

,

Second

,

.

Third

Sh

connection

Sh

,

,

)

the

and

log

several

N

(

(

MDL

use

.

the

BIC

)

,

42

)

and

the

that

quite

described

9

,

does

intuitive

the

is

validation

it

.

model

punishes

criterion

cross

,

parameterized

approximation

Section

First

approximation

is

term

)

in

respects

can

well

BIC

(

between

~

approximation

a

the

discussion

-

criterion

we

how

)

)

.

measuring

DI8s

the

a

,

information

)

Length

recalling

DI8s

Consequently

term

logN

(

Bayesian

(

Description

.

-

cases

For

intensive

accurate

logp

the

prior

a

data

model

)

called

~

is

a

Minimum

likelihood

)

prior

contains

the

the

DISh

Schwarz

the

assessing

Namely

example

Kass

relative

of

.

circumstances

approximated

approximation

depend

without

(

(

is

first

not

for

in

that

,

number

approximate

the

less

N

be

.

' s

.

,

relative

,

41

-

obtain

approximation

Wag

but

with

s

logp

This

(

-

Heckerman

in

arly

8

to

Heckerman

Equation

the

nevertheless

some

is

efficient

is

efficient

is

and

and

in

increases

,

in

Chickering

ri

Laplace

accurate

Although

approximation

very

N

parameters

approximation

terms

which

Also

.

accurate

Chickering

a

those

the

be

s

the

also

I

is

A

as

conditions

extremely

see

(

lower

approximation

where

be

qi

.

I A

Hessian

and

'

obtain

only

,

Laplace

)

simplification

of

,

is

among

of

)

l

is

known

Laplace

,

1995

of

can

Cun

ljN

=

.

regularity

approximation

independencies

Le

of

(

One

approximation

and

(

Ei

dimension

point

the

by

this

is

approximation

computation

elements

variant

as

can

this

s

this

certain

O

Raftery

'

the

models

impose

41

unrestricted

given

,

integration

under

331

with

typically

of

are

Laplace

,

the

is

Equation

of

network

variables

for

,

NETWORKS

Bayesian

discussion

that

and

Although

dimension

a

approximation

Kass

approaches

For

hidden

a

to

Laplace

and

.

BAYESIAN

dimension

approximation

the

diagonal

this

for

shown

discussions

1988

)

are

)

refer

have

this

,

detailed

(

(

88

technique

)

Thus

(

there

ale

and

1988

in

.

when

approximation

method

et

(

9

,

,

WITH

of

distributions

.

See

D

LEARNING

we

see

complexity

exactly

minus

by

that

the

and

Rissanen

marginal

MDL

.

130ne of the technical assumptions used to derive this approximation is that the prior ,. is non-zero around (Js.

332

DAVIDHECKERMAN

10. Priors To

compute

must

the

assess

( unless

the

we

are

parameter tions

prior

priors

large

,

we

network

structures

Several

authors

have

ods

deriving

priors

for

SpiegeIhaIter 1996

et

) . In

10 . 1 .

First

this

PRIORS

, let

us

structures

.

address

the

holds

and

tures

for

the

example

as

of

of

, under

direct

, 1992

for

assessments

; Buntine

; Heckerman

these

-

cer -

priors

corresponding

, 1991

al . , 1995b

func

structures

parameter

and

Herskovits et

scoring

. Nonetheless

number

)

) . The

network

and

, we

p ( fJsISh

BICjMDL

alternative

intractable

some

approaches

.

meth

-

, 1991

;

and

Geiger

of

network

,

.

PARAMETERS

of

the

approach

the

local

and

approach

X

Y X

priors

for

of

Heckerman

the

distribution

the

X

parameters et

ale

functions

assumption

Z

structure

is one

conditional

if and

ordered to

of

( 1995b

are

)

who

unrestricted

parameter

. In

independence

between distribution

consideration

have per

that and

- 4

Y

Pearl

is

an

) . For - 4

, these

arc

network assertion are

each

possible

p ( x ) are

are

independence arc

) . A

from

of

n ! pos -

for

ignoring

, 1990

X

Z ,

assertion

no

structures

of

inde

-

direc

-

v - structure to

Y

and

is from

Z .

that

is all

closely

Bayesian

functions F

X

for

structure

and

equivalence

distribution

se , because

1990

structure

same

there

,

, there

structures

network

( Verma

. Suppose local

network

the

Pearl

encodes

variables

-

set

, a complete

is , it n

network

, two

X

that

same

Y . Consequently

example

-

struc

independence

given

contains

have

the

and

the

equiva

- network

structures

another

: one

, Y , Z ) such

of

equivalence

,

they

arc

concept

( Verma

only

-

independence

Bayesian represent

network

complete

general if

X

edge X

v - structures (X

no

missing

. All

dependence

restriction

no

:

two

they

represent

. As

structures

tuple

if for

. When

only

concepts that

independent

variables

same

Y , but

The

Z

network

and

the

f -

equivalent

has

equivalent

equivalent tions

Y

conditionally

that

the

say

equivalent

independence

of

pendence

f -

are

complete

key

. We

{ X , Y , Z } , the

X

are

two

assertions =

Z , and

and

on

independence

given

- t

structures

ordering

based equivalence

are

network

sible

is

- independence ,

f -

that

a

and

structure priors

many

assumptions

assessment

where

distribution

conditional

Z

such

examine

consider

the

structure

; Heckerman

distributions

Their

an

be the

network

.

lence

X

for

, when

a manageable

NETWORK

case

multinomial

will

( Cooper

, we

a

parameter such

required

derive from

consider We

also

discussed

al . , 1993

ON

the

8 . Unfortunately

can

section

of

and

approximations

assessments

assumptions

many

- sample

Section

, these

probability p ( Sh )

p ( 6 s I s h ) are in

possible

tain

posterior

using

discussed

are

relative structure

can

be

a

in large

related

to

networks the family

family .

F We

that

for

X

. This say

that

of

in -

under is

not two

A TUTORIAL

ON

LEARNING

WITH

BAYESIAN

NETWORKS

Bayesian-network structures 51 and 52 for X are distribution

333

equivalent

with respect to (wrt) F if they represent the same joint probability distributions

for X - that

is , if , for every 881, there exists

a 882 such that

p(xI881' Sf ) = p(xI882, S~), and vice versa. Distribution equivalence wrt some F implies independence equivalence , but the converse does not hold . For example , when : F is the family of generalized linear -regression models , the complete network structures for n ~ 3 variables

do not represent the same sets of distributions

. Nonetheless , there

are families F - for example , unrestricted multinomial distributions and linear -regression models with Gaussian noise- where independence equiva-

lence implies distribution equivalencewrt F (Heckermanand Geiger, 1996). The notion of distribution equivalence is important , because if two network structures 81 and 52 are distribution equivalent wrt to a given F , then the hypotheses associated with these two structures are identical - that is,

Sf == 5~. Thus, for example, if 51 and 52-aredistribution equivalent , then their probabilities must be equal in any state of information . Heckerman et

ale (1995b) call this property hypothesisequivalence. In light of this property , we should associate each hypothesis with an equivalence class of structures rather than a single network structure , and our methods for learning network structure should actually be interpreted

as methods for learning equivalenceclassesof network structures (although, for the sakeof brevity, we often blur this distinction ) . Thus, for example, the sum over network -structure hypotheses in Equation 33 should be replaced with a sum over equivalence-class hypotheses . An efficient algorithm for identifying the equivalence class of a given network structure can be found

in Chickering (1995). We note that hypothesis equivalence holds provided we interpret Bayesian-network structure simply as a representation of conditional independence. Nonetheless , stronger definitions of Bayesian networks exist

where arcs have a causal interpretation (see Section 15) . Heckerman et al. (1995b) and Heckerman (1995) argue that , although it is unreasonable to assume hypothesis equivalence when working with causal Bayesian networks , it is often reasonable to adopt a weaker assumption of likelihood equivalence, which says that the observations in a database can not help to discriminate two equivalent network structures . Now

let

us return

to the

main

issue

of this

section

: the

derivation

of

priors from a manageable number of assessments. Geiger and Heckerman

(1995) show that the assumptionsof parameter independenceand likelihood equivalence imply that the parameters for any complete network structure Sc must have a Dirichlet distribution with constraints on the hyperparam eters given by

aijk = a p(xf, pa1IS ~)

(43)

334

DAVIDHECKERMAN

where G is the user's equivalent sample size,14, and p {xf , pa1IS ~) is computed from the user's joint probability distribution p {xI8 ~) . This result is rather remarkable , as the two assumptions leading to the constrained Dirichlet solution are qualitative . To determine the priors for parameters of incomplete network struc tures , Heckerman et ale (1995b) use the assumption of parameter modular ity , which says that if Xi has the same parents in network structures 81 and 82 , then

p((JijIS~) = p((JijIS~) for j = 1, . . . , qi. They call this property parameter modularity , because it says that the distributions for parameters 8ij depend only on the structure of the network that is local to variable Xi - namely , Xi and its parents . Given the assumptions of parameter modularity and parameter independence ,15 it is a simple matter to construct priors for the param eters of an arbitrary network structure given the priors on complete network structures . In particular , given parameter independence, we construct the priors for the parameters of each node separately . Furthermore , if node Xi has parents Pai in the given network structure , we identify a complete network structure where Xi has these parents , and use Equation 43 and parameter modularity to determine the priors for this node. The result is that all terms aijk for all network structures are determined by Equation 43. Thus , from the assessments O! and p (xIS ~) , we can derive the parameter priors for all possible network structures . Combining Equation 43 with Equation 35, we obtain a model-selection criterion that assigns equal marginal likelihoods to independence equivalent network structures . We can assess p (xIS ~) by constructing a Bayesian network , called a prior network , that encodes this joint distribution . Heckerman et al . (1995b) discuss the construction of this network .

10.2. PRIORSON STRUCTURES Now , let us consider the assessment of priors on network -structure hypothe ses. Note that the alternative criteria described in Section 8 can incorporate prior biases on network -structure hypotheses. Methods similar to those discussed in this section can be used to assesssuch biases. The simplest approach for assigning priors to network -structure hypotheses is to assume that every hypothesis is equally likely . Of course, 14Recallthe method of equivalent samplesfor assessingbeta and Dirichlet distributions discussed in Section 2. 15This construction procedure also assumesthat every structure has a non-zero prior probability .

A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 335 this assumption is typically inaccurate and used only for the sake of convenience. A simple refinement of this approach is to ask the user to exclude various hypotheses (perhaps based on judgments of of cause and effect ) , and then impose a uniform prior on the remaining hypotheses . We illustrate this approach in Section 12. Buntine ( 1991) describes a set of a5sumptions that leads to a richer yet efficient approach for assigning priors . The first assumption is that the variables can be ordered (e.g., through a knowledge of time precedence) . The second assumption is that the presence or absence of possible arcs are mutually independent . Given these ~ sumptions , n (n - 1)/ 2 probability ~ sessments (one for each possible arc in an ordering ) determines the prior probability of every possible network -structure hypothesis . One extension to this approach is to allow for multiple possible orderings . One simplifica tion is to assume that the probability that an arc is absent or present is independent of the specific arc in question . In this case, only one probability assessment is required . An alternative approach , described by Heckerman et ale (1995b) uses a prior network . The b~ ic idea is to penalize the prior probability of any structure according to some me~ ure of deviation between that structure and the prior network . Heckerman et ale ( 1995b) suggest one reasonable measure of deviation . Madigan et ale (1995) give yet another approach that makes use of imaginary data from a domain expert . In their approach , a computer program helps the user create a hypothetical set of complete data . Then , using techniques such aB those in Section 7, they compute the posterior proba bilities of network -structure hypotheses given this data , aBsuming the prior probabilities of hypotheses are uniform . Finally , they use these posterior probabilities as priors for the analysis of the real data .

11. Search Methods In this section , we examine search methods for identifying network struc tures with high scores by some criterion . Consider the problem of finding the best network from the set of all networks in which each node ha.s no more than k parents . Unfortunately , the problem for k > 1 is NP -hard even when we use the restrictive prior given by Equation 43 (Chickering et al . 1995) . Thus , researchers have used heuristic search algorithms , including greedy search, greedy search with restarts , best-first search, and Monte -Carlo methods . One consolation is that these search methods can be made more efficient when the model -selection criterion is separable . Given a network structure for domain X , we say that a criterion for that structure is separable if it

336

DAVID HECKERMAN

can be written as a product of variable -specific criteria : n C (Sh, D ) = II C(Xi , p ~ , Di ) i=l

(44)

where Di is the data restricted to the variables Xi and P ~ . An example of a separable criterion is the BD criterion (Equations 34 and 35) used in conjunction with any of the methods for assessingstructure priors described in Section 10. Most of the commonly used search methods for Bayesian networks make successive arc changes to the network , and em ploy the property of separability to evaluate the merit of each change. The possible changes that can be made are easy to identify . For any pair of variables , if there is an arc connecting them , then this arc can either be reversed or removed . If there is no arc connecting them , then an arc can be added in either direction . All changes are subject to the constraint that the resulting network contains no directed cycles. We use E to denote the set of eligible changes to a graph , and ~ (e) to denote the change in log score of the network resulting from the modification e E E . Given a separable criterion , if an arc to Xi is added or deleted , only C(Xi , p ~ , Di ) need be evaluated to determine ~ (e) . If an arc between Xi and Xj is reversed, then only C(Xi , p ~ , Di ) and c(Xj , llj , Dj ) need be evaluated . One simple heuristic search algorithm is greedy search. First , we choose a network structure . Then , we evaluate ~ (e) for all e E E , and make the change e for which ~ (e) is a maximum , provided it is positive . We terminate search when there is no e with a positive value for ~ (e) . When the criterion is separable , we can avoid recomputing all terms ~ (e) after every change. In particular , if neither Xi , Xj , nor their parents are changed, then Ll (e) remains unchanged for all changes e involving these nodes as long as the resulting network is acyclic . Candidates for the initial graph include the empty graph , a random graph , a graph determined by one of the polynomial algorithms described previously in this section , and the prior network . A potential problem with any local-search method is getting stuck at a local maximum . One method for escaping local maxima is greedy search with random restarts . In this approach , we apply greedy search until we hit a local maximum . Then , we randomly perturb the network structure , and repeat the process for some manageable number of iterations . Another method for escaping local maxima is simulated annealing . In this approach , we initialize the system at some temperature To. Then , we pick some eligible change e at random , and evaluate the expression p = exp (~ (e) IT 0) . If p > 1, then we make the change e; otherwise , we make the change with probability p. We repeat this selection and evaluation process Q: times or until we make ~ changes. If we make no changes in Q: repetitions ,

WITH BAYESIAN NETWORKS 337 A TUTORIALONLEARNING then we stop searching . Otherwise , we lower the temperature by multiplying the current temperature To by a decay factor 0 < 'Y < 1, and continue the search process. We stop searching if we have lowered the temperature more than 8 times . Thus , this algorithm is controlled by five parameters : To, a , ,8, 'Y and 8. To initialize this algorithm , we can start with the empty graph , and make To large enough so that almost every eligible change is made, thus creating a random graph . Alternatively , we may start with a lower temperature , and use one of the initialization methods described for local search. Another method for escaping local maxima is best-first search (e.g., Korf , 1993) . In this approach , the space of all network structures is searched systematically using a heuristic measure that determines the next best structure to examine . Chickering ( 1996) has shown that , for a fixed amount of computation time , greedy search with random restarts produces better models than does either simulated annealing or best-first search. One important consideration for any search algorithm is the search space. The methods that we have described search through the space of Bayesian-network structures . Nonetheless, when the assumption of hypoth esis equivalence holds , one can search through the space of network -structure equivalence classes. One benefit of the latter approach is that the search space is smaller . One .drawback of the latter approach is that it takes longer to move from one element in the search space to another . Work by Spirtes and Meek (1995) and Chickering (1996)) confirm these observations experimentally . Unfortunately , no comparisons are yet available that determine whether the benefits of equivalence-class search outweigh the costs.

12. A Simple Example Before we move on to other issues, let us step back and look at our overall approach . In a nutshell , we can construct both structure and param eter priors by constructing a Bayesian network (the prior network ) along with additional assessments such as an equivalent sample size and causal constraints . We then use either Bayesian model selection , selective model averaging , or full model averaging to obtain one or more networks for prediction and / or explanation . In effect , we have a procedure for using data to improve the structure and probabilities of an initial Bayesian network . Here, we present two artificial examples to illustrate this process. Consider again the problem of fraud detection from Section 3. Suppose we are given the database D in Table 12, and we want to predict the next casethat is, compute P(XN+ IID ) . Let us assert that only two network -structure hypotheses have appreciable probability : the hypothesis corresponding to the network structure in Figure 3 (81) , and the hypothesis corresponding

338 database for the fraud prob -

I CaseI 1 2 3 4 5 6 7 8 9 10

Fraud no no yes no no no no no no no

Gas Jewelry Age no no 30-50 no no 30-50 yes yes > 50 no no 30-50 yes no <30 no no <30 no no > 50 no yes 30-50 yes no < 30 no no <30

Sex I female male male male female female male female male female

to the same structure with an arc added from Age to Gas (S2). Furthermore , let us assert that these two hypotheses are equally likely - that is,

p(Sf ) = p(S~) = 0.5. In addition, let us usethe parameterpriors given by Equation 43, wherea = 10 and p(xIS~) is given by the prior network in Figure 3. Using Equations34 and 35, we o'btain p(SfID ) = 0.26 and p(S~ID) = 0.74. Becausewe haveonly two modelsto consider, we can model average according to Equation 33:

p(XN+IID ) = 0.26 p(XN+IID , Sf ) + 0.74 p(XN+IID , sg) where p(XN+IID , Sh) is given by Equation 27. (We don't display these probability distributions .) If we had to chooseone model, we would choose S2, assuming the posterior -probability

criterion is appropriate . Note that

the data favors the presence of the arc from Age to Gas by a factor of three .

This is not surprising , because in the two cases in the database where fraud is absent and gas was purchased recently , the card holder was less than 30 years

old .

An application of model selection, describedby Spirtes and Meek (1995) , is illustrated in Figure 6. Figure 6a is a hand-constructed Bayesian network for the domain

of ICU ventilator

management , called the Alarm

network

(Beinlich et al., 1989) . Figure 6c is a random sample from the Alarm network of size 10,000. Figure 6b is a simple prior network for the domain .

This network encodesmutual independenceamong the variables, and (not shown) uniform probability distributions for each variable. Figure 6d shows the most likely network structure found by a two- pass greedy search in equivalence-class space. In the first pass, arcs were added until the model score did not improve . In the second paBS, arcs were deleted

A TUTORIAL ON LEARNING WITH BAYESIAN NETWORKS

339

until the model score did not improve . Structure priors were uniform ; and parameter priors were computed from the prior network using Equation 43 with a = 10. The network structure learned from this procedure differs from the true network structure only by a single arc deletion . In effect , we have used the data to improve dramatically the original model of the user. 13 . Bayesian

Networks

for Supervised

Learning

As we discussed in Section 5, the local distribution functions p (xilpai , (Ji, Sh) are essentially classification / regression models . Therefore , if we are doing supervised learning where the explanatory (input ) variables cause the out come (target ) variable and data is complete , then the Bayesian-network and classification / regression approaches are identical . When data is complete but input / target variables do not have a simple cause/ effect relationship , tradeoffs emerge between the Bayesian-network approach and other methods . For example , consider the classification prob lem in Figure 5. Here , the Bayesian network encodes dependencies between findings and ailments as well as among the findings , where~ another classification model such ~ a decision tree encodes only the relationships between findings and ailment . Thus , the decision tree may produce more accurate classifications , because it can encode the necessary relationships with fewer parameters . Nonetheless , the use of local criteria for Bayesian-network model selection mitigates this advantage . Furthermore , the Bayesian network provides a more natural representation in which to encode prior knowl edge, thus giving this model a possible advantage for sufficiently small sample sizes. Another argument , based on bias- variance analysis , suggests that neither approach will dramatically outperform the other (Friedman , 1996) . Singh and Provan (1995) compare the classification accuracy of Bayesian networks and decision trees using complete data sets from the University of California , Irvine Repository of Machine Learning databases . Specifically , they compare C4 .5 with an algorithm that learns the structure and proba bilities of a Bayesian network using a, variation of the Bayesian methods we have described . The latter algorithm includes a model-selection phase that discards some input variables . They show that , overall , Bayesian networks and decisions trees have about the same classification error . These results support the argument of Friedman (1996) . When the input variables cause the target variable and data is incom plete , the dependencies between input variables becomes important , as we discussed in the introduction . Bayesian networks provide a natural frame work for learning about and encoding these dependencies. Unfortunately , no studies have been done comparing these approaches with other methods

A TUTORIALON LEARNINGWITH BAYESIANNETWORKS 341

Figure 7. A Bayesian-network structure for AutoClass. The variable H is hidden. Its possible states correspond to the underlying classesin the data.

14 . Bayesian

Networks

for Unsupervised

Learning

The techniques described in this paper can be used for unsupervised learn ing . A simple example is the AutoClass program of Cheeseman and Stutz ( 1995) , which performs data clustering . The idea behind AutoClass is that there is a single hidden (i .e., never observed) variable that causes the observations . This hidden variable is discrete , and its possible states correspond to the underlying claEses in the data . Thus , AutoClaES can be described by a Bayesian network such as the one in Figure 7. For reaEons of compu tational efficiency , Cheeseman and Stutz (1995) aEsume that the discrete variables (e.g., D1 , D2 , D3 in the figure ) and user-defined sets of continuous variables (e.g., { C1, C2, C3} and { C4, C5} ) are mutually independent given H . Given a data set D , AutoClaES searches over variants of this model (including the number of states of the hidden variable ) and selects a variant whose (approximate ) posterior probability is a local maximum . AutoClass is an example where the user presupposes the existence of a hidden variable . In other situations , we may be unsure about the presence of a hidden variable . In such cases, we can score models with and without hidden variables to reduce our uncertainty . We illustrate this approach on a real-world case study in Section 16. Alternatively , we may have little idea about what hidden variables to model . The search algorithms of Spirtes et ale ( 1993) provide one method for identifying possible hidden variables in such situations . Martin and VanLehn (1995) suggest another method . Their approach is based on the observation that if a set of variables are mutually dependent , then a simple explanation is that these variables have a single hidden common cause rendering them mutually independent . Thus , to identify possible hidden variables , we first apply some learning technique to select a model containing no hidden variables . Then , we look for sets of mutually dependent variables in this learned model . For each

342

DAVIDHECKERMAN

(a) Figure

8.

Bayesian

- network

structure

such

(a )

in

set

of

variables

( and variable

.

the

We .

we

a

means

( 1993

,

the

exposure the

of

or

product learn

be

true

not

the

(b )

A

the

network

a

new

model

model

create

variables

conditionally

finding

Figure

shows

8a

another

one

haB

better

two

model

sets

of

containing

.

.

We

a particular of

a

an

individual

on

. Let

. In

that

these

both

sides

we

should

in

variables seen

component

probability

that

the

B that

the

of == true

are

our

and

Spirtes

) .

analysts , or

to

maximize

and

Buy and

given

new

( 1995

analysis

== true

causal

, see

marketing

(A )

, we

how

issue

, decrease

order

provide section

on

advertisement

B

this

methods of

are

Ad

one

In

Freedman

increase

has

probability

.

network

discussion

and

advertisement

product

physical

note

basic

Humphreys

we

a Bayesian

relationships a

, suppose not

of

causal

) , and

or

physical the

in

8b

of

, possibly

model

provide

illustration

, respectively

, and

models

the

discussions

whether

sales

whether

to

of

know

set

semantics

learn and

learned

( 1995

purposes

from

the

be critical

) , Pearl

this

causal

can

can

to

.

by

Relationships

we

. For

For

variables

) suggested

) , we

that

new ,

by

semantics

relationships

thereof

. Figure

, the

which

these

controversial

want

observed

( shaded

renders

the

variables

mentioned by

examine

that

example

Causal

have

ale

for

variables

combinations

suggested

Learning

As

structure

hidden

score

For

dependent variables

.

et

then

original

mutually hidden

- network

with

a hidden

independent

15

Bayesian

structure

(a ) .

containing

than

A

(b)

that given

who

leave

alone

our (B ) has

, we we that

profit

represent purchased would force

we

like A

force

to A

A TUTORIAL ON LEARNING WITH BAYESIAN NETWORKS

343

to be false.16 We denote these probabilities p (bliL) and p (bla) , respectively . One method that we can use to learn these probabilities is to perform a randomized experiment : select two similar populations at random , force A to be true in one population and false in the other , and observe B . This method is conceptually simple , but it may be difficult or expensive to find two similar populations that are suitable for the study . An alternative method follows from causal knowledge . In particular , suppose A causes B . Then , whether we force A to be true or simply observe that A is true in the current population , the advertisement should have the same causal influence on the individual 's purchase. Consequently , p (bla) = p (bla) , where p (bla) is the physical probability that B = true given that we observe A = true in the current population . Similarly , p (bla) = p (bla) . In contrast , if B causes A , forcing A to some state should not influence B at all . Therefore , we have p (bliL) = p (bla) = p (b) . In general , knowledge that X causes Y allows us to equate p (ylx ) with p (ylx ) , where x denotes the intervention that forces X to be x . For purposes of discussion , we use this rule as an operational definition for cause. Pearl (1995) and Heckerman and Shachter (1995) discuss versions of this definition that are more complete and more precise. In our example , knowledge that A causes B allows us to learn p (bliL) and p (bla) from observations alone- no randomized experiment is needed. Bu t how are we to determine whether or not A causes B ? The answer lies in an assumption about the connection between causal and probabilistic dependence known as the causal Markov condition , described by Spirtes et ale (1993) . We say that a directed acyclic graph C is a causal graph for variables X if the nodes in C are in a one-to -one correspondence with X , and there is an arc from node X to node Y in C if and only if X is a direct cause of Y . The causal Markov condition says that if C is a causal graph for X , then C is also a Bayesian-network structure for the joint physical probability distribution of X . In Section 3, we described a method based on this condition for constructing Bayesian-network structure from causal assertions . Several researchers (e.g., Spirtes et al ., 1993) have found that this condition holds in many applications . Given the causal Markov condition , we can infer causal relationships from conditional -independence and conditional -dependence relationships that we learn from the data .17 Let us illustrate this process for the mar keting example . Suppose we have learned (with high Bayesian probability ) 16It is important that these interventions do not interfere with the normal effect of A on B . See Heckerman and Shachter (1995) for a discussion of this point . 17Spirteset al. (1993) also require an assumption known as faithfulness . We do not need to make this assumption explicit , becauseit follows from our assumption that p( 88ISh) is a probability density function .

DAVIDHECKERMAN

344

~

Buy

(a)

(b)

Figure .9. (a) Causal graphs showing for explanations for an observed dependence between Ad and Buy. The node H corresponds to a hidden common cause of Ad and Buy. The shaded node S indicates that the casehas been included in the database. (b) A Bayesian network for which A causesB is the only causal explanation, given the causal Markov condition .

that the physical probabilities p (bla) and p (bla) are not equal . Given the causal Markov condition , there are four simple causal explanations for this dependence: ( 1) A is a cause f'or B , (2) B is a cause for A , (3) there is a.. hidden common cause of A and B (e.g., the person's income ) , and (4) A and B are causes for data , selection . This last explanation is known as selection bias. Selection bias would occur , for example , if our database failed to include instances where A and B are false. These four causal explanations for the presence of the arcs are illustrated in Figure 9a. Of course, more complicated explanations ~- such as the presence of a hidden common cause and selection bias- are possible. So far , the causal Markov condition has not told us whether or not A causes B . Suppose, however, that we observe two additional variables : In come (I ) and Location (L ) , which represent the income and geographic location of the possible purchaser , respectively . Furthermore , suppose we learn (with high probability ) the Bayesian network shown in Figure 9b. Given the causal Markov condition , the only causal explanation for the conditional independence and conditional -dependence relationships encoded in this Bayesian network is that Ad is a cause for Buy . That is, none of the other explanations described in the previous paragraph , or combinations thereof , produce the probabilistic relationships encoded in Figure 9b. Based on this observation , Pearl and Verma (1991) and Spirtes et ale (1993) have created algorithms for inferring causal relationships from dependence relationships for more complicated situations .

A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 345 TABLE 2. Sufficient statistics for the Sewall and Shah (1968) study. 4

349

13

64

9

207

33

72

12

126

38

54

10

67

49

43

2

232

27

84

7

201

64

95

,12

115

93

92

17

79

119

59

8

166

47

91

6

120

74

110

17

92

148

100

6

42

198

73

4

48

39

57

5

47

123

90

9

41

224

65

8

17

414

54

24

5

454

9

44

5

312

14

47

8

216

20

35

13

96

28

11

285

29

61

19

236

47

88

12

164

62

85

15

113

72

50

7

163

36

72

13

193

75

90

12

174

91

100

20

81

142

77

6

50

36

58

5

70

110

76

12

48

230

81

13

49

360

98

Reproduced by permission from the University sity of Chicago . All rights reserved .

16 . A Case Study : College

of Chicago Press . @ 1968 by The Univer -

Plans

Real-world applications of techniques that we have discussed can be found

in Madigan and Raftery (1994) , Lauritzen et ale (1994) , Singh and Provan (1995) , and Friedman and Goldszmidt (1996). Here, we consider an application that comesfrom a study by Sewell and Shah (1968) , who investigated factors that influence the intention of high school students to attend college. The data have been analyzed by several groups of statisticians , including

Whittaker (1990) and Spirtes et ale (1993), all of whom have used nonBayesian techniques .

Sewelland Shah (1968) measuredthe following variablesfor 10,318 Wisconsin high school seniors: Sex (SEX): male, female; SocioeconomicStatus (SES): low, lower middle, upper middle, high; Intelligence Quotient (IQ) : low, lower middle, upper middle, high; Parental Encouragement(PE) : low, high; and College Plans (CP): yes, no. Our goal here is to understand the (possibly causal) relationships among these variables. The data entry

denotes

particular

are described the number

by the sufficient of cases in which

statistics

in Table

the five variables

16 . Each

take on some

configuration . The first entry corresponds to the configuration

SEX = male , SES = low , IQ = low , PE = low , and CP = yes . The remaining

en-

tries correspond to configurations obtained by cycling through the states of

eachvariable such that the last variable (CP) varies most quickly. Thus, for example, the upper (lower) half of the table correspondsto male (female) students

.

As a first pass, we analyzed the data assuming no hidden variables . To generate

priors for network

parameters , we used the method

described

in

Section 10.1 with an equivalent sample size of 5 and a prior network where

346

DAVIDHECKERMAN

,

,j ' ,j ,

logp(DIS: ) = - 45699 p(S~ID) = 1.2 X 10-10

log p ( DIS )h) = - 45653 p ( S)hI D ) = 1.0

Figure 10. The a posteriori most likely network structures without hidden variables.

p (xIS ~) is uniform . (The results were not sensitive to the choice of parame ter priors . For example , none of the results reported in this section changed qualitatively for equivalent sample sizes ranging from 3 to 40.) For structure priors , we assumed that all network structures were equally likely , except we excluded structures where SEX and/ or SES had parents , and/ or CP had children . Because the data set was complete , we used Equations 34 and 35 to compute the posterior probabilities of network structures . The two most likely network structures that we found after an exhaustive search over all structures are shown in Figure 10. Note that the most likely graph has a posterior probability that is extremely close to one. If we adopt the causal Markov assumption and also assume that there are no hidden variables , then the arcs in both graphs can be interpreted causally . Some results are not surprising - for example the causal influence of socioeconomic status and IQ on college plans . Other results are more interesting . For example , from either graph we conclude that sex influ ences college plans only indirectly through parental influence . Also , the two graphs differ only by the orientation of the arc between PE and IQ . Either causal relationship is plausible . We note that the second most likely graph was selected by Spirtes et ale (1993) , who used a non-Bayesian approach with slightly different assumptions . The most suspicious result is the suggestion that socioeconomic status has a direct influence on IQ . To question this result , we considered new models obtained from the models in Figure 10 by replacing this direct influence with a hidden variable pointing to both SES and IQ . We also considered models where the hidden variable pointed to SES , IQ , and P E , and none, one, or both of the connections SE S - P E and P E - I Q were removed . For each structure , we varied the number of states of the hidden variable from two to six . We computed

the posterior

probability

of these models using the

A TUTORIALON LEARNINGWITH BAYESIANNETWORKS 347 p(H=lJ)=0.63 p(H=l)=0.37 PE I()w I()w high high

H p(IQ=llighIPE ,H) 0I O .()9K 0.22 0 0.21 I 0.49

SES low low high high

Figure

11

.

The

Cheeseman

is

,

we

among

in

taining variable

to

a

evidence

that IQ

in

bilities

all

)

vational

they data

In

addition

suggests

)

and

Among

that

the con

the .

-

hidden

Thus

,

that we

we

have

if

we

have strong

socioeconomic

status

examination

of

variable

the

model

from

then

both An

hidden

the

proba

corresponds

to

-

some

is

Software

incomplete

.

models

and

references

learn

describe

the

likely

assume ,

.

( ) s .

model

also

find

.

guide

( 1993

best

.

maximum

likely arc

influencing

and

one

more

the

.

space

probability

most

additional

result

that "

times

and

To

local of

next

than

of

posterior

consideration

sensible

graphical

another

ale

The

likely

is a

additional

to ,

-

this

1010

. one

from

Literature

,

approximations cussed

less

.

variable

lack

.

largest

highest

assumption

quality

provides

et

times

variable

about

2

has

model

11

following

Spirtes

which

hidden

for

initializations

variable

Markov

hidden

to

the

( 1996

causal

" parent

more

offer

9

reasonable

tutorials

learning

-

the

the

is

a

omitted

approximation

taking

with

model

,

. 10

one

PE p(CP =yesISES ,IQ,PE ) low 0.011 high 0.170 !ow 0.124 high 0.53 low 0,0')3 high 0.39 low 0 .24 high 0.84

with

are

Laplace ,

hidden

population

Pointers

Like

5

Figure of

.

is

a

this

in

measure

17

,

the

IQ low low high high low low high high

structure

random

the

This

no

the

omitted

and

.

variable

PE

of

different ,

11

network probabilities

algorithm

with

Figure

adopt

not

runs

likely Some

variant

considered

hidden

.

EM

containing a

again

)

the

100

model

values

( 1995

we

shown

most

MAP

used

models

best

posteriori are

- Stutz 9s

the

a

shown

from

SES low low low low high high high high

SEX (PE =highISES ) ,SEX mulc 0.32 femuJe 0.166 malc 0.K6 femalc O.Kl log p(ShID) =. - 45629

Probabilities

MAP

H p(SES =highlH ) I()w O.OKR hig'h 0.51

to

Pearl

the

)

learning

readers

use

for

pointers

literature

networks for

those

methods and

( 1995

Bayesian methods

For

interested

learning to

in

them

software

.

,

we

Buntine

.

methods .

In

causal

based addition relationships

,

on

large

as

we

- sample have

from

dis

-

obser

-

struc

-

.

to

directed

models

,

researchers

have

explored

network

348

DAVIDHECKERMAN

tures

containing

undirected

representations

(

are

1990

)

,

discussed

Frydenberg

Bayesian

(

methods

Dawid

and

Finally

,

for

several

a

effect

.

(

.

called

)

,

compiles

)

in

For

~

have

et

)

ale

and

for

Pearl

(

are

1997

)

described

(

1994

)

have

)

sampler

that

for

created

as

a

model

a

system

Bayesian

computer

-

and

systems

criteria

have

devel

cause

built

of

for

have

about

variety

1992

)

network

program

.

Acknowledgments

I

thank

Max

Rommelse

this

used

bringing

Chickering

,

and

manuscript

to

.

analyze

this

,

Usama

Padhraic

I

also

the

data

Fayyad

Smyth

thank

Max

Sewall

set

to

for

and

my

attention

,

Eric

their

Horvitz

,

comments

Chickering

on

for

Shah

(

.

1968

Chris

Meek

earlier

data

the

,

and

,

Koos

versions

implementing

)

.

by

systems

ale

learning

1994

a

(

-

and

.

specified

Gibbs

Verma

software

(

Gilks

,

These

Richardson

et

using

a

)

.

data

Scheines

problem

into

1994

II

jsgaard

,

1982

and

developed

,

learning

,

from

(

models

problem

)

models

example

H

a

(

1990

TETRAD

and

representation

Lauritzen

Buntine

Spiegelhalter

takes

this

)

(

and

graphical

that

.

groups

.

1992

Thomas

. g

called

mixed

BUGS

and

1993

program

with

e

knowledge

Whittaker

research

Badsberg

selection

,

a

such

models

software

learn

)

as

learning

(

graphical

oped

(

1990

Lauritzen

learning

can

edges

Chris

of

software

Meek

for

A TUTORIAL

ON

LEARNING

WITH

BAYESIAN

NETWORKS

349

Notation X , Y, Z, . . .

Variables or their corresponding nodes in a Bayesian network

x , Y , Z, . . .

Sets of variables or corresponding sets of nodes

x = x x = x

Variable

x ,y ,z

Typically refer to a complete case, an incomplete

X

is in state

x

The set of variables X is in configuration x case, and missing data in a case, respectively

X\ Y D Dl p(xly)

Ep(o)(x) S Pai

The

variables

A data The

in X

that

are not

in Y

set : a set of cases

first

1-

1 cases in D

The probability

that X == x given Y == Y

(also used to describe a probability density, probability distribution , and probability density) The expectation of x with respect to p(.) A Bayesian network structure (a directed acyclic graph) The variable or node corresponding to the parents of node Xi in a Bayesian network structure A configuration of the variables Pai

pai ri

The number

of states of discrete variable

qi

The number of configurations of Pai

Xi

Sc

A complete network structure

Sh

The hypothesis corresponding to network structure S

8ijk

The multinomial

parameter corresponding to the

probabilityp(Xi = xflPai = pa1) 8ij (Ji (Js 0: '

O:' ijk Q' ij

Nijk

Nij

== ((Jij2, . . ., (Jijri) = (lJil , . . . , lJiqi) = (lJ1, . . . , lJn) An equivalent sample size

The Dirichlet hyperparameter corresponding to (Jijk """,,, r '

== L-Ik,=l O :' ijk

Thenumberof cases in datasetD whereXi = xf andPai = pat r

'

= Lk '= l Nijk

350

DAVIDHECKERMAN

References Aliferis , C. and Cooper, G. ( 1994). An evaluation of an algorithm for inductive learning of Bayesian belief networks using simulated data sets . In Proceedings of Tenth Con ference on Urtcertainty in Artificial Intelligence , Seattle , WA , pages 8- 14. Morgan Kaufmann

.

Badsberg, J. ( 1992). Model search in contingency tables by CoCo. In Dodge, Y . and Whittaker , J ., editors , Computational delberg .

Statistics , pages 251- 256. Physica Verlag , Hei -

Becker, S. and LeCun, Y . (1989). Improving the convergenceof back-propagation learning with second order methods . In Proceedings of the 1988 Connectionist School , pages 29- 37 . Morgan Kaufmann .

Models Summer

Beinlich, I ., Suermondt, H ., Chavez, R., and Cooper, G. (1989). The ALARM monitoring system : A case study with two probabilistic inference techniques for belief networks . In Proceedings of the Second European Conference on Artificial Intelli gence in Medicine , London , pages 247- 256. Springer Verlag , Berlin .

Bernardo, J. ( 1979). Expected information as expected utility . Annals of Statistics , 7 :686 - 690 .

Bernardo, J. and Smith , A . (1994) . Bayesian Theory. John Wiley and Sons, New York . Buntine , W . ( 1991). Theory refinement on Bayesian networks. In Proceedingsof Seventh Conference on Uncertainty Morgan Kaufmann .

in Artificial

Intelligence , Los Angeles , CA , pages 52- 60 .

Buntine , W . (1993). Learning classification trees. In Artificial Intelligence Frontiers in Statistics : AI and statistics III . Chapman and Hall , New York .

Buntine , W . (1996) . A guide to the literature on learning graphical models. IEEE Transactions

on Knowledge and Data Engineering , 8:195- 210.

Chaloner, K . and Duncan, G. (1983). Assessmentof a beta prior distribution : PM elicitation

.

The

Statistician

, 32 : 174 - 180 .

Cheeseman, P. and Stutz , J. ( 1995). Bayesian classification (AutoClass) : Theory and results . In Fayyad , U ., Piatesky -Shapiro , G ., Smyth , P., and Uthurusamy , R ., editors , Advances in Knowledge Discovery and Data Mining , pages 153- 180. AAAI Press , Menlo

Park

, CA .

Chib , S. (1995). Marginal likelihood from the Gibbs output . Journal of the American Statistical

Association

, 90 : 1313 - 1321 .

Chickering, D . (1995). A transformational characterization of equivalent Bayesian network structures . In Proceedings of Eleventh Conference on Uncertainty Intelligence , Montreal , QU , pages 87- 98 . Morgan Kaufmann .

in Artificial

Chickering, D . (1996). Learning equivalence classesof Bayesian-network structures . In Proceedings of Twelfth Conference on Uncertainty 0 R . Morgan Kaufmann .

in Artificial

Intelligence , Portland ,

Chickering, D ., Geiger, D ., and Heckerman, D. ( 1995) . Learning Bayesian networks: Search methods and experimental results . In Proceedings of Fifth Conference on Artificial Intelligence and Statistics , Ft . Lauderdale , FL , pages 112- 128. Society for Artificial

Intelli

~ ence

in

Statistics

.

Chickering, D . and Heckerman, D . (Revised November, 1996). Efficient approximations for the marginal likelihood of incomplete data given a Bayesian network . Technical Report MSR - TR -96- 08 , Microsoft Research , Redmond , W A .

Cooper, G. (1990). Computational complexity of probabilistic inference using Bayesian belief networks (Research note). Artificial Intelligence, 42:393- 405. Cooper, G. and Herskovits, E. (1992) . A Bayesian method for the induction of probabilistic

networks from data . Machine Learning , 9:309- 347.

Cooper, G. and Herskovits, E. (January, 1991). A Bayesian method for the induction

A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 351 Technical Report SMI-91- 1, Section on Medical

I

American Journal of

pages

352

DAVID

Heckerman, D., The combination

Heckerman,

networks.

Heckerman,

D.

soning.

Journal

Hłjsgaard, report,

Geiger, D., and of knowledge

D.,

Bayesian

Mamdani,

and

of

S., Skjłth, Department

Howard, mentation.

HECKERMAN

A.,

Communications

Shachter,

Artificial

Strategic

of

Jaakkola,

in

T.

intractable

Artificial Jensen, F. Jensen, F.

Decision of the

IEEE,

Kass,

R.,

pages

M.

(1996).

In

Tierney,

L.,

J.,

DeGroot,

B.

Mathematical

and

M.,

Oxford

(1993).

B,

Pe

P

(1996).

Computing

Proceedings

repor K. (

University

(1936).

On

Lindley,

structures

J. (1988) D.,

Press.

an

distributions adm 39:399 409.

Society,

Linear-space

50:157 224.

Computational Bayes facto

Kadane,

Lauritzen, S. (1982). Lectures Aalborg, Denmark. Lauritzen, S. (1992). Propagation cal association models. Journal Lauritzen, S. and Spiegeihalter, graphical Society

(1 Com

Menlo

D.

based Denmark. systems. Technical Lauritzen, S., and Olesen,

261 278.

R.

and

B.

analysis: 58:632 643.

Group,

Freedman, 47:113 118.

Jordan,

Resear

Aalborg,

Koopman, the American Korf,

Decisions

networks.

A

Dec

J. (1981). Influence on the Principles Strategic Decisions J., editors (1983).

by local computations. and 90:773Raftery, A. (1995). 795.

Bernardo, 3,

and

the

(1995).

Intelligence, Portland, OR, pages (1996). An Introduction to Bayesian and Andersen, S. (1990). Approximat

knowledge University, Jensen, F.,

models Kass, R. Association,

P. and Science,

R.

Weilman

of

Intelligence

Howard, R. and Matheson, son, J., editors, Readings volume II, pages 721 762. Howard, R. and Matheson, Analysis.

and

F., and Thiesson, of Mathematics

R. Proceedings (1970).

Humphreys, Philosphy

Chickering, and statistical

and

best-first

search

on

Contingenc of probabi of the America D. (1988).

their

application

Lauritzen, S., Thiesson, B., and Spiegethalter, model selection methods: A case study. Al and NewStatistics IV, volume Lecture Notes Verlag, York. MacKay,

MacKay, Neural MacKay, Cavendish

D.

Madigan, D., the predictive tics:

Madigan, tainty

(1992a).

D. (1992b). Computation, D.Laboratory, (1996).

Theory

in

Bayesian

Garvin, J., performance

and

D. and graphical

interpolation.

A practical 4:448 472. Choice of Cambridge,

Methods,

Raftery, models

and

Bayesian basis for th UK.

Raftery, of Bayesian

24:2271 2292.

A.

(1995) graphic

A. (1994). Model using Occam s win

A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 353

Interna -

909. Neal, R. (1993 ). Probabilistic inference usingMarkovchainMonteCarlomethods . TechnicalReportCRG-TR-93-1, Department of ComputerScience , Universityof Toronto. Olmsted , S. (1983 ). Onrepresenting andsolvingdecision problems . PhDthesis , Departmentof Engineering -EconomicSystems , StanfordUniversity . Pearl, J. (1986 ). Fusion , propagation , and structuringin beliefnetworks . Artificial Intelligence , 29:241-288. Pearl, J. (1995 ). Causaldiagramsfor empiricalresearch . Biometrika , 82:669- 710. Pearl, J. andVerma,T. (1991 ) . A theoryof inferredcausation . In Allen, J., Fikes, R., andSandewall , E., editors, Knowledge Representation andReasoning : Proceedings of theSecond InternationalConference , pages441-452. MorganKaufmann , NewYork. Pitman, E. (1936 ) . Sufficientstatisticsandintrinsicaccuracy . Proceedings of the CambridgePhilosophy Society , 32:567-579. Raftery, A. (1995 ). Bayesian modelselection in socialresearch . In Marsden , P., editor, Sociological Methodology . Blackwells , Cambridge , MA. Raftery, A. (1996 ). Hypothesis testingandmodelselection , chapter10. Chapmanand Hall. Ramamurthi , K. and Agogino , A. (1988 ). Realtime expertsystemfor fault tolerant supervisory control. In Tipnis, V. andPatton, E., editors , Computers in Engineering , pages333-339. AmericanSocietyof Mechanical Engineers , CorteMadera , CA. Ramsey , F. (1931 ). Truth andprobability . In Braithwaite , R., editor, TheFoundations of Mathematics andotherLogicalEssays . HumanitiesPress , London . Reprintedin KyburgandSmokIer , 1964 . Richardson , T. (1997 ) . Extensions of undirectedand acyclic , directedgraphicalmodels. In Proceedings of SixthConference on Artificial Intelligence andStatistics , Ft. Lauderdale , FL, pages407-419. Societyfor Artificial Intelligence in Statistics . Rissanen , J. (1987 ) . Stochastic complexity(with discussion ). Journalof theRoyalStatisticalSociety , SeriesB, 49:223-239and253-265. Robins , J. (1986 ). A newapproach to causalinterence in mortalitystudieswith sustained exposure results. Mathematical Modelling , 7:1393 - 1512 . Rubin, D. (1978 ). Bayesian inference for causaleffects : Theroleof randomization . Annals of Statistics , 6:34-58. Russell , S., Binder, J., Koller, D., andKanazawa , K. (1995 ). Locallearningin probabilis tic networkswith hiddenvariables . In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence , Montreal , QU, pages1146 - 1152 . Morgan Kaufmann , SanMateo, CA. Saul, L., Jaakkola , T., and Jordan , M. (1996 ). Meanfield theoryfor sigmoidbelief networks . Journalof Artificial Intelligence Research , 4:61- 76. Savage , L. (1954 ) . TheFoundations of Statistics . Dover,NewYork. Schervish , M. (1995 ). Theoryof Statistics . Springer -Verlag . Schwarz , G. (1978 ). Estimatingthedimension of a model.AnnalsofStatistics , 6:461-464. Sewell , W. andShah,V. (1968 ). Socialclass , parentalencouragement , andeducational aspirations . AmericanJournalof Sociology , 73:559-572. Shachter , R. (1988 ). Probabilistic inference andinfluence diagrams . Operations Research ,

354

DAVIDHECKERMAN

36 : 589 Shachter

- 604

.

, R . , Andersen

posable

graphs

, S . , and .

In

Intelligence

, Boston

Intelligence

, Mountain

Shachter , R . and 35 : 527 - 550 . Silverman and Singh

, M . and

Provan

classifiers

Spiegelhalter expert

Density .

, D . and on

directed

, P . and

data

.

Data

In

Meek

Mining

Suermondt

, H . and

MS

Truesson .

Data

In

Mining

Truesson

, B .

complete Aalborg ,

( 1995b

) .

A .,

) .

A . and . Science

, T . and

, J . ( 1990

ings of Sixth Conference 220 - 227 . Morgan Kaufmann , J . ( 1990 Sons .

) .

Graphical

Artificial

-

,

Chapman

Bayesian

Information

net

-

Science

. in

, R . ( 1993

) .

of

- 605

decision

analysis

Bayesian

.

analysis

conditional

in

probabili

-

.

Causation

, Prediction

networks

Kaufmann

) .

A

and

,

and

Search

:

- 311

( 1992

In

Bernardo

Judgment

and

Uncertainty .

in

in

and

Applied

Bugs

:

A

- 842

. Oxford

- Wesley

.

synthesis

of

Multivariate

-

incomplete

models

and

with

causal

to

, J . , Dawid

perform

Press

Heuristics

models

.

, Boston

Statistics

,

, A . , and

University

:

in -

University

program

uncertainty

Intelligence

, 5 : 521

Discovery

, Aalborg

, J . , Berger

under

Artificial

with

exponential

) .

inference

.

Systems

837

for

I ( nowledge

Kaufmann

recursive

. Addison

) . Equivalence

from

Reasoning

networks on

Electronic

. 4, pages

) .

variables Discovery

algorithms

Bayesian

W . .

Analysis

Models

exact Approximate

. Morgan

of

,

Statistics

on

of

for

Gilks

, D . ( 1974 - 1131 .

of of

Conference

306

discrete

I ( nowledge

.

Journal

sampling

Data

with on

combination

, Institute

Gibbs

, Bayesian

Kahneman , 185 : 1124

in

.

selective

encoding

Conference

information

D .,

Exploratory

Pearl

Bayesian

, pages

and

,

Artificial

Science

Analysis

and

updating

) .

quantification

report

using

, A . , editors

, J . ( 1977

Tversky , biases

Score

of

, PA

, 20 : 579

( 1993

International

, QU

Spiegelhalter inference

, R .

. International

First

, Montreal

learning

) . Probability

Networks

. Morgan

Accelerated of

decom

in

Management

Data

- 95 - 36 , Computer

International

, G . ( 1991

data . Technical , Denmark .

Bayesian Smith

) .

) . Efficient

) . Sequential .

.

and

, S . , and Cowell , 8 : 219 - 282 .

) . Learning

, QU

Proceedings

Uncertainty

diagrams

, Philadelphia

Scheines

networks

, B . ( 1995a

data

for

Statistics

- CIS

, C . ( 1975 - 358 .

First

Cooper

belief

and

Uncertainty

.

, C . ( 1995

, Montreal

on Bayesian 542 .

Whittaker and

, 1995

, S . ( 1990

of

Association

for

structures

, C . , and , New York

algorithms

on

influence

Pennsylvania

Lauritzen

.

) . Gaussian

Report

graphical

Proceedings

- 244

reduction

Conference

.

, G . ( November

of

) . Directed

Sixth

Estimation

Stael von Holstein Science , 22 : 340

Spirtes , P . , Glymour Springer - Verlag

Verma

237

, CA

, C . ( 1989

the

, D . , Dawid , A . , Lauritzen systems . Statistical Science

ties

Thomas

, pages

, University

Spiegelhalter

Spirtes

MA

. Technical

Department

, K . ( 1990 of

View

Kenley

Spetzler , C . and Management

Tukey

,

, B . ( 1986 ) . Hall , New York

work

Poh

Proceedings

.

In

.

and

Proceed

, MA

, pages

John

Wiley

-

Winkler ,R.(1967 ).The assessment of:prior distributions inBayesian analysis .American Statistical Association Journal ,62 776 -800 .

A VIEW OF THE EM ALGORITHMTHAT JUSTIFIES INCREMENTAL , SPARSE , AND OTHERVARIANTS

RADFORD

M

Dept

.

of

.

Statistics

and

University

of

http

:

/

/

www

.

Department

Abstract

resem

bles

with

with

, it

in

free

the

is

easy

which

the

exploits

sparse

of other

.

and

Canada

/

/

Ontario

,

hinton

/

~

maximum

over an

. This

that

the

variant

estimation

of the

is shown problem

are

also

that

maximizes

this

E step

of

the

unobserved

variant

described

to

possible

be

. From

this

EM

algo -

variables to give

of

is also seen

maximizes

variables

empirically

. A

for

a function

step the

variant

distributions

algorithms

M

and

unobserved

one

estimation

present

the

incremental only

Canada

likelihood

. We

parameters

for

conditional

variant

edu

show

model

justify

E step

,

radford

unobserved

distribution

a mixture

~

,

performs

distribution

in

/

Toronto

toronto

are

the

to

edu

Science

Ontario

Science

energy to

to

.

,

.

variables

Computer

HINTON

cs

algorithm

in each

1 .

.

of

,

Computer

EM

convergence

range

www

.

Toronto

toronto

Toronto

some

recalculated

that

/

respect

respect

perspective rithm

/

negative

function it

:

which

.

of

. The

in

.

of

http

,

cs

E

University

Dept

Toronto

GEOFFREY

data

NEAL

the

is

faster

algorithm

, and

a wide

.

Introduction

The

Expectation

parameter

- Maximization

estimates

Special

cases

of

grown

even

cussed

by Dempster

applications

more

are

in

the

( EM

problems

algorithm

since

its

date

generality

, Laird evident

, and in

the

) algorithm

where

finds

maximum

variables

back

several

and

widespread

Rubin book

some

decades

were , and

applicability

( 1977 ) . The

scope

of the

McLachlan

and

Krishnan

by

355

likelihood unobserved its

use were

algorithm

. has dis 's

( 1997 ) .

356

RADFORDM. NEALAND GEOFFREYE. HINTON

The EM algorithm estimatesthe parametersof a. model iteratively, starting from someinitial guess . Each iteration consistsof an Expectation (E) step, which finds the distribution for the unobservedvariables, giventhe knownvaluesfor the observedvariablesand the currentestimate of the parameters , and a Maximization (M) step, which re-estimatesthe parametersto be those with maximumlikelihood, under the assumption that the distribution found in the E step is correct. It can be shownthat eachsuchiteration improvesthe true likelihood, or leavesit unchanged(if a local maximumhasalreadybeenreached , or in uncommoncases , before then). The M step of the algorithm may be only partially implemented , with the new estimatefor the parametersimproving the likelihood given the distribution found in the E step, but not necessarilymaximizingit . Sucha partial M stepalwaysresultsin the true likelihoodimprovingaswell. Dempster, et al refer to suchvariantsas "generalizedEM (GEM)" algorithms. A sub-classof GEM algorithmsof wide applicability, the "ExpectationConditional Maximization (ECM)" algorithms, have been developedby Meng and Rubin (1992), and further generalizedby Meng and van Dyk (1997). In many cases , partial implementationof the E step is also natural. The unobservedvariablesare commonlyindependent , and influencethe likelihoodof the parametersonly throughsimplesufficientstatistics. If these statisticscan be updatedincrementallywhenthe distribution for oneof the variablesis re-calculated, it makessenseto immediatelyre-estimatethe parametersbeforeperformingthe E stepfor the next unobservedvariable, as this utilizesthe newinformationimmediately , speedingconvergence . An incrementalalgorithmalongthesegenerallineswasinvestigatedby Nowlan (1991). However , suchincrementalvariantsof the EM algorithm havenot previouslyreceivedany formal justification. We presentherea view of the EM algorithmin whichit is seenas maximizing a joint function of the parametersand of the distribution over the unobservedvariablesthat is analogousto the "free energy" function used in statistical physics, and whichcan alsobe viewedin termsof a KullbackLiebler divergence . The E step maximizesthis function with respectto the distribution over unobservedvariables ; the M step with respectto the parameters . Csiszarand Tusnady(1984) and Hathaway(1986) havealso viewedEM in this light. In this paper, we use this viewpoint to justify variants of the EM algorithm in which the joint maximization of this function is performed by other means - a processwhich must also lead to a maximum of the true likelihood. In particular , we can now justify incremental versions of the algorithm , which in effect employ a partial E step, as well as "sparse"

A VIEW OF THE EM ALGa RITHM AND ITS VARIANTS

357

versions, in which most iterations update only that part of the distribu tion for an unobserved variable pertaining to its most likely values, and "winner-take-all" versions, in which, for early iterations, the distributions over unobservedvariables are restricted to those in which a single value hag probability one. We include a brief demonstration showing that use of an incremental algorithm speedsconvergencefor a simple mixture estimation problem.

2. General theory Supposethat we haveobservedthe valueof somerandomvariable, Z , but not the valueof anothervariable, Y , and that basedon this data, we wish to find the maximumlikelihoodestimatefor the parametersof a modelfor Y and Z. We assumethat this problemis not easilysolveddirectly, but that the correspondingproblemin which Y is also knownwould be more tractable. For simplicity, we assumeherethat Y has a finite range, as is often the case, but the resultscan be generalized . Assumethat the joint probability for Y and Z is parameterizedusing (J, as P(y, z I 0). The marginal probability for Z is then P(z I (J) =: Ly P(y, z I fJ). Given observeddata, z, we wish to find the valueof (J that maximizesthe log likelihood, L ((J) = logP(z I (J). The EM algorithmstarts with someinitial guessat the maximumlikelihood parameters , (}(O), and then proceedsto iteratively generatesuccessive estimates, (J(l ), (J(2), . . . by repeatedlyapplyingthe followingtwo steps, for t = 1, 2, . . . E Step: Computea distribution p (t) overthe rangeof Y such that p (t)(y) = P(y I z, O(t- l )).

(1)

M Step: Set (J(t) to the (J that maximizesEp(t)[logP(y, z I (J)). Here, Ep[ .] denotes - expectationwith respectto the distribution over the rangeof Y givenby P. Notethat in preparationfor the later generalization , the standardalgorithm hasherebeenexpressed in a slightly non-standard fashion. The E step of the algorithm can be seenas representingthe unknown valuefor Y by a distribution of values, and the M step as then performing maximumlikelihood estimationfor the joint data obtainedby combining this with the knownvalueof Z , an operationthat is assumedto be feasible. As shownby Dempster, et ai, eachEM iteration increases the true log likelihood, L (lJ), or leavesit unchanged . Indeed, for most models, the algorithm will convergeto a local maximum of L (8) (though there are exceptions to this). Suchmonotonicimprovementin L (8) is also guaranteedfor any

358

RADFORD

GEM

algorithm

M

step

is

,

with

( }

greater

reached

)

In

the

to

,

the

simply

(

( J

)

[

AND

only

log

a

to

P

(

sense

y

GEOFFREY

partial

z

I

value

8

(

t

E

.

maximization

some

,

and

-

l

)

)

such

]

(

or

HINTON

is

that

is

performed

Ep

equal

if

*

(

,

,

8

)

.

then

(

t )

[

in

log

P

(

z

,

convergence

a

E

step

updates

as

will

y

the

'

( J

has

(

t

)

)

]

been

,

that

if

of

variety

of

one

ar

factor

at

~

of

in

least

local

increasing

occurs

at

( )

for

,

,

of

*

ag

well

F

.

We

maximizing

incremental

P

which

maximum

algorithms

which

only

variants

or

a

L

performing

its

maximizing

show

among

partially

and

maximum

wide

,

of

algorithm

seen

local

F

idea

EM

We

a

maximizing

each

the

are

P

contemplate

of

corresponding

of

,..., steps

F

P

the

view

M

, ,...,

*

of

a

the

therefore

means

which

NEAL

set

t

use

and

at

item

)

function

occurs

which

make

we

E

same

can

in

.

.

step

both

the

t

,

Ep

order

E

by

(

than

M

L

algorithms

corresponding

,

to

one

in

data

. . ... .,

The

function

F

(

P

,

8

)

is

defined

as

follows

:

,...,

F

(

,...,

P

,

( J

)

=

"""'

where

H

that

z

F

,

P

is

y

,

z

I

8

fJ

)

are

is

fJ

(

y

,

)

=

(

]

I

is

(

the

is

,

fJ

that

F

the

zero

states

)

,

,

is

,

varies

follows

H

(

P

)

(

(

the

log

For

of

the

P

we

P

the

will

,

(

y

,

)

F

is

I

a

Note

data

here

but

z

.

observed

assume

finite

P

F

,

"

,

this

fJ

)

,

that

restriction

is

a

continuous

continuous

function

)

,

fJ

)

=

-

(

of

a

to

that

the

state

is

divergence

"

the

"

physical

-

varia

-

states

log

P

( y . .. ... .

between

( )

)

+

L

of

-

that

"

energy

,

function

Liebler

(

F

the

free

partition

-

P

(

,

z

I

y

)

and

( J

)

.

"

fJ

corresponding

to

Boltzmann

"

and

.

(3)

)

that

They

the

also

divergence

is

non

well

-

distribution

free

energy

correspond

-

to

negative

and

is

.

of

by

fJ

PIIP

properties

variational

the

given

with

"

Liebler

D

state

value

,

analogous

-

P

Kullback

fixed

fJ

is

provided

energy

-

distributions

a

(

distribution

for

always

function

physics

identical

F

2

:

the

the

between

continuously

+

the

that

the

lemmas

that

1

]

that

Kullback

statistical

properties

maximizes

)

,

conclude

and

the

as

minimizes

to

Lemma

8

of

assume

can

sign

Y

to

two

from

related

I

simplicity

physics

of

following

facts

over

z

value

-

The

,

particular

For

statistical

F

known

y

entropy

a

to

of

of

F

z

so

we

values

y

)

need

change

"

be

,

which

relate

P

P

.

a

to

y

.

also

from. . . . . ,

P

also

(

to

zero

We

energy

can

[ log

.. .. ...

P

respect

never

from

taken

[ log

throughout

and

free

One

Ep

fixed

.

Apart

( )

-

with

of

both

tional

p

=

essential

function

of

)

is

not

Ep

.. ... ,

P

defined

which

(

is

(

=

.

( J

p

( )

,

(

there

y

)

is

=

=

P

a

(

y

unique

I

z

distribution

,

fJ

)

.

Furthermore

,

p

,

( )

this

,

that

p

( )

A VIEW OF THE EM ALGORITHM AND ITS VARIANTS

359

,..., PROOF. In maximizing F with respect to P , we are constrained by the requirement that P (y) ~ 0 for all y . Solutions with P (y) == 0 for some yare not possible , however - one can easily show that the slope of the entropy is infinite at such points , so that moving slightly away from the boundary will increase F . Any maximum of F ,.,.-must therefore occur at a critical point subject to the constraint that Ly P (y) == 1, and can be found using a Lagrange multiplier . At -such a maximum , Po, the gradient of F with respect to the components of P will be normal to the constraint surface , i .e. for some A and for all y ,

8F A=8P ,.,(y)(p(}) =log P(y,zIfJ )- log Po (y)- 1

(4)

From this, it follows that Pe(y) must be proportional to P (y, z 18) . Normalizing so that Ey p()(y) == 1, we have p()(y) == P (y I z, 8) as the unique solution. That Po varies continuously with () follows immediately from our

assumption thatP(y, z I (}) does . .., .., Lemma 2 If P(y) ==P(yIz, (J) ==p(}(y) then F(P,0) ==logP(zI0) ==L((J). .., PROOF . If P(y) ==P(yIz, (J), then "" "" F(P,O) == Ep[logP(y, zI0)] + H(P) == Ep[logP(y, z10)] - Ep[logP(yIz, 0)] == Ep[logP(y, zI0) - logP(yIz, 0)] == Ep[logP(z10)] == logP(zIlJ) Aniteration ofthethestandard EMalgorithm cantherefore beexpressed interms ofthefunction F asfollows : E Step: Setp(t) totheP thatmaximizes F(P,f}(t-l)). ,., (5) M Step: Setf}(t) tothef} thatmaximizes F(P(t),8). Theorem 1 Theiterations given by(1) andby(5) areequivalent . PROOF . ThattheE steps oftheiterations areequivalent follows directly fromLemma 1. ThattheMsteps areequivalent follows fromthefactthat the entropy term in the definition of F in equation (2) does not depend on (}. Once the EM iterations have been expressed in the form (5) , it is clear that the algorithm converges to values P * and (}* that locally maximize

360

RADFORD

M . NEAL

AND

GEOFFREY

E . HINTON

,.., F ( P , (J) lowing will

( ignoring

the

theorem

shows

yield

a local

also

algorithm

of

formed

that

, in

well

of

P

and

If

F ( P , fJ )

to

, finding

for

as

to

convergence

general

variants

,,.., as

respect

of

maximum

( 5 ) , but

partially

with

possibility

a saddle

a local

it

in

which in

(J simultaneously

not

the

E

which

a

2

local

and

maximum

0 * , then

PROOF

. By

To

show

L

any

that

near

to

then

we

is

In

of

typical

applications

Z , can Y , as

then

I

fJ )

An be

==

incremental

independent ,.., P

that

maximum

on , we

factor . We

can

then

to

be

that

( Zl

need of

to

maximize

0 ( 0 ) , and

that

that

,..,near

guess

L (8)

-

has ,..,

at

P *

P (z I 0) =

=

is no

(Jt

a (} t existed

,

Pot

. But

since

P * , contradicting (J* . The

to this

there

,..,if such pt

to

P * and

for

the

proof

nearby

the for

global

of

(J. The

values

latter

result

.

=

IIi

,

the

F

. The

in

Y

the

as

identically

this

that of

will ,..,

, ,

factored

also

exploits

Note

P

be

variable variable

.

a maximum ,..,

form

can are

here that

2 .

Z

items

algorithm

parameter

observed

unobserved

and

that

Theorem for

the

data

assume

Pi ( Yi ) , since

write

likelihood

items

for

EM

of

maximum

, . . . , Zn ) , and

the " " search

the

to

distributions

F

have

structure

since

this

Yi

form ,..,

at

are

the

F ( P , fJ) == L: : i Fi ( Pi , fJ ) , where ,..,

==

algorithm F , and

some

done

maximum

L ( O ) == log

,..,

incremental

8 * , then

show

, note

data

to

the

basis

Fi { Pi , (J)

An

per

is

F { P {}* , {} * ) == F { P * , {} * ) .

to

restriction

find

I 0 ) . Often

can "" restrict P (y )

and

a global

need

at

probability

not

the

as

must

independent

joint

variant

justified

see

see ,..,this

the

as

will

has

F ( P * , (J* ) , where

unnecessary

wish

of

P ( Yi , Zi we

are

hms

, we

IIi

L , we

maximum

decomposed

, but

P *

, L {8* ) =

of

fJ , pt

F

2 , we

, fJt ) >

are

( Y1 , . . . , Y n ) . The

P (y , z

steps

maximization

(J* .

particular

without

a number be

distributed

can

, but

algorit

given

at

L ( (J* ) . To ,..,

a local

continuity

Incremental

estimate

>

F ( pt

at

, if

1 and

, in

with haB

analogous

aBsumptions

. Similarly

maximum

have

F

maximum

maximum

that

continuously

maxima

well

L ( (Jt )

also

M

F (P , 8 ) standard

.

local

Lemmas

0 , and

that

a

a global

which

would

assumption

3.

as

{} * is a local

(J* for

varies

8*

has

the

,..,fol -

,.., has

combining

F ( P {} , O ) , for

Po

at

for

only

and

the

,.., Theorem

) . The

maximum

L ( 8 ) , justifying

algorithms

point

E ~ [ logP

using

the

hence

L , starting

at

distribution

the

( Yi ' ziI8

following from

)]

+

H

( Pi )

iteration some

, ~ ( O) , which

(6 )

can

guess might

at

the or

then

be

used

parameters might

not

, be

A VIEW OF THE EM ALGORITHMAND ITS VARIANTS

361

consistentwith (}(O): E Step: Choosesomedata item, i , to be updated. Set pit ) = pJt- l ) for j =I i . (This takesno time). Set Pi(t) to the Pi that maximizesPi (Pi, (j(t- l )), given by Pi(t) (Yi) = P(Yi I Zi, (}(t- l )).

(7)

M Step: Set (}(t) to the (} that maximizesF (P(t), (}), or, equivalently , that maximizesEp(t)[logP(y, Z I (})]. Data items might be selected for updating in the Estep cyclicly , or by some scheme that gives preference to data items for which Pi has not yet -

stabilized

.

Each E step of the above algorithm requires looking at only a single data item , but , ~ written , it appears that the M step requires looking at -

all components

-

of P . This can be avoided in the common

case where the

inferential import of the complete data can be summarized by a vector of sufficient statistics that can be incrementally updated , as is the case with models in the exponential family .

Letting this vector of sufficient statistics be s(y, z) = Ei Si(Yi, Zi) , the standard EM iteration of (1) can be implemented ~ follows:

E Step: Set .s'(t) = Ep[s(y, z)], whereP(y) = P(y I Z, (}(t- l ). (In detail set s(t) = ~ . ~t) with ~ t) = E- [s'(y' z.))

- ' L..,t 1. , t where Pi (Yi) = P (Yi I Zi, (}(t- l )).)

Pi t 1., 't ,

(8)

M Step : Set (}(t) to the {} with maximum likelihood given s( t) . Similarly , the iteration of (7) can be implemented using sufficient statistics

thataremaintained incrementally , starting withaninitialguess , ~O), which mayor follows

may not be consistent with (}(O). Subsequent iterations proceed ag :

E Step :

Choose some data item , i , to be updated .

Sets)t) = ~t-l) forj =I i. (Thistakes notime .) Set~t) = Epi[Si(Yi, Zi)], for Pi(Yi) = P(YiI Zi, (}(t- l)).

(9)

Sets(t) = s
In iteration (9), both the E and the M steps take constant time, independent of the number of data items . A cycle of n such iterations , visiting

362

RADFORD

M . NEAL

AND

GEOFFREY

E . HINTON

each data item once, will sometimes take only slightly more time than one iteration of the standard algorithm , and should make more progress, since the distributions for each variable found in the partial E steps are utilized immediately , instead of being held until the distributions for all the unobserved variables have been found . Nearly as fast convergence may be obtained with an intermediate variant of the algorithm , in which each E

step recomputes the distributions for several data items (but many fewer than n) . Use of this intermediate variant rather than the pure incremental algorithm reduces the amount of time spent in performing the M steps.

Note that an algorithm basedon iteration (9) must save the last value computed for each Si, so that its contribution a new value for Si is computed . This

to S may be removed when

requirement

will

onerous . The incremental update of s could potentially with

cumulative

avoided

round - off

error . If

in several ways -

necessary , this

one could

generally

lead to problems

accumulation

use a fixed - point

not be can

representation

be

of

5, in which addition and subtraction is exact , for example , or recompute 5 non-incrementally at infrequent intervals . An incremental variant of the EM algorithm somewhat similar to that

of (9) was investigated by Nowlan (1991). His variant does not maintain strictly accurate sufficient statistics , however . Rather , it uses statistics computed as an exponentially decaying average of recently -visited data points , with iterations of the following form : E Step :

Select the next data item , i , for updating .

Set~t) == Ep[Si(Yi, Zi)], forPi(Yi) ==P(YiIZi, (J(t- l)). a

t

Sets(t) == , s(t- l) + ~ ).

(10)

M Step : Set (J(t) to the () with maximum likelihood given s(t). where 0 < , < 1 is a decay constant . The above algorithm will not converge to the exact answer, at least not if , is kept at some fixed value. It is found empirically , however , that it can converge to the vicinity of the correct answer more rapidly than the standard EM algorithm . When the data set is large and redundant , one might expect that , with an appropriate value for " this algorithm could be

faster than the incremental algorithm of (9) , since it can forget out-of-date statistics more rapidly .

4. Demonstration

for a mixture

model

In order to demonstrate that the incremental algorithm of (9) can speed convergence, we have applied it to a simple mixture of Gaussians problem .

The algorithm using iteration (10) was also tested.

A VIEW OF THE EM ALGORITHMAND ITS VARIANTS

363

Given s(y, z) == LiSi (Yi, Zi) == (no, mo, qo, nl , ml , ql ), the maximum likelihood parameter estimates are given by 0: = nl / (no + nl ) , /-Lo = moina ,

0"5 == qolno- (molno)2, /-Ll == ml / nl , and ai == ql/ nl - (ml / nl )2. We synthetically tribution

with

generated a sample of 1000 points , Zi, from this dis-

0: == 0 .3 , /-Lo = 0 , 0"0 == 1 , /-Ll == - 0 .2 , and al == 0 . 1 . We then

applied the standard algorithm of (8) and the incremental algorithm of (9)

to thisdata. Asinitialparameter values , weused0:(0) = 0.5, /-L~O) ==+1.0, O "~O) = 1, /-L~O) = - 1, andaiD) = 1. Fortheincremental algorithm , a single iteration of the standard algorithm was then performed to initialize the distributions for the unobserved variables . This is not necessarily the best procedure , but was done to avoid any arbitrary selection for the starting distributions , which would affect the comparison with the standard algorithm . The incremental algorithm visited data points cyclicly . Both algorithms converged to identical maxima of L , at which 0: * = 0.269, /-La = - 0.016, 0' 0 = 0.959, /-Li = - 0.193, and 0' ; = 0.095. Special measures to control round -off error in the incremental algorithm were found

to be unnecessaryin this case (using 54-bit floating-point numbers) . The rates of convergence of the two algorithms are shown in Figure 1, in which the log likelihood , L , is plotted as a function of the number of " passes" - a pass being one iteration for the standard algorithm , and n iterations

for the incremental algorithm . (In both case, a pass visits each data point once.) As can be seen, the incremental algorithm reached any given level of L in about half as many passes as the standard algorithm . Unfortunately , each pass of the incremental algorithm required about twice as much computation time as did a pass of the standard algorithm , due primarily to the computation required to perform an M step after visiting every data point . This cost can be greatly reduced by using an

364

RADFORDM. NEALAND GEOFFREYE. HINTON

- 1080 . . .

- 1100

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

'

.

- 1120 - 1140

. . . . . . .

- 1160

.'

- 1180 - 1200 - 1220

-

- 1240 0

10

20

30

40

Figure 1. Comparison of convergencerates for the standard EM algorithm (solid line) and the incremental algorithm (dotted line). The log likelihood is shown on the vertical axis, the number of passesof the algorithm on the horizontal axis.

- 1080 ,

- 1100

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

'

- 1120 - 1140 - 1160 - 1180 - 1200 - 1220 - 1240 0

5

10

15

20

25

Figure 2. Convergencerates of the algorithm using exponentially decayed statistics with 'Y= 0.99 (dashed line) and 'Y = 0.95 (dotted line) . For comparison, the performance of the incremental algorithm (solid line) is reproduced as well (as in Figure 1) .

A VIEW OF THE EM ALGORITHM AND ITS VARIANTS

365

intermediate algorithm in which each E step recomputes the distributions for the next ten data points . The rate of convergence with this algorithm is virtually indistinguishable from that of the pure incremental algorithm , while the time required for each pass is only about 10% greater than for the standard algorithm , producing a substantial net gain in speed. The algorithm of iteration (10) was also tested . The same initialization procedure was used, with the elaboration that the decayed statistics were computed , but not used, during the initial standard iteration , in order to initialize them for use in later iterations . Two runs of this algorithm are shown in Figure 2, done with 'Y == 0.99 and with 'Y == 0.95. Also shown is the run of the incremental algorithm (as in Figure 1) . The run with I == 0.99 converged to a good (but not optimal ) point more rapidly than the incremental algorithm , but the run with 'Y = 0.95 converged to a rather poor point . These results indicate that there may be scope for improved algorithms that combine such fast convergence with the guarantees of stability and convergence to a true maximum that the incremental algorithm provides .

5. A sparse algorit hm A "sparse" variant of the EM algorithm may be advantageous when the unobserved variable , Y , can take on many possible values, but only a small set of " plausible " values have non-negligible probability (given the observed data and the current parameter estimate ) . Substantial computation may sometimes be saved in this case by "freezing" the probabilities of the im plausible values for many iterations , re- computing only the relative proba bilities of the plausible values. At infrequent intervals , the probabilities for all values are recomputed , and a new set of plausi ble values selected (which may differ from the old set due to the intervening change in the parameter estimate ) . This procedure can be designed so that F is guaranteed to increase with every iteration , ensuring stability , even though some iterations may decrease L . In detail , the sparse algorithm represents p (t) as follows :

p(t)(y)

q ~t )

if y ~ S ( t )

Q ( t ) r ~t )

if y E S (t )

(13)

Here , S (t) is the set of plausible values for Y , the qtt) are the frozen probabil ities for implausible values, Q (t) is the frozen total probability for plausible values, and the r ~t) are the relative probabilities for the plausible values, which are updated

every iteration

.

366

RADFORD

Most iterations E Step :

M . NEAL AND GEOFFREY

of the sparse algorithm

E . HINTON

go as follows :

Set S (t) == S (t - l ) , Q (t ) = Q (t - l ) , and qtt ) = qtt - l ) for all y ~ S (t) . (This takes no time .)

(14)

Set rtt ) == P (y I z , (}(t- l )) / P (y E S (t ) I z , (}(t - l )) for all y E S (t) . M

Step : Set (}(t ) to the () that maximizes

F (P (t) , (}) .

It can easily be shown that the above E step selects those rtt ) that maximize F (P (t) , e(t - l )) . For suitable models , this restricted E step will take time proportional only to the size of S (t), ind .ependent of how many values are in the full range for Y . For the method to be useful , the model must also be such that the M step above can be done efficiently , as is discussed below . On occasion , the sparse algorithm E Step :

performs

a full iteration

, as follows :

Set s (t) to those y for which P (y I z , (}(t - l ) ) is non - negligible . For all y ~ s (t) , set qtt ) == P (y I z , 8(t - l )) .

(15)

Set Q (t) == P (y E s (t) I z , (}(t - l ) ) . For all y E s (t ) , set rtt ) == P (y I z , O(t - l )) / Q (t). M Step : Set (}(t) to the 0 that maximizes

F (P (t), 8) .

The decisions as to which values have " non - negligible " probability can be made using various heuristics . One could take the N most probable values , for some predetermined N , or one could take as many values as are needed to account for some predetermined fraction of the total probability . The choice made will affect only the speed of convergence , not the stability of the algorithm - even with a bad choice for S (t) , subsequent iterations cannot decrease F . For problems with independent each data item , i , can be treated " plausible " values , s1t ) , and with

observations , where Y == (Y1, . . . , Yn) , independently , with a separate set of distributions

Pi(t ) expressed in terms of

~,y ", ) . For efficient implementation quantities q~ t) , Q ~t) , and r ~ty of the M step in ( 14) , it is probably necessary for the model to have simple sufficient statistics . The contribution to these of values with frozen probabilities can then be computed once when the full iteration of ( 15) is performed , and saved for use in the M step of ( 14) , in combination with the statistics for the plausible

values in Si(t )

A VIEWOFTHEEMALGORITHM ANDITSVARIANTS The

Gaussian

mixture

usefulness

of

mixture

,

having

the

each

data

come

the

algorithm

that

applied

in

.

distant

a

non

-

potential

in

negligible

the

probability

means

are

.

the

effect

of

nearby

avoids

negligible

the

components

components

have

of

many

whose

that

the

sparse

on

Freezing

continual

the

re

course

of

-

the

and

can

be

,

find

the

employ

For

,

but

(

,

,

stages

true

maximum

an

the

and

,

be

EM

-

can

)

.

"

winner

easily

be

-

"

-

,

-

-

all

-

all

method

also

,

,

but

which

they

though

In

'

t

L

in

seen

in

this

light

applied

proportions

Hidden

is

L

guaranteed

regard

sensible

the

the

estimating

to

the

.

as

instance

one

There

finding

mixing

used

.

in

algorithm

and

find

of

converged

be

EM

more

don

variant

variant

of

can

lead

appear

they

a

this

has

neither

might

such

capable

often

.

to

the

maximum

using

the

be

by

maximizes

variances

is

may

represented

variant

of

with

be

that

variant

all

version

(

view

probability

unconstrained

fJ

algorithm

recognition

hoc

even

-

clustering

take

iteration

take

,

This

algorithm

,

the

a

.

zero

,

of

to

jointly

EM

to

switching

fJ

Obviously

to

value

EM

could

methods

assign

course

.

converge

winner

means

speech

ad

F

,

and

the

to

of

One

.

advantages

take

each

,

a

P

of

- -

of

.

optimization

procedures

variant

find

problem

-

completely

maximizing

"

like

variants

F

-

P

,

not

winner

for

with

all

-

one

general

the

K

standard

to

can

F

,

increase

-

computational

known

Models

as

in

mixture

The

Markov

EM

only

maximizing

respect

probability

when

Gaussian

with

the

of

of

distribution

maximizing

incremental

ods

of

not

terms

distribution

need

only

well

fixed

to

,

)

take

assigned

hence

of

The

to

,

fJ

other

a

is

however

early

as

)

,

-

Such

in

variety

the

that

fJ

-

P

winner

cannot

P

might

"

.

algorithm

wide

into

a

value

value

F

(

constraining

one

single

a

F

insight

by

it

of

of

example

are

viewing

any

provide

obtained

the

variants

algorithms

by

maximum

also

all

incremental

sparse

justified

example

can

and

.

incremental

for

of

have

components

the

example

are

variants

that

of

there

typically

few

quantities

combination

Other

The

to

an

If

.

Note

6

a

for

of

.

will

only

probabilities

computation

provides

algorithm

point

from

small

problem

sparse

367

these

when

meth

seen

unconstrained

in

-

terms

maximum

.

ACKNOWLEDGEMENTS

We

thank

cke

,

This

Wray

and

work

of

tu

.

te

for

Geoffrey

Advanced

,

Bill

Titterington

was

Council

Centre

Buntine

Mike

Byrne

for

supported

by

Canada

and

Hinton

is

Research

the

by

,

Mike

Jordan

comments

the

the

Natural

,

an

Jim

Kay

earlier

Sciences

Ontario

Nesbitt

.

on

and

Burns

of

the

Stol

this

-

paper

.

Research

Technology

fellow

Andreas

of

Engineering

Information

-

,

version

Research

Canadian

Insti

-

368

RADFORD

M . NEAL

AND

GEOFFREY

E . HINTON

REFERENCES Csiszar I . and Tusnady, G. (1984) "Information geometry and alternating minimization procedures" , in E. J. Dudewicz, et al (editors) RecentResults in Estimation Theory and Related Topics (Statistics and Decisions, Supplement Issue No. 1, 1984) . Dempster, A . P., Laird , N. M ., and Rubin, D . B. (1977) "Maximum likelihood from incomplete data via the EM algorithm" (with discussion), Journal of the Royal Statistical Society B , vol . 39, pp . 1-38.

Hathaway, R. J. (1986) "Another interpretation of the EM algorithm for mixture distributions " , Statistics and Probability Letters , vol . 4, pp . 5356 .

McLachlan, G. J. and Krishnan,. T . (1997) The EM Algorithm and Extensions , New York : Wiley .

Meng, X . L. and Rubin, D. B. (1992) "Recent extensionsof the EM algorithm (with discussion)" , in J. M . Bernardo, J. O. Berger, A . P. Dawid, and A . F . M . Smith (editors ) , Bayesian Statistics 4, Oxford : Clarendon Press .

Meng , X . L . and van Dyk , D . (1997) "The EM algorithm -

an old folk -

song sung to a fast new tune" (with discussion), Journal of the Royal Statistical Society B , vol . 59, pp . 511-567.

Nowlan, S. J. (1991) Soft Competitive Adaptation.. Neural Network Learning Algorithms based on Fitting Statistical Mixtures , Ph . D . thesis , School of Computer Science, Carnegie Mellon University , Pittsburgh .

LATENT VARIABLE MODELS

CHRISTOPHER M. BISHOP Microsoft Research St. GeorgeHouse 1 Guildhall Street Cambridge CB2 3NH, U.K.

Abstract . A powerful approach to probabilistic modelling involves supplementing a set of observed variables with additional latent , or hidden , variables . By defining a joint distribution over visible and latent variables , the corresponding distribution of the observed variables is then obtained by marginalization . This allows relatively complex distributions to be expressed in terms of more tractable joint distributions over the expanded variable space. One well -known example of a hidden variable model is the mixture distribution

in which the hidden variable is the discrete component

the case of continuous

latent

variables

we obtain

models

label . In

such as factor

ana -

lysis . The structure of such probabilistic models can be made particularly transparent by giving them a graphical representation , usually in terms of a directed acyclic graph , or Bayesian network . In this chapter we provide an overview

of latent variable

models for representing

continuous

variables .

We show how a particular form of linear latent variable model can be used to provide a probabilistic formulation of the well -known technique of princi -

pal components analysis (PCA ). By extending this technique to mixtures, and hierarchical mixtures , of probabilistic PCA models we are led to a powerful interactive algorithm for data visualization . We also show how the probabilistic PCA approach can be generalized to non-linear latent variable models leading to the Generative Topographic Mapping algorithm (G TM ) . Finally , we show how GTM can itself be extended to model temporal data . 371

372

CHRISTOPHER M. BISHOP

1. Density

Modelling

One of the central problems in pattern recognition and machine learning is that of density estimation , in other words the construction of a model of a probability distribution given a finite sample of data drawn from that distribution . Throughout this chapter we will consider the problem of modelling the distribution of a set of continuous variables tl , . . . , td which we will collectively denote by the vector t . A standard approach to the problem of density estimation involves parametric models in which a specific form for the density is proposed which contains a number of adaptive parameters . Values for these parameters are

then determined from an observeddata set D = {tl , . . . , tN } consisting of N data vectors . The most widely used parametric model is the normal , or Gaussian , distribution

given by

p(tlJ -L,~)=(27r )-d/21 ~1-1/2exp {-~(t - J-L)~-l(t - J-L)T} (1) where JL is the mean , ~ is the covariance matrix , and I~ I denotes the determinant of E . One technique for setting the values of these parame ters is that of maximum likelihood which involves consideration of the log probability of the observed data set given the parameters , i .e.

N (I.L, ~ ) = Inp(DII.L, ~ ) = LInp .l"E) n=l (tnIJ

(2)

in which it is assumed that the data vectors tn are drawn independently from the distribution . When viewed as a function of IL and E , the quantity p (DIIL , E ) is called the likelihood function . Maximization of the likelihood (or equivalently the log likelihood ) with respect to IL and E leads to the set of parameter values which are most likely to have given rise to the observed data set. For the normal distribution (1) the log likelihood (2) can be maximized analytically , leading"""- to the intuitive result [1] that the maximum likelihood solutions [L and E are given by

... ~

-

1

N

NLtn

(3)

n = l

... E

-

1

N

N L (tn - j:L)(tn - j:L)T

(4)

n = l

corresponding to the sample mean and sample covariance respectively . As an alternative to maximum likelihood , we can define priors over J..I, and E use Bayes' theorem , together with the observed data , to determine

LATENTVARIABLEMODELS

373

the posterior distribution . An introrlluction to Bayesian inference for the normal distribution is given in rS]. While the simple normal distribution (1) is widely used, it suffers from some significant limitations . In particular , it can often prove to be too flexible in that the number of independent parameters in the model can be excessive. This problem is addressed through the introduction of conti nuous latent variables . On the other hand , the normal distribution can also be insufficiently flexible since it can only represent uni -modal distributions . A more general family of distributions can be obtained by considering mix tures of Gaussians , corresponding to the introduction of a discrete latent variable . We consider each of these approaches in turn .

1.1. LATENTVARIABLES Consider

the

~

is

a

further

symmetric

.

large

data

E

d

free

assumption and

, such

a

different

can

now

be

to

that

data

d and

the

to

reduce

of to

.

3 ) /

2

t

are

There

are in

large

numbers

maximum

likelihood

the

number

of

matrix

corresponds

to

a

statistically

capture

Since

parameters

covariance

,

components

+

excessively

diagonal

however

( d

that way

unable

the

while

number still

' hidden

the

first

,

very

free

which strong

independent

the

p

latent

to

)

of

assume

,

so

( t

)

correlations

that

the

( x

of

the

the

of

( Xl

, between

joint

within

of

be

a

'

tl

. . . ' Xq

)

conditional

, . . . , td

, x

variables given

the

latent

by

)

intro

terms <

into

-

model

in q

and

distribution

distribution

,

model

variable

where

( t

latent

the

captured

latent

distributionp the

variables the

goal variables

=

joint )

data that

freedom to

The

x

the p

of

correlations

.

variables

decomposing

( tlx

degrees

variables

distribution p

of

allowing

' )

distribution of

variables

parameters

,

ensure One

( 1 ) .

.

,

convenient

the

distribution

making d2

a ,

normal

independent JL ,

consider This

therefore

how

( or

distribution

to

to

the

like

.

is

2

in

required

.

is

show

by

often

the

namely

marginal

1 ) /

determined

model

number

the

+

grows

parameters

model

express

achieved

nal

the

latent

smaller

of

be well

controlled

ducing

( d

number

components

We

is

this

is

in just

d

in

parameters

may

for

parameters

contains

d

points

parameters has

it

free

independent

For

solution

of

, d

total of

number

d .

the the variables factorizes

of

a

This

is

product conditio

.

It

is

over

becomes

d p (t , x ) = p (x )p (tlx ) = p (X ) n p (tiIX ) . i=l

(5)

This factorization property can be expressedgraphically in terms of a Bayesian network, as shown in Figure 1.

374

CHRISTOPHER M. BISHOP p(X)

p(tdlx) -

-

-

-

-

-

-

Figure 1. Bayesian network representation of the latent variable distribution given by (5) , in which the data variables tl , . . . , td are independent given the latent variables x .

11 y(x;w)

s

., ~~""-'- - - """"'------.... Xz

Xl

13

t2

Figure2. The non-linearfunctiony (x ; w) definesa manifoldS embeddedin data spacegivenby the imageof the latent spaceunderthe mappingx -t y . We next expressthe conditional distribution p(tlx ) in terms of a mapping from latent variables to data variables, so that t = y (x ; w ) + u

(6)

where y (x ; w ) is a function of the latent variable x with parameters w , and u is an x -independent noise process. If the components of u are uncorrela ted , the conditional distribution for t will factorize as in (5) . Geometrically the function y (x ; w ) defines a manifold in data space given by the image of the latent space, as shown in Figure 2. The definition of the latent variable model is completed by specifying the distribution p (u ) , the mapping y (x ; w ) , and the marginal distribution p (x ) . As we shall see later , it is often convenient to regard p (x ) as a prior distribution over the latent variables .

LATENT VARIABLE MODELS

375

The desired model for the distribution p(t ) of the data is obtained by marginalizing over the latent variables p(t ) = jP (tlx )p(x ) dx .

(7)

This integration will , in general, be analytically intractable except for specific forms of the distributions p(tlx ) and p(x ). One of the simplest latent variable models is called factor analysis [3, 4] and is based on a linear mapping y (x ; w ) so that t = Wx

+ Jl, + u ,

(8)

in which Wand I'" are adaptive parameters . The distribution p (x ) is chosen to be a zero- mean unit covariance Gaussian distribution N (O, I ), while the noise model for u is also a zero mean Gaussian with a covariance matrix 'li which is diagonal . Using (7) it is easily shown that the distribution p (t ) is also Gaussian , with mean J.L and a covariance matrix given by 'l' + WWT . The parameters of the model , comprising W , 'Ii and Jl" can again be determined by maximum likelihood . There is, however, no longer a closedform analytic solution , and so their values must be determined by iterative procedures . For q latent variables , there are q x d parameters in W together with d in 'Ii and d in J-L. There is some redundancy between these para meters , and a more careful analysis shows that the number of independent degrees of freedom in this model is given by

(d + 1)(q + 1) - q(q + 1)/ 2.

(9)

The number of independent parameters in this model therefore only grows linearly with d, and yet the model can still capture the dominant correlations between the data variables. We consider the nature of such models in more detail in Section 2.

1.2. MIXTUREDISTRIBUTIONS The density models we have considered so far are clearly very limited in terms of the variety of probability distributions which they can model since they can only represent distributions which are uni - modal . However , they can form the basis of a very general framework for density modelling , obtained by considering probabilistic mixtures of M simpler parametric distributions . This leads to density models of the form

p(t ) =

M L (tli) i==l7rip

(10)

376 in

CHRISTOPHER M. BISHOP

which

the

and

p

might

each

(

7ri

integrate

have

10

~

to

these

example

)

1

,

of

are

Bayesian

called

normal

7ri

=

can

network

so

,

)

will

individual

represent

as

( t

shown

in

form

.

the

non

-

1

)

The

requi

negative

-

and

densities

also

distribution

3

(

Ei

satisfy

be

mixture

Figure

n

the

component

the

mixture

matrix

and

p

the

of

covariance

that

the

We

and

coefficients

1

( assuming

.

of

distributions

lLi

mixing

Ei

unity

)

components

mean

and

properties

simple

individual

independent

in

~

the

for

own

7ri

0

represent

,

its

rements

will

)

consist

with

parameters

a

( tli

(

10

)

as

.

. I

p

Figure

3

The

.

Bayesian

mixing

values

of

to

can

label

i

.

evaluate

tli

For

a

the

be

of

value

takes

to

ofp

for

( iltn

)

can

' explaining

reverse

the

The

log

'

Maximization

of

of

due

powerful

to

the

the

for

based

Zni

point

of

on

the

specifying

tn

)

then

,

}

)

=

In

the

] ,

log

likelihood

)

if

and

we

i

~

was

( tli

would

' comp ({7ri, ILi, Ei } ) =

Bayes

the

)

}

(

.

the

introductory

a

single

An

[ 5

a

] .

set

for

The

of

EM

12

)

com

elegant

-

and

expectation

account

in

-

of

EM

in

algorithm

indicator

generating

NM L=liL=lZni In{7riP (tli)} n the

theorem

form

for

logarithm

responsible

take

i

'

.

called

given

component

using

than

the

were

)

.

7riP

given

by

11

is

complex

an

given

'

.

optimization

is

,

{

the

Bayes

( 1 J

takes

inside

this

use

which

3

more

sum

for

then

,

responsibility

Figure

E

can

)

tn

distribution

distributions

that

Ii (

P

this

in

[ 11

component

the

is

algorithm

( tn j

Effectively

arrow

Ei

of

we

probabilities

7r

.

probabilities

tn

~ L. . . . . , j

mixture

performing

observation

the

.

likelihood

mixture

which

,

JLi

=

as

the

presence

( EM

context

,

)

distribution

prior

point

7rip

( 1, ltn

tn

the

log

technique

maximization

7ri

this

p

regarded

for

mixture

posterior

point

direction

likelihood

( {

ponent

be

data

simple

as

data

. = =

a

interpreted

given

corresponding

Rni

The

)

representation

coefficients

the

theorem

network

(

is

variables

each

data

form

(13)

LATENTVARIABLEMODELS

377

and its optimization would be straightforward , with the result that each component is fitted independently to the correspondinggroup of data points, and the mixing coefficientsare given by the fractions of points in eachgroup. The { Zni} are regarded as 'missing data', and the data set { tn } is said to be 'incomplete'. Combining { tn } and { Zni} we obtain the corresponding 'complete' data set, with a log likelihood given by (13). Of course, the values of { Zni} are unknown, but their posterior distribution can be computed using Bayes' theorem, and the expectation of Zni under this distribution is just the set of responsibilities Rni given by (11). The EM algorithm is based on the maximization of the expected complete-data log likelihood given from (13) by

N M ( comp ({1Ti , .ui, ~i})) == L L RniIn{7riP (tli )} . n==li ==l

(14)

It alternates between the E-step, in which the Rni are evaluated using (11) , and the M -step in which (14) is maximized with respect to the model parameters to give a revised set of parameter values. At each cycle of the EM algorithm the true log likelihood is guaranteed to increase unless it is already at a local maximum [11]. The EM algorithm can also be applied to the problem of maximizing the likelihood for a single latent variable model of the kind discussed in Section 1.1. We note that the log likelihood for such a model takes the form N

N

(W,J.L,"lI) ==L Inp (tn)==L In{! p(tnlxn )p(xn)dxn }. n == l

(15)

n == l

Again , this is difficult to treat because of the integral inside the logarithm . In this case the values of xn are regarded as the missing data . Given the prior distribution p (x ) we can consider the corresponding posterior distri bution obtained through Bayes' theorem

p(xnltn) =

p(tnIxn)p(Xn) p(tn)

(16)

and the sufficient statistics for this distribution are evaluated in the E-step . The M -step involves maximization of the expected complete -data log like lihood and is generally much simpler than the direct maximization of the true log likelihood . For simple models such as the factor analysis model discussed in Section 1.1 this maximization can be performed analytically . The EM (expectation -maximization ) algorithm for maximizing the likelihood function for standard factor analysis was derived by Rubin and Thayer

[23].

378

CHRISTOPHER M. BISHOP

We can combine the technique of mixture modelling with that of latent variables , and consider a mixture of latent -variable models . The correspon ding Bayesian network is shown in Figure 4. Again , the EM algorithm

-

-

-

-

-

-

Figure .4. Bayesian network representation of a mixture of latent variable models. Given the values of i and x , the variables tl , . . . , td are conditionally independent.

provides

a

and

riable

of

to

the

be

the

values

treated

to

and

obtain

a

clude

data

data

,

on

linear

and

a

,

most

set

and

the

largest

axes

d

v

j

,

j

variance

vectors

v

associated

j

are

the

eigenvalues

.

be

subject

maximizes

the

- dimensional

I

,

.

.

the

a

-

concepts

fruitful

part

density

-

modelling

for

be

its

data

PCA

found

,

dimensio

in

many

-

practically

applications

visualization

is

,

q

}

,

the

,

are

terms

in

those

of

the

vectors

is

q

in

variance

projection

by

may

of

data

.

va

in

-

exploratory

.

of

{

,

latent

how

in

technique

,

derivation

E

see

used

for

Examples

processing

under

given

shall

can

- established

recognition

which

observed

retained

well

on

pattern

the

Analysis

a

image

parameters

of

.

analysis

,

model

and

.

algorithms

visualization

chapter

common

of

principal

the

a

compression

projection

For

powerful

is

the

i

we

Component

multivariate

analysis

The

of

data

analysis

reduction

text

data

chapter

distributions

range

component

of

label

missing

this

Principal

Principal

every

component

of

and

Probabilistic

nality

determination

mixture

classification

.

for

the

as

sections

variables

pattern

of

together

subsequent

latent

nership

q

framework

both

x

In

2

natural

allows

{

}

dominant

)

of

It

eigenvectors

the

,

n

sample

space

E

{

axes

.

can

( i . e

N S=~n L= Il(tn-j:L)(tn-j'L)T Aj

standardized

projected

tn

orthonormal

maximal

a

covariance

.

.

.

[ 14

N

onto

be

.

I

those

}

,

] .

the

which

shown

that

with

the

matrix

(17)

such that SVj = AjVj. Here ~ is the samplemean, given by (3). The q principal componentsof the observedvector tn are given by the vector

LATENTVARIABLE MODELS

379

Un = yT (tn - Ii ) , where yT = (VI , . . . , Vq) T , in which the variables Uj are decorellated such that the covariance matrix for U is diagonal with elements { Aj } . A complementary property of FaA , and that most closely related to the original discussions of Pearson [20] , is that , of all orthogonal linear projections Xn = yT (tn - fL) , the principal component projection minimizes the squared reconstruction error "- En IItn - in 112 , where the optimal linear reconstruction of tn is given by tn = V Xn + it . One serious disadvantage

of both these definitions

of PCA is the absence

of a probability density model and associated likelihood measure . Deriving PCA from the perspective of density estimation would offer a number of important

advantages , including

the following :

. The corresponding likelihood measure would permit comparison with other density - estimation techniques and would facilitate statistical tes ting . . Bayesian inference methods could be applied (e.g . for model comparison ) by combining the likelihood with a prior . . If PCA were used to model the class - conditional densities in a classifica tion problem , the posterior computed .

probabilities

of class membership

could be

. The value of the probability density function would give a measure of the novelty of a new data point . . The single PCA model could be extended to a mixture of such models . In this section we review the key result of Tipping and Bishop [25] , which shows that principal component analysis may indeed be obtained from a probability model . In particular we show that the maximum - likelihood estimator of W in ( 8) for a specific form of latent variable models is given by the matrix of (scaled and rotated ) principal axes of the data .

2.1. RELATIONSHIP TO LATENTVARIABLES Links between principal component analysis and latent variable models have already been noted by a number of authors . For instance Anderson [2] observed that principal components emerge when the data is assumed to comprise a systematic component , plus an independent error term for each variable having common variance 0-2. Empirically , the similarity between the columns of Wand the principal axes has often been observed in situa tions in which the elements of 'l1 are approximately equal [22]. Basilevsky [4] further notes that when the model WwT + 0-21 is exact , and therefore equal to S, the matrix W is identifiable and can be determined analytically through eigen-decomposition of S, without resort to iteration .

380

CHRISTOPHER M. BISHOP

As

well

consider of

the

that

' I1

= be

likelihood

analysis 0- 21 , we

terms

now

principal

THE

the

a probability

show

WML

that

is

of

this

density

component

analysis

PROBABILITY

the

the

that

noise

when

the

is

are

may refer

to

so matrix

maximum

the

scaled

matrix

PCA shall

isotropic

0- 21 , the

columns

that we

is

not case

covariance

+

covariance

, which

do

a particular

data

wwT

sample

derivation

observations

covariance

whose

the

( PPCA

, such considering

form

matrix

model

exact

. By

even

of

a probability

is

which

using

eigenvectors

isotropic

model context

in

exactly

consequence of

the

model

expressed

principal

important

that - likelihood

estimator

rotated

For

assuming

maximum

factor

cannot

2 .2 .

as

the

S

and

[ 25 ] . An

be

expressed

as

probabilistic

in

) .

MODEL

noise

model

distribution

u

over

f "V N

( O , a2I

t - space

for

) , equations a given

p(tlx ) = (27ra2 )- d/2exp{ - ~ llt In the case of an isotropic

x

Wx

( 6 ) and given

(8 )

by

- JL112 }.

Gaussian prior over the latent

imply

variables

(18) defined

by

p(x)=(27r )-q/2exp {_~xTx } we then obtain the marginal distribution

(19 )

of t in the form

p(t) = jP(tlx)p(X)dX

(20 )

= (27r )-d/2ICI -1/2exp {-~(t - J.I,)TC -1(t - J.I,)} (21 ) w here the model

covariance

is

C = a2I + WWT . U sing Bayes ' theorem , the posterior

distribution

(22) of the latent

variables

x given the observed t is given by p (xlt )

= (27r )- q/2\a2M\- 1/2 x

exp{ - ~(x - (x))T(a2M)- I(X- (x))}

(23)

where the posterior covariancematrix is given by

a2M = a2(u2I + WTW)- l

(24)

LATENTVARIABLE MODELS and the mean of the distribution

381

is given by

(x) = M- 1WT(t - JL).

(25)

Note that M has dimension . q x q while C has dimension d x d. The

log - likelihood

for the observed

data

under

this

model

is given

by

N .c

where

=

L In { p ( tn ) } n= l

=

- Nd 2

the sample

In principle mizing

the

covariance

, we could

of the form

for the model

parameters

Our key result span

derivative

show

S of the observed the EM

that , for

model

of Rubin

is an exact

by ( 17 ) . by maxi -

and

case of an isotropic

, there

MAXIMUM

- LIKELIHOOD

the log - likelihood

the principal

subspace

of ( 26 ) with

{ tn } is given for this

algorithm

the

(26)

-N Tr { C - I S } 2

Thayer

noise

analytical

cova -

solution

.

OF THE

is that

-

the parameters

we are considering

2 .3 . PROPERTIES

of W

matrix , using

, we now

-N lnICI 2

determine

log - likelihood

[23 ] . However riance

In ( 27r ) -

respect

SOLUTION

( 26 ) is maximized

when

of the data . To show this

the columns

we consider

the

to W :

a .c == N ( C - 1SC - 1W

aw which

may

be obtained

from

standard

[ 19 ] , pp 133 ) . In [25 ] it is shown zero stationary

points

W where

the q column

eigenvalues rotation

matrix

point

corresponding

U q comprises to the

eigenvectors

q largest

represent

( 28 ) , the

columns

principal

eigenvectors

eigenvalues

together

matrix

to the

(.28)

Aq , and

global

, it

is also

maximum

eigenvectors

of S , with shown

of the

) and

that

likelihood

maximum

- likelihood scalings

the parameter

that

q x q ortho the

likelihood

stationary

occurs

of S ( i .e . the eigenvectors

of the

of S , with

corresponding

R is an arbitrary

eigenvalues

with

( see non -

for :

saddle - points

of the

results

by ( 22 ) , the only

= U q (Aq - a2J ) 1/ 2R

. Furthermore

the principal

( 27 )

differentiation

C given

in U q are eigenvectors

in the diagonal

gonal

ponding

vectors

C - 1W )

matrix

that , with

of ( 27 ) occur

-

all

other

estimator

determined a2 , and with

when corres -

combinations

of

surface . Thus , from WML

contain

the

by the corresponding arbitrary

rotation

.

382

CHRISTOPHER M. BISHOP

I t may also be shown that for W estimator for a2 is given by

W ML, the maximum-likelihood

d 2-- ~ Lq+lAj aML d-qj=

(29)

which has a clear interpretation as the variance'lost' in the projection, averagedover the lost dimensions . Note that the columnsof W ML are not orthogonalsince (30) WT MLW ML - RT(A q - a2I)R , which in generalis not diagonal. However , the columnsof W will be orthogonalfor the particular choiceR f I . In summary, we can obtain a probabilistic principal componentsmodel by finding the q principal eigenvectorsand eigenvalues of the sample covariancematrix. The density model is then given by a Gaussiandistribution with meanJ.t given by the samplemean, and a covariancematrix WwT + a2I in which W is givenby (28) and a2 is givenby (29). 3. Mixtures of Probabilistic PCA We now extend the latent variable model of Section 2 by considering a mix ture of probabilistic principal component analysers [24], in which the model distribution is given by (10) with component densities given by (22) . It is straightforward to obtain an EM algorithm to determine the parameters 7ri , JLi, W i and at . The E-step of the EM algorithm involves the use of the current parameter estimates to evaluate the responsibilities of the mixture components i for the data points tn , given from Bayes' theorem by

Rni =p(ptnli ))1Ti (tn .

(31)

In .the M-step, the mixing coefficientsand componentmeansare re-estimated usIng -

-

7ri

-

ILi ==

1N Nn L=lRni E ~=lRnitn - --NLn =l Rni

(32) (33)

while the parametersWi and at areobtainedby first evaluatingthe weighted covariance matrices given by

Si = }:::;:=1Rni (tN- [L)(tn- [L)T Ln =l Rni

(34 )

LATENTVARIABLEMODELS

383

and then applying (28) and (29).

3.1. EXAMPLE APPLICATION : HAND -WRITTENDIGIT CLASSIFICATION One

ten

will

potential

application

digit

recognition

generally

lie

geometry

of

scaling

' similar

Hinton

et

applied

set

digit

.

=

the

best

for

used

the

the

split

the

is

to

of

the

a

digit

,

as

,

of

model

of

model

such

each

to

the

rotation

classification

the

-

given

such

a

to

handwrit

manifold

digit

build

according

problem

CEDAR

digit

which

into

11

10

training

to

they

the

Hinton

et

to

,

but

also

,

set

rather

ale

be

the

than

[ 12

a

]

same

reported

a

single

of

individually

,

of

)

. 64

error

the

,

gray

] .

of

set

the

the

'

'

-

values

the

. 91

%

localized

- chosen

bs

probabi

used

of

4

The

data

parameter

%

-

each

,

and

digits

.

We

in

would

clustering

- estimated

arbitrarily

'

and

was

4

partly

of

' br

classification

an

8

[ 15

the

using

choice

of

-

reconstructed

data

misclassified

result

use

best

,

-

- by

database

of

sets

same

method

validation

service

subset

the

problem

l ' econstruction

8

validation

the

same

soft

smoothed

model

with

The

postal

- digit

which

utilizing

.

.

using

and

and

experiment

)

. S

, OOO

digit

,

scaled

U

an

handwritten

models

of

approach

=

of

PCA

according

while

model

]

images

to

)

is

continuous

of

best

conventional

improvement

[ 12

models

pixel

approach

digits

using

on

component

One

classification

from

q

model

the

in

to

mixture

PPCA

scale

properties

the

of

the

and

,

.

discussed

classified

set

each

]

'

repeated

test

-

smooth

the

constructed

was

10

expect

the

[ 12

,

PCA

( M

stroke

unseen

further

We

listic

the

ale

taken

was

gray

by

necessarily

' mixture

images

( which

of

density

' .

clustering

were

- dimensional

determined

classify

a

models

high

- dimensional

of

and

most

test

lower

not

,

scale

Examples

is

( although

based

.

a

thickness

separately

and

on

which

and

digits

are

for

,

of

values

of

global

value

a

[

.

One of the advantages of the PPCA methodology is that the definition of the density model permits the posterior probabilities of class membership to be computed for each digit and utilized for subsequent classification . After optimizing the parameters M and q for each model to obtain the best performance on the validation set, the model misclassified 4.61% of the test set. An advantage of the use of posterior probabilities is that it is possible to reject (using an optimal criterion ) a proportion of the test samples about which the classifier is most 'unsure ' , and thus improve the classification performance on the remaining data . Using this approach to reject 5% of the test examples resulted in a misclassification rate of 2.50%.

384

CHRISTOPHER

4

.

Hierarchical

An

Mixtures

interesting

,

is

to

extension

problem

a

spaces

.

of

1

.

standard

principal

orthogonal

leading

1

-

,

and

tion

of

*

(

is

,

)

is

is

an

not

into

because

of

.

W

=

xn

is

derived

mation

of

in

to

a2

>

The

onto

set

as

a

with

developed

-

in

(

,

-

of

the

of

tn

When

then

onto

prior

over

We

x

note

since

,

.

0

,

-

may

the

is

,

infor

point

taking

this

that

still

shrinkage

given

by

1

}

M

variables

(

xn

)

,

(

convey

data

*

-

of

,

reconstruction

WML

-

manifold

Because

data

by

2

becomes

the

however

each

-

0 -

projec

model

projection

,

.

posterior

orthogonal

density

.

)

the

projection

the

It

an

the

by

that

tn

variable

latent

vector

the

necessary

optimally

Figure

.

that

.

and

,

35

)

infor

even

two

.

the

.

The

benefits

3

.

A

in

-

the

case

in

to

two

other

the

this

model

while

toy

(

has

hierarchical

-

small

spaced

third

has

data

Gaus

closely

the

set

.

three

flat

data

latent

a

of

are

close

using

are

interactive

dimensional

also

relatively

,

which

are

clusters

of

plot

space

mixture

is

these

each

-

data

a

point

latent

visualization

which

this

data

dimensional

of

space

Gaussian

structure

-

type

of

of

each

two

from

two

of

single

mapping

the

points

latent

Each

parallel

first

in

this

properties

,

4

)

generated

planes

Section

xn

that

in

points

space

demonstrate

Note

points

dimension

the

by

(

in

to

data

principal

in

5

dimensional

from

visualized

mean

visualization

450

one

their

to

0

required

be

map

the

three

)

>

WML

property

will

of

in

1WT

by

spanned

this

seen

In

visualized

(

model

be

-

latent

original

therefore

topographic

illustrate

separated

order

the

in

consisting

variance

{

posterior

close

We

M

shrinkage

the

WML

in

.

are

becomes

result

this

Thus

can

illustrated

sufficiently

sians

.

corresponding

,

2

the

the

data

satisfies

set

]

=

data

.

the

space

[ 25

=

reconstruct

O

=

T

tn

-

visualization

points

may

the

"

and

)

0 -

WML

of

plane

it

although

a

inter

framework

data

data

then

For

from

With

for

the

)

projection

reconstructed

account

25

1WT

as

orthogonal

lost

optimally

origin

.

further

powerful

structure

PCA

(

(

)

a

probabilistic

components

(

-

recovered

the

not

,

by

WM

undefined

towards

xn

mation

be

is

model

and

given

and

to

the

probabilistic

)

PPCA

a

PCA

our

23

of

.

principal

For

(

is

l

thus

.

From

tn

-

PCA

and

shrunk

(

)

so

singular

W

WTW

a

into

analysis

)

.

mixtures

considering

led

retains

PPCA

the

eigenvectors

slightly

]

single

onto

projection

-

a

component

modified

mean

M

10

and

By

are

PROBABILISTIC

of

projection

two

is

use

we

insight

[

the

,

which

USING

first

,

.

model

dimensionality

Consider

model

considerable

VISUALIZATION

BISHOP

visualization

visualization

provide

high

PPCA

data

mixture

for

can

.

Visualization

the

of

hierarchical

algorithm

which

Data

for

the

to

active

4

for

application

models

and

M

is

been

well

chosen

approach

variable

model

is

385

Figure 5. Illustration of the projection of a data vector tn onto the point on the principal subspace corresponding to the posterior mean.

trained on this data set, and the result of plotting the posterior the data points is shown in Figure 6.

means of

Figure 6. Plot of the posterior means of the data points from the toy data set, obtained from the probabilistic PCA model, indicating the presence of (at least) two distinct clusters.

4.2. MIXTURE MODELS FOR DATA VISUALIZATION Next we consider the application of a simple mixture of PPCA models to data visualization . Once a mixture of probabilistic PCA models has been fitted to the data set, the procedure for visualizing the data points involves plotting each data point tn on each of the two-dimensional latent spaces at the corresponding posterior mean position (Xni ) given by

(Xni) = (W; Wi + a[I)- lW; (tn - J.Li)

(36)

as illustrated in Figure 7. As a further refinement, the density of 'ink ' for each data point tn is weighted by the corresponding responsibility Rni of model i for that data point , so that the total density of 'ink ' is distributed by a partition of

386

CHRISTOPHER M. BISHOP tn

~ ~ ~

.

, ~

~

" " ~

~ ~ ~ ~

" " ~ ~

" " ~

~

" " ~ ~

Figure 7. Illustration of the projection probabilistic PCA mixture model .

of a data vector onto two principal

surfaces in a

unity across the plots . Thus , each data point is plotted on every compone ~t model projection , while if a particular model takes nearly all of the posterior probability for a particular data point , then that data point will effectively be visible only on the corresponding latent space plot . We shall regard the single PPCA plot introduced in Section 4.1 as the top

level

in a hierarchical

visualization

model , in which

the mixture

model

forms the second level . Extensions to further levels of the hierarchy will be developed

in Section 4 .3.

The model can be extended

to provide

an interactive

data exploration

tool as follows . On the basis of the single top -level plot the user decides on an appropriate number of models to fit at the second level , and selects points x (i) on the plot , corresponding , for example , to the centres of apparent clusters . The resulting points y (i) in data space, obtained from y (i ) = W x (i ) + Jl" are then used to initialize

the means .t i of the respective

sub-models . To initialize the matrices Wi we first assign the data points to their nearest mean vector JLi and then compute the corresponding sample covariance matrices . This is a hard clustering analogous to K -means and represents

an approximation

to the posterior

probabilities

Rni in which the

largest posterior probability is replaced by 1 and the remainder by O. For each of these clusters we then find the eigenvalues and eigenvectors of the sample covariance matrix and hence determine the probabilistic PCA density model . This initialization is then used as the starting point for the EM algorithm . Consider the application of this procedure to the toy data set intro duced in Section 4 .1. At the top level we observed two apparent clusters ,

and so we might select a mixture of two models for the second level , with centres

initialized

somewhere

near

the

centres

of the

two

clusters

seen at

the top level . The result of fitting this mixture by EM leads to the two- level visualization plot shown in Figure 8. The visualization process can be enhanced further by providing infor -

LATENTVARIABLEMODELS and

the

were

model

given

second become

would

collapse

to a simple

a set of indicator

level

generated

variables

each N

data

we only

posterior points ding

have

tn , obtained

to the posterior

Rni

distribution

reduces

to the

Maximization as shown mixture

that

likelihood

would

.

( 39 )

in the

i having

form

generated

of the hierarchy

the expectation

of the

the

. The

data

correspon

of ( 39 ) with

-

respect

7ri L 7rjliP ( tli , j ) jEQi

as constants

( 40 )

. In the particular

to complete

certainty

for each data

point

case in which about

which

the

model

, the log likelihood

( 40 )

( 39 ) .

of ( 40 ) can again

, discussed

, we

i at the

Mo

is responsible

in [10 ] . This

probability

level

= L L Rni In n= l i= l

form

log

model

of the Zni to give

are all 0 or 1 , corresponding level

the

, information model

by taking

the Rni are treated

in the second

each

the second

N

in which

for

is obtained

.

then

7ri L 7rjliP (tli , j ) jEQi

, probabilistic

Rni

from

log likelihood

tn

which

Mo

partial

responsibilities

model . If , however

zni specifying

point

.c = L L Zni In n= l i = l In fact

mixture

389

in Section model

be performed

has the same form 1 .2 , except

(i , j ) generated

using

as the EM that

data

in the

point

the EM

algorithm

algorithm

E - step , the

tn is given

,

for a simple posterior

by

Rni ,j = RniRnjli

( 41 )

in which R

This

result

automatically

. .= nJ 17 , satisfies

7rjliP (tnli , j ) ~ L. "j ' 7rj ' liP (t n 10 1" J0' ) .

( 42 )

the relation

L Rni ,j = Rni jEQi so that point

the responsibility n is shared

of offspring

models

hierarchical

approach

The Figure

result 10 .

of each model

by a partition at the

third this

at the second

level

of unity

between

level . It

is straightforward

to any desired

of applying

( 43 )

number

approach

for a given

the corresponding

data group

to extend

this

of levels .

to the

toy

data

set is shown

in

LATENTVARIABLE MODELS

391

consisting of 1000 points is obtained synthetically by simulating the phy sical processes in the pipe , including the presence of noise determined by photon statistics . Locally , the data is expected to have an intrinsic dimen sionality of 2 corresponding to the 2 degrees of freedom given by the fraction

of oil and the fraction of water (the fraction of gas being redundant). However , the presence of different configurations , as well as the geometrical interaction between phase boundaries and the beam paths , leads to numerous distinct clusters . It would appear that a hierarchical approach of the kind discussed here should be capable of discovering this structure . Results from fitting the oil flow data using a 3-level hierarchical model are shown in Figure 11.

Figtl.re 11. Results of fitting the oil data. The symbols denote different multi -phase flow configurations corresponding to homogeneous(e), annular (0) and laminar (+ ). Note, for example, how the apparently single cluster, number 2, in the top level plot is revealed to be two quite distinct clusters at the second level.

In the case of the toy data , the optimal choice of clusters and subclusters is relatively unambiguous and a single application of the algorithm is sufficient to reveal all of the interesting structure within the data . For more complex data sets, it is appropriate to adopt an exploratory perspective and investigate alternative hierarchies through the selection of differing numbers of clusters and their respective locations . The example shown in Figure 11 has clearly been highly successful. Note how the apparently single

392

CHRISTOPHER M. BISHOP

cluster

,

number

clusters

at

guration

a2

.

- linear

The

latent from

,

third

that

this

cluster

the

physics

is

data

and

the

revealed

points

can .

is

to

from

be

level

of

y

( x 2

seen

be

the to

on

of to

a

diagnostic

two

quite

distinct

' homogeneous

lie

Inspection

confined

the

then

models

An

alternative

; w

a

the

'

two

confi

-

- dimensional

corresponding

nearly

va

planar

data

is

for

sub

the

-

- space

,

homogeneous

non

.

derived

Thus

the

a

far

based

form S

in

data

is

a

the

,

shown hyper

Section

3 . 1 )

variable

consider

map

not

in

latent

a

which

space

considered

to

in

which

linear

be

on

( 6 )

manifold

of

would

are

the

data

mixture ,

Mapping

manifold on

digits a

models

latent

.

variable

.

a

non

- linear

integration ,

called

so of

living

- written

using

However ,

.

however

- linear

the

considered

Data

hand

,

with

model

x .

Topographic

variables

using

general

intractable

have data

in

approach

in

linear

linear

the

difficulty

that

we

hyperplane

is

Generative

to

approximated

which

The

)

a

example be

The

variables

is

( for

can

model

:

variable

Figure

planar

Models

latent

function

by

making

the

mapping

over

x

in

careful

Generative

function ( 7 )

will

model

y

choices

Topographic

( x

become a

Mapping

; w

)

in

( 6 )

analytically tractable or

,

GTM

,

non

can

be

[8 ] .

The a

plot

Also

.

ping

is

in

from

Non

in

- level

.

isolated

confirms

expected

top

level

been

configurations

5

the

structure of

as

in

second

have

triangular lue

2 ,

the

central

sum

of

concept

delta

is

to

functions

introduce

centred

a on

the

prior

distribution

nodes

of

a

p regular

( x

)

grid

given in

by latent

space

1

p

in

which

case

linear

the

an

isotropic

to

deal

y

data

in

; w

) .

point ,

Xi

which

Figure space

. then

=

is

can

From takes

the

)

( 44

analytically

.

( Note

and a

that

a

Gaussian

( 44

)

we

see

this

is

data

is

for

chosen

)

easily

the

function distribution

be

generalized

considering

y

non to

the

distributions point

density that

)

by

multinomial

corresponding

of

even ( tlx

categorical

Gaussian

and

Xi

performed

a2

to

-

distributionp

and

centre ( 7 )

l5 ( x l

be

variance

mapped

the

L i =

conditional

of then

K

K

continuous

forms 12

)

( 7 )

The

with

mixed product

latent space

( x

Gaussian with

corresponding

in

integral

functions

( x

( Xi ,

; w as

)

. )

Each

in

data

illustrated function

in

form

K 2 1~ p(tIW ,a)=KL ..,p ,W,a2) i= l (tIXi

(45)

LATENTVARIABLEMODELS

t1

y(x;w) .

.

393

.

' -_'~ - _..._..'_ ...'~ _.'----.-..... .

X2

.

.

.

.

.

.

t2

t3

Xl

Figure 12. In order to formulate a tractable non-linear latent variable model, we consider a prior distribution p(x ) consisting of a superposition of delta functions, located at the nodes of a regular grid in latent space. Each node Xi is mapped to a corresponding point Y(Xi ; w ) in data space, and forms the centre of a corresponding Gaussian distribution .

which corresponds to a constrained Gaussian mixture model [13] since the centresof the Gaussians, given by Y(Xi; w ), cannot move independently but are related through the function y (x ; w ). Note that , provided the mapping function Y(X; w ) is smooth and continuous, the projected points Y(Xi; w ) will necessarilyhave a topographicordering in the sensethat any two points xA and xB which are close in latent spacewill map to points Y(XA; w ) and Y(XB; w ) which are close in data space.

5.1. ANEMALGORITHM FORGTM Since

G

for

TM

is

a

form

maximizing

form

for

M

- step

by

a

of

the the

has

mixture

y

simple

form

generalized

linear

log

( x

; w .

)

In

we

is

the a

d

x

elements M

universal

cJ > ( x

the

tion

of

basis such

typically the

models

present

consist

)

of

however

are

,

exponentially context

) is

that

with this

is

choosing

significant

particular

( x

; w

)

which to

the

be

given

( 46

functions

j

models - layer

number

basis q

problem

) ,

The

)

W same ,

pro

usuallimita functions

of

and

the

networks .

of

( X

possess

adaptive

dimensionality a

a in

y

appropriately the

the not

basis

multi

algorithm

)

fixed

chosen

EM

form

regression as

( X

the

an

algorithm

choose

q, ( x

M

linear

j ,

w

seek

By

EM

shall of

=

capabilities functions

grow In

)

Generalized

approximation

vided

[5 ] .

.

an

we

; w

to .

obtain

model

( x

natural

likelihood

can

regression

of

matrix

is

particular

y

where

it

corresponding

mapping a

model

-

must

the

input

since

the

space dimen

-

394

CHRISTOPHER M. BISHOP

sionality is governed by the number of latent variables which will typically be small . In fact for data visualization applications we generally use q = 2. In the E-step of the EM algorithm we evaluate the posterior probabilities for each of the latent points i for every data point tn using

~n

-

p(xiltn, W, 0'2) P(tnIXi,W,0'2)

Then in the M-step we obtain a revisedvalue coupledlinear equationsof the form

for W

(47) (48) by solving

~TG~WT = ~TRT

a set of

(49)

where tl is a K x M matrix with elements ij = cPj(Xi ), T is a N x d matrix with elements tnk , R is a K x N matrix with elements ~ n, and G is a K x K diagonal matrix with elements Gii =

N L=l~n(W ,0-2). n

(50)

We can now solve (49) for W using singular value decomposition to allow for possible ill -conditioning . Also in the M -step we update 0-2 using the following fe-estimation formula

NK 2 1 """"" 2 2 a = NdnL,., LII Rin ( W , a ) IIW l/J (Xi ) tnll =li=l

(51)

Note that the matrix ~ is constant throughout the algorithm , and so need only be evaluated once at the start . When using G TM for data visualization we can again plot each data point at the point on latent space corresponding to the mean of the posterior distribution , given by

(xltn,W,0-2)

-

f xp(xltn, W , 0-2) dx

K = L RinXi . i=l It should

(52) (53)

be borne in mind ,. however ,. that as a consequence of the non linear mapping from latent space to data space the posterior distribution can be multi -modal in which case the posterior mean can potentially give a

LATENTVARIABLE MODELS

395

very misleading summary of the true distribution . An alternative approach is therefore to evaluate the mode of the distribution , given by

=arg max . '/.,max {i} Rin

(54)

In practice it is often convenient to plot both the mean and the mode for each data point , as significant differences between them can be indicative of a multi -modal distribution . One of the motivations for the development of the GTM algorithm was to provide a principled alternative to the widely used 'self-organizing map ' (SaM ) algorithm [17] in which a set of unlabelled data vectors tn (n = 1, . . . , N ) in a d-dimensional data space is summarized in terms of a set of reference vectors having a spatial organization corresponding to a (generally ) two -dimensional sheet. These reference vectors are analogous to the projections of the latent points into data space given by Y(Xi ; w ) . While the SaM algorithm has achieved many successesin practical appli cations , it also suffers from some significant deficiencies , many of which are highlighted in [18]. These include : the absence of a cost fuIJ.ction , the lack of any guarantee of topographic ordering , the absence of any general proofs of convergence, and the fact that the model does not define a probability density . These problems are all absent in GTM . The computational complexities of the GTM and SOM algorithms are similar , since the dominant cost in each case is the evaluation of the Euclidean distanced between each data point and each reference point in data space, and is the same for both algorithms . Clearly , we can easily formulate a density model consisting of a mixture of GTM models , and obtain the corresponding EM algorithm , in a prin cipled manner . The development of an analogous algorithm for the SaM would necessarily be somewhat ad-hoc.

5.2. GEOMETRYOF THE MANIFOLD An additional advantageof the GTM algorithm (compared with the SOM) is that the non-linear manifold in data spaceis defined explicitly in terms of the analytic function y (x ; w ). This allows a whole variety of geometrical properties of the manifold to be evaluated [9]. For example, local magnification factors can be expressedin terms of derivatives of the basis functions appearing in (46). Magnification factors specify the extent to which the area of a small patch of the latent spaceof a topographic mapping is magnified on projection to the data space, and are of considerable interest in both neuro-biological and data analysis contexts. Previous attempts to consider magnification factors for the SOM were been hindered becausethe manifold is only defined at discrete points (given by the referencevectors).

396

CHRISTOPHER M. BISHOP

We tion

can

determine

factors

der

a

each

, using

standard point

in

techniques

of

coordinates

of

P

~ 1, in

'

is

data

space

manifold

differential

in

x '" of

P

, as

Y

xi by

a

, the

mapping

including

which

each

; W

as in

[9 ] .

latent

continuous

space

function

point in

magnifica

follows

the

defines

illustrated

( X

,

geometry

mapped

manifold .

~ ." =

the

coordinates

space in

the .

values

of

Cartesian

latent

p ~ int

coordinate

properties

set P

responding

the

a P

'

Figure

is

of

.

-

Since a

cor

-

curvilinear

labelled

13

Consi .

to

set

-

with

the

Throughout

this

)

~ --, , '- ' ~ - - - - - - - - - --~. . .. 2

X

~

dA

Figure

Xl

13 .

Xi in manifold

This

latent

diagram

space

shows

onto

a

the

mapping

curvilinear

of

the

coordinate

1

Cartesian

system

~i

coordinate in

the

system

q - dimensional

S .

section

we

raised

indices

shall

use

components

covariant

- contravariant

We

the

denote

covariant

local

~

2

first

,

discuss

coordinates

coordinates

is

given

d

2 s

where

gij

is

the

notation

-

of

differential

components

with

an

implicit

indices

transformation

tesian

standard

contravariant

geometry

and

lowered

summation

over

in

indices pairs

of

which denote

repeated

.

the

metric

,

at

some

properties

( i

=

( i (~ ) .

point

P '

Then

in

the

of

the

S ,

to

manifold a

squared

set

S . of

length

Consider

a

rectangular element

Car in

-

these

by

~ dl " J1. dl " V u J1.l.I ~ ~

metric

tensor

-

~ 8 ( J-L 8 ( 11 d (: id u J1.V W W ~

, which

is

(: j ~

-

therefore

d (: id ~

gij

given

(: j ~

( 55

)

( 56

)

by

8 ( JL 8 ( II gij

We

now

seek

Consider Since to

S the

an

again is

expression the

embedded

squared

for

squared

9ij

length

within length

=

k i-

in

terms

element

the

element

c} JLIIWW

Euclidean of

the

.

of

the

ds2

non

lying

data

space

- linear

mapping

within , this

the

manifold

also

y

(x ) . S .

corresponds

form

~ .ayl

.

.

.

.

ds2 == <5kidy dy - <5ki8xi axJ.dxzdxJ = 9ijdxzdxJ

(57)

LATENTVARIABLE MODELS and

397

so we have

oyk oyl gij = <5klo XZ00xJ o.

(58)

Using (46) the metric tensor can be expressedin terms of the derivatives of the basis functions cPj(x ) in the form g = nTwTwn

(59)

where n has elements nji = 8j / 8xi . It should be emphasizedthat , having obtained the metric tensor as a function of the latent space coordinates, many other geometrical properties are easily evaluated, such as the local curvatures of the manifold. Our goal is to find an expression for the area dAI of the region of S corresponding to an infinitesimal rectangle in latent spacewith area dA == IIi dx't ~ shown in Figure 13. The area element in the manifold S can be related to the corresponding area element in the latent space by the Jacobian of the transformation ~ - + ( dA' = II d( J-L= J II d~i = J II dxi = JdA J-L i i

(60)

where the Jacobian J is given by 8( J-L) = det ( ~ 8( J-L) . J = det ( ~

(61)

We now introduce the determinant 9 of the metric tensor which we can write in the form 9 = det (gij ) = det ( <5J-LII~ 8( IJ.& 8(J11 ) = det ( & 8(i IJ.) det ( & 8(J11 ) = J2

(62)

.

and so, using (60), we obtain an eXpreSSIC.n for the local mag-nification factor in the form

dA ' =J=det dA 1/2g.

(63)

Although the magnification factor represents the extent to which areas are magnified on projection to the data space, it gives no information about which directions in latent space correspond to the stretching . We can recover this information by considering the decomposition of the metric tensor g in terms of its eigenvectors and eigenvalues. This information can be conveniently displayed by selecting a regular grid in latent space (which could correspond to the reference vector grid , but could also be much finer ) and plotting at each grid point an ellipse with principal axes oriented according to the eigenvectors , with principal radii given by the square roots of

LATENTVARIABLEMODELS large

stretching

partial

in

The

is

the

region

of

between

males

corresponding

metric

of

the

separation

plot

given

in

stretching

of

Figure

15

,

local

and

Within

each

eigenvector

shows

"

cluster

there

decomposition

both

the

~

.

"

.

,

.

Plots

Figure

of

.

"

~

.

'

the

14

,

.

.

"

'

1

.

-

-

,

.

_

.

.

.

~

~

.

.

.

.

is

of

direction

a

and

the

magnitude

.

/

-

'

.

~

"

.

.

.

-

"

.

-

.

,

,

,

,

'

.

.

.

,

,

-

-

.

~

4 ~

-

,

.

.

18

.

.

~

.

.

stretching

the

,

.

1

~

.

.

8e

local

using

.

.

"

.

,

"

/

in

'

"

-

15

.

.

the

"

.

Figure

females

.

.

example

them

from

399

"

of

ellipse

.

the

latent

representation

space

discussed

,

corresponding

to

in

the

text

the

.

6. Temporal Models : GTM Through Time In all of the models we have considered so far, it has been assumedthat the data vectors are independent and identically distributed . One common situation in which this assumption is generally violated in where the data vectors are successivesamples in a time series, with neighbouring vectors having typically a high correlation. As our final example of the use of latent variables, we consider an extension of the GTM algorithm to deal with temporal data [6]. The key observation is that the hidden states of the GTM model are discrete, as a result of the choice of latent distribution p(x ), which allows the machinery of hidden Markov models to be combined with GTM to give a non-linear temporal latent variable model. The structure of the model is illustrated in Figure 16, in which the hidden states of the model at each time step n are labelled by the index in corresponding to the latent points {Xin} . We introduce a set of transition probabilities p(in+l \in ) corresponding to the probability of making a transition to state in+l given that the current state is in . The emission density for the hidden Markov model is then given by the GTM density model (45). It should be noted that both the transition probabilities p(in+l \in )

CHRISTOPHER M. BISHOP

400

p(i2Ii})

nil

p(i3Ii2 ) ..

.

p(t}li}) p(t2Ii2 ) p(t3Ii3 ) Figure 16 . The temporal version ofGTM consists ofahidden Markov model in which the hidden states are given by the latent points of the GTM model , and theemission probabilities aregoverned bytheGTM mixture distribution . Note thattheparameters oftheGTM model , MwellMthetransition probabilities b_etween states ,are tiedtocommon values across alltime steps .Forclarity wehave simplified the graph and not made the factorization property of the conditional distribution p(tli) explicit . and the parameters Wand a2 governing the G TM model are common to all time steps, so that the number of adaptive parameters in the model is independent of the length of the time series. We also introduce separate prior

probabilities

7ril on each of the latent

points at the first time step of

the algorithm . Again we can obtain an EM algorithm for maximizing the likelihood for the temporal

G TM

model . In the context

of hidden

Markov

models , the

EM algorithm is often called the Baum - Welch algorithm , and is reviewed

in [21]. The E-step involves the .evaluation of the posterior probabilities of the hidden states at each time step , and can be accomplished efficiently using a technique called the forward -backward algorithm since it involves two counter -directional propagations along the Markov chain . The M -step equations again take the form given in Section 5.1. As an illustration of the temporal GTM algorithm we consider a data set obtained from a series of helicopter test flights . The motivation behind this application is to determine the accumulated stress on the helicopter air frame . Different flight modes, and transitions between flight modes, cause different

levels

of stress , and at present

maintenance

intervals

are determi

-

ned using an assumed usage spectrum . The ultimate goal in this application would be to segment each flight into its distinct regimes, together with the transitions between those regimes, and hence evaluate the overall integrated stress with

greater accuracy .

The data used in this simulation was gathered from the flight recorder over four test flights , and consists of 9 variables (sampled every two seconds)

402

CHRISTOPHER M. BISHOP

are

the

the

observed

inference

and

the

of

the

integrating

of

over

all

in

time

,

The

of

the

classes

for

mult

.iple

may

grow

hidden

the

since

the

focus

of

the

research

communities

to

of

within

the

for

considered

at

motivate

anyone

the

,

which

this

-

in

-

states

.

models

graphical

to

hidden

variables

re

there

leads

of

such

-

general

in

models

with

develop

more

considered

hidden

deal

the

choices

active

configurations

number

to

)

we

be

many

of

approximations

extensive

computing

number

For

hidden

Gaussian

instance

be

for

.

.

For

can

,

or

variables

can

helps

.

However

with

controlled

also

states

.

exponentially

of

,

model

)

summing

variables

states

algorithm

continuous

and

hidden

given

EM

hidden

distribution

however

the

involves

the

( linear

hidden

variables

of

over

discrete

probabilistic

variables

algorithms

lopment

simple

the

hidden

step

integration

mixture

discrete

hidden

tractable

of

,

of

the

of

standard

viewpoint

new

presentations

are

case

one

to

graphical

ment

,

of

the

-

which

of

chapter

In

the

E

configurations

only rise

the

function

because

.

which

giving

likelihood

this

possible

structure

models

the

in

was

model

of

to

possible

considered

variables

distribution

( corresponding

evaluation

models

the

posterior

variables

The

is

modelling

deve

-

currently

and

neural

.

ACKNOWLEDGEMENTS The

author

the

would

work

like

reported

Svensen

,

in

Michael

to

thank

this

the

chapter

Tipping

:

and

following

for

Geoffrey

Chris

their

Hinton

Williams

,

contributions

lain

to

Strachan

,

Markus

.

References 1. Anderson York

, :

T

John

.

W

.

Wiley

( 1958

) .

An

Introduction

to

Multivariate

Statistical

Analysis

.

New

.

2. Anderson of

,

T

.

W

.

Mathematical

( 1963

)

.

Asymptotic

Statistics

34

theory

,

122

-

148

for

principal

component

analysis

.

Annals

.

3. Bartholomew

,

Charles

D

Griffin

.

&

J

.

Co

( 1987 .

) .

Ltd

Latent

Variable

Models

and

Factor

Analysis

.

London

:

.

4. Basilevsky

,

Wiley

A

.

( 1994

) .

M

.

( 1995

) .

M

. ,

Statistical

Factor

Analysis

and

Related

Methods

.

New

York

:

.

5. Bishop Press

,

C

.

,

C

.

Neural

Networks

for

Pattern

Recognition

.

Oxford

University

.

6. Bishop Proceedings

G

lEE

bridge

,

U

. K

,

C

. ,

.

E

.

Fifth pp

.

Hinton

,

and

I

International

111

-

116

.

G

.

D

.

Strachan

Conference

( 1997 on

) .

Artificial

GTM

through

Neural

time

Networks

,

.

In

Cam

-

dual

-

.

7 . Bishop energy

8.

in

.

and

G

Research

,

nerative tion

M

C

.

M

To : /

/

appear wwv

D

.

James

M

. ncrg

( 1993

and A

. ,

topographic .

.

densitometry

Physics

Bishop

http

.

gamma

.

327

,

580

Svensen

-

593

,

in

.

volume . ac

10 . uk

/

, .

Analysis

networks

of .

Nuclear

multiphase

Hows

using

Instruments

and

Methods

.

and

mapping

. aston

) .

neural

C

.

K

.

I .

Accepted

for

number

1 .

Williams

( 1997a

publication Available

) . in

as

NCRG

GTM

:

Neural

the

ge

Computa /

96

/

015

-

from

LATENTVARIABLE MODELS

403

9. Bishop , C. M., M. Svensen , andC. K. I. Williams(1997b ). Magmfication factorsfor

theGTMalgorithm . In Proceedings lEE FifthInternational Conference onArtificial NeuralNetworks , Cambridge , U.K., pp. 64-69. 10. Bishop , C. M. andM. E. Tipping(1996 ). A hierarchical latentvariablemodel for datavisualization . Technical ReportNCRG /96/028,NeuralComputing Research Group , AstonUniversity , Birmingham , UK. Accepted forpublication in IEEEPAMI. 11. Dempster , A. P., N. M. Laird,andD. B. Rubin(1977 ). Maximum likelihood fromincomplete dataviatheEMalgorithm . JournaloftheRoyalStatistical Society , B 39(1), 1-38. 12. Hinton , G. E., P. Dayan , andM. Revow (1997 ). Modeling themanifolds of images of handwritten digits. IEEETransactions onNeuralNetworks 8(1), 65-74. 13. Hinton,G. E., C. K. I. Williams , andM. D. Revow (1992 ). Adaptive elasticmodels for hand -printedcharacter recognition . In J. E. Moody , S. J. Hanson , andR. P. Lippmann (Eds.), Advances in Neural Information Processing Systems , Volume 4, pp. 512 - 519.Morgan Kauffmann . 14. Hotelling , H. (1933 ). Analysis of a complex of statistical variables intoprincipal components . Journalof Educational Psychology 24, 417 - 441. 15. Hull, J. J. (1994 ). A database for handwritten text recognition research . IEEE Transactions onPatternAnalysis andMachine Intelligence 16, 550 - 554. 16. Jordan , M. I. andR. A. Jacobs (1994 ). Hierarchical mixtures of expertsandthe EMalgorithm . NeuralComputation 6(2), 181 - 214 . 17. Kohonen , T. (1982 ). Self -organized formation oftopologically correct feature maps . Biological Cybernetics 43, 59-69. 18. Kohonen , T. (1995 ). Self-Organizing Maps . Berlin:Springer -Verlag . 19. Krzanowski , W. J. andF. H. C. Marriott(1994 ). Multivariate Analysis Part I: Distributions , Ordination andInference . London : Edward Arnold . 20. Pearson , K. (1901 ). Onlinesandplanes of closest fit to systems ofpointsin space . TheLondon , Edinburgh andDublinPhilosophical Magazine andJournalof Science , SixthSeries2, 559 - 572. 21. Rabiner , L. R. (1989 ). A tutorialonhidden Markov models andselected applications in speech recognition . Proceedings of theIEEE77(2), 257 - 285 . 22. Rao,C. R. (1955 ). Estimation andtestsof significance in factoranalysis . Psycho metrika20, 93- 111 . 23. Rubin,D. B. andD. T. Thayer(1982 ). EM algorithms for ML factoranalysis . Psychometrika 47(1), 69-76. 24. Tipping , M. E. andC. M. Bishop(1997a ). Mixturesof probabilistic principal component analysers . Technical ReportNCRG /97/003,NeuralComputing Research Group , AstonUniversity , Birmingham , UK. Submitted to NeuralComputation . 25. Tipping , M. E. andC. M. Bishop(1997b ). Probabilistic principal component analysis.Technical report,NeuralComputing Research Group , AstonUniversity , Birmingham , UK. Submitted to Journalof theRoyalStatistical Society , B.

STOCHASTIC ANAL

ALGORITHMS

YSIS

DATA

FOR

EXPLORATORY

DATA

:

CLUSTERING

JOACHIMM

AND

DATA

. BUHMANN

Institut Jur Informatik Rheinische

VISUALIZATION

Friedrich

III ,

- Wilhelms

- Universitiit

D - 53117 Bonn , Germany jb ~ informatik . uni - bonn . de http : ! ! www- dbv . informatik . uni - bonn . de

Abstract . Iterative , EM -type algorithms for data clustering and data vi sualization are derived on the basis of the maximum entropy principle . These algorithms allow the data analyst to detect structure in vectorial or relational data . Conceptually , the clustering and visualization procedures are formulated

as combinatorial

or continuous

optimization

problems which

are solved by stochastic optimization .

1.

INTRODUCTION

Exploratory Data Analysis addresses the question of how to discover and model structure hidden in a data set. Data clustering (JD88 ) and data visualization are important algorithmic tools in this quest for explanation of data relations . The structural

relationships

between

data points , e.g .,

pronounced similarity of groups of data vectors , have to be detected in an unsupervised fashion . This search for prototypes poses a delicate tradeoff : a sufficiently rich modelling approach should be able to capture the essential structure

in a data

set but

we should

restrain

ourselves

from

imposing

too much structure which is absent in the data . Analogously , visualization techniques should not create structure which is caused by the visualization methodology rather than by data properties . In the first part of this article we discuss the topic of grouping vectorial and relational data . Conceptually , there exist two different approaches to data clustering : 405

406

JOACHIM M. BUHMANN

-

Parameter

estimation

of mixture

models by parametric

statistics .

-

Vector quantization of a data set by combinatorial optimization .

Parametric statistics assumes that noisy data have been generated by an unknown number of qualitatively similar , stochastic processes. Each indi vidual process is characterized by a unimodal probability density . The density of the full data set is modelled by a parametrized mixture model, e.g., Gaussian mixtures

are used most frequently

(MB88 -) . This model -based an . -

a

. proach to data clustering requires estimating the mixture parameters , e.g., the mean and variance of four Gaussians for the data set depicted in Fig . la . Bayesian statistics provides a unique conceptual framework to compare

and validate

different

mixture

models

.

The second approach to data clustering which has been popularized as vector quantization in information and communication theory , aims at finding a partition of a data set according to an optimization principle . Clustering as a data partitioning problem arises in two different forms depending on the data format 1:

- Central clustering of vectorial data B = {Xi E n:td : 1 ~ i ~ N } ; - Pairwise clustering of proximity data V = { Vik E ill. : 1 :::; i , k :::; N } . The goal of data clustering is to determine a partitioning which either

minimizes

of a data set

the average distance of data points to their cluster

centers for central clustering or the average distance between data points of the same cluster for pairwise clustering . Note that the dissimilarities Vik do not necessarily

respect the requirements

of a distance

measure , e.g ., dissi -

milarities or confusion values of protein , genetic , psychometric or linguistic data frequently violate the triangle inequality and the self-dissimilarity Vii is not necessarily zero. For Sect. 3 we only assume symmetry Vik = Vki . The second main topic of this paper addresses the question

how rela -

tional data can be represented by points in a low-dimensional Euclidian space. A class of algorithms known as Multidimensional determine

coordinates

in a two - or three - dimensional

Scaling (Sect. 4)

Euclidian

space such

that pairwise distances IIXi - xkll match as close as possible the pairwise dissimilarities V . A combination of data visualization and data clustering is discussed in Sect. 5. This algorithm preserves the grouping structure of relational data during the visualization procedure . 2. Central

Clustering

The most widely used nonparametric technique to find data prototypes is central clustering or vector quantization . Given a set of d-dimensional data JIn principle , it is possible to consider triple or even more complicated data relations in an analogous fashion but the discussion in this paper is restricted to vectorial and relational

data

.

DATA

vectors

B

to

=

{ Xi

determine

: rn . d

:

CLUSTERING

E

an

1

::::;

v

specified

::::;

by

: n: td

AND

:

1

~

optimal

K

}

i

set

of

according

Boolean

~

N

=

}

,

an

central

clustering

)

E

.

and

{ O , l

}

the

vectors

criterion M

i = l ,...,N

poses

reference

optimality variables

( Miv

407

VISUALIZATION

d - dimensional

to

assignment

M

DATA

a

A

problem Y

data

configuration

NXK

=

{ y

v

E

partition

is

space

M

,

(1)

,

lI = l , . . . , K

(2) M

Miv

=:

1 ( 0 )

vector

y

rations

v .

Eq

.

The

quality

of

tion

which

favors

.

1i

( a

)

favor

~

set

of

,

( a

' )

is

+

specific

1i

we

point

constraint

( a

" ) .

Xi

is

as

the

CENTRAL

1 ,

)

Vi

}

.

assigned

set

of

unique

to

configu

assignment

=

1 ,

reference

admissible

Vi

of

-

data

to

.

CLUSTERING

vectors

is

assessed

with

a

this

number

' , a

"

design

of

.

objective

costs

always a

,

i .e . ,

favorable bias

function

-

H

be

spurious

cost

-

compact

clustering

The

func

cluster

should

principle

clusters

an

intra

of

clusters

Without

by

high

superadditivity

two

"

=

( not

1 Miv

solutions

" optimal

Miv

a

): : : ; ~ =

require

into

E

defined

reference

a

:

requires

assignment

cluster

1i

a

data

FOR

Furthermore

a

the space

the

FUNCTION

a

{ a , 1 } NXK

Admissibility by

COST

E

that

( 2 ) .

2 . 1 .

splitting

M

solution

expressed

ness

{

denotes The

in

clusters

=

could

for

central

clustering N

HCC

( M

)

=

L

L

i =

fulfills

V

these

( Xi

Yv

, Yv

)

. An

between

, Yv

The

)

: =

the

most

Ilxi

-

standard

( Mac67

A

sion

probability

code

vector

cation

setup

the

to

to

the

the

Y

.

cluster

index

'Y.

The

, Yv

( 3 )

average

)

distortion

the

on

squared

known

its as

cluster

( 3 )

has

a

and

following

been

' Y

make

it

.

and

reesti

-

convergence

.

discussed a

displays

vector algorithm

centers

with

diagram

distance

- means

until

channels

vector application

reference

K

assignments

Noisy

the

Euclidian

and

closest

data

error

reference

depends

vector

function

- coding

) .

corresponding

( Xi

data

objective

- channel

close

the

( 3 ) ,

to

, Yv

the

being

the

data

between Ya

and

minimizing

assigns

of

source

up

choice

for

( Xi

l

V

between

according

generalization of

Xi

common

centers

MivV

v =

measure

Yvl12

iteratively

these

context

vector

algorithm

) ,

mates

data

l

summing

distortion

with

( Xi

by

a

appropriate

domain

V

requirements

K

in

significant

preferable

to the

the

confu

place

communi

:

~~~ ! ; J .!.

{Xi} -+ IEncoder Xi -+ YaI ~

~ [:~:2!!:i~;!!!!~~!1 4 ~ _ecod

-

-t {Y'Y}

-

408 In

JOACHIMM. BUHMANN

addition

nel

to

the

distortion

quantization

due

function

for

to

the

error

index

total

V

( Xi

corruption

,

Ya

into

distortions

)

we

have

account

between

to

.

sender

An

and

take

the

chan

appropriate

-

cost

receiver

K

V

( Xi

,

Ya

)

=

L v

favors

the

topological

measure

v

Toll

due

to

noise

of

popular

as

Lut89

. 2

;

.

BSW97

this

)

paper

cost

advocate

1 {

cc

costs

uniqueness

( M

,

we

)

have

principle

)

,

,

states

( 3

.

to

,

e

a

. ,

"

the

)

closeness

a

- dimensional

-

with

,

dimensional

neural

( M

index

topological

grid

are

computation

"

very

( Koh84

;

)

)

stochastic

optimization

of

requires

data

many

clusters

this

by

suggested

for

are

The

distri

the

introducing

central

-

of

yield

an

principle

.

The

clustering

distributed

assign

probability

might

entropy

to

optimization

the

ambiguity

assignments

.

Stochastic

assignments

maximum

principle

to

.

determining

Since

the

originally

that

to

index

low

two

variables

break

. g

1lcC

random

M

constraint

in

assignments

be

assignments

expected

( RGF90

we

to

function

over

a

maps

and

considered

according

confusing

a

or

OF

,

( 4

.

centroids

are

entropy

with

chain

OPTIMIZATION

optimized

the

Distortions

a

Yvl12

vectors

of

topological

;

Throughout

bution

code

probability

defining

selforganizing

RMS92

ments

.

clusters

STOCHASTIC

find

the

-

l

of

specifies

transmission

arrangement

2

organization

which

Tavllxi =

unbiased

maximum

by

according

-

same

Rose

to

et

the

ale

Gibbs

distribution

pGibbS

( 1 {

CC

F

( M

)

( 1 - lCC

)

=

exp

)

=

-

(

T

-

( 1lcC

log

( M

L

)

-

:

exp

(

-

F

( 1icC

)

1icC

)

( M

jT

)

)

/

T

,

)

( 5

)

( 6

)

MEM

The

"

the

in

computational

expected

Eq

.

costs

( 6

function

exp

term

temperature

)

can

1lcc

(

-

1lcC

exp

.

( M

(

-

:

.

be

F

/

pointed

as

factor

T

( 1lcC

)

.

)

exp

( :

F

a

( 1lcC

Constraining

/

T

)

can

T

serves

out

interpreted

The

)

As

"

rewritten

( HB97

smoothed

)

the

be

in

as

/

T

a

)

,

Lagrange

the

free

version

)

the

by

for

energy

of

normalizes

assignments

parameter

the

F

orie

'-'

exponential

E

~

l

( 1 - lcC

: inal

)

cost

terms

Mill

=

1

the

as

(7)

(8)

DATACLUSTERING AND DATAVISUALIZATION

409

for predefinedreferencevectorsT = {y 1I} ' This Gibbsdistribution can also be interpreted as the completedata likelihood for mixture modelswith parametersT . Basically, the distribution (8) describesa mixture model with equalpriors for eachcomponentand equal, isotropiccovariances . The optimal referencevectors{y~} are derivedby maximizingthe entropy of the Gibbs distribution, keepingthe averagecosts(1lCC ) fixed, i.e., Y* = argmax - L pGibbs (1lcC(M )) logpGibbs (1lcC(M )) T MEM = argmax 2::: 1{cC(M )pGibbs (1{cC(M ))/ T Y MEM N K + 2::: log 2::: exp(- V (Xi, YJL )/ T) . i=l JL =l

(9)

pGibbs (M ) is the Gibbsdistribution of the assignments for a set 1 of fixed referencevectors. To determineclosedequationsfor the optimal reference vectorsy~ we differentiatethe argumentin Eq. (9) with the expectedcosts being kept constant. The resultingequation N a 0 = L (MiV)aV i=l Yv (Xi, Yv) with exp(- V (Xi, Yv)/ T) (Miv) = "L".,.. ,,,,JLexp(- "n( )/ T) Vv E { I , . . . ,K } v X1 ". YJL

(10) (11)

is known as the centroid equationin signalprocessingwhich are optimal in the senseof rate distortion theory (CT91). The angular bracketsdenote Gibbsexpectationvalues, i.e., (f (M )) := L:::MEMj (M )pGibbs (M ). The equations(10,11) are efficientlysolvedin an iterative fashionusingthe expectation maximization(EM) algorithm (DLR77): The EM algorithm alternatesan estimationstep to determinethe expectedassignments(M iv) with a maximizationstepto estimatemaximum likelihood valuesfor the clustercentersYv. Dempsteret ale(DLR77) have proventhat the likelihood increasesmonotonicallyunder this alternation scheme . The algorithm convergestowardsa local maximumof the likelihood function. The log-likelihoodis up to a factor (- T) equivalentto the freeenergyfor centralclustering. The function f (.) denotesan appropriate annealingschedule , e.g., f (T) == T / 2. The sizeK of the cluster set, i.e., the complexityof the clusteringsolution, has to be determinedby a problem-dependentcomplexitymeasure

410

JOACHIMM . BUHMANN EM

Algorithm

I for Centroid

Estimation

INITIALIZE y~o) E Rd randomlyand (Miv)(O) E (0, 1) arbitrarily; temperature

T +- To ;

WHILE T > T FIN AL t +- O., REPEAT

E-step: estimate(M iv) (t+1) as a function of y ~t) ; M-step: calculatey~t+l ) for given (Miv)(t+l ); t +-

t +

1;

UNTIL all { {Miv)(t), y~t)} satisfyEqs. (10,11) T +- f (T );

(BK93 ) which monotonically grows with the number of clusters . Simulta neous minimization of the distortion costs and the complexity costs yields an optimal number K * of clusters . Constant complexity costs per cluster

or logarithmiccomplexitycosts- log(E ~ l Miv/N) (Shannon information ) are utilized in various applications like signal processing , image compression or speech recognition . A clustering result with logarithmic complexity costs is shown in Fig . lc . It is important to emphasize that the vector quantiza tion approach to clustering optimizes a data partitioning and , therefore , is not suited to detect univariate components of a data distribution . The split -

ting of the four Gaussiansin Fig. la into 45 clusters (Fig. lc ) results from low complexity costs and is not a defect of the method . Note also, that the distortion costs as well as the complexity costs determine the position of the

prototypes { Yv} . For example, the density of clusters in Fig. lc is asymptotically independent of the density of data (in the area of non-vanishing data density ) , which is a consequence of the logarithmic complexity costs. 3 . Pairwise 3 .1 .

COST

Clustering

FUNCTION

FOR

PAIRWISE

CLUSTERING

The second important class of data are relational data which are encoded by a proximity or dissimilarity matrix . Clustering these non-metric data which are characterized by relations and not by explicit Euclidian coordinates is usually formulated as an optimization problem with quadratic assignment costs. A suggestion what measure we should use to evaluate a clustering solution for relational data is provided by the identity

N !2~ =lEk Mkvllxi -xkl12 (12) -Yvl12 .l,M 7 ,.vEf N iL=lMivllxi iL= =lMkv -

UALIZA TI0N DATACLUSTERING ANDDATAVIS

Figure

1 .

Clustering

estimated the

Gaussian .

Data connected

This . rIng

and ( c )

by

a

data

model stars

is (* )

shows

a

self

- organizing

are

data

( a )

depicted

generated in

cluster

centers

partitioning

using

chain

is

by

( b ) , the

shown

four

plus

. The

Gaussian

signs

( +

circles

a

logarithmic

in

( d ) ,

sources

)

are

denote

the the

The of

covariance

complexity neighboring

:

centers

measure clusters

.

being

.

with

tances

Figure

clustering

multivariate

mixture

sources

estimates

Ilxi

of

Gaussian

411

Yv -

= Xk is

identity

E 112

~ =

l

MkvXk

pairwise

identical

to strongly

/

E

clustering central

~ =

l

Mkv

.

with

For

squared

Euclidian

normalized

clustering

with

distances

average cluster

intra means

Vik - cluster

as

prototypes

INK 1lPC ({Miv })=2 ~~ i,L k=lvL=l~El =lMlv motivates

the

objective

function

for

pairwise

= dis

.

cluste

-

( 13

)

with Miv E { a, I } and E ~=l Miv = 1. Whenever two data items i and k are assignedto the same cluster v the costs are increasedby the amount Vik / E ~ l Mlv . The objective function has the important invariance property that a constant shift of all dissimilarities Vik - t Vik + Va does not effect the assignmentsof data to clusters (for a rigorous axiomatic frame-

412

JOACHIM M. BUHMANN Algorithm

INITIALIZE

Clustering

iv (O) and (Miv ) (O) E (0, 1) randomly ; temperature

WHILE

II for Pairwise T f-- To ;

T > T FIN AL t f -- O., REPEAT

E-like step: estimate (M iv) (t+ 1) as a function of tiv (t) ; M -like step : calculate t ~

iv (t+ l ) for given (Miv ) (t+ l )

t + 1;

UNTIL all { (M iv) (t) , [;iv (t)} satisfy (14) T f-- f (T );

work, see (Hof97)) which dispensesthe data analyst from estimating absolute dissimilarity values instead of (relative) dissimilarity differences.

3.2. MEANFIELDAPPROXIMATION OF PAIRWISECLUSTERING Minimization of the quadraticcost function (13) turns out to be algorithmically complicateddue to pairwise, potentially conflicting interactions betweenassignments . The deterministicannealingtechnique, which producesrobust reestimationequationsfor centralclusteringin the maximum entropy framework, is not directly applicableto pairwiseclusteringsince there is no analyticaltechniqueknownto capturecorrelationsbetweenassignmentsMill and Mkll in an exact fashion. The expectedassignments (Mill)' however , can be approximatedby calculatingthe averageinfluence ill exertedby all Mkll, k # i on the assignmentMill (pp. 29, (HKP91)), thereby neglectingpair-correlations((MillMkll) = (Mill) (Mkll))' A maximum entropyestimateof ill yieldsthe transcendentalequations (Mill) =

iLl({ (Mill)} IV)

-

Kexp(- ill/ T) }:::JL =l exp(- iJL / T) ,

(14)

-Ef Mjv )V) jv jk kt=l}:;N j= jv 1Vik 2}:=;~ j= #li(M)+kv # il(M (15)

The partial assignmentcosts [,iv of object i to cluster v measurethe average dissimilarity in cluster v weighted by its cluster size. The secondsummand in the bracket in (15) can be interpreted as a reactive term which takes the increased size of cluster v after assignment of object i into account. Equation (14) suggestsan algorithm for learning the optimized cluster assignments which resemblesthe EM algorithm : In the E- step, the assignments { Mill } are estimated for given { 'iv} . In the M-step the { 'iv} are

DATA

reestimated

CLUSTERING

on

gorithm

the

converges

data

basis

to

clustering

AND

of

a

new

(

estimates

solution

HB97

)

413

VISUALIZATION

assignment

consistent

problem

DATA

of

.

This

assignments

iterative

for

al

the

-

pairwise

.

3.3. TOPOLOGICALPAIRWISECLUSTERING Central

clustering

gical break

the

titioning in

with

clustering

, i .e . , to

pairwise

to same

break

clustering

clustering

for

criterion

yields

( 3 ) has

distortions

symmetry

between in

index

can

be

generalization

the

permutation

cost

function

generalized

which

and

space

favor

for

clusters

coincides

distances

a topolo

with

-

distortions a data

, e .g . , a linear

introduced of

to

generalized

clusters

symmetry

Euclidian

cost

been

( 4 ) . These

neighborhood

quadratic the

function

with

permutation according

Fig . Id . The

ring

cost

problem

pairwise . We

par -

chain

as

cluste

introduce

topological

as dissimilarities

. This

a

central design

function N

1ltPC ( { Miv

} )

==

~

K

L L MivMkv i ,k = lv = l

--- : ~ El = l Mlv

( 16 )

K Miv

=

L T J.tvMiJ .t8 J1 .= 1

( 17 )

-The

effective

ments

assignments

which

encode

maximum

entropy

clustering

we

and

transform

apply the

Mill

the

estimates the

can

of

the

inverse

solution

be

interpreted

neighborhood

of

assignments

transformation

of the

as

structure

resulting

for to

costs

transformed clusters

. To

topological

the

cost

( 13 ) which

function

assign find

-

the

pairwise ( 17 )

yields

-

(Mio) =

; xp(- ia~T) ,

LJ1.=l exp(- iJ1 ./ T )

(18)

K

io = L Tov iv({(Miv)}IV),

(19)

v = l

iv in -r .h .s. of replaced byMiv.

where

4 . Data

Visualization

(19) are the assignment costs (15) with Miv being

by Multidimensional

Scaling

Grouping data into clusters is an important concept in discovering struc ture . Apart from partitioning a data set, the data analyst often explores data by visual inspection to discover correlations and deviations from ran domness. The task of embedding given dissimilarity data V in ad -dimensional

415

DATACLUSTERING ANDDATAVISUALIZATION intermediate (l ) Wik -

normalization 1

N

which

(N

corresponds

error

( DH73 We

the

to

the

minimization

same

clustering

expected the

the

1 . w ~m ) = 'L.... """ "'~,k = l V tk ? ' ."k of

relative

1 D 'l.k . ,L.... """ lN,m = l V2 lm

, absolute

or

( 21 )

intermediate

) .

pursue

pairwise

of

. w ~g ) = I ) V ."k ? " "k

-

optimization

case

strategy

, i . e . , we

coordinates

derive

( Xi ) using

to

the

the

minimize

maximum

( 20 )

entropy

approximation

that

as

in

the

of

distribution

exp (exp -fi((-xfi)(/X T))i/T)i iII =l-00 J00 dXi

embedding

coordinates

pO(XIi)

-

X

factors

according

the

estimate

to

(22)

f i (Xi) = a?IIXi114 + IIXi112xThi + Tr -[xixTHi] + xThi. (23) Utilizing the symmetry Vik == Vki and neglecting constant terms an expansion of 1lmdsyields the expected costs (expectations w .r .t . 22)

N (1{mdS )

:::1Wik[ 2(llxi114 ) - 8(IIXi112Xi )T(Xk) + 2(IIXi112 )(11Xk 112 ) ?',2 ,k= +4Tr [(XiX[ )(XkX [ )] - 4Vik((llxiI12 ) - (Xi)T(Xk)) ] , (24)

Tr

[

A

]

denoting

The

the

statistics

2

} =: : :

k

=

1

wik

,

=

f

statistics

of

(

a

with

)

of

clearly

the

ported

} =: : :

8

=

226

(

I

k

=

XkX

,

hi

,

(

b

>

1

)

,

,

1

wik

[

)

(

+

)

(

.

.

.

,



N

)

used

)

information

,

A

detailed

,

1

hi

=

lxk

12

in

in

derivation

~

8

)

-

i

the

box

an

iterative

~

,

is

N

(

are

allows

c

)

intra

in

(

defined

KB97

as

the

(

or

the

data

.

with

d

)

a

?

.

-

)

)

(

to

.

12xk

)

)

-

compute

2

dark

gives

an

grey

by

weighting

levels

minimizing

.

with

IXk

dissimilarity

The

different

embed

-

accuracy

dissimilarities

.

Embedding

to

visualize

structure

rather

(

determinis

Fig

derived

data

visualized

-

the

family

(

)

III

globin

the

Xk

a

fashion

cluster

set

(

Algorithm

local

-

Vik

propose

experimentalist

that

in

wik

We

Starting

and

the

.

are

Clustering

guarantee

1

)

for

the

of

and

=

embeddings

structure

-

k

)

.

from

inter

} =: : :

Vik

practice

)

cluster

scaling

no

(

in

Pairwise

,

Xk

4I

intermediate

of

the

.

hi

sequences

the

Simultaneous

by

,

dissimilarities

reveal

however

Hi

sketched

protein

representation

,

?

be

small

Multidimensional

exists

a

A "

N

(

global

dings

.

8

(

might

to

)

-

Wik

MDS

correspond

5

1



how

matrix

in

=

=

algorithm

idea

20

(

matrix

N

hi

} =: : :

annealing

the

(

=

"

Hi

tic

of

 i

N

and

trace

than

data

is

being

indeed

generated

.

There

sup

-

by

416

JOACHIM M. BUHMANN Algorithm

INITIALIZE WHILE

III : MDS

by Deterministic

Annealing

the parameters <1> of pO(XI ) randomly.

T > TFINAL REPEAT

E -like step :

Calculate(Xi)(t+l ), (xixT )(t+l ), (1IxiI12Xi )(t+l ) w.r .t . p (t)(XI (t ) M -like step :

compute it+l), 1 ~ k ~ N, k # i t ~

UNTIL

t +

1;

convergence

T f - f (T );

the visualization process. We, therefore, have proposed a combination of pairwise clustering and visualization with an emphasis on preservation of the grouping statistics (HB97). The coordinates of data points in the embedding spaceare estimated in such a way that the statistics of the resulting cluster structure matches the statistics of the original pairwise clustering solution. The relation of this new principle for structure preserving data embedding to standard multidimensional scaling is summarized in the following diagram: { Dik } -- + .t 1lmds { llxi - Xk112 } -- +

1lpc(MI { Vik } )

-- +

1lcC(MI { Xi} )

- -t

pGibbS (1lpc(MI { Vik } )) .t I (1lCC II1lPc) pGibbS (1{ cC(MI { Xi} )).

Multidimensional scaling provides the left/ bottom path for structure detection, i .e., how to discover cluster structure in dissimilarity data. The dissimilarity data are first embedded in a Euclidian space and clusters are derived by a subsequent grouping procedure as K -means clustering. In contrast to this strategy for visualization, we advocate the top fright path , i .e., the pairwise clustering statistic is measured and afterwards, the points are positioned in the embedding space to match this statistic by minimizing the Kullback-Leibler divergence I (.) between the two Gibbs distributions pGibbs (1lcC(MI {Xi} )) and pGibbs (1lPC (MI { Vik } ))' This approach is motivated by the identity (12) which yields an exact solution (.T(pGibbs (1lCC ) IIPGibbS (1lPC )) = 0) for pairwise clustering instanceswith ' Dik= IIXi - Xk112 . Supposewe have found a stationary solution of the mean- field equations (14). For the clustering problem it sufficesto considerthe mean assignments (Mill ) with the parameters eill being auxiliary variables. The identity (12)

417

DATACLUSTERING AND DATA

VISUALIZATION

allows

us

to

centroid

interpret

these

under

scaling

problem

tentials

ill

definition

the

=

for

~

~

1

be

MivXk

squared

E

the

the

~

1

distance

data

are

of

/

embedding

the

Euclidian

Xi

to

E

the

1

KiXi

of

coordinates

restricted

Yll

equations

as

assumption

the

are

variables

In

the

unknown

form

Miv

.

-

then

Y

the

are

v

112

.

with

If

the

the

following

fulfilled

cluster

multidimensional

po

reestimation

K

}

:

11 =

( Miv

)

(

11Y1I112

-

t : iv

(

{

( Miv

)

}

IV

)

-

centroid

:

K

' 2

the

quantities

II Xi

,

coordinates

to

)

(

Yv

-

}

1

(25) :

( MiJ

J- L =

- L ) YJ

-L )

'

1

K

Ki

=

( yyT

)

i

-

( Y

) i

( Y

) ;

with

( Y

) i

=

} v

The

t : iv

dissimilarity

which

values

are

Appendix

C

iteratively

determine

defined

of

in

( HB97

solving

)

the

.

(

The

the

15

)

.

Algorithm

the

{

)

according

IV : Structure

) Yv

Xi

of

coordinates

( 25

( Miv

Xi

(26)

.

1

coordinates

Details

equations

: =

through

the

derivation

}

can

and

to

{

the

Preserving

YII

}

are

potentials

be

found

in

calculated

Algorithm

by

IV

.

MDS

INITIALIZEx~o) randomly and(Miv)(O) E (0,1) arbitrarily ; WHILE

temperature T > TFIN AL

T +- To ;

t t - O', REPEAT

E-likestep:estimate (Miv)(t+l) asa function of {x~t),y~t)} M - like step : REPEAT

calculate x~t+l) given(Miv)(t+l) andy~t) updatey~t+l ) to fulfill the centroidcondition tf UNTIL

UNTIL - t + 1

convergence

convergence

T f - f (T );

The derived system of transcendental equations given by (11) with quadratic distortions , by (25) and by the centroid condition explicitly reflects the dependencies between the clustering procedure and the Euclidian representation . Simultaneous solution of these equations leads to an efficient algorithm which interleaves the multidimensional scaling process and the clustering process, and which avoids an artificial separation into two uncor related data processing steps. The advantages of this algorithm are convin cingly demonstrated in the case of dimension reduction . 20-dimensional

418

JOACHIMM. BUHMANN 1

a)

I

M

L

LL

LL

L

:

E

iM

ME1u -.

E

EE ~

0

p

H

GKK

NEN ~tf .r ~ Q ~

C

N

2

f$G --q~

:

DC C~

"fr ~H

K

~

R

~H

K

I

~

_0 5

a ~i 0S :A :

p

0

.

0

s

I

S 0

i !

p

G~

-2

FF

0 0 O~ ~AA

A

"

I: " Y

At !

...

F

0 : !

~

C

F F ~r

~

:

4, 5 S

B ~ ~~ L. ~

0

P all!! s ~Ss R IftQ Q

C

5

M.

KK Q K K

D\ pQ !.. ~~ T .T ~! T ~RI ~r.iI 13 ; roLI
!

J

J JG_ ,JG

~

.

b)

J

G

,' PJI' I

' LL EEEiM E M N !

O. 5

4

G

,.

I

: ~

~

i A : A - 1 - 1

. 0

- 0 . 5

0 . 5

1

- 4

- 4

- 2

0

2

Figure 3. Embedding of 20-dimensional data into two dimensions: (a) projection of the data onto the first two principal components; (b) cluster preserving embedding with Algorithm IV . Only 10% of the data are shown.

data generated by 20 Gaussians located on the unit sphere are projected onto a two- dimensional plane in Fig . 3 either by Principal Component Analysis (a) or by Structure Preserving MDS (b) . Clearly the cluster structure is only well preserved in case (b ) .

6. Discussion Exploratory sets

data

. Standard

methods

setting

algorithms

. The

value

. Extensions ) or

conceptually

discovery this

rich

be which

straightforward

with and

for

by

shown

to

to

data

been

clustering

clustering

as yields

selecting

an well

- all

in

a

well

as

robust

appropriate on

real

clustering

published

data

algorithms

perform

- take

in

data

deriving

hierarchical

K - winner

have

structure

methodology

precision

algorithms

approaches

for

both

been

hidden

encompasses

annealing in

have

clustering

goal

discussed

tuned

of

framework

deterministic

can

clustering

the

achieve

been

and of

at

. A

has

which

temperature

HB96b

to

visualization

visualization

data

aims

methods

and

probabilistic data

analysis

- world ( HB95

rules

( HB96a

elsewhere

.

;

) are

References

J . Buhmann

and

H . Kuhnel

tions on Information C . M . Bishop - , M . Svensen self - organizing T . F . Cox Statistics

and

map

. Vector

. Neural

M .A .A . Cox . and

Applied

quantization

with

complexity

Theory , 39 ( 4 ) : 1133 - 1145 , July ,. and C . K . I . Williams . GTM Computation

. IEEE

Transac

alternative

to

-

the

, ( in press ) , 1997 .

Multidimensional

Probability

costs

1993 . : A principled . .

. Chapman

Scaling &

Hall

.

Number , London

59

in

, 1994 .

Monographs

on

DATA

T

.

M

.

Cover

New R

. O

A

.

and

York

P .

.

, the

- field

pages

197

Thomas

- 202

, A

Cie

,

.

Theory

and

,

.

John

Wiley

&

Sons

,

Scene

Analysis

.

Wiley

, New

York

,

Joachim

M

1996

,

. . .

.

Maximum Soc

.

likelihood Se1

.

B

,

Hierarchical

ICANN

Buhmann

39

from

: 1 - 38

pairwise ' 95

.

An

Proceedings

Springer .

.

Rubin Statist

of

In .

.

,

A

,

incomplete

1977

.

data

clustering

NEURONIMES

' 95

,

by

volume

II

,

.

M

1996

B

Buhmann

1995

.

York

annealed

of

" neural

ICANN

' 96

,

gas

pages

"

151

network - 156

for

,

Berlin

,

.

Buhmann

In

.

Infering

Proceedings

Portland

,

1997

hierarchical

of

, Redwood

the

City

clustering

Knowledge ,

CA

structures

Discovery

, USA

,

1996

.

and

AAAI

Data

Press

.

( in

, and

Wesley

,

Hofmann

.

,

Jain

1997

and ,

Hansj6rg

N

R

J

Lecture

.

and

.

.

,

E .

, T

281

metrika

Jr

Computers

.

A ,

Theory

of

Neural

Deterministic

determinis

-

Intelligence

,

Computation

Annealing

Mathematisch

.

Framework

- Naturwissenschaftliche

Bonn

Clustering

Buhmann

and

.

, D - 53117

Data

.

Bonn

Prentice

,

Springer

Verlag

, Fed

Hall

scaling

editors

, ,

,

.

lEE

Proceedings

and

. Rep

.

Englewood

Symposium

on

, ,

of

determinis

-

EMMCVPR

' 97

,

.

Springer

analysis

by

Proceedings

1997

Memory .

classification

1967

E .

Multidimensional

. Hancock

Associative

Berkeley

,

.

.

.R

quantizations for

5th

K

E

Science

Berlin

136

,

: 405

1984

- 413

multivariate

. ,

1989

.

observations

Mathematical

Statistics

.

and

Pro

-

.

Basford

and

and

42

G

.

.

Fox

Letters

,

, and , New .

A ,

An ,

M

- 297

,

Wesley

:

the

- Universitat

for

and

the

. Martinetz

Takane scaling

by

Machine

Mixture

Models

.

Marcel

Dekker

,

INC

,

New

York

,

.

. Sammon

Yoshio

:

thesis

Algorithms

vector

Gurewitz

Addison

.

. Pellilo

Recognition

. Ritter

clustering and

.

Computer

and

Pattern

to

Beyond

PhD

- Wilhelms

methods

, pages

Rose

Dubes 1988

M

of

1988

data Analysis

Introduction

and

Joachim

Some

McLachlan ,

Pairwise

Pattern

.

.

- organization

Proceedings

Basel

.

1991

Analysis

Hierarchical

Macqueen

,

Clustering

In In

Self .

bability

. ,

.

Luttrell

.

on

Friedrich

C

Notes

. Kohonen

Buhmann

. Palmer

York

Data

07632

Klock

.

.

annealing

on

. G

, Rheinische

Cliffs

In

R

New

Data

Germany

M

Transactions

.

Exploratory

tic

Joachim

IEEE

. Krogh

Fakultat

J . W

&

.

Royal

M

and

( 1 ) : 1 - 14

for

H

D J .

Proceedings

In

annealing

19

Thomas

.

.

Joachim

New

.

Addison

K

and

quantization

Hofmann

J .

Information

) .

J . Hertz

.

,

.

EC2

and

annealing

.

of

419

VISUALIZATION

Classification

Joachim

and

tic

. K

Laird

Conference

Thomas

G

.

deterministic

press

S .P .

M

and

,

Mining

Elements

Pattern

algorithm

Hofmann by

J .

.

vector

Thomas

.

.

annealing

Heidelberg

T

N

Hofmann robust

. Hart

EM

Hofmann mean

Thomas

DATA

3 .2 .

Dempster

Thomas

.

AND

.

P . E

Sect

via

A

1991

and .

data

A

J .

,

. Duda 1973

CLUSTERING

Forest

W

, .

least ,

March

A

.

1992

1969

- 594 Neural

annealing ,

1990

approach

to

clustering

.

.

Computation

and

Self

- organizing

Maps

.

. for

data

structure

analysis

.

IEEE

Transactions

.

Young

.

squares 1977

deterministic

) : 589

mapping

- 409

alternating ( 1 ) : 7 - 67

,

- linear

: 401

. ( 11

. Schulten

York

non

18

K

11

.

Nonmetric method

individul with

differences optimal

scaling

multidimensional features

.

Psycho

-

LEARNING BAYESIAN NETWORKS WITH LOCAL STRUCTURE

NIR FRIEDMAN Computer ScienceDivision, 387 SodaHall, University of California , Berkeley, CA 94720. nir @cs.berkeley.edu AND MOISESGOLDSZMIDT SRI International , 333 RavenswoodAvenue, EK329, Menlo Park, CA 94025. [email protected] .com Abstract . We examine a novel addition to the known methods for learning Bayesian networks from data that improves the quality of the learned networks . Our approach explicitly represents and learns the local structure in the condi tional probability distributions (CPDs ) that quantify these networks . This increases the space of possible models, enabling the representation of CPDs with a variable number of parameters . The resulting learning procedure induces models that better emulate the interactions present in the data . We descri be the theoretical foundations and practical aspects of learning local structures and provide an empirical evaluation of the proposed learning procedure. This evaluation indicates that learning curves characterizing this procedure converge faster , in the number of training instances , than those of the standard procedure , which ignores the local structure of the CPDs . Our results also show that networks learned with local structures tend to be more complex (in terms of arcs) , yet require fewer parameters .

1 . Introduction Bayesian networks are graphical representations of probability distribu tions ; they are arguably the representation of choice for uncertainty in artificial intelligence . These networks provide a compact and natural representation , effective inference , and efficient learning . They have been suc421

422

NIRFRIEDMAN ANDMOISES GOLDSZMIDT A

B

A BEl

E

Pr(SIA, B, E) 0.95 0.95 0.20 0.0.) 0.00 0.00 0.00 0.00

0

S

Figure 1. A simple network structure and the associated CPD for variable S (showing the probability values for S = 1).

cessfully applied in expert systems , diagnostic engines, and optimal decision making systems . A Bayesian

network

consists of two components . The first is a directed

acyclic graph (DAG ) in which eachvertex correspondsto a random variable. This graph describes conditional independence properties of the represented distribution . It captures the structure of the probability distribution , and is exploited for efficient inference and decision making . Thus , while Bayesian networks can represent arbitrary probability distributions , they provide computational advantage to those distributions that can be represented with

a sparse

DAG

. The

second

component

is a collection

of conditional

probability distributions (CPDs) that describethe conditional probability of each variable given its parents in the graph . Together , these two components

represent a unique probability distribution (Pearl, 1988) . In recent years there has been growing interest in learning Bayesian net-

works from data; see, for example, Cooper and Herskovits (1992) ; Buntine (1991b); Heckerman (1995); and Lam and Bacchus (1994) . Most of this research has focused on learning the global structure of the network , that is, the edges of the DAG . Once this structure is fixed , the parameters in the CPDs quantifying the network are learned by estimating a locally exponential

number

of parameters

from the data . In this article

we introduce

,-.jo ,-.jO

~ o

~ ~ oo

.

~ ~ oo

networks

-

methods and algorithms for learning local structures to represent the CPDs as a part of the process of learning the network . Using these structures , we can model various degrees of complexity in the CPD representations . As we will show , this approach considerably improves the quality of the learned

,-.jO

a distribution

over the values X can take . For example , con -

to specify

~ ~ ~ ~ oooo

In its most naive form , a CPD is encoded by means of a tabular representation that is locally exponential in the number of parents of a variable X : for each possible assignment of values to the parents of X , we need

LEARNING BAYESIAN NETWORKS WITH LOCAL STRUCTURE 423 sider the simple network in Figure 1, where the variables A , B , E and S correspond to the events " alarm armed ," " burglary ," "earthquake ," and "loud alarm sound ," respectively . Assuming that all variables are binary , a tabular representation of the CPD for S requires eight parameters , one for each possible state of the parents . One possible quantification of this CPD is given in Figure 1. Note , however, that when the alarm is not armed (i .e., when A = 0) the probability of S = 1 is zero, regardless of the values Band E . Thus , the interaction between S and its parents is simpler than the eight -way situation that is assumed in the tabular representation of the CPD . The locally exponential size of the tabular representation of the CPDs is a major problem in learning Bayesian networks . As a general rule , learning many parameters is a liability , since a large number of parameters requires a large training set to be assessed reliably . Thus learning procedures generally encode a bias against structures that involve many parameters . For example , given a training set with instances sampled from the network in Figure 1, the learning procedure might choose a simpler network structure than that of the original network . When the tabular representation is used, the CPD for S requires eight parameters . However , a network with only two parents for S , say A and B , would require only four parameters . Thus , for a small training set, such a network may be preferred , even though it ignores the effect of E on S . This example illustrates that by taking into account the number of parameters , the learning procedure may penalize a large CPD , even if the interactions between the variable and its parents are relatively benign . Our strategy is to address this problem by explicitly representing the local structure of the CPDs . This representation often requires fewer parameters to encode CPDs . This enables the learning procedure to weight each CPD according to the number of parameters it actually requires to capture the interaction between a variable and its parents , rather than the maximal number required by the tabular representation . In other words , this explicit representation of local structure in the network 's CPD allows us to adjust the penalty incurred by the network to reflect the real complexity of the interactions described by the network . There are different types of local structures for CPDs , a prominent example is the noisy -or gate and its generalizations (Heckerman and Breese, 1994; Pearl , 1988; Srinivas , 1993) . In this article , we focus on learning local structures that are motivated by properties of context -specific independence (CSI ) (Boutilier et al ., 1996) . These independence statements imply that in some contexts , defined by an assignment to variables in the network , the conditional probability of variable X is independent of some of its parents . For example , in the network of Figure 1, when the the alarm is not set

424

NIR FRIEDMANAND MOISESGOLDSZMIDT A

A

BEl

Pr (SIA , B , E )

1

1

1

0 .95

1 1

1 0

0 1

0 .95 0 .20

1

0 *

0 I

(a) Figure 2. decision

0 .05 0 .00

0.0 5

0 .05

0 .20

(b)

Two representations of a local CPD structure : (a) a default table, and (b) a

tree .

(i .e., the context defined by A = 0) , the conditional probability does not depend on the value of Band E ; P (S I A == 0, B == b, E == e) is the same for all values band e of Band E . As we can see, CSI properties induce equality constraints among the conditional probabilities in the CPDs . In this article , we concentrate on two different representations for capturing the local structure that follows from such equality constraints . These representations , shown in Figure 2, in general require fewer parameters than a tabular representation . Figure 2(a) describes a default table , which is simi lar to the usual tabular representation , except that it does not list all of the possible values of S 's parents . Instead , the table provides a default probability assignment to all the values of the parents that are not explicitly listed . In this example , the default table requires five parameters instead of the eight parameters required by the tabular representation . Figure 2(b) describes another possible representation based on decision trees (Quinlan and Rivest , 1989) . Each leaf in the decision tree describes a probability for S , and the internal nodes and arcs encode the necessary information to decide how to choose among leaves, based on the values of S 's parents . For example , in the tree of Figure 2(b) the probability of S = 1 is 0 when A == 0, regardless of the state of Band E ; and the probability of S = 1 is 0.95 when A = 1 and B = 1, regardless of the state of E . In this example , the decision tree requires four parameters instead of eight . Our main hypothesis is that incorporating local structure representations into the learning procedure leads to two important improvements in the quality of the induced models . First , the induced parameters are more reliable . Since these representations usually require less parameters , the frequency estimation for each parameter takes , on average, a larger num ber of samples into account and thus is more robust . Second, the global

LEARNING

structure

BAYESIAN

of the induced

NETWORKS

network

WITH

is a better

LOCAL

STRUCTURE

approximation

425

to the real

(in)dependenciesin the underlying distribution . The use of local structure enables the learning procedure to explore networks that would have

incurred an exponential penalty (in terms of the number of parameters required) and thus would have not been taken into consideration. We cannot stress enough the importance of this last point . Finding better estimates of the parameters for a global structure that makes unrealistic independence assumptions will not overcome the deficiencies of the model . Thus , it is crucial to obtain a good approximation of the global structure . The experiments described in Section 5 confirm our main hypothesis . Moreover

, the

results

in that

section

show

that

the

use of local

represen -

tations for the CPDs significantly affects the learning process itself : The learning procedures require fewer data samples in order to induce a network that

better

approximates

the target

distribution

.

The main contributions of this article are: the derivation of the scoring functions and algorithms for learning the local representations ; the formu lation of the hypothesis introduced above, which uncovers the benefits of having an explicit local representation for CPDs ; and the empirical investigation that validates this hypothesis . CPDs with local structure have often been used and exploited in tasks of knowledge acquisition from experts ; as we already mentioned above, the

noisy-or gate and its generalizationsare well known examples (Heckerman and Breese, 1994; Pearl , 1988; Srinivas , 1993) . In the context of learn ing , several authors have noted that CPDs can be represented via logis-

tic regression, noisy-or, and neural networks (Buntine, 1991b; Diez, 1993; Musick, 1994; Neal, 1992; Spiegelhalter and Lauritzen, 1990). With the exception network

of Buntine , these authors structure

is fixed

have focused on the case where the

in advance , and motivate

the

use of local

struc -

ture for learning reliable parameters . The method proposed by Buntine (1991b) is not limited to the case of a fixed structure ; he also points to the use of decision trees for representing CPDs . Yet , in that paper , he does not provide empirical or theoretical evidence for the benefits of using local structured representations with regards to a more accurate induction of the global structure of the network . To the best of our knowledge , the benefits that relate to that , as well as to the convergence speed of the learning pro-

cedure (in terms of the number of training instances) , have been unknown in the literature

prior to our work .

The reminder of this article is organized as follows : In Section 2 we review the definition of Bayesian networks , and the scores used for learning these

networks

. In Section

3 we describe

the two forms

of local

structured

CPDs we consider in this article . In Section 4 we formally derive the score for learning networks with CPDs represented as default tables and decision

426

NIR FRIEDMANAND MOISESGOLDSZMIDT

trees , and describe the procedures for learning these structures . In Section 5 we describe the experimental results . We present our conclusions in Section 6.

2. Learning Bayesian N etwor ks Consider

a

each

letters

set

Xi

such

as

x

X

can

,

IIX

y

,

z

to

II

=

I

as

X

in

,

,

,

,

if

y

)

=

I

Z

P

x

Z

and

a

edges

G

Xi

is

)

,

y

a

a

;

the

finite

those

Val

(

y

(

,

an

the

x

,

y

such

of

is

values

denoted

as

capital

in

,

capital

set

set

variables

as

z

(

we

z

Y

)

)

>

,

and

o

.

z

letters

these

sets

use

Val

(

are

X

)

U

the

is

,

.a

Z

)

statements

in

,

we

have

and

a

of

=

(

'

G

.

variables

)

.

.

given

'

.

that

encoded

in

-

struc

G

as

.

a

-

The

can

set

family

P

G

DAG

variable

in

to

graph

I

whose

each

distribution

the

a

and

graph

parents

any

x

.

is

,

The

referred

(

prob

G

Xn

:

its

usually

P

variables

,

.

and

given

joint

random

=

Xl

,

that

statements

show

U

,

encodes

B

the

is

1988

)

set

independence

parents

Pearl

of

(

pair

,

its

Val

variables

nondescendants

and

E

that

the

random

of

variables

independent

DAG

composed

between

its

the

conditionally

annotated

set

independence

The

this

boldface

over

for

(

.

by

to

use

letters

variables

denoted

where

We

lowercaEe

of

Yare

dependencies

arguments

.

and

cardinality

such

domain

variable

variables

domain

,

values

and

E

following

of

the

X

to

independent

Standard

.

P

the

of

isfies

U

is

direct

by

)

are

network

encodes

random

a

.

correspond

composed

X

of

whenever

represent

ture

discrete

distribution

X

Bayesian

of

letters

)

of

,

}

names

variables

of

(

Xn

taken

(

assignments

Val

)

,

from

probability

E

z

.

variable

Val

of

joint

nodes

.

denote

network

Formally

.

values

values

Sets

distribution

whose

to

subsets

I

,

on

way

Bayesian

ability

Xl

lowercase

x

(

{

as

.

,

be

all

=

)

a

Z

for

A

,

obvious

,

=

specific

,

be

Y

Y

boldface

P

X

=

take

denoted

X

Y

the

Let

Z

(

by

! ' IXII

,

denote

Val

denoted

let

X

is

=

U

may

aE

attain

such

z

finite

variable

.

that

be

sat

-

factored

as

n

p

(

X

1

,

.

.

.

,

X

n

)

=

=

II

i

where

a

Pai

denote

the

distribution

of

abilities

on

the

the

P

is

(

exactly

conditional

,

hand

network

probability

,

Xi

I

one

Pai

as

we

conditional

deal

.

all

This

specified

probability

I

P

Note

to

ai

)

,

(

that

set

1

Xi

.

specify

It

prob

the

of

conditional

immediately

form

follows

of

-

component

Equation

that

1

with

the

.

variables

,

such

)

specify

conditional

second

CPDs

the

completely

the

the

of

has

to

provide

precisely

This

in

tables

i

.

is

that

discrete

G

variables

distribution

with

X

need

.

for

(

l

in

only

namely

)

Xi

we

side

probabilities

When

of

form

right

Bayesian

there

parents

this

P

=

as

we

usually

the

represent

one

in

Figure

the

1

.

These

CPDs

in

tables

LEARNINGBAYESIANNETWORKS WITH LOCALSTRUCTURE427 contain

a

The

Given

=

fit

( } xi

of

a

B

(

of

a

)

that

and

to

:

1994

we

the

)

.

1

.

The

in

our

.

D

,

used

by

fined

length

the

the

(

e

Thomas

(

work

,

the

B

,

D

U

find

a

(

in

that

1995a

)

on

scoring

,

the

,

1995

)

literature

.

.

In

frequently

Lam

and

Bacchus

,

.

. g

.

D

,

1991

the

)

such

.

learning

Bayesian

a

the

(

balances

(

,

this

PB

a

)

is

PH

we

MDL

we

;

.

the

in

build

an

Cover

a

net

description

induced

-

and

implies

that

the

network

represents

-

and

choose

This

-

in

see

should

.

the

network

probable

network

minimized

a

over

can

code

network

-

class

,

more

Huffman

of

the

de

the

The

is

,

the

then

particular

model

to

,

of

complexity

which

the

,

principle

.

.

words

or

to

a

D

model

and

.

from

distribution

code

MDL

is

D

compression

one

distribution

Using

the

of

length

length

the

with

the

-

of

store

length

networks

the

respect

accuracy

in

the

com

suitable

version

also

version

is

a

-

store

a

find

compact

must

description

combined

with

we

Sup

to

save

to

description

used

shorter

to

a

,

and

is

.

like

space

total

probability

.

would

data

compressed

encoding

According

that

The

model

Shannon

)

conserve

D

coding

we

produce

the

total

universal

which

the

model

assigns

using

to

to

.

of

data

by

,

recover

optimal

the

motivated

use

to

describes

in

of

the

with

frequencies

.

now

describe

of

.

.

heuristic

most

(

)

of

Heckerman

are

score

Pai

network

a

rely

proposed

(

follows

goodness

introduce

)

,

Val

as

,

usually

ones

.

E

of

networks

compressing

length

that

procedure

degree

We

candidate

of

is

like

the

B

data

learning

of

MDL

instances

of

able

minimizes

,

encoded

the

storage

that

)

of

would

of

scheme

stances

the

PSi

stated

normally

been

al

and

now

notion

we

(

1989

compress

the

appearing

encoding

we

possible

can

the

that

network

stances

in

to

context

a

)

the

et

D

be

description

)

In

Xi

be

instances

,

Heckerman

encoder

of

dictates

Such

of

data

of

way

to

sum

interest

}

on

we

One

want

encoder

of

(

formalize

have

set

,

.

the

we

the

principle

UN

To

the

,

a

D

that

as

the

as

,

1

problem

Rissanen

Naturally

of

for

Val

can

Length

given

version

model

of

(

are

records

Moreover

.

.

attention

(

E

SCORE

we

pressed

.

space

score

principle

that

.

to

our

BDe

MDL

MDL

pose

,

Description

the

THE

Ul

functions

focus

Xi

network

D

the

Minimal

and

value

optimization

scoring

article

used

{

the

over

different

this

=

respect

solve

techniques

Several

each

Bayesian

matches

with

,

search

D

best

network

for

a

set

,

/ pai

learning

training

G

function

2

parameter

problem

both

network

in

the

detail

network

is

defined

the

representation

and

as

the

the

length

coded

total

data

description

.

required

The

for

MDL

length

score

.

To

the

of

store

a

a

IThroughout this article we will assume that the training data is complete , i .e., that each Ui assigns values to all variables in U . Existing solutions to the problem of missing values apply to the approaches we discuss below ; see Heckerman ( 1995) .

428

NIR FRIEDMANAND MOISESGOLDSZMIDT

network B = (G , ) , we need to describe U , G , and . To describe U , we store the the number of variables , n , and the cardi nality of each variable Xi . Since U is the same for all candidate networks , we can ignore the description length of U in the comparisons between networks . To describe the DAG G , it is sufficient to store for each variable Xi a description of Pai (namely , its parents in G ) . This description consists of the number of parents , k , followed by the index of the set Pai in some (agreed upon ) enumeration of all (~) sets of this cardinality . Since we can encode the number k using log n bits , and we can encode the index using log (~) bits , the description length of the graph structure is2

DLgraph (G) = ~ (lOg n+ log(IP : I) ) . To

describe

tional

the

CPDs

probability

IIP

~

table

II ( 11Xill

ters

-

depends

usual for

a

1 )

on

choice

in

"

. For

we

the

parameters

the

of

we

use

1 / 2 log

this

the

parameters with

representation

bits

is

discussion

store associated

The

of

literature

must table

.

number

the

thorough

C P D

in

N

point

in Xi

, we

length

for

each

( see

condi to

these

and

, the

encoding

-

1 ) logN

-

store

parame

parameter

Friedman

) . Thus

need

of

numeric

each

-

. The

Yakhini

( 1996

length

of

)

Xi

'8

is 1 DLtab

To the code

encode

,

the

signed this

the

network

to length

B exact

approximate ing

length

approximated

each by

==

of

,

it

optimal instance

is

- IIPaill 2

, we a

length

However

)

data

construct

particular

the of

, P ~

training to

that .

( Xi

use

the

Huffman

each

codeword .

u .

length Thus

,

for

the

is and

no

on

Thomas

using

-

defined

instances

closed

description

.

measure

the

depends

There ( Cover

encoding

probability

code

instance known

( IIXill

the

PB

by

. In

this as -

description

1991

length

D

probability

- form ,

log

in

( u )

) as of

that the the

of we encod data

can is

N DLdata (D I B) = - Llog (Ui ). i=l PB

We can rewrite this expression " in a more convenient form . We start by introducing some notation . Let PD be the empirical probability measure 2Since description lengths are measured in terms of bits , we use logarithms throughout this article .

of base 2

LEARNING

BAYESIAN

NETWORKS

WITH

LOCAL

STRUCTURE

429

induced by the data set D . More precisely, we define

A

1~

PD(A) = N 't!=--1 lA (Ui) where lA (u) =

{ 0I ififuuctEAA

for all eventsof interest, i.e., A <; Val(U). Let N:f (x) be the numberof A

instancesin D where X = x (from now on, we omit the subscript from PD,

andthe superscriptand the subscriptfrom NJl, whenever they areclear ,

.

from the context). Clearly, N (x ) = N . P (x ). We use Equation 1 to rewrite the representation length of the data as N

DLdata (D I B) = - } : logPB(Uj) j=l

= - N} : p(u) logII P(Xi I pai) u

i

= -} : } : N(Xi, pai) logP(Xi I pai). i

(2)

Xi ,p ~

Thus , the encoding of the data can be decomposed as a sum of terms that

are "local" to eachCPD : these terms dependonly on the counts N (Xi, pai ). Standard arguments show the following .

Proposition 2.1: IfP (Xi I Pai ) is representedas a table, then the parameter values that minimize DLdata(D I B ) are (JXiIpai = P (Xi I PSi). A

Thus , given a fixed network structure G , learning the parameters that min imize the description length is straightforward : we simply compute the appropriate long-run fractions from the data . Assuming that we assign parameters in the manner prescribed by this

proposition, we can rewrite DLdata(D I B ) in a more convenient way in terms of conditional entropy: N Li H (Xi I P8i ), where H (X I Y ) = - Lx ,y P (x , y) log P (x I y) is the conditional entropy of X , given Y . This A

A

formula provides an information -theoretic interpretation tion

of the

data : it measures

how

many

bits

to the representa-

are necessary

to encode

the

value of Xi , once we know the value of Pai .

Finally , the MDL score of a candidate network structure G , assuming that we choose parameters

as prescribed

above , is defined as the total

de-

scri ption length

DL(G, D)

DLgraph (G) + L DLtab (Xi,Pai) + ~

NE~ H(Xi I P8i).

(3)

430

NIRFRIEDMAN ANDMOISES GOLDSZMIDT

According to the MDL principle , we should strive to find the network struc ture that minimizes this description length . In practice , this is usually done by searching over the space of possible networks . 2 .2 .

THE

BDE

SCORE

Scores for learning Bayesian networks can also be derived from methods of Bayesian statistics . A prime example of such scores is the BDe score,

proposed by Heckerman et ale (199Sa). This score is basedon earlier work by Cooper and Herskovits (1992) and Buntine (1991b). The BDe score is (proportional to) the posterior probability of each network structure , given the data. Learning amounts to searchingfor the network(s) that maximize this probability . Let Ch denote the hypothesis that the underlying distribution

satisfies

the independe.ncies encodedin G (seeHeckermanet ale (1995a) for a more elaborate discussion of this hypothesis). For a given structure G, let 8a represent the vector of parameters for the CPDs quantifying G . The pos-

terior probability we are interested in is Pr (Gh I D ) . Using Bayes' rule we write

this

term

as

Pr (Gh I D ) == aPr (D I Gh) Pr (Gh) ,

(4)

where a is a normalization constant that does not depend on the choice of

G. The term Pr (Gh) is the prior probability of the network structure , and the term Pr (D I Gh) is the probability of the data, given that the network structure

is G .

There are several ways of choosing a prior over network structures . Heck-

erman et ale suggest choosing a prior Pr (Gh) cx a~(G,G'), where L\ (G, G/), is the difference in edges between G and a prior network structure G' , and 0 < a < 1 is penalty for each such edge. In this article , we use a prior b~ ed

on the MDL encoding of G. We let Pr (Gh) cx 2DLgraph (G). To evaluate the Pr (D I Gh) we must consider all possible parameter assignments to G . Thus ,

Pr(DIGh )=J Pr(DI8a,Gh )Pr(8aIGh )d8a ,

(5)

where Pr (D I 8a , Gh) is defined by Equation 1, and Pr (8a I Gh) is the prior density over parameter assignmentsto G. Heckermanet ale (following Cooper and Herskovits (1992)) identify a set of assumptions that justify decomposing this integral . Roughly speaking , they assume that each distri -

bution P (Xi I pai ) can be learned independently of all other distributions .

LEARNING BAYESIAN NETWORKS WITHLOCALSTRUCTURE 431 Usingthis aBsumption , they rewritePr(D I Gh) as Pr(D I Gh) = IJ1. pai II J IIXi o~f; ~iPai ) Pr(8 XilpaiI Gh)d8Xilpai.

(6)

(This decomposition is analogous to the decomposition in Equation2.) Whenthe prior on eachmultinomialdistributione XiIpaiis a Dirichlet prior, the integralsin Equation6 havea closed -formsolution(Heckerman , 1995 ). We briefly reviewthe propertiesof Dirichletpriors. For moredetailed description , wereferthe readerto DeGroot(1970 ). A Dirichletprior for a multinomialdistributionof a variableX is specifiedby a set of hyperpa rameters{N~ : x E Val(X )}. Wesaythat Pr(8x ) f'VDirichlet({N~ : x E Val(X )}) if Pr(8 x ) ==QII (}~~, x wherea is a normalization constant . If the prior is a Dirichletprior, the probabilityof observing a sequence of valuesof X with countsN (x) is J IIx (J~(X) Pr(8x I Gh)d8x == r (Lxr (N~ (Lx+N~) + N (x)) , N (x))) IIx r (N~ r (N~) wherer (x) = foco tX- le- tdt is the Gammafunctionthat satisfies the propertiesr (l ) = 1 andr (x + 1) = xr (x). Returningto the BDescore , if weassignto eache XiIpaia Dirichletprior with hyperparameters N~i Ipai' then Pr(D I Gh) = II II r (~ X' N; .lpa;) II r (N~ilpai+ N (Xi. pai)) i pair (EXi Nxilpa,. + N(pai)) X., r (N"XIIpai) . (7) Therestill remainsa problemwith thedirectapplicationof this method . Foreachpossiblenetworkstructurewewouldhaveto assignpriorson the parametervalues . This is clearlyinfeasible , sincethe numberof possible structuresis extremelylarge.Heckerman et alepropose a setof assumptions that justify a methodby which,givena priornetworkBPandanequivalent samplesizeN', we can assignprior probabilitiesto parameters in every possiblenetworkstructure.The prior assigned to e Xilpaiin a structureG is computed fromthe priordistributionrepresented in BP. In this method , we assignN~ilpai == N' . PBv(Xi, pai). (Notethat Pai arethe parentsof

432 Xi

NIR FRIEDMANAND MOISESGOLDSZMIDT

in

G

,

but

the

conditional

the

expected of

The

the

ables

G

Once

( u

again

we

we

can

,

( Xi

use

are

, D

,

where

eG

large

1994

,

given

, Gh

term

I D

data

network

prior expected

or

by

)

Pr

) Pr

the

set

of

. In

set

" Pr

the

I D

the

=

Oi

a

, Ch

( Xi

of

average

the

G

, and

I pai

, D

I

Gh

MD

(

. )

' D

, Ch

L

)

~

in

the

score

More

. Using

parameter

, Gh

)

where

) dOXilpaj

general

and

.

I

of

8a

,

setting

be

Ch

)

(

prior

-

the

score

~

tend

scores

derived

Schwarz

the

solution

two

by

7

,

as

1978

)

using

done

.

by

Schwarz

,

logN

for

is

BDe

these

Equation

parameters

our

the

can

on

( D

- form

,

in

result

logPr

closed

precisely

function

likelihood

which

have

equivalence

r

-

) d8ao

data

I pai

constraints

maximum

,

the

This

vari to

should

structure

Pr

( Oxilpai

.

regularity

( D

G

,

order

of

parameters

, we

( 8a

integrals

the

using

the

number

G

over

observable

, Gh

sets

.

as

( e .g . ,

structure

the

using

similarly

some

of

G

,

,

number

( 8

given

of

D

free

,

and

d

)

is

parameters

.

Note

that

attempts

the

to

( which

one

which

corresponds

is

.

the

BP

,

completely

these

to

)

dimension

term

, then

structures

are

the

methodology

18a

OXilpai

equivalent

(

uses

network

NxilPa ' ;+N(X,pai Pr (XIpai i,D,Gh )=Exi N~ilpai +N(pi))ai .

log

G

: = J

approximations

that

in

)

priors

asymptotically

shows

N

, Gh

consider

Bouckaert

( u

( u

in

compute

. Thus

this

Pr

essentially

prior

to

a

to

ea

Pr

the

distribution

Bayesian

of

that

candidate

asymptotic

the

== J

above

get

I pai

we

score

)

score

need

the

decompose

we

Dirichlet

When

to

, Gh

proposal in

proportional

to

, we

to

,

confidence

probability

to

their

.

how

the G

stated

independence

If

I D

assumptions

Pr

shows

, pai

the is

Pai

assignments

Pr

,

of

structure

Thus

given

)

values

. According

possible

.)

,

Similarly

about a

the all

the

the above

u , given

over

.

predictions

quantify

BP Xi

hyperparameters

of

exposition

make

in of

probability

occurrences

to

necessarily probability

magnitude of

not

negligible

term

on

maximize

)

attempts

to

to

-

in

the

the

is

the

right

the

negative

minimize

logarithm -

asymptotic

-

hand

side

of

)

,

when

of

of

the

we

the

analysis

Equation

8

MDL

score

ignore

prior

Pr

,

since

the

( . Gh

).

it

( which

of

one

Equation

3

description

.

does

Note

of

also

not

that

depend

G

this

on

,

LEARNING

3. In

Local the

BAYESIAN

NETWORKS

WITH

LOCAL

STRUCTURE

433

Structure

discussion

above , we

have

assumed

the

standard

tabular

repre -

sentation of the CPDs quantifying the networks . This representation requires that for each variable Xi we encode a locally exponential number ,

//PBiII (IIXill - 1) , of parameters. In practice, however, the interaction between Xi and its parents p ~ can be more benign , and some regularities can be exploited to represent the same information with fewer parameters . In the example of Figure 1, the the CPD for S can be encoded with four

parameters, by meansof the decision tree of Figure 2(b), in contrast to the eight parameters required by the tabular representation . A formal foundation for representing and reasoning with such regularities

is provided in the notion of context-specific independence(CSI) (Boutilier et al., 1996). Formally, we say that X and Yare contextually independent, given Z and the context c E Val(C), if P (X I Z , c, Y ) := P (X I Z , c) wheneverP (Y , Z, c) > O.

(9)

CSI statements are more specific than the conditional independence state ments captured by the Bayesian network structure . CSI implies the inde-

pendenceof X and Y , given a specific value of the context variable(s), while conditional independence applies for all value assignments to the condition ing variable . As shown by Boutilier et ale (1996) , the representation of CSI leads to several benefits in knowledge elicitation , compact representation , and computational efficiency . As we show here, CSI is also beneficial to learning , since models can be quantified with fewer parameters . As we can see from Equation 9, CSI statements force equivalence lations between certain conditional probabilities . If Xi is contextually

rein -

dependent of Y given Z and c E Val(C) , then P (Xi I Z, c, y ) == P (Xi I Z , C, y' ) for y , y' E Val(Y ) . Thus, if the parent set Pai of Xi is equal to Yu Z u C , such CSI statements will induce equality constraints among the conditional probability of Xi given its parents . This observation suggests an alternative way of thinking of local structure in terms of the partitions they induce on the possible values of the parents of each variable Xi . We note that while CSI properties imply such partitions , not all partitions can be characterized by CSI properties . These partitions impose a structure over the CPDs for each Xi . In this article we are interested in representations that explicitly capture this struc ture that reduces the number of parameters to be estimated by the learning procedure . We focus on two representations that are relatively straightfor ward to learn . The first one, called default tables, represent a set of singleton partitions

with

one additional

partition

that can contain

several values of

Pai . Thus , the savings it will introduce depends on how many values in

434

NIR FRIEDMANAND MOISESGOLDSZMIDT

Val(Pai ) can be grouped together. The second representation is based on decision trees and can represent more complex partitions . Consequently it can reduce the n urn ber of parameters

even further . Yet the induction

al -

gorithm for decision trees is somewhat more complex than that of default tables .

We introduce a notation that simplifies the presentation below . Let L be a representation for the CPD for Xi . We capture the partition structure represented by L using a characteristic random variable T L . This random variable maps each value of Pai to the partition that contains it . Formally ,

TL (pai) = TL (pai) for two values pai and pai of Pai , if and only if these two values are in the same partition

in L . It is easy to see that from the

definition of lL , we get P (Xi IlL ) = P (Xi I Pai ), since if p~ and Pa~ are in the same partition , it must be that P (Xi I pai) .= P (Xi I pai). This means that we can describe the parameterization of the structure L

in terms of the characteristic random variable Y L as follows: e L = { lJxiIv : Xi E Val(Xi ) , v E Val(TL )} . As an example , consider the tabular CPD representation , in which no CSI properties

are taken into consideration

. This implies that the corresponing

partitions contain exactly one value for Pai . Thus, in this case, Val(lL ) is isomorphic to Val(Pai ). CPD representationsthat specify CSI relations will have fewer partitions , and thus will require fewer parameters . In the sections below we formally describe default tables and decision trees , and the partition structures they represent . 3 .1.

DEFAULT

TABLES

A default table is similar to a standard tabular representation of a CPD , except that only a subset of the possible values of the parents of a variable are explicitly represented as rows in the table . The values of the parents that are not explicitly represented as individual rows are mapped to a special row called the default row. The underlying idea is that the probability of a node X is the same for all the values of the parents that are mapped to the default row ; therefore , there is no need to represent these values separately in several entries . Consequently , the number of parameters explicitly represented in a default

table can be smaller than the number

of parameters

in

a tabular representation of a CPD. In the example showing in Figure 2(a), all the values of the parents of S, where A = 0 (the alarm is not armed), are mapped to the default row in the table , since the probability of S = 1 is the same in all of these situations , regardless of the values of Band E .

Formally, a default table is an object V = (Sv , 8v ) . Sv describes the structure of the table , namely , which rows are represented explicitly , and

which are representedvia the default row. We define Rows(V ) ~ Val(Pai )

LEARNING

to

be

the

defines the

BAYESIAN

set

the

of

T ( pai

then

the

that

are

of

set

of

from

Rows

A

= D '

DECISION

tree

can

take

The

, and

leaves

start

. At

each of

that (5

with

the

value

with

0 ) , and

A

can

tree

by

the

test

T , and

of a

, we

the

, the

explicit

for

V . It P ( Xi

cases

.

If

the

I

PSi

( V ) , then

is

E

P ( Xi

partition

I

that

be

to the

the

the

root

set

values

in

pai

consistent

of

the

, and the

. It

is

lying

path

to in

the

the verify tree

is

that . The

leaf first

tree

.

com

by and

every

partitions

.

by is

a

repre

Y

structure at

, for

the

root

v of

Label

this

representa a leaf

consistent

for

( annotated

recursively

tree

root

is

2 (b ) .

represented

value

the

to

Val ( Y ) } ) , where a

induced

path

B

tested

the

like

annotated

) . The

is

E

testing

appropriate

defined

leaf

: y

with

at

,

reach

Figure is

composite

STy

between of

is

variable

partitions

easy

, 87

( 57

== ( Y , { STy tree

arcs

==

. A

(T)

labeling

the

tree

the

the

edge

, and

Label

the

left

in

.

outgoing

would

edge

reach

== A . A

by

we

X

parents

the

shown

node

we

traverse

this

tree

associated

describe

if

T

S7 S7

of by

the

composite

form

subtree

pai

unique

a

constant

denote

with

of

tree

we

object

structure or

the

its

until

following that

A , since

follow

that over

of

tree

is

node

by

to

and

in

E , till

an

subtree

at

at

particular

a value the

, suppose

, we

aE

special

( T , v ) the

== 1 )

a

node

distribution

, given

node

. Thus

internal

from

traverse

which

subtree

edge

each

represented

X

and

that

== 0 , E

tree

which edges

of

value

right

of

in

a probability

choose

right

a leaf a

at

need

that

the

Y . We

a path

signment is

of

, we

. Let

there

to

Sub

consistent

in

determine

rJ Rows (V )

variable

node

annotates

a

tree

the

root

A . Similarly

either

a

with

node

denote

variable

by

(V ) , values

( V ) . Thus

rows

two

PSi

Rows

, outgoing

that

the

the

structure

y

Finally

is

a

value

the

parameterization

consider

=:

probability

the

== 1 , B to

represents be

is

that

the

1 for

equal

sented

each

I A

edge

, we

X

to

again

Formally

structure

to

)

Rows

the

( V ) , then

rJ Rows all

Val ( Y v ) . To

= { pai } . If

annotated

at

parent

== 1 the

, 57

) -

Rows

PSi

.

variable

internal

the

follow

ponent

the E

need

. If

structure

2 (a ) ) .

Xi

Val ( Psi

values

corresponds

Pr

contains

to

This

E

that ( Psi

Figure

value

(}xiITv =:

row

are

retrieving

know

tion

the

of

edge

is

parent

. We

value

We

a with

follows

a leaf the

=

variable

process as

) D

default

for

with

annotated

I pai

Val

PSi

PSi

435

TREES

decision

are

is

the

in

we

where

is

. If

.

only

correspond

, constitutes each

representation

to

annotated

, that

( e . g . , as , 8v

P ( Xi

(}xiITv

corresponds

3 .2 .

table

variable

partition

table

STRUCTURE

explicitly

contains

default

default

LOCAL

represented

random

(}Xi Iv for

this

==

are

represented

the

WITH

that

the

parameters

( V ) , then )

is

the

parameters

)

PSi

)

by

representation

that

partition

explicitly

defined

The

the

T ( pSi

not

partitions

V

characteristic ) is

value

contains

in

following

value

PSi

rows

NETWORKS

with pai

E induced

of

(T ) .

. A

path

the

as -

Val ( Pai

)

by

a

436

NIR FRIEDMANAND MOISESGOLDSZMIDT

decision tree correspond to the set of paths in the tree , where the partition that correspond to a particular path p consists of all the value assignments that p is consistent with . Again , we define the set of parameters , 8r , to contain parameters 8Xilv for each value Xi E Val(lv ) . That is, we associate with each (realizable ) path in the tree a distribution over Xi . To deter mine P (Xi I pai ) from this representation , we simply choose 8Xilv, where v = { pa ~ I pa ~ is consistent with p} , where p is the (unique ) path that is consistent with psi . 4

.

Learning

We

start

for

default

Local

this

section

a

learning

4 . 1 .

( 1996

L

,

)

and

by

E

4

. 1 . 1 .

Let

Val

B

tion as

in

encode

,

We

networks

.

scoring

We Note

derive

of

such

the

the

the

score

material

and

of

parent

describe

that

a

produce

the

variables

representations

functions

then

representation

( G ( Xi

Section

CPDs

.

( See

.

Let

that Boutilier

. )

, {

Li I

} )

structure parameters

( YL

for

Pai

P

( Xi

I

by

for

PSi

S L

) }

be

a The

a

L

.

default of

We

table the

,

local

assume

L a

denote

tree

,

or

a

representation

that

8L

=

{ 8Xilv

:

Structure

network

MDL

Changes and given

,

encoding occur

,

e .g . ,

derivations

structure of

Bayesian

SLi

our

.

Local

) .

) ,

the

parameterization Val

2 . 1 .

the

necessary

denote

the E

Score

optimal

structure

.

) , v

MDL

P

BDe .

to

values

notations of

table 8L

( Xi

== of

the of

representation

( complete

Xi

scoring

structured

discussion

some

local

any

and

FUNCTIONS

introduce

a

of

a

MDL

generalized

over

for

the representations

high

easily for

partition )

SCORING

We

be

procedure

.

both tree

for

can

a

al

deriving decision

searching

section

represents et

and

for

this

by

table

procedures in

Structure

the the

in

of the

Li

the

now

depends

the G

.

Additionally

of eLi

,

is

DAG

encoding

parameters data

where

Li

on

local

representa

remains .

We

now ,

the

-

the

choice

same have

the

to

choice of

local

.

First , we describe the encoding of SL for both default table and tree representations. When L is a default table V , we needto describe the set of rows that are representedexplicitly in the table, that is, Rows(V ). We start by encoding the number k = IRows(V ) I; then we describeRows(V ) by encoding its index in some (agreed upon) enumeration of all (1IP : ill) sets of this cardinality .

LEARNINGBAYESIANNETWORKS WITH LOCALSTRUCTURE437 Thus, the description length of the structure V is

DL'ocal -struct (V) =logIIPail1 +log(IIP ; II) . When

L

the

all

the

be

any

in

'

tree

In

the

the

node

In

,

from

the

if

the

(

k

tree

)

.

,

since

along

are

The

the

the

(

T

,

k

)

=

node

in

by

the

+

log

(

k

)

+

Ei

DLr

(

1i

,

k

1

r

if

is

this

Next

formula

,

we

,

we

encode

define

the

(

DLlocal

11Xill

-

-

1

)

struct

liT

(

11

T

)

=

tested

then

we

leaf

a

(

parameters

T

,

tree

for

'

,

IPail

L

:

composite

subtrees

DLr

need

description

formula

a

is

with

Using

,

test

variable

total

T

a

been

recurring

)

,

can

of

each

tree

following

-

variable

not

The

leaf

depends

test

the

.

is

a

choice

have

variable

leaf

description

test

the

we

if

1

the

that

and

composite

variable

,

1

DLr

,

path

test

a

a

from

the

test

subtree

single

current

described

of

and

variables

:

it

root

a

a

k

the

encoding

the

,

by

follows

differentiate

variable

At

at

the

is

.

to

tree

proposed

as

of

tree

describe

structure

,

encoding

the

to

to

O

1

the

encoding

recursively

to

The

of

the

test

there

root

bits

encoded

contraEt

,

structure

use

associated

in

.

We

value

.

general

log

is

.

equal

to

restricted

.

path

of

the

the

tree

value

set

parents

more

only

length

S

encode

the

A

of

of

the

. 3

subtrees

once

store

to

in

with

bit

description

Xi

is

most

need

)

bit

a

position

of

we

1989

immediate

variable

yet

(

single

a

the

,

nodes

with

by

on

to

a

starts

followed

T

Rivest

by

tree

tree

internal

and

encoded

at

a

of

Quinlan

of

is

labeling

)

Tl

,

.

.

.

,

T

m

.

.

with

description

length -1

1

DLparam

Finally

,

given

as

the

We

we

model

now

a

,

.

.

.

,

=

2

2

Equation

network

4

1

)

n

.

2

,

:

then

If

P

we

(

IIXill

1

I

can

P

~

rewrite

,

.

CPDs

Xi

-

we

)

IIYLlllogN

.

describe

1

to

the

describe

are

)

l

encoding

of

the

data

.

2

when

1

(

.

Proposition

Proposition

=

L

Section

using

for

i

in

generalize

rameters

for

did

(

is

the

optimal

represented

using

represented

D

Ldata

by

(

D

I

B

local

)

choice

local

representation

of

pa

structure

-

.

Li

,

as

DLdata (DIB) = -NL L L P(Xi,lLi =v)log (Jxilv8 i vE Val (TLi ) Xi 3Wallace and Patrick ( 1993) note that this encoding is inefficient, in the sense that the number of legal tree structures that can be described by n-bit strings, is significantly smaller than 2n. Their encoding, which is more efficient, can be easily incorporated into our MDL encoding. For clarity of presentation, we use the Quinlan and Rivest encoding in this article .

438

NIR FRIEDMANAND MOISESGOLDSZMIDT

Moreover, the parameter valuesfor L that minimize DLdata(D I B ) are , . .

()XiITLi=v = P (Xi I T Li = v). As in the case of tabular

CPD representation

the parameters correspond to the appropriate ing data . As a consequence

cal structure

of this

, DLdata is minimized

when

frequencies in the train -

result , we find

that

for a fixed

lo -

L , the minimal representation length of the data is simply

N . H (X IlL ). Thus, once again we derive an information-theoretic interpretation of DLdata(8 L, D ) . This interpretation showsthat the encoding of X depends only on the values of 1 L . From the data processing inequality

(Cover and Thomas, 1991) it follows that H (Xi I lL ) ~ H (Xi I Pai ). This implies that a local structure cannot fit the data better than a tabular CPD . Nevertheless , as our experiments confirm , the reduction in the number of parameters can compensate for the potential loss in information . To summarize , the MDL score for a graph structure augmented with a local structure

Li for each Xi is

DL(G, 1, . . ., Ln, D) == DLgraph (G) + L (DLlocal -struct (Li) + DLparam (Li)) t

+N E H(~"YI YLi)' t

4.1.2. EDe Score for Local Structure We now describe how to extend the BDe score for learning local struc -

ture. Given the hypothesisGh, we denoteby ~ the hypothesisthat the underlying distribution

satisfies the constraints of a set of local structures

.c == { Li : 1 ~ i ~ n} , where Li is a local structure for the CPD of Xi in G. Using Bayes' rule , it follows that

Pr(Gh, L~ I D) cxPr(D I L~,Gh) Pr(L~ I Gh) Pr(Gh). The specification of priors on local structures presents no additional compli cations other than the specification of priors for the structure of the network

Gh. Buntine (1991a, 1993), for example, suggestsseveral possible priors on decision trees . A natural prior over local structures is defined via the MDL

descriptionlength, by setting Pr(.c~ I Gh) cx : 2- Ei DLlocal -struct (Li). For the term Pr(D I L~, Gh), we makean assumptionof parameterindependence, similar to the one made by Heckerman et ale (1995a) and by Buntine (1991b): the parameter valuesfor eachpossiblevalue of the characteristic variable Y Li are independent of each other . Thus , each multinomial

LEARNINGBAYESIANNETWORKS WITH LOCALSTRUCTURE439 sample is independent of the others, and we can derive the analogue of Equation 6:

Pr(D I ~, Gh) = U II J II (}~ f~i'V) Pr(8 Xilv I Lih, Gh)d8 Xilv t vEVal (TLi) Xi (10) (This decompositionis analogousto the onedescribedin Proposition4.1.) As before, we assumethat the priors Pr(8 Xilv I ~, Gh) are Dirichlet, and thus we get a closed -form solutionfor Equation10,

Pr

( D

I

. c ~

, Gh

)

=

II

II i

Once

more

priors

,

of

global

is

to

that

is and

set

BP

Recall

First

,

instances

that

the

Pr

from

N

N

xilv

~ ilv +

)

Xilv

I

.

Our

objective

prior

~

II

N

( v

problem

( 8

) )

of

, Gh

)

for ,

r

( N

; ilv

+

r

( N

Xi

specifying

each

ag

a

the

N

( Xi

~ ilv

)

, V ) ) .

multitude

possible

in

distribution

of

combination

cage

tabular

represented

on

, Z

the

that

and to

prior

that

CPDs

by

tests

then Y

=

, Z

Y =

a

. by

example on

.

Our

z ,

be

variable ) .

value

grouped

first

on y

for

For

( Pai

We

a

, Y

and

the

,

specific

this

then

this

on

same

.

Z

the

,

the

trees

and

another that

prior

of

of

possible

the in

both

-

set

value

two

-

vari on

particular

requires the

par

assump

partition

only

consider

the

two

characteristic

depends

assumption assigned

by

of It

are

make

generated

structure

are .

Val

groupings

local

variable

one

random over

the

the

on parents

CPD

characteristic

structure and

that

depend the

the

local

priors

assume not

of

the

random

correspond

( l:::~ i

( l : : : xi

the

a

values

the

of

first

r

with

structures

by

characteristic same

)

.

we

does

( iLj

faced

priors

regarding

able

r

Val

specifying

imposed

tions

are

local

that

titions

tests

,

these

network

the

we

' lIE

for that leaves trees

.

Second, we assume that the vector of Dirichlet hyperparameters assigned to an element of the partition that corresponds to a union of several sm~ller partitions in another local structure is simply the sum of the vector ~ of Dirichlet hyperparameters assigned to these smaller partitions . Again , consider two trees , one that consists of a single leaf , and another that has one test at the root . This assumption requires that for each Xi E Val(Xi ) , the Dirichlet hyperparameter N ~ilv ' where v is the root in the first tree , is the sum of the N ~ilv ' for all the leaves in the second tree . It is straightforward to show that if a prior distribution over structures , local structures , and parameters satisfies these assumptions and the assumptions of Heckerman et ale (1995a) , then there must be a distribution p ' and a positive real N ' such that for any structure G and any choice of

440

NIR

local

structure

Pr

FRIEDMAN

a

( 8Xilv

1 . c ~

for

AND

MOISES

GOLDSZMIDT

G

, Gh

)

' "

Dirichlet

( {

N

'

.

P

' ( Xi

, if

=

v

)

:

Xi

E

Val

( Xi

) } )

. ( 11

This

result

network N

allows B

' .

From

Finally MDL

( that

these

hypothesis

the

'

us

we

,

we and

to

represent

specifies two

need

,

we

to

note

that

BDe

scores

the

the

prior

compute

local

Schwarz structure

)

and

a a

)

Bayesian

positive

hyperparameters

learning

use

using PI

Dirichlet

during

can for

information

distribution the

evaluate

we

prior

real for

every

.

's are

result

( 1978

)

to

asymptotically

show equivalent

that .

4.2. LEARNING PROCEDURES Once we define the appropriate score, the learning ta8k reduces to finding the network that maximizes the score, given the data . Unfortunately , this is an intractable problem . Chickering (1996) shows that finding the network (quantified with tabular CPDs ) that maximizes the BDe score is NP -hard . Similar arguments also apply to learning with the MDL score. Moreover , there are indications that finding the optimal decision tree for a given fam ily also is an NP -hard problem ; see Quinlan and Rivest (1989) . Thus , we suspect that finding a graph G and a set of local structures { Ll , . . . , Ln } that jointly maximize the MDL or BDe score is also an intractable problem . A standard approach to dealing with hard optimization problems is heuristic search. Many search strategies can be applied . For clarity , we focus here on one of the simplest , namely greedy hillclimbing . In this strategy , we initialize the search with some network (e.g., the empty network ) and repeatedly apply to the "current " candidate the local change (e.g., adding and removing edges) that leads to the largest improvement in the score. This " upward " step is repeated until a local maxima is reached, that is, no modification of the current candidate improves the score. Heckerman et ale (1995a) compare this greedy procedure with several more sophisticated search procedures . Their results indicate that greedy hillclimbing can be quite effective for learning Bayesian networks in practice . The greedy hillclimbing procedure for learning network structure can be summarized as follows . procedure Le2fnNetwork ( Go ) Let G current -i;- Go do Gener2te 211successorsS = { GI , . . . , Gn } of Gcurrent ~ Score = maxGES Score (G ) - Score (Gcurrent) If ~ Score > 0 then Let Gcurrent -i;- arg maxGESScore (G )

LEARNING

BAYESIAN

while

(

dScore

return

The

>

successors

of

an

arc

This

greedy

ent

sets

,

we

we

Bouckaert

(

When

a

1994

)

to

parent

local

learning

operation

a

the

sets

are

The

specific

procedures

below

.

the

,

we

.

fix

Both

Since

the

choice

,

in

the

local be

of

( i . e as

a

. ,

.

Score

to

need

to

We

,

default

)

is

.

find

see

loop

modi

( an

this

by

This

only

-

approxi

one

-

or

two

procedure

greedy

a

to

1

Li

)

only

decision

Pai

in

,

and

discussion

properties

the

Pr

of

defined

score

log

to

the

partitions

,

trees

independently

( D

I

of

the

Gh

,

by

data

~

)

the

,

for

given

BDe

)

,

counts

.

the

the

,

that

the

of

the

.

values

several

partition

these

new

tables

finding

the

This

that

.

procedure

row

be

done

to

to

( i . e

.

. ,

Then

the

,

assign

leads

corresponded

correspond

-

only

.

The

row

can

we

subpartitions

explicitly

refinement

that

we

corre

then

default

single

represented

if

( that

,

the

takes

that

partitions

only

by

Xi

implies

default

term

terms

possible

to

when

the

of

,

one

containing

score

)

replacing

inducing

row

replace

v

decomposition

correspond

for

)

I

of

This

union

table

default

sum

v

by

by

( Xi

)

that

parents

to

the

( TLi

= =

terms

in

with

MDL

structure

the

the

only

and

parents

precisely

of

Li

default

improvement

need

its

SCOre

Val

strategy

trivial

refines

values

tables

applied

decomposability

More

function

1

the

a

row

G

Since

invoke

are

of

L

a

of

reevaluate

we

.

;

this

of

to

underlying

for

vE

local

value

iteratively

biggest

since

v

the

with

of

.

-

successor

savings

modify

.

default

the

DLdata

where

one

use

starts

ment

I

refining

sponds

it

( " "Yi

instances

consider

)

par

sum

i

these

we

additional

of

L

in

each

we

CPD

,

and

on

variable

structure written

where

Pai

two

additional

attempts

each

learning

Xi

terms

random

can

I

most

-

and

( Xi

at

evaluate

get

score

Score

successor

procedures

rely

functions

characteristic

for

these

procedures

score

the

successor

used

next

CPD

Li

to

that

each

,

net

MDL

modify

,

procedure

in

the

form

each

for

arc

only

.

scoring

structure

an

consider

.

described

each

the

terms

)

adding

( We

Bayesian

both

search

to

1991b

.

learning

,

representations

search

modified

CPDs

the

few

before

local

is

computations

(

arc

for

That

the

structured

local

best

.

recompute

Buntine

an

efficient

have

these

allowing

)

these

are

441

. )

decompose

score

to

and

cycle

during

cache

by

of

particularly

use

BDe

need

can

invokes

mation

for

)

a

considered

only

,

fication

STRUCTURE

generated

direction

involve

is

we

the

are

the

not

scores

successors

Moreover

adding

do

of

the

structure

reversing

procedure

the

( logarithm

Since

LOCAL

)

current

and

that

since

the

the

,

successors

works

0

WITH

Gcurrent

removing

legal

NETWORKS

to

-

the

efficiently

the

new

,

previous

row

and

442

NIR FRIEDMAN AND MOISESGOLDSZMIDT

the new default row . This greedy expansion is repeated until no improve ment in the score can be gained by adding another row . The procedure is summarized as follows .

procedureLe2fnDef2Ult() Let Rows(V ) -t- 0 do

Let r == argmaXrEVa~Pai)- Rows(V) Score(Rows(V) U { r } ) if Score(Rows(V) U { r } ) < Score(Rows(V)) then returnRows(D) Rows(V) f - Rows(V) U { r } end

For inducing decision trees , we adopt the approach outlined by Quinlan and Rivest (1989) . The common wisdom in the decision-tree learning liter ature (e.g., Quinlan (1993)) , is that greedy search of decision trees tends to become stuck at bad local minima . The approach of Quinlan and Rivest at tempts to circumvent this problem using a two -phased approach . In the first phase we "grow " the tree in a top -down fashion . We start with the trivial tree consisting of one leaf , and add branches to it in a greedy fashion , until a maximal tree is learned . Note that in some stages of this growing phase, adding branches can lower the score: the rationale is that if we continue to grow these branches , we might improve the score. In the second phase, we remove harmful branches by "trimming " the tree in a bottom -up fashion . We now describe the two phases in more details . In the first phase we grow a tree in a top -down fashion . We repeatedly replace a leaf with a subtree that has as its root some parent of X , say Y ; and whose children are leaves, one for each value of Y . In order to decide on which parent Y we should split the tree , we compute the score of the tree associated with each parent , and select the parent that induces the best scoring tree . Since the scores we use are decomposable , we can compute the split in a local fashion by evaluating on the instances with respect to the training data that are compatible with the path from the root of the tree to the node that is being split . This recursive growing of the tree stops when the node has no training instances associated with it , the value of X is constant in the associated training set, or all the parents of X have been tested along the path leading to that node. In the second phase, we trim the tree in a bottom -up manner . At each node we consider whether score of the subtree rooted at that node is better or equal to the score replacing that subtree by a leaf . If this is the case, then the subtree is trimmed and replaced with a leaf . These two phases can be implemented by a simple recursive procedure , LecrnT fee, that receives a set of instances and returns the " best" tree for

LEARNING

BAYESIAN

NETWORKS

WITH

LOCAL STRUCTURE

443

this set of instances . procedure Simple T ree(Y ) For y E Val (Y ) , let ly f - A ( i .e., 2- le2f ) return (Y , { ly : Y E Val (Y ) } ) end procedure Le2fnTree ( D ) if D = 0 or Xi is homogeneous in D then return A . I I Growing ph25e Let Ysplit == arg maxYEPai Score (SimpleTree for y E Val (Ysplit ) Let Dy = { Ui ED : Ysplit = Y in Ui } LetTy = Exp2-ndT ree( A , Dy ) let r = (Y split , { Ty : y E Val (Y ) } ) II Trimming ph2-se if Score (A I D ) > Score (T I D ) then return A else return T end

5.

Experimental

(Y ) I D )

Results

The main purpose of our experiments is to confirm and quantify the hy pothesis stated in the introduction : A learning procedure that learns local structures for the CPDs will induce more accurate models for two reasons : 1) fewer parameters will lead to a more reliable estimation , and 2) flexible penalty for larger families will result in network structures that are better approximations to the real (in ) dependencies in the underlying distribution . The experiments compared networks induced with table - based , tree based , and default - based procedures , where an X - based procedure learns networks with X as the representation of CPDs . We ran experiments using both the MDL score and the BDe score . When using the BDe score , we also needed to provide a prior distribution and equivalent sample size . In all of our experiments , we used a uniform prior distribution , and examined several settings of the equivalent sample size N ' . All learning procedures were based on the same search method discussed in Section 4 .2 . We ran experiments with several variations , including different settings of the BDe prior equivalent size and different initialization points for the search procedures . These experiments involved learning approximately 15,000 networks . The results are summarized below .

444

NIR FRIEDMAN

TABLE

Name

1.

Description

AND MOISES GOLDSZMIDT

of the networks

used in the experiments

I Description

.

IIUII

I n

Alarm

A network by medical experts for monitoring intensive care ( Beinlich et al ., 1989 ) .

in

37

253.95

509

Hailfinder

A network for modeling summer hail in northeastern Col orado (http : / / uuv . lis . pitt . edu / - dsl / hailfinder ).

56

2106.56

2656

A network for classifying

27

244.57

1008

Insurance

patients

lei

insurance applications

( Russell et

al ., 1995 ) .

5 .1.

METHODOLOGY

The

data

sets

networks these

, 16000

them

have

. The

the

results of

By the

virtue

the

. In

same

having

and

local

the

only

data model

could

in

model

on

the

Bayesian each

of

the

learning

data

sets , and

increase

10

the

did

not

accuracy

( independently

, the

,

procedures

of

sampled

methods

we

)

compared

. each

precisely

original

the

to

experiments

three 1 . From

250 , 500 , 1000 , 2000 , 4000 ran

order with

from

in Table

-

and

experiment of the

a golden

structures

-

. In

training

, we

ofsizes

received

network

all

experiment

quantify . We

the

were

parameter

also

, represented error able

estimation

to

and

by

between

the

quantify

the

the

structure

.

MEASURES

the

procedures

networks

of the

selection

As

of

sets

sampled

described

instances

generating

the

were are

training 32000

data

models

effect

experiments

repeated

as input

induced

5 .2 .

the

, we

original

, and

learning

to

training

received

the

characteristics

we sampled , 24000

access

sets

in

main

networks

8000 on

used

whose

main

as Kullbak tribution bu tion

P

OF

ERROR

measurement

- Leibler to

the

to

an

of error

divergence induced

we use

and

relative

distribution

approximation

the

entropy

. The

Q

entropy

distance

) from

entropy

is defined

the

( also

known

generating

distance

from

dis -

a distri

-

as P (x )

D ( PIIQ

This the tropy

quantity distri

is

bu tion

distance

is

D ( Q IIP ) . Another that

D ( PIIQ

a measure is Q not

) =

of

when

the

the

symmetric

important

) ~ 0 , where

L :x : P ( x ) log inefficiency

real

distri

property holds

incurred

bu tian

, i .e . , D ( PIIQ

equality

Q( ; ) .

of

the

if and

) is

is P . not

entropy only

by

assuming

N ate equal

in

distance if P =

that

Q .

that the

en -

general

to

function

is

LEARNING BAYESIAN NETWORKS WITHLOCALSTRUCTURE 445 There

On

are

the

several

axiomatic

side

properties

of

the

only

ples

possible

true

)

and

.

a

We

the

in

G

the

a

to

a

of

target

( P

(

induced

*

IIG

1991

-

entropy

Q

instead

of

monetary

)

for

a

*

networks

of

this

We

)

the

expected

discussion

allows

procedures

influences

P

,

or

different

on

distribution

Dstruct

the

the

separate

.

exam

distribution

Thomas

is

motivating

.

of

structure

distance

examples

needed

and

structure

network

also

.

desirable

entropy

both

the

examples

distance

several

that

In

bits

entropy

suggest

are

using

Cover

)

There

.

by

error

network

.

additional

to

1980

the

show

gambling

these

using

and

them

distance

assessing

be

respect

reader

entropy

induced

Let

. ,

generalization

interested

the

. g

of

the

the

of

incurred

( e

analysis

Measuring

(

,

all

loss

P

refer

Johnson

and

the

detailed

compare

and

satisfies

distribution

losses

and

compression

measures

the

for

measures

that

data

distance

Shore

approximation

function

from

,

justifications

the

error

.

us

We

are

parameter

to

also

estimation

.

define

the

inherent

error

of

G

with

as

=

mill

D

( P

*

II ( G

,

)

)

.

. c

The

inherent

choice

eters

CPDs

for

As

a

G

it

,

we

I

~

)

.

Then

)

P

An

*

:

P8i

is

as

,

Y

IIG

Z

be

Intuitively

,

and

by

the

smaller

"

best

error

can

be

expect

,

of

Xi

any

possible

possible

than

param

Dstruct

evaluated

the

,

"

( P

by

best

P8i

,

IIG

)

.

of

for

is

*

means

CPDs

given

-

G

are

identical

to

this

X

when

Ip

( X

*

.

G

Thomas

;

We

I

I

,

)

1991

=

is

)

)

.

is

*

be

a

such

distribu

that

= =

Hp

if

and

are

-

P

( Xi

I

a

each

of

error

these

of

the

independen

strength

of

mutual

conditional

network

independence

the

conditional

the

of

the

measure

the

of

;

error

"

degree

defined

know

,

*

P

inherent

measuring

-

the

depen

-

information

mutual

.

information

as

( X

I

how

0

let

where

to

measure

already

Z

,

the

variables

Z

)

reasonable

what

of

the

of

Y

)

and

attempt

to

,

*

"

can

way

Z

,

how

measures

we

Y

.

One

given

( X

II ( G

about

of

sets

term

;

*

thinking

is

,

( P

structure

.

estimating

P

Y

network

D

i

of

three

and

=

all

variables

,

a

find

error

might

a

measure

by

Ip

that

)

for

a

in

X

compress

be

in

violated

between

( Cover

*

way

,

between

X

we

achievable

can

get

of

As

G

)

structure

is

dency

and

Let

encoded

network

Let

. 1

I

assumptions

cies

to

we

distribution

( P

( Xi

G

error

if

measure

conditional

alternative

structure

even

hope

this

.

Dstruct

=

smallest

,

.

5

tion

the

Thus

equation

Proposition

Pai

is

.

cannot

,

the

P

G

G

still

- form

( Xi

for

out

where

*

of

. c

turns

closed

those

P

error

of

Z

)

-

much

Z

only

.

It

Hp

the

well

if

( X

Y

,

Z

)

knowledge

known

X

I

is

of

that

independent

.

Ip

Y

( X

of

;

helps

Y

Y

I

,

Z

given

us

)

~

0

Z

,

446

NIR FRIEDMAN AND MOISESGOLDSZMIDT

Using the mutual information as a quantitative measure of strength of dependencies, we can measure the extent to which the independence assumptions represented in G are violated in the real distribution . This suggests that we evaluate this measure for all conditional independencies represented by G . However , many of these independence assumptions "overlap " in the sense that they imply each other . Thus , we need to find a minimal set of independencies that imply all the other independencies represented by G . Pearl (1988) shows how to construct such a minimal sets of independence . Assume that the variable ordering Xl , . . . , X n is consistent with the arc direction in G (i .e., if X i is a parent of X j , then i < j ) . If , for every i , Xi is independent of { Xl , . . . Xi - I } - Pai , given Pai , then using the chain rule we find that P can be factored as in Equation 1. As a consequence, we find that this set of independence assumptions implies all the independence assumptions that are represented by G . Starting with different consistent orderings , ~ e get different minimal sets of assumptions . However , the next proposition shows that evaluating the error of the model with respect to any of these sets leads to the same answer. Proposition 5 .2 : Let G be a network structure , Xl , . . . , Xn be a variable ordering consistent with arc direction in G , and P * be a distribution . Then

Dstruct (P*IIG ) = LIp 't . (Xi; {XI, . . .Xi-I} - PaiI Pai) . This proposition shows that Dstruct(P * IIG) == 0 if and only if G is an [ map of P * ; that is, all the independence statements encoded in G are also true of P * . Small values of Dstruct(P * IIG) indicate that while G is not an I-map of P * , the dependencies not captured by G are "weak ." We note that Dstruct(P * IIG) is a one-sided error measure, in the sense that it penalizes structures for representing wrong independence statements , but does not penalize structures for representing redundant dependence statements . In particular , complete network structures (i .e., ones to which we cannot add edges without introducing cycles) have no inherent error , since they do not represent any conditional independencies . We can postulate now that the difference between the overall error (as measured by the entropy distance ) and the inherent error is due to errors introduced in the estimation of the CPDs . Note that when we learn a local structure , some of this additional error may be due to the induction of an inappropriate local structure , such as a local structure that makes aBsumptions of context -specific independencies that do not hold in the tar get distribution . As with global structure , we can measure the inherent error in the local structure learned . Let G be a network structure , and let SLl ' . . . , SLn be structures for the CPDs of G . The inherent local error of

LEARNING BAYESIAN NETWORKS WITHLOCALSTRUCTURE 447 G and3Ll' .. .,SLnis Dlocal (P*IIG, {SL1, . . ., SLn}) = From the above , tion 5 .1.

we get the following expected generalization

Proposition 5 .3 : Let G be a network structure , let SL1' . . . , SLn structure for the CPDs of G , and let P * be a distribution . Then

Dlocal(P* IIG, { SL1' . . . ' SLn} ) == D (P*II(G, [, *)), where [, * is such that P (X i I Y Li) == P* (Xi I YLi ) for all i . From

the

definitions

of inherent

error

above

it follows

that

for any

net -

work B == (G, ) , D (P* IIPB) ~ Dlocal(P* IIG, { SLl ' . . . , SLn} ) ~ Dstruct(P* IIG) . Using these measures in the evaluation

of our experiments

, we can mea -

sure the "quality " of the global independence assumptions made by a net-

work structure (Dstruct) , the quality of the local and global independence assumptions made by a network structure and a local structure (Dlocal) , and the total error , which also includes the quality of the parameters . 5 .3 . RESULTS We want

to characterize

the

error

in the

induced

models

as a function

of the number of samples used by the different learning algorithms for the induction . Thus , we plot learning curves where the x-axis displays the n urn ber of training

instances

N , and the y- axis displays

the error of the

learned model . In general , these curves exhibit exponential decrease in the error . This makes visual comparisons between different learning procedures

hard, since the differences in the large-sample range (N ~ 8000) are obscured , and , when a logarithmic scale is used for the x-axis , the differenc .es

at the small-sample range are hard to visualize. Seefor example Figure 3(a) and (b) . To address this problem , we propose a normalization of these curves , motivated by the theoretical results of Friedman and Yakhini (1996) . They show that learning curves for Bayesian networks generally behave as a linear

function of

. Thus , weplottheerrorscaled by~ . Figure 5(a) shows

the result of the application of this normalization to the curves in Figure 3. Observe that the resulting curves are roughly constant . The thin dotted

diagonal lines in Figure 5(a) correspond to the lines of constant error in Figure 3. We plot these lines for entropy distances of 1/ 2i for i == 0, . . ., 6.

448

NIRFRIEDMAN ANDMOISES GOLDSZMIDT " "

~ ...~.!

J

Table

~' :4 i ,

Default

...... .....

.

,

0 .8

.

.

~\ ,

I

.

I

,

I

,

I

.,

\

I

,

0 .6

.

"'" \ ", \ ,

I

._ --_._~ - _..

(a)

.

,

.

, .

0 .4

\

8. " , "

,

"

' .

, ,

, .

, ' .

\ .

\ .

- -

--

_ ._ .._ -

-

,

~

0 .2

~.

. .

" .- " - . ~

. .

-....-......-.-.-....-.-...--...-. I

4

8

16 Number

24

of instances

32

X I ,( xx)

I 1/2

.-.........-...-..-.....--......-....-..-...-...-..-...-.....-----..-.--.....-.-....- ..-.-.-..-....-

1/4 .~...-"'1'---1/8 .----.-- -- - -- - - '."" ':::--_. ." ~_-.. ' &... 1/16 . ...............-.........-.. .

(b)

---- -----1/32 ...... ........ . .,......... "' . ..................................-......-................."'."'.-""""-""..'.."" 1/64 I

4

8 16 24 Number ofinstances X 1,(xx)

32

~Jue1S !P..(dOJ1U3

Figure 9 . (a) Error curves showing the entropy distance achieved by procedures using the MDL score in the Alarm domain . The x -axis displays the number of training instances N , and the y- axis displays the entropy distances of the induced network . (b ) The same curves with logarithmic y-axis . Each point is the average of learning from 10 independent data sets .

Figures BDe

4 and 5 display

and MDL

iments

, the

eventually

scores

learning they

the entropy

( Table curves

would

2 summarizes appear

intersect

respect

to the

entropy

of networks

these

to converge

the dotted

f > o . Moreover , all of them appear specified by the results of Friedman With

distance

target

line of f entropy ) conform .

, tree - based

distribution distance

to the

procedures

procedures

in the

performed

As a general

~Jue1S !P..(dOJ1U3

based

poorly

AI2fm

and

in the

Insur2flce

H2ilfinder

rule , we see a constant

domains domain

. The

behavior

few excep the table -

default

- based

.

gap in the the curves

:

for all

performed

better in all our experiments than table - based procedures . With tions , the default - based procedures also performed better than methods

via the

values .) In all the exper to the

to ( roughly and Yakhini

distance

learned

corresponding

LEARNING BAYESIAN NETWORKS WITHLOCALSTRUCTURE 449

105

! ! !: "I ",

i i

.

, ,I

,.

i

I(M) Iii

Table Tree

/

I

I'

! ~.

l

95

'. /

9{ )

. - - -.....- _.

Default..... ....

/

".../ "

/

/ '." "

Po5 (a)

,

/

;1 .". ........... .

" '"

i "

PoO

""

.

.

i

7 () j ,

65

/'

'

i ,

"

"

6{ )

~JeWJoN ~~mns~p f.doJ1u~ ~

4

4( X )

) '

.

i

- - - - - --I I I ' I

- - - - ... - - - - .

- -. . - - - - .

- - -

-

.

/

R

16

24

of instances

X

32

I ,( KK )

Tahle . Tree ---.,6..--- ,I

jI

Default..... ....! / ...

i

.' .. . '

1

;

..

j '

... '

320

.,'

I

3 ( K)

" "

!

.." ,

----/ !

2()()

i - - - - - - - - - - - - , , " , - - - - . . .

i

/

,,

/

,

/

, ' ,.

!

,J

~

,/

16 Number

/

i

,,' "

,

4

/

../ ' / ~ - - - - - - - - - - - - - - - - - -- ,," ' j !

. .

-----.. -

I

/

. . .

280

I

/

...."

0

/

......"'.'/'.'

.'

/

..

Z

. . .........

...'

'."'. .....'

.~

I

_-

I

! I i

3M )

- -. .

'

i

3 ') - {' K

/

I

,

3RO

/

/ I I

J I

~

~ ""; e

.

I

Number

:;

;

.

. . . -- - - - - -

i

! I

(b)

.

"

!

0

.

I

I

~

.

.. ... . .. ...... .. .. ..... . .. .. .. .

,

t)

.

I

I /' '

i

U

.

I; '"""",

75

~ ~ "' !

.

J

"

/

,/

/

24

of instances

32

X I ,( KX )

16( ) 150 1M ) 13 ( ) 1 I ....

. ".. 7

(c)

...

.....8

. ..

120

. ...

.'.

..

.'...'"

110

....."

.

."

,, '

....

" ' . ..

I (X) 9( ) RO

------

70

~~~ S!P'(doJ1u ~ ~ !reWJoN

1

4

R

16 Number

of instances

24

32

X I ,( KK )

Figure 1, . Normalized error curves showing the entropy distance achieved by procedures using the BDe (with N ' = 1) score in the (a) Alarm domain, (b) Hailfinder domain, and (c) Insurancedomain. The x-axis displays the number of training instances N , and the y-axis displays the normalized entropy distances of the induced network (seeSection 5.2) .

450

NIRFRIEDMAN ANDMOISES GOLDSZMIDT

3 ( M) /

/!

' !

, \

25()

I

,I I !

I i

,' i!

, a.

i ,/

Ii

.

" "\ ;

'

,

,. ':;t.----t-., i " .

(a)

"

;

": !I : f ' I

2 ( XI

.

,

.

,

.

.

.

.

,

I : /:

; ,

-

'

-

-

-

-

/ '

/

I

!

"

/ /

,/

4

-

". .. -- - -.... . I '

/

.

.

'. .

.. .. ..

.

. . . . .' :o~

-......' .

/ "

..'

.. .. .. . . " .......... ............ '. '

.

' ,

"

/,;"" .....""'"

R

4( K)

I /

---

.

...', ....

I

I

I

/

/

-----------~ -

32

X 1,(XM)

Table Tree

/ 'I

/

I

. -- - .... -- -

Default

.... . ....

/ //

!

35 ( ) .

'

........

..........

I!

i

.

24

of instances

/i ,

i

.

....

",,'

/ . . . . .

...... --------------

,,/

16 Number

45 ( )

"

.,

.

......

"~ .....

", '

I

(b)

-

---- / '

-

.-.: . . . . . . . . . . . . . . . .

.

/'

/

J i / I

~.1U ' e1S !P'(doJ1u ~ ~ !fEWJo N

.

f ..

I'

150 '! iI I

f

. .I . . . . ~

/

/

/

' . ,/ ~ .

.i'

~.. .1 i ''

;"

/

/ '

, ,

' i /' .. .

/

I

,

,

.,

/

/!

' .

II I ,

:

!

I

I

I

,i :

I

I'

I

//

,I

,

/ ....,. ' .

I

.

'

/'

. r .. .' .. .

.. .

. . . . . .. .

............j ...........

. '

....

.

.

'

i ## .

'

'

.

.

..... . 8

---- ---------6

."

,

.,#' .

.

/

.."

/

,

.

------; A--

. . . .. .. .. ..... . ....

I

,

3 (X )

/

/

,

/

/

.' / ./

~JU~S!PI.doJ1u ~p~ !reWJoN

250

(c)

1

4

X

16 Number

3(M ) i i r I '

i

!

!

25()

I

I I ill i;

i

,I

;

I

I

I

j

i

I

i '

I

,

I

i Ii

I

/

/.

/I

2(M ) I I ! .1 r

/

I

1

:

I'

,

, .'

" .

"

!

7

,'

. . ... ,..

: .w " /

...'

... . . . 8 .. .. .. .. ..

"

.

.......

..,

..........

150 1\ /

I1 ._~ 'I.-"~1._.._---..& i' i, / .! I

, '

~~

/ ,.r

~

-

i

~.1umS !P'{doJ1u ~ ~ !~WJoN

I ( M)

I

. - - - -

I I I

4

..... ....

/ /

/

. .

.,'

Default

,/

/

'~',.at f ",' .....,......", "1 ;:/ I / ,

/ T;~~ = =

/

il

32

X 1 ,( XX )

/ "

I'

'

24

of instances

R

""

.

T

.....

16 Number

of instances

'....:.:.-........ ..

......-

.........-

......-

..

-

. . . - . -

- -.-. ..-.. ....-..- .......... ,

....'

..... .........-' "

24

.

-

...-

-

..-._-4 32

X I ,(MM)

Figure 5. Normalized error curves showing the entropy distance achieved by procedures using the MDL score in the (a) Alarm domain, (b) Hailfinder domain, and (c) Insurance domain.

LEARNING

BAYESIAN

NETWORKS

WITH

LOCAL

STRUCTURE

451

to different representations . Thus , for a fixed N , the error of the procedure representing local structure is a constant fraction of the error of the cor-

responding procedure that does not represent local structure (i .e., learns tabular CPDs). For example, in Figure 4 (a) we seethat in the large-sample region (e.g., N ~ 8000) , the errors of proceduresthat use trees and default tables are approximately 70% and 85% (respectively) of the error of the table-based procedures. In Figure 5(c) the corresponding ratios are 50% and 70 % .

Another way of interpreting these results is obtained by looking at the number of instances needed to reach a particular error rate . For example ,

In Figure4(a), the tree-basedprocedure reaches the errorlevelof 3\ with approximately 23,000 instances . On the other hand , the table -based procedure barely reaches that error level with 32,000 instances . Thus , if we want to ensure this level of performance , we would need to supply the table -baBed procedure with 9,000 additional instances . This number of instances might be unavailable in practice . We continued our investigation by examining the network structures learned by the different procedures . We evaluated the inherent error , Dstruct, of the structures learned by the different procedures . In all of our experi ments , the

inherent

error

of the

network

structures

learned

via

tree - based

and default -based procedures is smaller than the inherent error of the networks learned by the corresponding table -based procedure . For example , examine

the Dstruct column

in Tables 3 and 4 . From these results , we con -

clude that the network structures learned by procedures using local representations make fewer mistaken assumptions of global independence , as predicted by our main hypothesis . Our hypothesis also predicts that procedures that learn local representation are able to assessfewer parameters by making local assumptions of independence in the CPDs . To illustrate this , we measured the inherent local error , Dlocal, and the number of parameters needed to quantify these networks . As we can see in Tables 3 and 4, the networks learned by these procedures exhibit smaller inherent error , Dstruct; but they require fewer parameters , and their inherent local error , Dlocal, is roughly the same as that of networks learned by the table -based procedures . Hence, instead of making global assumptions of independence , the local representation procedures make the local assumptions of independence that better capture the regularities in the target distribution and require fewer parameters . As a consequence, the parameter estimation for these procedures is more accurate

.

Finally , we investigated how our conclusions depend on the particular choices we made in the experiments

. As we will see, the use of local structure

leads to improvements regardless of these choices. We examined two aspects

452

NIR FRIEDMANAND MOISESGOLDSZMIDT

of the learning process: the choice of the parameters for the priors and in the search procedure .

We start by looking at the effect of changing the equivalent sample size N '. Heckerman et ale (1995a) show that the choice of N ' can have drastic effects on the quality of the learned networks . On the basis of on their experiments

in the AI2fm domain , Heckerman

et ale report

that

N' = 5

achieves the best results . Table 5 shows the effect of changing N ' from 1 to 5 in our experiments

. We see that the choice of N ' influences the ma ~nitl ]clp '-'

of the errors

in the learned

networks

, and the sizes of the error

gaps between

the different methods . Yet these influences do not suggest any changes on the benefits

of local

structures

.

Unlike the BDe score, the MDL score does not involve an explicit choice of priors . Nonetheless , we can use Bayesian averaging to select the parame ters for the structures that have been learned by the MDL score, ~ opposed to using maximum likelihood estimates . In Table 6 we compare the error between the maximum likelihood estimates and Bayesian averaging with N ' := 1. As expected , averaging leads to smaller errors in the parameter estimation

, especially

for small sample sizes . However , with the exception

of

the Alcrm domain , Bayesian averaging does not improve the score for large

samples (e.g., N == 32, 000) . We conclude that even though changing the parameter estimation technique may improve the score in some instances , it does not change our basic conclusions . Finally , another aspect of the learning process that needs further investi gation is the heuristic search procedure . A better search technique can lead to better

induced

models as illustrated

in the experiments

of Heckerman

et ale (1995a) . In our experiments we modified the search by initializing the greedy search procedure with a more informed starting point . Follow -

ing Heckerman et ale (1995a) we used the maximal branching as a starting state for the search. A maximal branching network is one of the highest scoring network among these where IPail s 1 for all i . A maximal branch ---.

-

,

-

ing can be found in an efficient manner (e.g., in low-order polynomial time) (Heckerman et al., 1995a) . Table 7 reports the results of this experiment. In the Alcrm domain , the use of maximal branching as an initial -point led to improvements in all the learning procedures . On the other hand , in the Insur2-nce domain , this choice of for a starting point led to a worse error . Still , we observe that the conclusions described above regarding the use of local

6. The

structure

held

for these

runs

as well .

Conclusion main

contribution

of this

article

is the introduction

of structured

rep -

resentations of the CPDs in the learning process, the identification of the

LEARNING

BAYESIAN

NETWORKS

WITH

LOCAL

STRUCTURE

453

benefits of using these representations , and the empirical validation of our

hypothesis. As we mentioned in the introduction (Section 1), we are not the first to consider efficient representations for the CPDs in the context of learning . However , to the best of our knowledge , we are the first to consider and demonstrate the effects that these representations may have on the learning of the global structure of the network . In this paper we have focused on the investigation of two fairly simple , Rtructured -

-

-

-

-

-

-

representations

-

..

of CPDs : trees

and

default

tables . There

are

certainly many other possible representation of CPDs , based, for example ,

on decision graphs, rules, and CNF formulas: seeBoutilier et ale(1996) . OUf choice was mainly- due to the availability of efficient computational tools for learning the representations we use. The refinement of the methods studied in this paper to incorporate these representations deserves further attention . In the machine learning literature , there are various approaches to learning trees , all of which can easily be incorporated in the learning procedures for Bayesian networks . In addition , certain interactions among the search procedures for global and local structures can be exploited , to reduce the computational cost of the learning process. We leave these issues for future

research

.

It is important to distinguish between the local representations we examine in this paper and the noisy-or and logistic regression models that have

been examined in the literature . Both noisy-or and logistic regression (as applied in the Bayesian network literature ) attempt to estimate the CPD with a fixed number of parameters . This number is usually linear in the number of parents in the CPD . In cases where the target distribution

does

not satisfy the assumptions embodied by these models , the estimates of CPDs produced by these methods can arbitrarily diverge from the target distribution . On the other hand , our local representations involve learning the structure

of the CPD , which

can range

from

a lean

structure

with

few

parameters to a complex structure with an exponential number of parameters . Thus , our representations can scale up to accommodate the complexity of the training data . This ensures that , in theory , they are aBymptotically correct : given enough samples, they will construct a close approximation of the target

distri bu tion .

In conclusion

, we have shown

that

the induction

of local

structured

rep -

resentation for CPDs significantly improves the performance of procedures for learning Bayesian networks . In essence, this improvement is due to the fact that we have changed the bias of the learning procedure to reflect the nature of the distribution in the data more accurately .

454

NIR FRIEDMANAND MOISESGOLDSZMIDT

TABLE 2. Summaryof entropydistancefor networkslearnedby the procedure using the MDL scoreand BDe scorewith N ' = 1. MDL Score BDeScore Domain Size Table Tree Defualt Table Tree Default (X 1,000) Alarm

Hailfinder

Insurance

0.25 0.50 1.00 2.00 4.00 8.00 16.00 24.00 32.00

. 0.25 I 0.50 1.00 2.00 4.00 8.00 16.00 24.00 32.00

5.7347 3.5690 1.9787 1.0466 0.6044 0.3328 0.1787 0.1160 0.0762

5.5148 3.2925 1.6333 0.8621 0.4777 0.2054 0.1199 0.0599 0.0430

5.1832 2.8215 1.2542 0.6782 0.3921 0.2034 0.1117 0.0720 0.0630

1.6215 0.9701 0.4941 0.2957 0.1710 0.0960 0.0601 0.0411 0.0323

1.6692 1.0077 0.4922 0.2679 0.1697 0.0947 0.0425 0.0288 0.0206

1.7898 1.0244 0.5320 0.3040 0.1766 0.1118 0.0512 0.0349 0.0268

9.5852 4.9078 2.3200 1.3032 0.6784 0.3312 0.1666 0.1441 0.1111

9.5513 4.8749 2.3599 1.2702 0.6306 0.2912 0.1662 0.1362 0.1042

8.7451 I 6.6357 4.7475 3.6197 2.3754 1.8462 1.2617 1.1631 0.6671 0.5483 0.3614 0.3329 0.2009 0.1684 0.1419 0.1470 0.1152 0.1081

6.8950 3.7072 1.8222 1.1198 0.5841 0.3117 0.1615 0.1279 0.0989

6.1947 3.4746 1.9538 1.1230 0.6181 0.3855 0.1904 0.1517 0.1223

0.25 0.50 1.00 2.00 4.00 8.00 16.00 24.00 32.00

4.3750 2.7909 1.6841 1.0343 0.5058 0.3156 0.1341 0.1087 0.0644

4.1940 2.5933 1.1725 0.5344 0.2706 0.1463 0.0704 0.0506 0.0431

4.0745 2.3581 1.1196 0.6635 0.3339 0.2037 0.1025 0.0780 0.0570

1.9117 1.0784 0.5799 0.3316 0.1652 0.1113 0.0480 0.0323 0.0311

2.1436 1.1734 0.6335 0.3942 0.2153 0.1598 0.0774 0.0458 0.0430

2.0324 1.1798 0.6453 0.4300 0.2432 0.1720 0.0671 0.0567 0.0479

LEARNINGBAYESIANNETWORKSWITH LOCALSTRUCTURE455

'fABLE 3. Summaryof inherentelTor, inherentlocal error, and numberof parametersfor the networkslearnedby the table-basedand the t.ree-basedproceduresusingthe BDe scorewith N' = 1. Table Tree Oomain Size D Dstruct/Dlocal Param D Dlocal Dstruct Param (X 1,000) Alarm

1 4 16 32 .

0.4941 0.1710 0.0601 0.0323

0.1319 0.0404 0.0237 0.0095

570 653 702 1026

0.4922 0.1697 0.0425 0.0206

0.1736 0.0570 0.0154 0.0070

0.0862 0.0282 0.0049 0.0024

383 453 496 497

Hailfinder

1 4 16 32

1.8462 0.5483 0.1684 0.1081

1.2166 0.3434 0.1121 0.0770

2066 2350 2785 2904

1.8222 0.5841 0.1615 0.0989

1.1851 0.3937 0.1081 0.0701

1.0429 0.2632 0.0758 0.0404

1032 1309 1599 1715

Insurance

1 4 16 32

0.6453 0.2432 0.0671 0.0479

0.3977 0.1498 0.0377 0.0323

487 724 938 968

0.5799 0.1652 0.0480 0.0311

0.3501 0.0961 0.0287 0.0200

0.2752 0.0654 0.0146 0.0085

375 461 525 576

rrAT3I .lE4. Summary ofillherent error , inherent localerror , andnumber ofparameters forthe networks learned bythetable -based andtree -based procedures using theMDLscore . Tabie Tree Domain Size D Dstruct /DiocalPar an. D DlocalDstructParam (x 1,000 ). Alarm 1 I 1.9787 0.5923 361I 1.63330.4766 0.3260 289 4 0.6044 0.2188 457 0.47770.14360.0574 382 16 0.1787 0.0767 639 0.11990.0471 0.0189 457 722- 0.04300.0135 0.0053 461 I32- 0.0762 0.0248 Hailfinder 1 2.3200 1.0647 10922.35991.13430.9356 1045 4 0.6784 0.4026 13630.63060.3663 0.2165 1322 16 0.1666 0.1043 17180.16620.11070.0621 1583 0.0743 18640.10420.0722 0.0446 1739 --2 - 0.1111 II ~ -----Insurance 1 1.68Ll1 1.0798 335 1.17250.5642 0.4219 329 4 0.5058 0.3360 518 0.27060.11690.0740 425 16 0.1341 0.0794 723 0.07040.0353 0.0187 497 32 0.0644 0.0355 833 0.04310.0266 0.0140 544

456

NIR FRIED!\1AN AND MOISESGOLDSZMIDrr

TABLE with

5. Summary of entropy. distance for procedures that use the BDe score

N ' =

1 and

Domain

N ' =

fi -

N' = 1 N' = 5 Size Table Tree Default Table Tree Default (x 1,000)

Alarm

1 4 16 32

0.4941 0.1710 0.0601 0.0323

0.4922 0.1697 0.0425 0.0206

0.5320 0.1766 0.0512 0.0268

0.3721 0.1433 0.0414 0.0254

0.3501 0.1187 0.0352 0.0175

0.3463 0.1308 0.0435 0.0238

Hailfinder

1 4 16 32

1.8462 0.5483 0.1684 0.1081

1.8222 0.5841 0.1615 0.0989

1.9538 0.6181 0.1904 0.1223

1.4981 0.4574 0.1536 0.0996

1.5518 0.4859 0.1530 0.0891

1.6004 0.5255 0.1601 0.0999

Insurance

1 4 16 32

0.6453 0.2432 0.0671 0.0479

0.5799 0.1652 0.0480 0.0311

0.6335 0.2153 0.0774 0.0430

0.5568 0.1793 0.0734 0.0365

0.5187 0.1323 0.0515 0.0284

0.5447 0.1921 0.0629 0.0398

TABLE6. Summary of entropydistance for procedures thatusetheMDLscore for learningthe structureandlocalstructurecombined with two methodsfor parameterestimation . Maximum Likelihood Bayesian , N' = 1 Domain Size Table Tree Default Table Tree Default (x 1,000) Alarm

1 4 16 32

1.9787 0.6044 0.1787 0.0762

1.6333 0.4777 0.1199 0.0430

1.2542 0.3921 0.1117 0.0630

0.8848 0.3251 0.1027 0.0458

0.7495 0.2319 0.0730 0.0267

0.6015 0.2229 0.0779 0.0475

Hailfinder

1 4 16 32

2.3200 0.6784 0.1666 0.1111

2.3599 0.6306 0.1662 0.1042

2.3754 0.6671 0.2009 0.1152

1.7261 0.5982 0.1668 0.1133

1.7683 0.5528 0.1586 0.0964

1.8047 0.6091 0.1861 0.1120

Insurance

1 4 16 32

1.6841 0.5058 0.1341 0.0644

1.1725 0.2706 0.0704 0.0431

1.1196 0.3339 0.1025 0.0570

1.1862 0.3757 0.1116 0.0548

0.7539 0.1910 0.0539 0.0368

0.8082 0.2560 0.0814 0.0572

LEARNINGBAYESIANNETWORKS WITH LOCALSTRUCTURE457

TABLE 7. Summaryof entropydistancefor two methodsfor initializingthe search , usingthe the BDescorewith N' = 1. EmptyNetwork Maximal Branching Network Domain Size Table Tree Default Table Tree Default (X 1,000) Alarm 1 0.4941 0.4922 0.5320 0.4804 0.5170 0.4674 4 0.1710 0.1697 0.1766 0.1453 0.1546 0.1454 16 0.0601 0.0425 0.0512 0.0341 0.0350 0.0307 32 0.0323 0.0206 0.0268 0.0235 0.0191 0.0183 Hailfinder

1 4 16 32

1.8462 0.5483 0.1684 0.1081

1.8222 0.5841 0.1615 0.0989

1.9538 0.6181 0.1904 0.1223

1.7995 0.6220 0.1782 0.1102

1.7914 0.6173 0.1883 0.1047

1.9972 0.6633 0.1953 0.1162

Insurance

1 4 16 32

0.6453 0.2432 0.0671 0.0479

0.5799 0.1652 0.0480 0.0311

0.6335 0.2153 0.0774 0.0430

0.6428 0.2586 0.1305 0.0979

0.6350 0.2379 0.0914 0.0538

0.6502 0.2242 0.1112 0.0856

NIR FRIEDMANAND MOISESGOLDSZMIDT

458

Acknowledgments The authors are grateful to an anonymous reviewer and to Wray Buntine and David Heckerman

for their comments

on previous versions of this paper

and for useful discussions relating to this work . Part

of this

research

was done

while

both

authors

were

at the

Fockwell

Science Center , 4 Palo Alto Laboratory . Nir Friedman was also at Stanford University at the time . The support provided by Fockwell and Stanford University is gratefully acknowledged . In addition , Nir Friedman was supported in part by an IBM graduate fellowship and NSF Grant IFI -95-03109. A preliminary version of this article appeared in the Proceedings, 12th Conference on Uncertainty in Artificial Intelligence , 1996. References I . Beinlich , G . Suermondt , R . Chavez , and G . Cooper . The ALARM monitoring system : A case study with two probabilistic inference techniques for belief networks . In Proc . 2 'nd European Conf . on AI and Medicine . Springer - Verlag , Berlin , 1989. R . R . Bouckaert . Properties of Bayesian network learning algorithms . In R . Lopez de Mantaras and D . Poole , editors , Proc . Tenth Conference on Uncertainty in Artificial

Intelligence ( UAI '94) , pages 102- 109. Morgan Kaufmann , San Francisco, CA , 1994. C . Boutilier , N . Friedman , M . Goldszmidt , and D . Koller . Context -specific independence in Bayesian networks . In E . Horvitz and F . Jensen , editors , Proc . Twelfth Con -

ference on Uncertainty in Artificial Intelligence ( UAI '96) , pages 115- 123. Morgan Kaufmann

, San

Francisco

, CA , 1996 .

W . Buntine . A theory of learning classification ogy , Sydney , Australia , 1991.

rules . PhD thesis , University

of Technol -

W . Buntine . Theory refinement on Bayesian networks . In B . D . D 'Ambrosio , P. Smets , and P. P. Bonissone , editors ,' Proc . Seventh Annual Conference on Uncertainty Ar -

tificial Intelligence ( UAI '92) , pages 52- 60. Morgan Kaufmann , San Francisco, CA , 1991 .

W . Buntine . Learning classification trees . In D . J . Hand , editor , A rtificial Intelligence Frontiers in Statistics , number I I I in AI and Statistics . Chapman & Hall , London , 1993 .

D . M . Chickering . Learning Bayesian networks is NP -complete . In D . Fisher and H .- J . Lenz , editors , Learning from Data : Artificial Intelligence and Statistics V. Springer Verlag , 1996. G . F . Cooper and E . Herskovits . A Bayesian method for the induction of probabilistic networks from data . Machine Learning , 9:309- 347 , 1992. T . M . Cover and J . A . Thomas . Elements of Information Theory . John Wiley & Sons, New

York

, 1991 .

M . H . DeGroot . Optimal Statistical Decisions . McGraw -Hill , New York , 1970. F . J . Diez . Parameter adjustment in Bayes networks : The generalized noisy or -gate . In D . Heckerman and A . Mamdani , editors , Proc . Ninth Conference on Uncertainty in

Artificial

Intelligence ( UAI '99) , pages 99- 105. Morgan Kaufmann , San Francisco,

CA , 1993 .

N . Friedman and Z . Yakhini . On the sample complexity of learning Bayesian networks . In E . Horvitz and F . Jensen , editors , Proc . Twelfth Conference on Uncertainty in 4All products their

respective

and company names mentioned in this article are the trademarks holders .

of

LEARNING

BAYESIAN

NETWORKS

WITH

LOCAL

STRUCTURE

459

Artificial Intelligence ( UAI '96) . Morgan Kaufmann , San Francisco, CA , 1996. D . Heckerman and J . S. Breese . A new look at causal independence . In R . Lopez de Mantaras and D . Poole , editors , Proc . Tenth Conference on Uncertainty in Artificial

Intelligence ( UAI '94) , pages 286- 292. Morgan Kaufmann , San Francisco, CA , 1994. D . Heckerman , D . Geiger , and D . M . Chickering . Learning Bayesian networks : The combination of knowledge and statistical data . Machine Learning , 20:197- 243, 1995. D . Heckerman . A tutorial on learning Bayesian networks . Technical Report MSR - TR 95 - 06 , Microsoft

Research

, 1995 .

W . Lam and F . Bacchus . Learning Bayesian belief networks : An approach based on the MDL principle . Computational Intelligence , 10:269- 293, 1994. R . Musick . Belief Network Induction . PhD thesis , University of California , Berkeley , CA , 1994 .

R . M . Neal . Connectionist

learning of belief networks . Artificial

Intelligence , 56:71- 113,

1992 .

J . Pearl . Probabilistic

Reasoning in Intelligent

Systems . Morgan Kaufmann , San Fran -

cisco , CA , 1988 .

J . R . Quinlan and R . Rivest . Inferring decision trees using the minimum description length principle . Information and Computation , 80 :227- 248, 1989. J . R . Quinlan . C4 .5: Programs for Machine Learning . Morgan Kaufmann , San Francisco , CA , 1993 .

J . Rissanen . Stochastic Complexity

in Statistical

Inquiry . World Scientific , River Edge ,

NJ , 1989 .

S. Russell , J . Binder , D . Koller , and K . Kanazawa . Local learning in probabilistic works

with

hidden

variables

. In Proc . Fourteenth

International

Joint

net -

Conference

on

Artificial Intelligence (IJCAI '95) , pages 1146- 1152. Morgan Kaufmann , San Francisco , CA , 1995 .

G . Schwarz . Estimating the dimension of a model . Annals of Statistics , 6:461- 464, 1978. J . E . Shore and R . W . Johnson . Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy . IEEE Transactions on Information

Theory, IT -26(1) :26- 37, 1980. D . J . Spiegelhalter and S. L . Lauritzen . Sequential updating of conditional probabilities on directed graphical structures . Networks , 20:579- 605 , 1990. S. Srinivas . A generalization of the noisy -or model . In D . Heckerman and A . Mamdani ,

editors, Proc. Ninth Conference on Uncertainty in Artificial Intelligence ( UAI '93) , pages 208- 215 . Morgan Kaufmann , San Francisco , CA , 1993. C . Wallace and J . Patrick . Coding decision trees . Machine Learning , 11:7- 22, 1993.

ASYMPTOTIC MODEL SELECTION FOR DIRECTED NETWORKS WITH HIDDEN VARIABLES

DAN

GEIGER

Computer Science Department Technion , Haifa 32000, Israel dang@cs. technion . aCtil DAVID

HECKERMAN

Microsoft Research, Bldg 98 Redmond

W A , 98052 - 6399

heckerma @microsoft .com AND CHRISTO

PHER

MEEK

Carnegie -Mellon University Department of Philosophy meek @cmu . edu

Abstract . We extend the Bayesian Information Criterion (BIC ) , an asymptotic approximation for the marginal likelihood , to Bayesian networks with hidden variables . This approximation can be used to select models given large samples of data . The standard BIC as well as our extension punishes the complexity of a model according to the dimension of its parameters . We argue that the dimension of a Bayesian network with hidden variables is the rank of the Jacobian matrix of the transformation between the parameters of the network and the parameters of the observable variables . We compute the dimensions of several networks including the naive Bayes model with a hidden root node . This manuscript was previously published in The Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence , 1996, Morgan Kauf mann . 461

462 1.

DAN GEIGERET AL.

Introduction

Learning Bayesian networks from data extends their applicability to sit uations where data is easily obtained and expert knowledge is expensive .

Consequently, it has beenthe subject of much researchin recent years (e.g., Heckerman, 1995a; Buntine, 1996). Researchershave pursued two types of approaches for learning Bayesian networks : one that uses independence tests

to direct

a search

among

valid

models

and

another

that

uses a score

to search for the best scored network - a procedure known as model selection . Scores based on exact Bayesian computations have been developed

by (e.g.) Cooper and Herskovits (1992) , Spiegelhalter et ale (1993) , Buntine (1994) , and Heckerman et al. (1995), and scores based on minimum description length (MDL ) have been developedin Lam and Bacchus (1993) and Suzuki (1993) . We consider a Bayesian approach to model selection . Suppose we have

a set { Xl , . . . , Xn } == X of discrete variables, and a set { Xl , . . ., XN} = D of cases , where

each

case is an instance

of some

or of all the

variables

in

X . Let (8 , 8s) be a Bayesian network, where.S is the network structure of the Bayesian network , a directed acyclic graph such that each node Xi of 8 is associated

with

a random

variable

Xi , and 8s is a set of parameters

associated with the network structure . Let Sh be the hypothesis that precisely the independence assertions implied by S hold in the true or objective joint distribution of X . Then , a Bayesian measure of the goodness-of-fit of

networkstructure S to D is p(ShID) cx : p(Sh)p(DISh), wherep(DISh) is known aB the marginal likelihood of D given Sh . The problem of model selection among Bayesian networks with hidden variables more

, that

difficult

is , networks than

model

with

variables

selection

among

whose

values

networks

are not observed

without

hidden

is

vari -

ables. First , the space of possible networks becomes infinite , and second, scoring each network is computationally harder because one must account

for all possible values of the missing variables (Cooper and Herskovits, 1992) . Our goal is to develop a Bayesian scoring approach for networks that include hidden variables . Obtaining such a score that is computation ally effective and conceptually simple will allow us to select a model from among a set of competing models . Our approach is to use an aBymptotic approximation of the marginal likelihood . This asymptotic approximation is known as the Bayesian Infor -

mation Criteria (BIC ) (Schwarz, 1978; Haughton, 1988), and is equivalent to Rissanen's (1987) minimum description length (MDL ). Such an asymptotic approximation

haB been carried out for Bayesian networks by Her-

skovits (1991) and Bouckaert (1995) when no hidden variables are present. Bouckaert (1995) shows that the marginal likelihood of data D given a

ASYMPTOTIC

MODEL

SELECTION

463

network structure S is given by

p(DISh) == H(S, D)N - 1/ 2dim(S) log(N) + 0 (1)

(1)

where N is the sample size of the data, H (S, D ) is the entropy of the probability distribution obtained by projecting the frequencies of observed cases into the conditional probability tables of the Bayesian network 5 ,

and dim (5 ) is the number of parameters in S. Eq. 1 revealsthe qualitative preferences made by the Bayesian approach . First , with sufficient data , a network than

structure

a network

ond , among

that

structure

all network

is an I- map of the true distribution that

is not an I - map of the true

structures

that

are I - maps

is more likely

distribution

of the true

. Sec -

distribution

,

the one with the minimum number of parameters is more likely . Eq . 1 was derived from an explicit formula for the probability of a network given data by letting the sample size N run to infinity and using a Dirichlet prior for its parameters . Nonetheless, Eq . 1 does not depend on the selected pribr . In Section 3, we use Laplace 's method to rederive Eq . 1 without assuming a Dirichlet prior . Our derivation is a standard application of asymptotic Bayesian analysis . This derivation is useful for gaining intuition for the hidden -variable case. In section 4, we provide an approximation to the marginal likelihood for Bayesian networks

for this approximation

with hidden variables , and give a heuristic

argument

using Laplace 's method . We obtain the following

.

equatIon :

logp(SID) ~

" " logp(SID, (Js) - 1/ 2dim(S, (Js) log(N)

(2)

.. where(Jsis the maximum likelihood(ML) valuefor the parametersof the .. networkand dim(S, (Js) is the dimensionof S at the ML valuefor 8s. The dimensionof a modelcan be interpretedin two equivalentways. First, it is the numberof free parametersneededto representthe parameterspace nearthe maximumlikelihoodvalue. Second , it is the rank of the Jacobian matrix of the transformationbetweenthe parametersof the networkand the parametersof the observable(non~-hidden) variables . In any case, the dimensiondependson the value of (Js, in contrast to Eq. 1, where the dimensionis fixed throughoutthe parameterspace. In Section5, we computethe dimensionsof severalnetworkstructures, including the naive Bayesmodelwith a hiddenclassnode. In Section6, we demonstratethat the scoringfunction usedin AutoClasssometimes divergesfrom p(SID) asymptotically . In Sections7 and 8, we describehow our heuristicapproachcan be extendedto Gaussianand sigmoidnetworks.

464

DAN GEIGER ET AL.

2.

Background

We

introduce

the

number

of

to

the

of

states

.

PSi

=

use

, e

use

pai

to

=

=

addition

,

parameters

fJi

i

.

~

e

.

we

use

=

Thus

fJs

=

,

{

fJij

Os

=

=

fJij

{

j

=

~

Oill

{

i

}

:

( }

=

~

~

ijk

~

n

1

( Jijk

12

}

.

=

k

for

Psi

.

ri

,

}

to

is

,

jth

be

the

=

( }

non

instance

of

-

the

,

.

given

ijk

>

parents

use

O

.

Pai

with

we

We

redundant

associated

unambiguous

write

xf

assume

denote

states

we

state

Xi

we

parameters

S

is

its

Also

i

of

That

that

.

r

number

assigned

given

the

When

1

~

a

denote

the

parameter

=

~

i

to

are

Let

corresponding

be

of

Xi

.

variables

rl

states

of

node

qi

~

the

or

)

of

~

probability

that

set

I1XlEP

parents

true

network

the

=

index

the

Note

~

be

Bayesian

8

,

node

instead

of

'

To

compute

eral

p

(

DISh

assumptions

a

(

sample

sets

(

,

(

local

,

.

.

.

,

(

the

parameters

rameter

Fifth

,

is

two

,

the

prior

Dirichlet

the

-

that

,

p

(

number

these

following

et

Oij

ISh

of

exact

formula

p

(

)

=

=

where

N

We

call

ilk

is

The

last

:

is

assumption

is

ditional

the

implied

.

,

they

represent

cases

i , il +

G '

Xi

-

D

:

-

:

?; J

in

ft k

r =

sets

which

of

independence

if

a

node

of

the

(

is

be

pa

.

node

interpreted

as

P8i

1992

=

)

-

complete

each

+ (

Nijk

aijk

=

pai

.

obtained

the

=

=

) )

xf

and

Pai

function

.

data

are

(

shows

assumptions

1995

.

the

in

)

the

provide

the

from

fifth

one

Bayesian

aEsumptions

pa1

,

are

that

and

equivalent

=

Namely

seen

Heckerman

which

=

.

convenience

,

are

-

networks

and

aijk

Xi

sake

three

param

,

case

xf

r

and

52

of

(

complete

first

Third

can

scoring

Geiger

each

the

with

=

l

erskovits

the

and

and

.

each

(

distribution

51

,

both

ijk

=

for

:

)

,

distribution

Herskovits

N

H

)

,

where

after

.

the

same

~

for

and

if

itk

which

in

-

Dirichlet

that

the

i ' tJ

made

from

assumption

. e

-

family

of

Fourth

-

be

independent

the

likelihood

Cooper

Dirichlet

characterization

( } ~

(

Second

)

1990

then

sev

to

independent

in

.

.

associated

in

-

l

before

the

fIk

)

,

assumed

mutually

identical

)

Os

1990

,

1995

is

mutually

,

and

r =

of

assumption

family

j

the

distributions

same

i

number

expression

parameter

a

the

this

l

,

,

be

parameters

X

8

are

,

marginal

fr =

.

Cooper

the

ft i

On

Lauritzen

al

seen

for

DISh

(

cases

,

,

to

are

the

)

assumptions

.

Lauritzen

node

of

is

equivalent

Using

this

distribution

.

networks

Heckerman

D

(

.

structures

data

assumed

distinct

with

modularity

are

and

in

associated

,

and

i

network

the

network

01

Spiegelhalter

parents

,

Bayesian

each

many

First

sets

for

,

same

for

.

Spiegelhalter

Jiqi

independence

has

form

made

parameter

,

Jil

closed

some

the

independence

eter

in

usually

from

structure

global

)

are

random

network

(

to

with

11

a

Pai

qi

j

the

pat

,

and

integer

associated

and

,

that

deno

Pai

Xi

denote

for

Xi

node

the

to

{ Jijk

that

variable

of

W

notation

of

parents

Pai

In

following

ad

networks

)

,

then

the

-

ASYMPTOTICMODELSELECTION events et for to

Sf

and

al . , 1995

S ~ are

) . This

causal

equivalent

networks

distinct

et

was

where

hypotheses

Heckerman

as well

assumption

ale

two

arcs

( Heckerman ( 1995

( hypothesis

made

) show

one

) . To

must

in

the

Cooper

- Herskovits

probability network

specified in

The

of by

the

Cooper

the

prior

satisfy

. Nonetheless

and

Nijk

finite

/ N

independence

still N

holds , the

washes

Assymptotics

We

shall

result

rederive

likelihood

of

Our

section tion is

the derivation size

only

, we

assumes . Finally

extended

to

We

begin

use

is

the argue

defining

joint

size

or

in

to

a qualita

, as

we

the

show

prior

/ N

Sterling

's

and

lo -

global , the

a large

-

Nij

result

sample

size

.

Variables

) and 's

the

maximum

need

Bouckaert

method

compute

- + oo P ( DNISh

for

9

next

with

f ( 9 ) == logp

the

section

( DNI9

) asymptotic

to

expand

value

, and

distribution P ( DNISh

) for

discussed

in

variables , Sh ) . Thus

data the , our

maximum that

likelihood

our

the then

.

) . Furthermore

around

hidden

is

likelihood

assumptions

the

' s ( 1995

, which

- normal to

the

prior

networks

the

keeping

of

, with

of

' ( 1991

limN

, we

,

Bayesian

r ( . ) using

assumptions

, although

a multivariate the

itself yet

expanding

Laplace

requires

that

) is

prior sample

lend

infinity

. Intuitively

around

compute

Bayesian by

prior

not

to

the

contribution

using

, which

or

' s effective

does

by

Hidden

bypasses N

. Instead

positive

any

data

peak

assumptions

)

initial

user

grow

on

a Dirichlet

we

the

derived hinges

Herskovits

the

N

assumptions

away

technique

approximate

a sample

be

Without

now

. The

on

these

these

q ( X 1 , . . . , Xn

an

function

letting

. 1 can

and

data

scoring

derivation

without

3 .

log

, Eq

is

hold

.

, by

. This

, where

G'

not

correspond

. == pai

from

, and

does

use

, p ~

obtained

user

- Herskovits

analysis

== Xi

function

X

network

tive

approximation

q ( Xi

scoring

distribution

confidence

cal

== a

it

directions

k aijk

, Heckerman

, because

opposing

, 1995b that

equivalence

explicit

with

465

derivation

DN

of

previous deriva

-

value can

be

. ,

P(DNISh) == J P(DNIO,Sh) p(OISh ) dO==

J exp{f (B)} p(BISh ) dB ,. ,. Assuming f (8) has a maximum- the ML value 8- we have f ' (8) Using a Taylor-seriesexpansion of / (8) around the ML value, we get

" " " f (8) ~ f (8) + 1/ 2(8 - 8)f " (8) (8 - 8)

(3) o. (4)

466

DANGEIGERET AL.

where f " (9) is the Hessian of I - the square matrix of second derivatives with respect to every pair of variables { 8ijk, 8i'jlk /} . Consequently, from Eqs. 3 and 4,

logp(DISh) ~ 1(8)+

(5)

logfexP { 1/ 2(8 - 9)/ " (8)(8 - 8)}p(8ISh)d8 We assume that - f " ( 9) is positive - definite , and that , as N grows to infinity , the peak in a neighborhood around the maximum becomes sharper . Consequently , if we ignore the prior , we get a normal distribution around the peak,.. Furthermore , if we assume that the prior p (9ISh ) is not zero around 8 , then as N grows it can be assumed constant and so removed from the integral in Eq . 5. The remaining integral is approximated by the formula for multivariate -normal distributions :

J exp{ 1/ 2(8 - 8)f " (8) (8 - iJ)}d8 ~

.;"'iidet[-/"(8)]d/2

(6)

where d is the number of parameters in 8 , d = Oi : l (ri - l ) qi. As N grows to infinity , the above approximation becomes more precise because the entire mass becomes concentrated around the peak. Plugging Eq . 6 into Eq . 5 and noting that det [- 1" (8 )] is proportional to N yields the BIC :

P(DNISh ) ~ p(DNriJ ,Sh) - ~ log(N)

(7)

A carefulderivationin this spirit showsthat, undercertainconditions, the relative error in this approximationis Op(l ) (Schwarz , 1978; Haughton, 1988). For Bayesiannetworks, the function f (9) is known. Thus, all the assumptionsabout this function can be verified. First, we note that j " (8) is a block diagonalmatrix whereeachblock Aij correspondsto variableXi and a particularinstancej ofPai , andis of size(ri - l )2. Let usexamineone suchAij . To simplify notation, assumethat Xi hasthreestates. Let WI, W2 and W3denote(Jijk for k = 1,2,3, wherei and j are fixed. We consider only thosecagesin DN wherePai = j , and examineonly the observations of Xi - Let D~ denotethe set of N valuesof Xi obtainedin this process . With eachobservation , weassociatetwo indicatorfunctionsXi and Yi. The function Xi is one if Xi getsits first valuein casei and is zerootherwise. Similarly, Yi is oneif Xi getsits secondvaluein cagei andis zerootherwise.

ASYMPTOTICMODELSELECTION

467

The log likelihoodfunction of DN is givenby N

t\(WI, W2 ) ==logII W~iW~i (1- WI- W2 )1-Xi- Yi

(8)

i = l

To find the maximum, we set the first derivativeof this function to zero. The resultingequationsare calledthe maximumlikelihoodequations: AWl(Wl, W2) == '~ Xi - W2 Yi ] - 0 LN =l [ ~X' - 11-- Wl

AW2 (WI, W2) = ~ Yi - 11-- WI Xi - W2 Yi ] = 0 ~N =I [ ;-; Th~ only solution to theseequationsis given by WI == X == L::i xii N , W2= Y = L::i Yi/ N , which is the maximumlikelihoodvalue. The Hessianof A(WI, W2) at the ML valueis givenby \ " (w1, W) WlW2 A 2 -- ( A" A"WlWl A"W2W2 ) -WtWt A" -N

~-~ y ! 1=~ y y + l1- x- y ) ( 1 l - x- 1

(9)

This Hessianmatrix decomposes into the sumof two matrices. One matrix is a diagonalmatrix with positivenumbersl / x and l / yon the diagonal. The secondmatrix is a constantmatrix in which all elementsequal the positivenumber1/ (1 - x - y). Becausethesetwo matricesare positiveand non-negativedefinite, respectively , the Hessianis positive definite. This argumentalso holdswhenXi hasmorethan three values. Becausethe maximumlikelihoodequationhasa singlesolution, andthe Hessianis positivedefinite, and becauseas N increasesthe peak becomes sharper(Eq.9), all the conditionsfor the generalderivationof the BIC are met. Pluggingthe maximumlikelihood valueinto Eq. 7, which is correct to 0 (1), yields Eq. 1. 4. Assymptotics With Hidden Variables Let us now considerthe situation whereS containshidden variables. In this case, wecan not usethe derivationin the previoussection, becausethe log-likelihood function logP(DNISh, 8) does not necessarilytend toward a peakas the samplesize increases . Instead, the log-likelihood function can tend toward a ridge. Consider, for example, a network with one arc

468

DANGEIGERET AL. -

H

- t

X

where

Assume

H

that

the

only

likelihood

( 1 x

-

( Jh

in =

its

,

i

x

two

and

of

is

Xi

and

values

values

function

) ( Jxlii

case

X

has

is

zero

w

.

fli

The

X

-

that

WXi

( 1

-

N

.

is

two

is

w

) l -

,

is

Xi

,

and

.

w if

true

== X

x

of

,

8h8xlh

+

gets

value that

w

solution

.

Then

probability

terms

any

x

hidden

one

in

Nonetheless

is

where

the

unique

values

H

equals w

value

xii

has

that

parameter

ML

2: : : i

and

function

The

==

h

observed

by

otherwise

when

are

indicator

.

maximum

X

given

the

unconditionally

hand

:

it

attains

for

8

to

the

equation

Lxii

N

==

( } h ( } xlh

+

( 1

the

data

. In

-

( } h ) ( } xlh

z

will

maximize

H

- t

an

the

X

has

informal for

Given

,

let

W

S

a

Bayesian

{ wolD of

defines a

by

a

region

o

Now

,

)

g

with

d

sample

.

as

log

of

matrix

.

locally the

J

( x

)

linear

( Spivak

,

1979

not point

a

,

and

)

, Sh

)

==

==

{ < PI

,

log

p

can

log

P

transformation m :

. The Rn

- t

dimension

d is

a

,

) . change

of This

k

in

a

small

of

Rn the

the

around

M

.

defined

In

a

will

the

~

small

with

look

log

like

- likelihood

become

log

,

exception

peaked

approximation

how

it

Rm

,

as :

N

( 10

can

the

of

}

,

it

be

equals can

)

,

rank

J of

around

x

k

in

When

rank which

( x

of

a

the

approximated

matrix

the

.

is

the be

neighborhood

when

found

transformation

Jacobian

region holds

8

is

space

) , ,

. That

space

of

will

-

image

the

small

ball

)

mapping

for a

BIC

- t

where

observation in

:

W

the

the

( 9

C

) .

and

smooth

matrix

image

, Sh

of

( with

Thus

0

probability

value

in

9 .

I
, Sh

is

}

Rm

N

joint

values

around

provide

- redundant

Euclidean A

the

( DNliJ d

ML

. . . , < Pd } ( D

non .

g

resemble

( DNI

what

a

of

we

variables

M

all

apply

logp

( J is

range

region

of

~

understand

will

of

true

manifold

M

we

the

of

. The

of

small

function

W curved

set

observable

value

image

transformation

the

to

a

the

in

transformation

does

regular

x k

of

matrix a

n

linear

dimension

,

( DNISh

linear

size

a

is

a

( DNI to

When as

as

P

a

matrix

is

manifold

That

increases

remains

considering

) ,

with

structure ,

variables

of

every

network section

a

parameters

(J )

coordinates

size

that It

from

A

( 8

the

logp

Note

points

g

( O ) ,

written

the

g

of

X

the

this

hidden

domain

the to

map

orthogonal

function

for

denote

,

In

identify

with

Corresponding

set

dimension

Rd

}

consider

around

some

.

O

sense .

to

network

network

E

this

parameter how

Bayesian

- zero

. 1

- redundant describing

( smooth

measure

W

non

any

==

distribution

of

of

one

argument

parameters

X

likelihood

only

)

x is

of case

( x

)

serves

ERn

.

the the

The

rank Jacobian

x

is

.

1For terminology and basic facts in differential geometry, see Spivak (1979) .

called

of

ASYMPTOTICMODELSELECTION

469

Returning to our problem , the mapping from (J to W is a polynomial function of 8 . Thus , as the next theorem shows, the rank of the Jacobian matrix [ltl , ] is almost everywhere some fixed constant d, which we call the regular rank of the Jacobian matrix . This rank is the number of nonredundant parameters of S- that is, the dimension of S . Theorem 1 Let lJ be the parameters of a network S for variables X with observable variables 0 C X . Let W be the parameters of the true joint distribution of the observable variables . If each parameter in W is a poly nomial function is a constant .

of 8 , then rank [ ~

(lJ)] = d almost everywhere, where d

Proof : Because the mapping from (J to W is polynomial , each entry in the matrix J (lJ) = [ fla9,(8 )] is a polynomial in lJ. When diagonalizing J , the leading elements of the first d lines remain polynomials in 8 , whereas all other lines , which are dependent given every value of 8 , become identically zero. The rank of J (lJ) falls below d only for values of () that are roots of some of the polynomials in the diagonalized matrix . The set of all such roots ha5 mea5ure zero. 0 Our heuristic argument for Eq . 10 does not provide us with the error term . Researchers have shown that Op(l ) relative errors are attainable for a variety of statistical models (e.g., Schwarz, 1978, and Haughton , 1988) . Although the arguments of these researchers do not directly apply to our ca5e, it may be possible to extend their methods to prove our conjecture . 5 . Computations

of the Rank

We have argued that the second term of the BIC for Bayesian networks with hidden variables is the rank of the Jacobian matrix of the transfor mation between the . parameters of the network and the parameters of the observable variables . In this section , we explain how to compute this rank , and demonstrate the approach with several examples . Theorem 1 suggests a random algorithm for calculating the rank . Compute the Jacobian matrix J (O) symbolically from the equation W == g (O) . This computation is possible since 9 is a vector of polynomials in 8 . Then , aBsign a random value to 8 and diagonalize the numeric matrix J (lJ) . The orem 1 guarantees that , with probability 1, the resulting rank is the regular rank of J . For every network , select- say- ten values for 8 , and determine r to be the maximum of the resulting ranks . In all our experiments , none of the randomly chosen values for lJ accidentally reduced the rank . We now demonstrate the computation of the needed rank for a naive Bayes model with one hidden variable H and two feature variables Xl and X 2. Assume all three variables are binary . The set of parameters W = 9 (lJ)

470

DANGEIGERET AL.

(Jh8x2lh (}h8xllh (1- (}h)8x21h (1- ()h)()Xllh ()xllh ()X21h - (}xllh (}X2Ih - (}h(}X2Ih (}h(}Xllh - (1- (}h)(}X2Ih (1- ()h)(}Xlih (Jxllh8x2lh - 8Xllh )8x21h (1- (}h(}X2Ih ) - (}h(}Xllh (1- (}h)(}X2Ih - (1- (}h)8xllh(}xllh (}X2Ih - 8Xllh8x21h ) Figure 1. nodes

The Jacobian matrix

for a naive Bayesian network with two binary feature

is giyen by WXIX2== ()h()Xllh()X2Ih+ (1 - ()h)()XlIJi()X2IJi WXIX2= (}h(l - (}xllh)(}X2Ih+ (1 - (}h) (l - ()xlIJi) (}X2IJi WXIX2== lJhlJxllh ( l - (}x2Ih) + (1 - ()h)(}xlIJi( l - (}x2IJi) The 3 x 5 Jacobian matrix for this transformation is given in Figure 5 where (}xilh = 1 - (}xilh (i = 1, 2). The columns correspond to differentiation with respect to (}xllh, (}x2Ih, (}xllh, (}x2lh and (}h, respectively. A symbolic computation of the rank of this matrix can be carried out ; and it shows that the regular rank is equal to the dimension of the matrix - namely, 3. Nonetheless, as we have argued, in order to compute the regular rank, one can simply chooserandom valuesfor (J and diagonalize the resulting numerical matrix . We have done so for naive Bayes models with one binary hidden root node and n ~ 7 binary observable non-root nodes. The size of the associatedmatrices is (1 + 2n) x (2n - 1). The regular rank for n == 3, . . . , 7 was found to be 1+ 2n. We conjecture that 1+ 2n is the regular rank for all n > 2. For n == 1, 2, the rank is 1 and 3, respectively, which is the size of the full parameter spaceover one and two binary variables. The rank can not be greater than 1 + 2n becausethis is the maximum possibledimension of the Jacobian matrix . In fact, we have proven a lower bound of 2n as well. Theorem 2 Let S be a naive Bayes model with one binary hidden root node and n > 2 binary observablenon-root nodes. Then 2n ~ r ~ 2n + 1 where r is the regular rank of the Jacobian matrix betweenthe parameters of the network and the parameters of the feature variables.

...-~

~

The proof is obtained by diagonalizing the Jacobian matrix symbolically, and showing that there are at least 2n independent lines. The computation for 3 ~ n ~ 7 showsthat , for naive Bayesmodels with a binary hidden root node, there are no redundant parameters. Therefore, the best way to represent a probability distribution that is representable by such a model is to use the network representation explicitly .

ASYMPTO'TIC MODELSELECTION

471

Nonetheless , this result does not hold for all models. For example , consider the following W structure : A - + C +- H - + D +- B

where H is hidden . Assuming all five variables are binary , the space over the observables is representable by 15 parameters , and the number of parameters of the network is 11. In this example , we could not compute the rank symbolically . Instead , we used the following Mathematica code.

There are 16 functions (only 15 are independent) defined by W == g((}). In the Mathematica code, we use fijkl for the true joint probability Wa=i ,b=j ,c=k,d=l , cij for the true conditional probability (}c=Ola=i,h=j , dij for (}d= Olb= i ,h=j ' a for (Ja=O, b for (}b= O, and hO for (}h= O.

The first function is given by

10000 [a_, b_, hO_, cOO _, . . . , cll _, dOO -, . . . , dll _] :== u * b * (h0 * cO0 * dO0 + (1 - h0) * cO1 * dOl) and the other functions

are similarly

written . The Jacobian

matrix

is com -

puted by the command Outer , which has three arguments . The first is D which tions

stands for the differentiation , and

the

third

is a set

of variables

operator , the second is a set of func .

J [a_, b_, kG_, cOO -, . . . , cll _, dOO _, . . . , dll _J :== Outer [D , { fOOOO[a, b, hO, cOO,cOl, . . . , dl1 ] ,

f 0001 [a, b, h0, cOO , . . . , c11, dO0, . . . , d 11] , .

.

.

,

fIlII [a, b, h0, cOO , . . . , c11, dO0, . . . , d11]} , { a, b, hO, cOO , cO1, clO, cll , dOO , dOl, dlO, dll }] The next command produces a diagonalized matrix at a random point with a precision of 30 decimal digits . This precision was selected so that matrix elements equal to zero would be correctly identified as such.

N [RowReduce[J [a, b, hO, cOO , . . ., c11, dOO , . . . , d11]/ .{ a - t Random[Integer, { I , 999} ]/ 1000, b - t Random[Integer, { I , 999} ]/ 1000, .

.

.

,

dll - + Random[Integer, { I , 999}J/ IOOO }J, 30J The result of this Mathematica program was a diagonalized matrix with 9 non - zero rows and 7 rows containing

all zeros . The same counts were ob -

tained in ten runs of the program . Hence, the regular rank of this Jacobian matrix is 9 with probability 1.

472

DAN GEIGER ET AL.

The interpretation of this result is that , around almost every value of 9 , one can locally represent the hidden W structure with only 9 parameters . In contrast , if we encode the distribution using the network parameters (8 ) of the W structure , then we must use 11 parameters . Thus , two of the network parameters are locally redundant . The BIC approximation punishes this W structure according to its most efficient representation , which uses 9 parameters , and not according to the representation given by the W structure , which requires 11 parameters . It is interesting to note that the dimension of the W structure is 10 if H has three or four states , and 11 if H hag 5 states . We do not know how to predict when the dimension changes as a result of increasing the number of hidden states without computing the dimension explicitly . Nonetheless , the dimension can not increase beyond 12, because we can average out the hidden variable in the W structure (e.g., using arc reversals) to obtain another network structure that has only 12 parameters .

6. AutoClass The AutoClass clustering algorithm developed by Cheeseman and Stutz ( 1995) uses a naive Bayes model .2 Each state of the hidden root node H represents a cluster or class; and each observable node represents a measurable feature . The number of classes k is unknown a priori . AutoClass computes an approximation of the marginal likelihood of a naive Bayes model given the data using increasing values of k . When this probability reaches a peak for a specific k , that k is selected as the number of classes. Cheeseman and Stutz (1995) use the following formula to approximate the marginal likelihood :

logp(DIS) ~

A A logp(DcIS) + logp(DIS, Os) - logp(DcIS,Os)

where Dc is a database consistent with the expected sufficient statistics ~ computed by the EM algorithm . Although Cheeseman and Stutz suggested this approximation in the context of simple AutoCI ~ s models , it can be used to score any Bayesian network with discrete variables as well as other models (Chickering and Heckerman , 1996) . We call this approximation the CS scoring function . Using the BIC approximation for p (DcIS ) , we obtain

,. logp(DIS) ~ logp(DIS, Os) - d'/ 21ogN 2The algorithm

can handle conditional

dependencies among continuous variables .

ASYMPTOTIC MODELSELECTION

473

where d' is the number of parameters of the network . (Given a naive Bayes model with k classes and n observable variables each with b states , d' = nk (b - 1) + k - 1.) Therefore , the CS scoring function will converge asymptotically to the BIC and hence to p (DIS ) whenever d' is equal to the regular rank of S (d) . Given our conjecture in the previous section , we believe that the CS scoring function will converge to p (DIS ) when the number of classes is two . Nonetheless, d' is not always equal to d. For example , when b = 2, k == 3 and n == 4, the number of parameters is 14, but the regular rank of the Jacobian matrix is 13. We computed this rank using Mathematica as described in the previous section . Consequently , the CS scoring function will not always converge to p (DIS ) . This example is the only one that we have found so far ; and we believe that incorrect results are obtained only for rare combinations of b, k and n . Nonetheless , a simple modification to the CS scoring function yields an approximation that will asymptotically converge to p (DIS ) :

logp(DIS) ~ logp(DcIS) + logp(DIS, Os)logp(DcIS, lis) - d/ 2log N + d' / 2 logN Chickeringand Heckerman(1996) showthat this scoringfunction is often a better 7.

approximation

Gaussian

In

this

section

X

are

continuous

the

the a

Networks , we

network

associated joint

considel . As

structure with

the

of

is

local

' the

case

whel

' e each

before

, let

of

Bayesian

network

structure

. A

the

network

likelihood

product

for p (DIS ) than is the BIC .

that

likelihoods

of

a .

( 5 , ( Js )

be

of a

varia

, and Gaussian

multivariate

Each

tIle

{ Xl

network ( J s is

a

, . . . , -~ n } , where

set

network

Gaussian

local

,bIes

Bayesian

of is

is

the

== is

parameters one

in

distribution

likelihood

5

which that

linear

is

regression

model

p ( xilpai

where ance

N v

>

( Jl , v ) 0 ,

mi

is

a is

' ( Ji , S )

normal a

conditional

==

N

( Gaussian mean

( mi

)

+

EXjEPaibjiXj

, Vi )

distribution of

Xi

,

with bji

is

a

mean

coefficient

J. l that

and

vari repre

-

gents the strength of the relationship between variable X j and Xi , Vi is a variance ,3 and ()i is the set of parameters consisting of mi , Vi, and the bji . The parameters () s of a Gaussian network with structure S is the set of all 8i . 3mi is the mean of Xi conditional on all parents being zero, bji corresponds to the partial regression coefficient of Xi on X j given the other parents of Xi , and Vi corresponds to the residual variance of Xi given the parents of Xi .

474

DANGEIGER ET AL.

To apply the techniques developed in this paper , we also need to specify the parameters of the observable variables . Given that the joint distribu tion

is multivariate

- normal

and that

multivariate

- normal

distributions

are

closed under marginalization , we only need to specify a vector of means for the observed

variables

and a covariance

matrix

over the observed

variables

.

In addition , we need to specify how to transform the parameters of the network

to the observable

and

transformation

the

parameters . The transformation

to obtain

the observable

covariance

of the means matrix

can be

accomplished via the trek-sum rule (for a discussion, see Glymour et ale 1987) . Using the trek -sum rule , it is easy to show that the observable param eters are all sums of products of the network parameters . Given that the mapping from 8 s to the observable parameters is W is a polynomial func -

tionof8, it follows from Thm . 1thattherank oftheJacobian matrix [~ ] is almost everywhere some fixed constant d, which we again call the regular rank

of the

parameters Let

Jacobian

matrix

. This

rank

of S- that is , the dimension

us consider

two Gaussian

to the code in Section

is the

number

of non - redundant

of S .

models . We use Mathematica

5 to compute

their

code similar

dimensions , because we can

not perform the computation symbolically . As in the previous experiments , none of the randomly chosen values of (Js accidentally reduces the rank . Our first example is the naive-Bayes model H

Xl

~

\ ' -' X3

X2

X4

in which H is the hidden variable and the Xi are observed. There are 14 network parameters: 5 conditional variances, 5 conditional means, and 4 linear parameters. The marginal distribution for the observedvariables also has 14 parameters: 4 means, 4 variances, and 6 covariances. Nonetheless, the analysis of the rank of the Jacobian matrix tells us that the dimension of this model is 12. This follows from the fact that this model imposes tetrad constraints (seeGlymour et ale 1987). In this model the three tetrad constraints that hold in the distribution over the observedvariables are cov(X 1, X2)COV (X3, X4) - COV (X 1, X3) COV (X2, X4) = 0 COV (XI , X4)COV (X2, X3) - COV (XI , X3)COV (X2, X4) = 0 COV (XI , X4)COV (X2, X3) - COV (XI , X2)COV (X3, X4) = 0 two of which are independent. These two independent tetrad constraints lead to the reduction of dimensionality.

ASYMPTOTIC MODELSELECTION

475

Our second example is the W structure described in Section 5 where each of the variables is continuous . There are 14 network parameters : 5 conditional means, 5 conditional variances , and 4 linear parameters . The marginal distribution for the observed variables has 14 parameters , whereas the analysis of the rank of the Jacobian matrix tells us that the dimension of this model is 12. This coincides with the intuition that many values for the variance of H and the linear parameters for C f - Hand H - + D produce the same model for the observable variables , but once any two of these parameters are appropriately set, then the third parameter is uniquely determined by the marginal distribution for the observable variables .

8. Sigmoid Networks Finally , let us consider the casewhere eachof the variables { Xl , . . . , Xn } == X is binary (discrete), and each local likelihood is the generalized linear model

p(xilpai , 8i, S) == Sig(ai + EXjEPaibjiXj)

whereSig(x) is thesigmoid functionSig(x) == ,1;=. . These models , which we call sigmoid networks , are useful for learning relationships among discrete variables , because these models capture non-linear relationships among variables yet employ only a small number of parameters (Neal , 1992; Saul

et al., 1996) . Using techniques similar to those in Section 5, we can compute the

rank oftheJacobian matrix [~ ]. Wecannotapply Thm . 1toconclude

that this rank is almost everywhere some fixed constant , because the local likelihoods are non-polynomial sigmoid functions . Nonetheless, the claim of Thm . 1 holds also for analytic transformations , hence a regular rank exists

for sigmoid networks as well (as confirmed by our experiments) . Our experiments show expected reductions in rank for several sigmoid networks . For example , consider the two -level network

HI Xl

H2

[S;~ ~~~; 2J X3 X2 X4

This network has 14 parameters . In each of 10 trials , we found the rank of the Jacobian matrix to be 14, indicating that this model has dimension 14. In contrast , consider the th ree-level network .

476

DANGEIGERET AL. H3 /

~

HI

H2

[ S ; ~~ ~~ ~ ; 2J Xl

X3

X2

X4

This network has 17 parameters , whereas the dimension we compute is 15 . This reduction is expected , because we could encode the dependency between the two variables in the middle level by removing the variable in the top layer and adding an arc between these two variables , producing network with 15 parameters .

a

References Bouckaert, R. (1995). Bayesian belief networks: From construction to inference. PhD thesis, University Utrecht . Buntine , W . (1994). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2: 159- 225. Buntine , W . (1996). A guide to the literature on learning graphical models. IEEE Transactions on I( nowledge and Data Engineering, 8:195- 210. Cheeseman, P. and Stutz , J. (1995). Bayesian classification (AutoClass): Theory and results. In Fayyad, U ., Piatesky-Shapiro, G., Smyth, P., and Uthurusamy , R., editors, Advances in I( nowledge Discovery and Data Mining , pages 153- 180. AAAI Press, Menlo Park , CA . Chickering, D. and Heckerman, D. ( 1996). Efficient approximations for the marginal likelihood of incomplete data given a Bayesian network . In Proceedingsof Twelfth Conference on Uncertainty in Artificial Intelligence, Portland , OR, pages 158- 168. Morgan Kaufmann . Cooper, G. and Herskovits, E. (1992) . A Bayesian method for the induction of probabilistic networks from data . Machine Learning, 9:309- 347. Geiger, D . and Heckerman, D. (1995) . A characterization of the Dirichlet distribution with application to learning Bayesian networks. In Proceedingsof Eleventh Conference on Uncertainty in Artificial Intelligence , Montreal , QU , pages 196- 207. Morgan Kaufmann . Seealso Technical Report TR -95- 16, Microsoft Research, Redmond, WA , February 1995. Glymour , C., Scheines, R., Spirtes, P., and Kelly , K . (1987). Discovering Causal Structure . Acedemic Press. Haughton , D . (1988). On the choice of a model to fit data from an exponential family . Annals of Statistics , 16:342- 355. Heckerman, D . ( 1995a) . A tutorial on learning Bayesian networks. Technical Report MSR- TR -95-06, Microsoft Research, Redmond, WA . Revised November, 1996. Heckerman, D . (1995b). A Bayesian approach for learning causal networks. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence, Montreal , QU , pages 285- 295. Morgan Kaufmann . Heckerman, D ., Geiger, D ., and Chickering, D . (1995) . Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197- 243. Herskovits, E. (1991). Computer-basedprobabilistic network construction . PhD thesis, Medical Information Sciences, Stanford University , Stanford, CA . Lam , W . and Bacchus, F . ( 1993). Using causal information and local measuresto learn

ASYMPTOTICMODELSELECTION Bayesian networks . In Proceedings of Ninth Conference on Uncertainty Intelligence , Washington , DC , pages 243- 250. Morgan Kaufmann .

477 in Artificial

Neal, R. (1992). Connectionist learning of belief networks. Artificial Intelligence , 56:71113 .

Rissanen, J. (1987). Stochastic complexity (with discussion). Journal of the Royal Statistical Society , Series B , 49 :223- 239 and 253- 265.

Saul, L ., Jaakkola, T ., and Jordan, M . (1996). Mean field theory for sigmoid belief networks . Journal of Artificial

Intelligence

Research , 4:61- 76.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics , 6:461- 464. Spiegelhalter, D ., Dawid , A ., Lauritzen , S., and Cowell, R. (1993). Bayesian analysis in expert

systems . Statistical

Science , 8 :219 - 282 .

Spiegelhalter, D . and Lauritzen , S. (1990). Sequential updating of conditional probabili ties on directed graphical structures . Networks , 20:579- 605 .

Spivak, M . (1979) . A ComprehensiveIntroduction to Differential Geometry 1, 2nd edition . Publish or Perish , Berkeley , CA .

Suzuki, J. ( 1993). A construction of Bayesian networks from databasesbased on an MDL scheme . In Proceedings of Ninth Conference on Uncertainty Washington , DC , pages 266- 273. Morgan Kaufmann .

in Artificial

Intelligence ,

A

HIERARCHICAL

COMMUNITY

OF

EXPERTS

GEOFFREYE. HINTON BRIANSALLANS AND ZO

UBIN

G

HAHRAMANI

Department

of

University

Toronto

{

hinton

of

,

,

Computer

Ontario

sallans

Science

Toronto

,

,

Canada

zoubin

M5S

}

Ccs

3H5

. toronto

. edu

Abstract . We describe a directed acyclic graphical model that contains a hierarchy of linear units and a mechanism for dynamically selecting an appropriate subset of these units to model each observation . The non-linear selection mechanism is a hierarchy of binary units each of which gates the output of one of the linear units . There are no connections from linear units to binary units , so the generative model can be viewed as a logistic belief net (Neal 1992) which selects a skeleton linear model from among the available linear units . We show that Gibbs sampling can be used to learn the parameters of the linear and binary units even when the sampling is so brief that the Markov chain is far from equilibrium .

1. Multilayer

networks

of linear - Gaussian

units

We consider hierarchical generative models that consist of multiple layers of simple , stochastic processing units connected to form a directed acyclic graph . Each unit receives incoming , weighted connections from units in the layer above and it also has a bias (see figure 1) . The weights on the connections and the biases are adjusted to maximize the likelihood that the layers of " hidden " units would produce some observed data vectors in the bottom layer of " visible " units . The simplest kind of unit we consider is a linear -Gaussian unit . Following the usual Bayesian network formalism , the joint probability of the 479

480

GEOFFREYE. HINTONET AL.

Figure 1.

states

of

of

all

each

the

unit

the

.

units

in

given

units

in

Gaussian

the

the

the

of

units

unit

,

j

,

in

.

with

in

the

the

top

,

layer

its

the

,

state

of

learned

we

of

which

the

for

each

mean

can

down

product

parents

The

a

layer

next

is

of

above

distribution

Yk

network

states

layer

Units in a belief network .

compute

in

networks

the

variance

the

-

are

top

.

top

probability

layered

unit

and

local

layer

Given

down

has

the

input

,

a

states

. vj

,

to

each

:

Yj

:

=

bj

+

L

WkjYk

(

1

)

k

where

bj

and

is

Wkj

of

unit

is

learned

of

one

is

j

is

the

bias

the

weight

then

(

even

any

subset

of

is

tistical

in

sensible

Ghahramani

all

units

connection

in

from

mean

factor

Yj

the

k

and

to

a

of

a

layer

to

j

.

above

The

state

variance

is

the

aJ

data

so

structure

to

and

to

that

extend

Hinton

are

linear

,

.

are

Given

the

the

to

the

the

dis

-

known

,

parameters

higher

for

of

posterior

all

fit

states

is

update

all

:

easy

order

tagks

sta

like

-

vision

.

models

1996

units

they

inappropriate

crucial

Gaussian

distribution

ignore

consists

weighted

advantages

compute

to

models

they

-

and

this

)

send

linear

.

once

1984

that

important

unobserved

algorithm

is

of

two

data

and

linear

,

)

layer

tractable

EM

Everitt

factors

continuous

units

use

(

the

have

are

it

,

way

visible

variables

units

the

(

noise

linear

order

analysis

units

)

Unfortunately

-

down

Gaussian

unobserved

in

higher

(

the

linear

the

structure

which

One

them

the

.

over

with

models

to

model

index

Gaussian

good

straightforward

the

-

loadings

of

of

top

-

with

across

an

underlying

linear

factor

many

tribution

it

the

provide

when

is

distributed

of

models

often

k

the

model

layer

connections

,

.

generative

They

j

on

data

hidden

Linear

unit

Gaussian

from

The

of

;

is

Hinton

to

et

use

al

a

.

,

1997

mixture

of

)

.

This

M

retains

of

A HIERARCHICAL COMMUNITY OFEXPERTS

481

tractability because the full posterior distribution can be found by computing the posterior across each of the M models and then normalizing . However , a mixture of linear models is not flexible enough to represent the kind of data that is typically found in images. If an image can have several different objects in it , the pixel intensities cannot be accurately modelled by a mixture unless there is a separate linear model for each possible combination of objects . Clearly , the efficient way to represent an image that contains n objects is to use a "distributed " representation that contains n separate parts , but this cannot be achieved using a mixture because the non-linear selection process in a mixture consists of picking one of the lin ear models . What we need is a non-linear selection process that can pick arbitrary subsets of the available linear -Gaussian units so that some units can be used for modelling one part of an image , other units can be used for modelling other parts , and higher level units can be used for modelling the redundancies between the different parts . 2 . Multilayer

networks

of binary -logistic

units

Multilayer networks of binary -logistic units in which the connections form a directed acyclic graph were investigated by Neal (1992) . We call them logistic belief nets or LBN 's. In the generative model , each unit computes its top -down input , Sj , in the same way as a linear -Gaussian unit , but instead of using this top -down input as the mean of a Gaussian distribution it uses it to determine the probability of adopting each if the two states 1 and 0:

(2)

Sj = bj +

1

p

( Sj

=

11 {

Sk

:

k

E

paj

} )

=

a

( sj

)

=

1

+

e -

Sj

(3)

where paj is the set of units that send generative connections to unit j (the " parents " of j ) , and 0' (.) is the logistic function . A binary -logistic unit does not need a separate variance parameter because the single statistic Sj is sufficient to define a Bernouilli distribution . Unfortunately , it is exponentially expensive to compute the exact posterior distribution over the hidden units of an LBN when given a data point , so Neal used Gibbs sampling : With a particular data point clamped on the visible units , the hidden units are visited one at a time . Each time hidden unit u is visited , its state is stochastically selected to be 1 or 0 in propor tion to two probabilities . The first , pa \ su=l == p (Su == 1, { Sk : k # u } ) is the joint probability of generating the states of all the units in the network (including u ) if u has state 1 and all the others have the state defined by the current configuration of states , G'. The second, pa \ su=0, is the same

482

GEOF ' FREYE. HINTONET AL.

quantity of

all

if the

u

has

other

plication

of

this

configurations

of

O . When are

the

constant

is

a configuration

these

. It

decision

selected

LBN

calculating

held

stochastic being

Because pa

state units

rule

it

, a , of

pa

states

== II

to

is

their

easy

of

p ( si

probabilities

be

shown

eventually

according

acyclic

can

to

all

I { Sk

leads

units

: k

states

repeated

to

posterior

ap -

hidden

state

probabilities

compute

the

, the

that

the

joint

.

probability

.

E pai

}

(4 )

't

where

sf It

are -

is

the

binary

is convenient

called 1n pa

state

to

energies

work

by

of in

unit

the

analogy

i in

configuration

domain

with

of

Q .

negative

statistical

log

physics

probabilities

. We

which

define

EQ

to

be

. Ea

== -

L

( s ~ In

s~ +

(1 -

s ~ ) In ( l

-

s~ ) )

(5 )

u

where

s ~ is

expectation units

in The

of

two

the

binary

state

generated the

net

rule

for

probabilities

by

of

unit

the

u in

layer

configuration

above

0: , s ~ is the

, and

u

is

an

index

top

over

- down all

the

. stochastically and

picking hence

~ E ~

the

a

new

state

difference

== Ea

\ su = o -

of

Ea

for

two

u

requires

the

ratio

energies

\ su = l

(6 )

p(su== 11{Sk: k # u}) ==0'(~ E~)

(7)

All the contributions to the energy of configuration 0: that do not depend on Sj can be ignored when computing L\ Ej . This leavesa contributionthat depends on the top -down expectation Sj generated by the units in the layer above (see Eq . 3) and a contribution that depends on both the states , Si, and the top -down expectations , Si, of units in the layer below (seefigure 1)

L\ EC J!

InBj- In(l - Bj) +2;=[sfIns~\Sj=l + (1- sf)In(1- s~\Sj=l) ~

-

aI "a\sj=O (1 a)I (1 "a\sj=o)] (8)

s . ~

n s . ~

-

-

s . ~

n

-

s . ~

Given samplesfrom the posteriordistribution, the generativeweights of a LBN can be learnedby using the online delta rule which performs gradientascentin the log likelihoodof the data:

~ Wji= ESj(Si- Si)

(9)

""rI') . I.

A HIERARCHICAL COMMUNITYOF EXPERTS

483

"

I

)

." ~)

' r

" r

I

I

I

I

I

I . .

.I

.I ~ ' I

I

.

.

.

.

, ,

I

,

, ,

.I

I

.I .I

.I

,

,

I

'

I

.I

,t

)

I

,

,

I

, 1'

I

'

I

.I

.I .I

'

I

.I .I

,

I

'

,

"--

)

I

' r

' r

' r

I I

I I

I I

I

I

.

)

I

I

.

.

.

.

.

.

.

.

.

'

.

.

.

.

.

.

.

.

.

Figure 2. Units in a community of experts , a network of paired binary and linear units . Binary units (solid squares ) gate the outputs of corresponding linear units (dashed circles ) and also send generative connections to the binary units in the layer below . Linear units send generative connections to linear units in the layer below (dashed arrows ) .

3 . Using

binary

units to gate linear

units

It is very wasteful to use highly non-linear binary units to model data that is generated from continuous physical processes that behave linearly over small ranges. So rather than using a multilayer binary network to generate data directly , we use it to synthesize an appropriate linear model by selecting from a large set of available linear units . We pair a binary unit with each hidden linear unit (figure 2) and we use the same subscript for both units within a pair . We use y for the real-valued state of the linear unit and s for the state of the binary unit . The binary unit gates the output of the linear unit so Eq . 1 becomes:

Yj == bj +

L WkjYkSk

(10)

k

It is straightforward to include weighted connections from binary units to linear units in the layer below , but this was not implemented in the examples we describe later . To make Gibbs sampling feasible (see below ) we prohibit connections from linear units to binary units , so in the generative model the states of the binary units are unaffected by the linear units and are chosen using Eq . 2 and Eq . 3. Of course, during the inference process the states of the linear units do affect the states of the binary units . Given a data vector on the visible units , it is intractable to compute the posterior distribution over the hidden linear and binary units , so an

484

GEOFFREY E. HINTONET AL.

approximate inference method must be used. This raises the question of whether the learning will be adversely affected by the approximation errors that occur during inference . For example , if we use Gibbs sampling for inference and the sampling is too brief for the samples to come from the equilibrium distribution , will the learning fail to converge? We show in section 6 that it is not necessary for the brief Gibbs sampling to approach equilibrium . The only property we really require of the sampling is that it get us closer to equilibrium . Given this property we can expect the learning to improve a bound on the log probability of the data .

3.1. PERFORMING GIBBSSAMPLING The obvious way to perform Gibbs sampling is to visit units one at a time and to stochastically pick a new state for each unit from its posterior distribution given the current states of all the other units . For a binary unit we need to compute the energy of the networ.k with the unit on or off . For a linear unit we need to compute the quadratic function that determines how the energy of the net depends on the state of the unit . This obvious method has a significant disadvantage . If a linear unit , j , is gated out by its binary unit (i .e., Sj == 0) it cannot influence the units below it in the net , but it still affects the Gibbs sampling of linear units like k that send inputs to it because these units attempt to minimize (Yj - Yj )2/ 20'J . So long as Sj == 0 there should be no net effect of Yj on the units in the layer above. These units completely determine the distribution of Yj , so sampling from Yj would provide no information about their distributions . The effect of Yj on the units in the layer above during inference is unfortunate because we hope that most of the linear units will be gated out most of the time and we do not want the teeming masses of unemployed linear units to disturb the delicate deliberations in the layer above. We can avoid this noise by integrating out the states of linear units that are gated out . Fortunately , the correct way to integrate out Yi is to simply ignore the energy contribution ( YJ. - YJ ,. .)2/ 20' j8 2 A second disadvantage of the obvious sampling method is that the decision about whether or not to turn on a binary unit depends on the particular value of its linear unit . Sampling converges to equilibrium faBter if we integrate over all possible values of Yj when deciding how to set Sj. This integration is feasible because, given all other units , Yj has one Gaussian posterior distri bu tion when Sj = 1 and another Gaussian distri bu tion when Sj = o. During Gibbs sampling , we therefore visit the binary unit in a pair first and integrate out the linear unit in deciding the state of the binary unit . If the binary unit gets turned on , we then pick a state for the linear unit from the relevant Gaussian posterior . If the binary unit is turned off

A HIERARCHICAL COMMUNITYOF EXPERTS

485

it is unnecessary to pick a value for the linear unit . For any given configuration of the binary units , it is tractable to compu te the full posterior distribution over all the selected linear units . So one interesting possibility is to use Gibbs sampling to stochastically pick states for the binary units , but to integrate out all of the linear units when making these discrete decisions . To integrate out the states of the selected linear units we need to compute the exact log probability of the observed data using the selected linear units . The change in this log probability when one of the linear units is included or excluded is then used in computing the energy gap for deciding whether or not to select that linear unit . We have not implemented this method because it is not clear that it is worth the computational effort of integrating out all of the selected linear units at the beginning of the inference process when the states of some of the binary units are obviously inappropriate and can be improved easily by only integrating out one of the linear units . Given samples from the posterior distribution , the incoming connection weights of both the binary and the linear units can be learned by using the online delta rule which performs gradient ascent in the log likelihood of the data . For the binary units the learning rule is Eq . 9. For linear units the rule

is :

~ Wji == { YjSj(Yi - Yi)si/ af

(11)

The learning rule for the biasesis obtained by treating a bias as a weight coming from a unit with a state of 1.1 The variance of the local noise in each linear unit , aJ, can be learned by the online rule: ~ a; == ESj [(Yj - Yj)2 - o-} ]

(12)

Alternatively , aJ can be fixed at 1 for all hidden units and the effective local noise level can be controlled by scaling the incoming and outgoing weights . 4 . Results

on the bars task

The noisy bars task is a toy problem that demonstrates the need for sparse distributed representations (Hinton et al ., 1995; Hinton and Ghahramani , 1997) . There are four stages in generating each K X K image . First a global orientation is chosen, either horizontal or vertical , with both cases being equally probable . Given this choice, each of the K bars of the appropriate orientation is turned on independently with probability 0.4. Next , each active bar is given an intensity , chosen from a uniform distribution . Finally , lWe have used Wji to denote both the weightsfrom binary units to binary units and from linear units to linear units; the intendedmeaningshouldbe inferred from the context.

488

GEOFFREY

5 .

Results

We

trained

from

a

the

to

an

similar

8

X

8

grid

three

,

a

test

both

set sets

. A

a

Figure

5 .

For

clarity

24

pairs

and

the

Gibbs

.

Gibbs

the

linear

layer

.

.

.

.

.

.

.

In

this

represent

the

handwritten ,

1994

pixel

values

into and

a

threes

training

data

digits were

training

set

being

equally

is

shown

in

of

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

'.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

of

the

training

consisted

units

in

a

training

learning

rate

case digit

, the

training network

features

the is uses

. Some

and in

a

the

by

constraints in

of ,

all of

in 24 the

43

network

.

layer

,

of the

,

units

in

through

parameter

sign to

hidden

paBses

used

the

layer

linear

trained

top

problem

on

features

5 (a ) .

- Gaussian

decay

iterations

figure

the

the

linear

made

hidden

shown

in 64

weight

placed first

units

and

bars

12

by

,

.

network

0 . 01 as

generated

figure

pair

, the of

this

layer

, followed no

units of

single

performed

iterations

- Gaussian

a

in

hidden

. During

were

result

first

. b ) Images

values

of

the

was

, there

data

positive

lie

digits

figure

.

represents

to

1400

.

.

threes scaled

represented

.

.

and were

rescaled

.

subset

b

twos

) . The

.

, with

The

.

.

sampling task

AL

.

sampling

this

twos

of

.

layer

set

- scale divided

, with

subset

on ( Hull

- gray were

digits

.

network

visible

data

digits

600

network

databaBe

256

.

a ) A

of

the

small

, black

The

- layer 1

2000

of

ET

digits

CDROM

[ 0 , 1 ] . The

and

to

handwritten

CEDAR

within

in

on

E . HINTON

the of

with

4

0 . 02

discarded

for

learning

.

the

weights

from

units

in

the

For

visible

6 . units are

in

the

global

first , while

hidden others

layer are

.

490

GEOFFREYE. HINTONET AL.

highly localized . The top binary unit is selecting the linear units in the first hidden layer that correspond to features found predominantly in threes , by exciting the corresponding binary units . Features that are exclusively used in twos are being gated out by the top binary unit , while features that can be shared between digits are being only slightly excited or inhibited . When the top binary unit is off , the features found in threes are are inhibited by strong negative biases, while features used in twos are gated in by positive biases on the corresponding binary units . Examples of data generated by the trained network are shown in figure 5(b) . The trained network was shown 600 test images, and 10 Gibbs sampling iterations were performed for each image . The top level binary unit was found to be off for 94% of twos , and on for 84% of threes . We then tried to improve classification by using prolonged Gibbs sampling . In this case, the first 300 Gibbs sampling iterations were discarded , and the activity of the top binary unit was averaged over the next 300 iterations . If the average activity of the top binary unit was above a threshold of 0.32, the digit was classified as a three ; otherwise , it was classified as a two . The threshold was found by calculating the optimal threshold needed to classify 10 of the training samples under the same prolonged Gibbs sampling scheme. With prolonged Gibbs sampling , the average activity of the top binary unit was found to be below threshold for 96.7% of twos , and above threshold for 95.3% of threes , yielding an overall successful classification rate of 96% (with no rejections allowed ) . Histograms of the average activity of the top level binary unit are shown in figure 7. a

200

150

100

50

0

b

.

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .0

1

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .8

1

200

1

1

0 0

Figure 7. Histograms of the average activity of the top level binary unit , after prolonged Gibbs sampling , when shown novel handwritten twos and threes . a) Average activity for twos in the test set . b ) Average activity for threes in the test set .

F=~QQEQ - (-~QQlnQQ )

(13)

492

GEOF~""REYE. HINTONET AL.

If

Q

then

'

is

F

the

is

visible

posterior

equal

to

units

ative

the

under

log

gence

distribution

negative

the

of

Q

log

model

probability

between

over

and

P

F

configurations

probability

defined

visible

hidden

of

by

E

.

given

the

configuration

Otherwise

configuration

,

by

the

E

of

F

exceeds

Kullback

the

-

,

the

neg

Leibler

-

diver

-

:

=

=

-

In

p

(

visibIe

)

+

E

Qa

.

In

~

( Pa

14

)

de

-

.

a .

The

EM

1993

)

algorithm

:

a

consists

full

termine

M

E

achieved

,

step

and

by

a

Q

the

-

t

-

-

Et

by

a

in

previous

partial

M

few

sweeps

E

-

step

ensures

-

l

Neal

and

Hinton

,

that

respect

to

Q

,

over

which

the

is

hidden

that

to

descent

without

Qt

occurs

,

be

in

fully

to

be

the

F

is

that

minimizing

the

it

distribution

energy

after

imagine

we

of

we

coordinate

E

-

with

reached

function

partial

it

used

step

t

during

and

updates

.

guaranteed

we

compute

sampling

the

are

that

can

Gibbs

from

step

Et

t

that

network

as

define

and

noise

so

each

(

parameters

distribution

F

We

step

+

sampling

produced

.

t

M

networks

F

the

with

posterior

improve

step

Partial

to

eliminate

identical

the

EM

Q

E

.

to

F

viewing

which

function

sampling

gradient

steps

partial

step

energy

-

in

respect

minimizes

to

distribution

of

E

To

of

the

end

partial

the

E

to

descent

with

.

of

partial

respect

step

equal

E

advantage

justifies

at

E

given

major

coordinate

F

full

setting

configurations

A

of

minimizes

hidden

have

an

the

.

infinite

Provided

Q

we

configuration

that

ensemble

exact

start

at

pt

+

l

~

distribution

the

the

Ft

end

because

Gibbs

of

the

the

:

~ +l < Et L-.III Q 'Et Q -~ L.." QQ QQt QQt

(15)

while Gibbs sampling, however brief, ensuresthat :

Ea Q~+lE;+l + Q~+lInQ~+l ~ EQ~ a E;+l + Q~InQ~.

(16)

In practice , we try to approximate an infinite ensemble by using a very small learning rate in a single network so that many successive partial Esteps are performed using very similar energy functions . But it is still nice to know that with a sufficiently large ensemble it is possible for a simple learning algorithm to improve a bound on the log probability of the visible configurations even when the Gibbs sampling is far from equilibrium . Changing the parameters can move the equilibrium distribution further from the current distribution of the Gibbs sampler . The E s.tep ensures that the Gibbs sampler will chase this shifting equilibrium distribution . One wor risome consequence of this is that the equilibrium distribution may end up

A HIERARCHICAL COMMUNITY OFEXPERTS

493

very far from the initial distribution of the Gibbs sampler . Therefore , when presented a new data point for which we don 't have a previous remembered Gibbs sample , inference can take a very long time since the Gibbs sampler will have to reach equilibrium from its initial distribution . There are at least three ways in which this problem can be finessed: 1. Explicitly learn a bottom -up initialization model . At each iteration t , the initialization model is used for a fast bottom -up recognition pass. The Gibbs sampler is initialized with the activities produced by this pass and proceeds from there . The bottom -up model is trained using the difference between the next sample produced by the Gibbs sampler and the activities it produced bottom -up . 2. Force inference to recapitulate learning . Assume that we store the sequence of weights during learning , from which we can obtain the sequence of corresponding energy functions . During inference , the Gibbs sampler is run using this sequence of energy functions . Since energy functions tend to get peakier during learning , this procedure should have an effect similar to annealing the temperature during sampling . Storing the entire sequence of weights may be impractical , but this procedure suggests a potentially interesting relationship between inference and learning . 3. Always start from the same distribution and sample briefly . The Gibbs sampler is initialized with the same distribution of hidden activities at each time step of learning and run for only a few iterations . This has the effect of penalizing models with an equilibrium distribution that is far from the distributions that the Gibbs sampler can reach in a few samples starting from its initial distribution .2 We used this procedure in our simulations .

7.

Conclusion

We have described a probabilistic generative model consisting of a hierar chical network of binary units that select a corresponding network of linear units . Like the mixture of experts (Jacobs et al ., 1991; Jordan and Jacobs, 1994) , the binary units gate the linear units , thereby choosing an appropri ate set of linear units to model nonlinear data . However , unlike the mixture of experts , each linear unit is its own expert , and any subset of experts can 2The free energy, F , can be interpreted as a penalized negative log likelihood , where the penalty term is the K ullback- Leibler divergence between the approximating distribu tion Qa and the equilibrium distribution (Eq. 14). During learning, the free energy can be decreasedeither by increasing the log likehood of the model, or by decreasing this KL divergence. The latter regularizes the model towards the approximation .

494

GEOFFREYE. HINTON ET AL.

be selected

at once ,

so we call this network a hierarchical community of

experts. Acknowledgements We thank

Peter Dayan , Michael

Jordan , Radford

Neal and Michael

Revow

for many helpful discussions. This research was funded by NSERC and the Ontario Information Technology Research Centre . GEH is the Nesbitt Burns

Fellow

Refere

n ces

of the Canadian

Institute

for Advanced

Research

.

Everitt , B. S. (1984) . An Introduction to Latent Variable Models. Chapman and

Hall , London

.

Ghahramani, Z. and Hinton , G. E. (1996). for mixtures

of factor analyzers .

The EM algorithm

Technical Report

CRG - TR -96- 1

[ ftp : / / ftp . cs . toronto . edu / pub / zoubin / tr - 96 - 1 .ps . gz ] , Depart ment of Computer Science , University of Toronto .

Hinton , G. E., Dayan, P., Frey, B. J., and Neal, R. M . (1995). The wakesleep algorithm for unsupervised neural networks . Science, 268:11581161 .

Hinton , G. E., Dayan, P., and Revow, M . (1997). Modeling the manifolds of Imagesof handwritten digits . IEEE Trans. Neural Networks, 8(1) :65- 74. Hinton , G. E. and Ghahramani, Z. (1997) . Generative models for discovering sparse distributed

representations

. Phil . Trans . Roy . Soc . London

B , 352 : 1177 - 1190 .

Hull , J. J. (1994) .

A database for handwritten text recognition re-

search. IEEE Transactions on Pattern Analysis and Machine Intelli -

gence, 16(5) :550- 554. Jacobs, R. A ., Jordan, M . I ., Nowlan, S. J., and Hinton , G. E. (1991) . Adaptive mixture of local experts . Neural Computation , 3:79- 87.

Jordan, M . I . and Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm . Neural Computation , 6:181- 214.

Neal, R. M . (1992) . Connectionist learning of belief networks. Artificial Intelligence , 56:71- 113.

Neal, R. M . and Hinton , G. E. (1993). A new view of the EM algorithm that justifies incremental and other variants . Unpublished manuscript [ ftp

: / / ftp

. cs . utoronto

. ca / pub / radford

Computer Science, University of Toronto .

/ em. ps . z] , Departmentof

AN INFORMATI ON-THEORETIC ANALYSIS OF HARD AND SOFT ASSIGNMENT METHODS FOR CLUSTERING

MICHAEL

KEARNS

AT & T Labs

Florham YISHA

- Research

Park , New Jersey

Y MANSOUR

Tel Aviv

University

Tel Aviv , Israel AND ANDREW

Y . NG

Massachusetts Institute of Technology Cambridge , Massachusetts Abstract

. Assignment methods are at the heart of many algorithms for unsu-

pervised learning and clustering -

in particular , the well -known K -means and

Ezpectation-Mazimization (EM) algorithms . In this work, we study several different methods of assignment , including the "hard " assignments used by K -means and the " soft " assignments used by EM . While it is known that K -means mini mizes

the

distortion

on the

data

and

EM

maximizes

the

likelihood

, little

is known

about the systematic differences of behavior between the two algorithms . Here we shed light on these differences via an information -theoretic analysis . The corner stone of our results is a simple decomposition of the expected distortion , showing

that K -means (and its extension for inferring general parametric densities from unlabeled sample data) must implicitly manage a trade-off between how similar the data assigned to each cluster are, and how the data are balanced among the clusters . How well the data are balanced is measured by the entropy of the parti tion defined by the hard assignments . In addition to letting us predict and verify systematic differences between K -means and EM on specific examples , the decomposition allows us to give a rather general argument showing that K -means will consistently find densities with less " overlap " than EM . We also study a third nat ural assignment method that we call po6terior assignment , that is close in spirit to the soft assignments of EM , but leads to a surprisingly different algorithm . 495

496

MICHAEL KEARNS ET AL.

1. Introduction Algorithms for density estimation , clustering and unsupervisedlearning are an important

tool in machine learning . Two classical algorithms are the K -

means algorithm (MacQueen, 1967; Cover and Thomas, 1991; Duda and Hart , 1973) and the Ezpectation-Mazimization (EM ) algorithm (Dempster et al., 1977). These algorithms have been applied in a wide variety of settings , including parameter estimation in hidden Markov models for speech

recognition (Rabiner and Juang, 1993), estimation of conditional probability tables in belief networks for probabilistic inference (Lauritzen , 1995) , and various clustering problems (Duda and Hart , 1973) . At a high level , K -means and EM appear rather similar : both perform a two -step iterative optimization , performed repeatedly until convergence. The first step is an assignment of data points to "clusters " or density mod els, and the second step is a reestimation of the clusters or density models based on the current assignments . The K -means and EM algorithms differ

only in the manner in which they assigndata points (the first step). Loosely speaking , in the case of two clusters l , if Po and Pl are density models for the two clusters , then K -means assigns z to Po if and only if Po( z ) ~ Pl (z ) ;

otherwise z is assignedto Pl . We call this hard or Winner- Take-All (WTA ) assignment . In contrast , EM assigns z fractionally , assigning z to Po with

weight Po(z)/ (Po(z) + Pl (z)) , and assigningthe "rest" of z to Pl . We call this soft or fractional

assignment . A third natural alternative would be to

again assign z to only one of Po and Pl (as in K -means) , but to randomly assign it , assigning to Po with probability Po(z)/ (Po(z) + Pl (z)) . We call this posterior assignment . Each of these three assignment methods can be interpreted as classifying

points as belonging to one (or more) of two distinct populations, solely on the basis of probabilistic models (densities) for these two populations. An alternative interpretation

is that we have three different ways of inferring

the value of a "hidden" (unobserved) variable, whose value would indicate which

of two sources had generated

an observed

data point . How these

assignment methods differ in the context of unsupervised learning is the subject of this paper . In the context of unsupervised learning , EM is typically viewed as an algorithm for mixture density estimation . In classical density" estimation . a I

--

finite training set of unlabeled data is used to derive a hypothesis density . The goal is for the hypothesis density to model the "true " sampling density as accurately as possible , typically as measured by the Kullback -Leibler 1Throughout the paper , we concentrate on the case of just two clusters or densities for simplicity of development . All of our results hold for the general case of K clusters or

densities

.

HARD

AND

SOFT

ASSIGNMENTS

FOR CLUSTERING

497

(KL ) divergence. The EM algorithm can be used to find a mixture density model of the form c:toPo+ (1 - c:tO)Pl . It is known that the mixture model found by EM will be a local minimum of the log-loss (Dempster et al., 1977) (which is equivalent to a local maximum of the likelihood), the empirical analogue of the KL divergence . The K -means algorithm is often viewed as a vector quantization algo-

rithm (and is sometimes referred to as the Lloyd-Max algorithm in the vector quantization literature ) . It is known that K -means will find a local minimum of the distortion or quantization error on the data (MacQueen, 1967) , which we will discussat some length. Thus , for both the fractional and WTA assignment methods , there is a

natural and widely used iterative optimization heuristic (EM and K -means, respectively) , and it is known what loss function is (locally) minimized by each algorithm (log-loss and distortion , respectively) . However, relatively little seems to be known about the precise relationship between the two loss functions and their attendant heuristics . The structural similarity of EM and K - means often leads to their

being considered

closely related

or

even roughly equivalent. Indeed, Duda and Hart (Duda and Hart , 1973) go as far as saying that K -means can be viewed as " an approximate way to obtain maximum likelihood estimates for the means" , which is the goal of density estimation in general and EM in particular . Furthermore , K -means is formally equivalent to EM using a mixture of Gaussians with covariance

matrices fI (where I is the identity matrix ) in the limit e - + o. In practice, there is often some conflation of the two algorithms : K -means is sometimes used in density estimation applications due to its more rapid convergence, or at least used to obtain "good" initial parameter values for a subsequent execu

tion

of EM

.

But there are also simple examples in which K -means and EM converge to rather different solutions , so the preceding remarks cannot tell the entire story . What quantitative statements can be made about the systematic differences between these algorithms and loss functions ? In this work , we answer this question by giving a new interpretation of the classical 'distortion that is locally minimized by the K -means algorithm . We give a simple information -theoretic decomposition of the expected dis-

tortion that showsthat K -means (and any other algorithm seekingto minimize the distortion ) must manage a trade-off between how well the data are balanced or distributed and the accuracy

among the clusters

of the density

by the hard assignments ,

models found for the two sides of this as-

signment . The degree to which the data are balanced among the clusters is measured by the entropy of the partition defined by the assignments . We refer to this trade -off as the information -modeling trade - off . The information -modeling trade - off identifies two significant ways in

498

MICHAELKEARNSET AL.

which

K - means

sampling

and

EM

density

Q

with

with

explicitly

concerned the

sampling

rately strongly

is

result

of

second

the

even

in In

simple

specific

tend

find

ant

the

loss

with tioned

above

Our and

should

that

as a partition Po

and

we

must

In be

each

of

sepa -

may

be

; in

EM

the

intuitive

it

this

here ; the

of K - means

behavior us

to

less

form

of

K - means

derive

a general

K - means

" overlap

will

" with

each

examples

, this

bias

general

bias

that

is a rather

the

of

, that

simple

it

or

unequal

study

also

allows

weightings

the

" the the

has

an

incentive

posterior the

us to of the

weighting

despite

density

models

analyze

resulting

loss by

vari -

density

models

Po

effect

on

for

performed

the

interesting finding

a partition

assignment

this

and

for

that

we

to

output

method

function EM

men -

' s algebraic

, it

differs

rather

explore

of the

determined

by

may

will

, we its

settings

think of

and

z

E X

of F ~

flip

in

we consider PI

Plover to

as

X , and

either

0 or

" assigning

( and

to

, K - means

a ( possibly 1 ; we

" points

determine

{ O, I } ; in

such

propose

interpretation

Po and

EM

.

to

( b E { O, I } ) as a density coins

call

applying

learning

Assignments

Po

a value

. We

to anyone

Hard

maps

think

it . F

section

interest

of unsupervised

densities

F

assignments

all

is

is

Ql

.

problems

of X . We

always

. In

densities

define

: namely have

that

optimization

have

PI , and

assigned

"

for

behavior

allows

that

argue

be of some

to

) mapping

F

" hard

that

and

K - means

formalize

the

. In certain

of this

Decomposition

of

density

show iterative

variants

Suppose

points

also

entire

.

Loss

domized

but

the

P1

EM

" erasing

. We

PI

by

the

differ

we use

use

, essentially

results

A

the

and

we

explain

density

maintains

that

, and

to

their

2.

that

entropy

similarity dramatically

and

framework

show

function

high

EM

algorithms

mathematical

P1 ; we

and

we

sampling

the

, and

and

by

; here

the

by

of K - means

and

ally

on

Po

Qo

differences

determine

the

Qo ) P1 , K - means

they

these

decomposition

found

is apparent

P1 used

The

models

model

see .

predict

new

to those

little

Po and

can

K - means

density

of K - means depends

, the

how

compared

us

of

shall

Po

partition

methods

, as we

(1 -

identified

first

actually

+

to

subpopulations

models

of the

assignment

letting

about

to

. The

seeks

QoPo

distinct good

absent

, but

examples

prediction

model

finding

entropy

examples to

EM

of subpopulations

the

obvious

addition

other

by

differing

is less

, where

identifying

choice

entirely

. First

a mixture

, and

, the

influenced

influence

on

density

. Second

differ

other

a triple

in this perhaps

ran refer

exactly

model

words

, F

one

for

assignment must

to

the of

z ,

make

( F , { Po , PI } ) a partitioned

a measure and

the

will

of goodness

consequences paper

, the

some

for

partitioned

.

partition

additional

F will parameters

actu

),

HARD

AND

SOFT

ASSIGNMENTS

FOR CLUSTERING

499

but we will suppress the dependency of F on these quantities for notational brevity . As simple examples of such hard assignment methods , we have the

two methods discussedin the introduction : WTA assignment (used by K means) , in which z is assignedto Po if and only if Po(z) ~ PI (z) , and what we call posterior assignment , in which ~ is assigned to Pb with probability

Pb( z)j (Po(Z) + PI (z)) . The soft or fractional assignmentmethod used by EM does not fall into this framework , since z is fractionally

assigned to

both Po and Pl .

Throughout the development , we will assume that unclassified data is drawn according to some fixed , unknown density or distribution Q over X that we will call the sampling density . Now given a partitioned density

(F, { Po, PI } ) , what is a reasonableway to measurehow well the partitioned density "models " the sampling density Q ? As far as the Pb are concerned , as we have mentioned , we mi ~ht ask that the density Pb be a good model

of the sampling density Q conditioned on the event F (z) = b. In other words , we imagine that F partitions Q into two distinct subpopulations , and demand that Po and PI separately model these subpopulations . It is

not immediately clear what criteria (if any) we should ask F to meet; let us defer this question for a moment .

Fix any partitioned density (F, { Po, PI } ) , and define for any z E X the partition

loss

X(z) = E[- log (PF (z)(z))]

(1)

where the expectation is only over the (possible) randomization in F . We have suppressed the dependence of X on the partitioned density under consideration for notational brevity , and the logarithm is base 2. If we ask that the partition loss be minimized , we capture the informal measure of goodness proposed above: we first use the assignment method F to assign z to either Po or Pl ; and we then "penalize " only the assigned density Pb

by the log loss - 10g(Pb(z)) . We can define the training partition loss on a finite set of points S , and the expected partition loss with respect to Q , in the natural

ways .

Let us digress briefly here to show that in the special case that Po and

Pl are multivariate Gaussian (normal) densitieswith meansJ.Loand Ill , and identity covariance matrices , and the partition F is the WTA assignment method , then the partition loss on a set of points is equivalent to the well known distortion or quantization error of J.Loand III on that set of points

(modulo some additive and multiplicative constants) . The distortion of z with respect to J.Laand ILl is simply (1/ 2) min(llz - ILo112 , Ilz - ILl112 ) = (1/ 2) llz - ILF(z)112 , where F (z) assignsz to the nearer oflLo and ILl according to Euclidean distance (WTA assignment) . Now for any z , if ~ is the ddimensional Gaussian(1/ (27r) (d/ 2 )e- (1/ 2)llz- l.I.bI12 and F is WTA assignment

500

MICHAEL

KEARNS

wi th respect to the Pb, then the partition

ET AL .

loss on z is

(2) (1/ 2)llz - JlF(z)112Iog (e) + (d/ 2) log27r . (3)

- log{PF(z) (z)) == log((27r )d/2e(1/2)llz- #.I.p'(z)112 )

The first term in Equation (3) is the distortion times a constant, and the second term is an additive constant that does not depend on z , Po or Pl . Thus , minimization of the partition loss is equivalent to minimization of the distortion . More generally , if z and Jl are equal dimensioned real

vectors, and if we measure distortion using any distance metric d(z , p,) that can be expressedas a function of z - IL, (that is, the distortion on z is the smaller of the two distances d(z , ILo) and d(z , /ll ) ,) then again this distortion

is the special case of the partition

loss in which the density

Pb is Pb(Z) = (l / Z )e- d(z,lJ.b) , and F is WTA assignment. The property that d(z , p,) is a function of z - /l is a sufficient condition to ensure that the normalization factor Z is independent of Jl; if Z depends on /l , then the partition loss will include an additional IL-dependent term besides the distortion , and we cannot guarantee in general that the two minimizations are equivalent .

Returning to the development , it turns out that the expectation of the partition loss with respect to the sampling density Q has an interesting decomposition and interpretation . For this step we shall require some basic but important definitions . For any fixed mapping F and any value b E

{ O, I } , let us define Wb== PrzEQ[F (z) == b], so Wo+ WI = 1. Then we define Qb by

Qb(Z) == Q(z) . Pr [F (z) == b]/ Wb

(4)

where here the probability is taken only over any randomization of the mapping F . Thus , Qb is simply the distribution Q conditioned on the event

F (z) == b, so F "splits" Q into Qo and QI : that is, Q(z) == woQo(z) + WIQI (z) for all z . Note that the definitions of Wb and Qb depend on the partition F (and therefore on the Pb, when F is determined by the Pb). Q:

Now we can write

the expectation

of the partition

loss with

respect to

EzEQ[X(Z)] WOEzoEQo [- log(Po(zo))] + wIEztEQt [- log(PI (ZI))]

(5)

wOEzoEQo [log ~ -IOg (Qo (zo )] +WlEZ1 EQI[log-Ql(Zl) 1~(-;;-~)- - log(Q1(z1))] -

woKL (QoIIPo) + wIKL (QIIIP1) + wo1l(Qo) + wl1l (QI ) woKL(QoIIPo) + wIKL (Ql //P1) + 1l (Q/F ).

(6) (7) (8)

HARD AND SOFT ASSIGNMENTSFOR CLUSTERING

501

Here KL (Qbll~ ) denotes the Kullback-Leibler divergencefrom Qb to Pb, and 1l (QIF ) denotes 1l (zIF (z)) , the entropy of the random variable z , distributed according to Q, when we are given its (possibly randomized) assignment F (z) . This decomposition will form the cornerstone of all of our subsequent arguments, so let us take a moment to examine and interpret it in some detail . First , let us remember that every term in Equation (8) depends on all of F , Po and Pl , since F and the Pb are themselvescoupled in a way that depends on the assignmentmethod. With that caveat, note that the quantity KL (QbIIPb) is the natural measure of how well Pb models its respective side of the partition defined by F , as discussedinformally above. Furthermore, the weighting of these terms in Equation (8) is the natural one. For instance, as Woapproaches0 (and thus, Wl approaches1) , it becomeslessimportant to make KL (QoIIPo) small: if the partition F assigns only a negligible fraction of the population to category 0, it is not important to model that category especially well, but very important to accurately model the dominant category 1. In isolation, the terms woKL(QoIIPo) + wlKL (QlIIPI ) encourage us to choose Pb such that the two sides of the split of Q defined by Po and PI (that is, by F ) are in fact modeled well by Po and Pl . But these terms are not in isolation. The term 1l (QIF ) in Equation (8) measuresthe informativeness of the partition F defined by Po and PI , that is, how much it reducesthe entropy of Q. More precisely, by appealing to the symmetry of the mutual information I (z , F (z)) , we may write (where z is distributed according to Q) : 1l (QIF ) = = = =

1l (zIF (z)) 1 (z) - I (z , F (z)) 1 (z) - (1 (F (z)) - 1 (F (z) lz)) 1 (z) - (1l2(wo) - 1 (F (z) lz))

(9) (10) (11) (12)

where 1l2(p) = - plog (p) - (l - p) log(l - p) is the binary entropy function . The term 1 (z) = 1 (Q) is independent of the partition F . Thus, we see from Equation (12) that F reducesthe uncertainty about z by the amount 1l2(WO ) - 1 (F (z) lz) . Note that if F is a deterministic mapping (as in WTA assignment) , then 1l (F (z) lz) = 0, and a good F is simply one that maximizes 1l (WO ). In particular , any deterministic F such that wo = 1/ 2 is optimal in this respect, regardlessof the resulting Qo and Ql . In the general case, 1l (F (z) Iz) is a measureof the randomnessin F , and a good F must trade off between the competing quantities 1l2(WO ) (which, for example, is maximized by the F that flips a coin on every z) and - 1l (F (z) lz) (which is always minimized by this same F ) . Perhaps most important , we expect that there may be competition between the modeling terms woKL (QoIIPo) + wlKL (Q11IP1 ) and the partition

502

MICHAELKEARNSET AL.

information term 1 (Q IF ). If Po and PI are chosenfrom some parametric class P of densities of limited complexity (for instance, multivariate Gaussian distributions ), then the demand that the KL (QbIIPb) be small can be interpreted as a demand that the partition F yield Qb that are "simple" (by virtue of their being well -approximated , in the KL divergence sense, by den-

sities lying in P ). This demand may be in tension with the demand that F be informative , and Equation (8) is a prescription for how to manage this competition , which we refer to in the sequel as the information -modeling trade - off .

Thus, if we view Po and PI as implicitly defining a hard partition (as in the case of WTA assignment) , then the partition loss provides us with one particular way of evaluating the goodness of Po and P1 as models of the sampling density Q . Of course, there are other ways of evaluating the

Pb, one of them being to evaluate the mixture (1/ 2)Po + (1/ 2)Pl via the KL divergence KL (QII(1/ 2)Po+ (1/ 2)Pl ) (we will discussthe more general case of nonequal mixture coefficients shortly ) . This is the expressionthat is (locally ) minimized by standard density estimation approachessuch as EM , and we would particularly

like to call attention to the ways in which

Equation (8) differs from this expression. Not only does Equation (8) differ by incorporating the penalty 1 (Q IF ) for the partition F , but instead of asking that the mixture (1/ 2)Po + (1/ 2)Pl model the entire population Q , each Pb is only asked to - and only given credit for - modeling its respective Qb. We will return to these differences in considerably more detail in Section

4.

We close this section by observing that if Po and Pl are chosen from a class P of densities , and we constrain F to be the WTA assignment method for the Pb, there is a simple and perhaps familiar iterative optimization algorithm for locally minimizing the partition loss on a set of points S over all choices 'of the Pb from P - we simply repeat the following two steps until

convergence

:

- (WTA Assignment) Set 80 to be the set of points z E 8 such that Po(z) ~ P1(z) , and set 81 to be S - So.

- (Reestimation ) ReplaceeachPbwith argminpEP { - EzESblog(P(z))} . As we have already noted , in the case that the ~

are restricted to be

Gaussian densities with identity covariance matrices (and thus, only the means are parameters) , this algorithm reduces to the classical K -means algorithm . Here we have given a natural extension for estimating Po and PI from a general parametric class, so we may have more parameters than just the means . With some abuse of terminology , we will simply refer to our generalized version as K -means. The reader familiar with the EM algorithm for choosing Po and PI from P will also recognize this algorithm as simply

HARD

a

" hard

"

or

mixture

AND

WTA

SOFT

assignment

coefficients It

is

easy

partition

to

us

fact

K

of

unweighted

will

result

EM

503

( that

is , where

the

) .

- means

chosen

from

special

P

case

that

K

trade

means

loss

- off

at

Equation

of

using

the

case

in

the

a

local

WTA

partition

minimum

of

assignment

loss

in

does

not

be

the

, for

, for

Finally

K

the

method

- means

each

, that

, Qb

that

we

can

can

,

.

loss

for

the

terms

also

change

( QblIPb

with

)

each

generalize

K

-

-

- means but

this

nonincreasing

iteration

Equation

is

litera

them are

in

this

by

to

K

terms

where

estimated

assigned

-

the

the

quantization

means

KL

of

examples

vector

, combined information

increase

each

see

the

, the points

loss the

not

that

will in

the

easily

will

mean

we

iteration of

- means manage

- means

not

observed

means

instance

example

, note

K

does indeed

at

true

K

implicitly

although

often

) that the

imply

( because

been

the

must

, this

increMej

h ~

, 1982

fact

that

not

It

minimizes

- means

iteration

will

.

( Gersho

must

locally K

. Note

any

( 8 ) the

- means

( 8 ) , implies

modeling

case

Pb

this

Equation

ture

equal

CLUSTERING

.

The

not

be that

over

rename .

convenience

with

verify

FOR

variant

must

loss

Let

ASSIGNMENTS

) .

( 8 )

to

the

K

- cluster

: K EQ

[X ( Z ) ]

== L

wiKL

( QiIIPi

)

+

1l

( QIF

) .

( 13

( 1l

( z ) ) -

)

i = l

Note

that

where

, as

z

now

is

an

3 .

O

in

Equation

( log

we

EM or

have

IlK

of

the

' P Po

ization

to

algorithm

aoPo

case

E

P

case

, of

( z )

( Reestimation

-

( Reweighting

:2::

( z ) that

for

( F general

densities

1l

K

)

S

a

) P1

Replace Replace

of

of

( F

, 1l

Set

So

to

( z ) , and each ao

set Pb

with

0

E

the

steps

be

the S1

with ISol

a

( z ) lz

( F

) ) ,

( z ) )

is

a

to

be

natural

space

X

and

outputs

gener

weighted

K

a ,

the

pair

' P

.)

and

ao

,

of

general

straightforward E

-

variant

,

( Again

1 / 2 ,

The

and

then

such

that

:

set to

be

argminpEP / ISI

Pb

unweighted

- assignment

[0 , 1 ] . is

the

forced

also

hard

points

weights for

three

a

over

K

choices

is

as

data

weight and

following

of are

) . There

densities

set

as

random

ao )

of

variant

coefficients

thought

' P

well

) -

densities

a

K

- assignment

K be

as

the

hard

mixture

class

with

( 1

a

of

input

Assignment

-

= = tl

, and

the

can

any as

executes

( WTA

is

that

. For

begins

repeatedly

- means

takes

, Pl

the

)

Q

.

is , where

- means

EM over

densities

-

, K

general

K

weighted

means

quantity

( that

of

( QIF to

- Means

noted

in

alization

) )

K

algorithm

) , 1l

according

( K

Weighted

As

( 11

distributed

.

of S

points -

So { -

z

E

S

. EZESb

log

( P

( z ) ) } .

504

MICHAELKEARNSET AL.

Now we can again ask the question: what loss function is this algorithm (locally ) minimizing ? Let us fix F to be the weighted WTA partition , given by F (z) = 0 if and only if aoPo(z) ~ (1 - aO)Pl (z) . Note that F is deterministic , and also that in general, ao (which is an adjustable parameter of the weighted K -means algorithm ) is not necessarily the same as Wo (which is defined by the current weighted WTA partition , and dependson Q) . It turns out that weighted K -meanswill not find Po and PI that give a local minimum of the unweighted K -means loss, but of a slightly different loss function whose expectation differs from that of the unweighted K means loss in an interesting way. Let us define the weightedK -means loss of Po and PIon z by

- log(a~-F(z)(l - ao)F(Z)PF(z)(Z))

(14)

where again , F is the weighted WTA partition determined by Po, PI and Qo. For any data set S , define Sb = { z E S : F (z ) = b} . We now show that weighted K -means will in fact not increase the weighted K -means loss on S with each iteration . Thus2

-zES L log (a.~-F(z)(1- ao )F(z)PF (z)(z)) - zESo L log (aoPo (z))- zESt L log ((1- aO )Pl(z)) - zESo L log (Po (Z))- zESt L log (P1 (Z)) -ISollog (ao )- ISlllog (l - ao ).

(15)

-

(16)

Now

- ISollog (ao) - IS111og (1- ao) -

ISol (ao) + TSl1og IS11 (1- ao)) - ISI ( TSl1og

(17)

which is an entropic expression minimized by the choice 0 = 1801/ 181. But this is exactly the new value of 00 computed by weighted K -means from the current assignments 80, 81. Furthermore , the two summations in Equation (16) are clearly reduced by reestimating Po from So and P1 from Sl to obtain the densities P~ and P{ that minimize the log-loss over So and 81 respectively , and these are again exactly the new densities computed 2Weare grateful to Nir Friedmanfor pointing out this derivationto us.

HARD

by

AND

weighted

K

means

loss

(

justifying

-

a

where

the

[

Wb

side

is

give

the

wo

)

=

-

[

PrzEx

(

ao

,

1

ao

estimate

ao

-

F

(

)

(

z

.

-

)

z

)

is

the

=

(

=

b

-

z

)

}

ao

(

z

)

)

as

)

=

)

in

( z

the

)

-

,

PI

}

the

)

on

S

weighted

at

K

each

-

iteration

,

PF

(

,

z

)

log

.

(

(

z

K

)

,

{

)

-

first

(

Po

F

,

{

wllog

(

term

Po

,

Pi

,

Pi

}

)

,

)

.

and

ISI

(

ISol

/

means

loss

for

this

ISI

large

=

=

Wo

,

)

-

two

Wi

)

=

is

(

we

know

that

K

simply

the

expect

)

hand

wo

much

we

18

terms

=

weighted

we

(

right

not

,

how

samples

the

wo

means

is

aD

last

is

-

-

The

there

K

1

on

}

(

weighted

/

-

]

distributions

F

of

)

ao

The

of

ISol

limit

weighted

have

Wo

for

=

iteration

,

F

]

(

but

ao

Po

We

binary

fixed

;

505

decreases

{

expected

before

the

a

means

,

?

loss

For

-

F

Q

1

PF

have

Thus

(

what

(

entropy

each

WOe

(

log

must

at

of

,

partition

-

of

CLUSTERING

.

between

we

)

loss

PI

~

F

cross

convergence

reassigns

a

entropy

this

K

14

density

-

[

-

weighted

this

expected

cross

about

(

FOR

(

and

log

the

,

sampling

EzEQ

just

and

say

=

=

Thus

Equation

Po

to

EzEQ

ASSIGNMENTS

of

fixed

respect

=

.

by

naming

for

with

means

given

our

Now

SOFT

-

,

1

-

can

at

means

empirical

Wo

-

+

wo

,

and

thus

-

Combining

Equation

sition

for

found

by

weighted

K

[

=

-

=

=

~

since

1 .

(

the

of

the

tion

mative

weighted

log

means

(

has

does

weight

K

-

means

wo

(

)

Z

)

wo

(

)

1

+

-

the

)

-

+

18

1l2

)

(

wo

and

gives

(

(

)

.

(

our

that

+

(

general

for

19

)

decompo

the

Po

,

Pi

-

and

ao

WI

QIIIPl

)

Pb

Q

(

,

1

will

)

1l

]

(

1l

no

+

QIF

(

2

Q

.

)

)

-

1l2

Q

ao

But

(

wo

,

we

may

finding

(

.

,

This

all

{

,

the

,

,

this

,

}

(

21

)

(

22

)

the

)

that

the

of

and

P1

thus

(

,

has

unweighted

the

introduc

finding

-

an

modeling

trade

the

)

introduction

Po

-

PI

from

F

fixed

20

of

Po

differs

towards

minimize

think

F

(

)

.

beyond

bias

to

)

partition

information

try

(

of

the

for

)

1l

)

First

the

is

)

QIIIPI

even

/

removed

there

Z

+

as

.

1

=

+

(

or

means

of

=

)

)

ways

ao

Z

)

wIKL

two

(

the

-

+

and

also

PF

)

definition

Qo

,

)

QIIIPl

(

)

Z

)

on

QoIIPo

fixing

algorithm

Wl

wIKL

in

Thus

(

QIIIP1

(

K

means

F

(

depend

has

;

)

)

wllog

)

of

.

ao

wIKL

our

Qo

8

wIKL

+

-

changed

F

-

wllog

)

woKL

to

partition

F

weighted

definition

the

"

-

not

corresponds

of

Wi

(

(

)

of

-

(

Equation

QoIIPo

goal

wllog

Equation

QoIIPo

(

-

with

QoIIPo

(

)

~

(

K

00

a

wolog

sum

the

-

(

)

,

(

unweighted

weight

changed

K

Q

)

goal

log

Wo

)

wo

means

woKL

generalization

minimizes

-

woKL

-

(

in

woKL

-

,

19

loss

=

Thus

log

(

partition

EZEQ

(

Wo

modeling

"

-

infor

off

-

for

terms

506

MICHAELKEARNSET AL.

woKL(QoIIPo) + wIKL (QIIIP1) only. Note, however, that this is still quite different from the mizture KL divergenceminimized by EM . 4. K -Means vs . EM : Examples In this section, we consider severaldifferent sampling densities Q, and compare the solutions found by K -means (both unweighted and weighted) and EM . In eachexample, there will be significant differencesbetweenthe error surfaces defined over the parameter space by the K -means lossesand the KL divergence. Our main tool for understanding these differenceswill be the loss decompositionsgiven for the unweighted K -meansloss by Equation (8) and for the weighted K -means loss by Equation (22) . It is important to remember that the solutions found by one of the algorithms should not be considered "better" than those found by the other algorithms: we simply have different loss functions, eachjustifiable on its own terms, and the choice of which loss function to minimize (that is, which algorithm to use) determines which solution we will find . Throughout the following examples, the instance spaceX is simply ~ . We compare the solutions found by (unweighted and weighted) EM and (unweighted and weighted) K -means when the output is a pair { Po, PI } of Gaussians over ~ - thus Po == N (ILo, O'o) and PI == N (JLI, 0' 1) , where JLo , 0'0, JLI, 0' 1 E ~ are the parameters to be adjusted by the algorithms . (The weighted versionsof both algorithms also output the weight parameter ao E [0, 1] .) In the caseof EM , the output is interpreted as representing a mixture distribution , which is evaluated by its KL divergencefrom the sampling density. In the case of (unweighted or weighted) K -means, the output is interpreted as a partitioned density, which is evaluated by the expected (unweighted or weighted) K -meanslosswith respect to the sampling density . Note that the generalization here over the classical vector quantization case is simply in allowing the Gaussiansto have non-unit variance. In each exampIe, the various algorithms were run on IO thousand examples from the sampling density; for these I -dimensional problems, this sample size is sufficient to ensure that the observed behavior is close to what it would be running directly on the sampling density. Example (A ) . Let the sampling density Q be the symmetric Gaussian mixt ure Q = o.5N (- 2, 1.5) + O.5N (2, 1.5) .

(23)

See Figure 1. Supposewe initialized the parameters for the algorithms as JLo== - 2, III == 2, and 0'0 == 0'1 == 1.5. Thus, eachalgorithm begins its search from the "true" parameter values of the sampling density. The behavior of unweighted EM is clear: we are starting EM at the global minimum of its expected loss function , the KL divergence; by staying where it begins, EM

HARD

can

enjoy

KL

divergence

a

of

The

term

= =

1

/

2

the

event

.

I F

)

(

z

)

Qo

the

O

.

)

of

best

choice

of

F

will

improve

the

same

us

the

of

provided

by

move

on

more

subtle

. 130

. 500

)

some

. 5

)

0 '

are

these

)

term

the

the

K

- means

,

2

added

of

. 5

as

,

the

We

)

,

.

)

the

the

Symmet

-

as

WTA

is

,

and

long

it

the

par

-

possible

to

degrad

make

optimal

. 5

mass

without

.

1

results

2

the

thus

)

,

towards

1

then

,

back

that

-

,

( QIF

as

-

0

( QoIIPo

conditions

1i

)

of

,

(

= =

moved

Furthermore

initial

N

moved

value

-

term

. 5

woKL

symmetric

)

on

reflected

1

.

:

( QIIIP1

is

0

,

)

yields

z

value

movements

from

for

.

it

has

initial

)

1 ,

2

terms

which

reflection

initial

the

-

tail

the

,

below

= =

(

tail

the

)

z

N

the

( QIIIP1

and

. 5

.

conditioned

,

1

since

( 8

wIKL

Q

,

parameters

0

and

Rather

,

on

only

~

is

above

( since

than

.

( 2

the

. 5

to

wIKL

the

= =

As

1

. 338

-

essentially

performance

is

K

the

means

- means

. 131

.

-

we

find

The

that

after

8

coarse

. 301

the

is

KL

that

the

the

various

of

the

easy

to

decomposi

that

to

of

that

been

behavior

approximation

behavior

out

.

have

is

)

divergence

to

superior

would

this

pushed

inferior

loss

it

( 24

been

,

point

of

1

have

means

justification

= =

is

,

.

1

Naturally

example

a

0 '

model

K

the

,

means

mixture

as

where

2

the

directly

a

K

= =

,

simple

provides

examples

III

expected

this

-

,

,

solution

reduced

to

its

in

sample

the

predicted

variances

while

finite

to

0 ' 0

Q

regarding

to

1

tail

than

0 ' 0

,

.

of

( 8

,

( since

1

on

the

behavior

2

converged

that

Equation

-

)

which

Q

the

of

z

,

Equation

choice

ofN

its

smaller

density

remark

2

in

if

is

or

of

of

given

( QoIIPo

-

-

choice

each

only

operation

weighted

and

,

(

with

this

initial

( that

presence

.

2

0

parameters

predict

Example

,

sampling

Let

directly

= =

)

by

has

-

for

and

tail

respect

experiment

= =

origin

the

starting

tion

. 5

the

N

the

smaller

( QbIIPb

means

Wo

the

from

0

not

of

value

the

Ilo

from

. 5

left

and

. 5

loss

if

the

here

examine

Qo

reflection

term

,

0

us

0

Q

:

irrelevant

woKL

below

the

for

yields

1

= =

that

is

,

with

wbKL

Performing

which

2

Qo

Ill

= =

with

unchanged

= =

0

terms

-

be

to

00

-

-

)

Notice

but

tail

optimal

K

0

,

be

prediction

,

(

should

terms

for

iterations

N

,

and

initially

achieved

~

0

( z

The

variance

remain

the

the

z

should

ILo

.

.

= =

the

Thus

0 ' 0

of

ing

.

Ilo

Let

EM

essentially

by

F

1

of

apply

movements

tition

,

mean

mean

choice

remarks

is

the

best

ric

or

it

reduces

final

= =

z

the

and

,

?

density

weighted

partition

simply

)

Clearly

moves

,

0

is

minimized

is

above

0

507

CLUSTERING

sampling

of

is

- means

expected

story

,

= =

the

already

( wo

= =

"

the

true

parameter

different

Equivalently

)

js

1l2

a

also

K

F

off

z

left

( Q

F

below

in

this

of

are

chopped

on

for

and

,

is

parameter

partition

however

models

same

unweighted

1

WTA

FOR

perfectly

The

decomposition

the

"

.

value

about

the

Wo

)

weighting

optimal

in

ASSIGNMENTS

that

the

What

SOFT

solution

0

absence

the

AND

-

cannot

EM

.

be

We

now

algorithms

is

.

(

competes

B

)

.

We

now

with

examine

the

KL

an

divergences

example

in

.

Let

which

the

the

sampling

term

density

1l

( QIF

Q

)

be

508

MICHAELKEARNSET AL.

the single unit -variance Gaussian Q(z) = N (O, 1); see Figure 2. Consider the initial

choice of parameters

/.Lo == 0 , 0' 0 == 1, and Pl at some very distant

location , say /.Lo= 100, 0'0 = 1. We first examine the behavior of un weighted

K -means. The WTA partition F defined by these settings is F ( z) = 0 if and only if z < 50. Since Q has so little mass above z = 50, we have Wo ~

1, and thus 1l (QIF ) ~ 1l (Q) : the partition is not informative . The term wlKL (QlIIPl ) in Equation (8) is negligible, since Wl ~ o. Furthermore, Qo ~ N (O, 1) becauseeven though the tail reflection described in Example (A) occurs again here, the tail ofN (O, 1) abovez == 50 is a negligible part of the density. Thus woKL(QoIIPo) ~ 0, so woKL(QoIIPo)+ WlKL (QlIIPl ) ~ o. In other words , if all we cared about were the KL divergence terms , these settings would be near-optimal . But the information -modeling trade -off is at work here: by moving Pl closer to the origin , our KL divergences may degrade , but we obtain a more informative partition . Indeed , after 32 iterations unweighted K -means converges

to

ILo== - 0.768, 0'0 == 0.602, III == 0.821, 0' 1 == 0.601

(25)

which yields Wo == 0 .509 .

The information -modeling tradeoff is illustrated nicely by Figure 3, where we simultaneously plot the unweighted K -means loss and the terms

woKL (QoIIPo) + wIKL (QIIIP1) and 1 2(Wo) as a function of the number of iterations during the run . The plot clearly shows the increase in 1 2(wo) (meaning a decreasein 1 (QIF )) , with the number of iterations , and an increase in woKL (QoIIPo) + wIKL (QIIIP1) . The fact that the gain in partition information is worth the increase in KL divergences is shown by the resulting decrease in the unweighted K -means loss. Note that it would be especially difficult to justify the solution found by unweighted K -means from the viewpoint of density estimation .

As might be predicted from Equation (22) , the behavior of weighted K -means is dramatically different for this Q , since this algorithm has no incentive to find an informative partition , and is only concerned with the KL divergence terms . We find that after 8 iterations it has converged to

ILo== 0.011, 0'0 == 0.994, ILl == 3.273, 0' 1 = 0.033 with

(26)

0 = Wo = 1.000. Thus , as expected , weighted K -means has chosen a

completely uninformative partition , in exchangefor making WbKL(QbIIPb) ~ o. The values of III and 0"1 simply reflect the fact that at convergence, P1 is assigned only the few rightmost points of the 10 thousand examples . Note that the behavior of both K -means algorithms is rather different

from that of EM , which will prefer Po = P1 = N (0, 1) resulting in the mixture (1/ 2)Po+ (1/ 2)PI = N (O, 1) . However, the solution found by weighted

HARDAND SOFTASSIGNMENTS FORCLUSTERING

509

K -means is "closer" to that of EM , in the sense that weighted K -means effectively eliminates one of its densities and fits the sampling density with a single Gaussian . Example ( C ) . A slight modification to the sampling distribution of Ex ample (B ) results in some interesting and subtle difference of behavior for our algorithms . Let Q be given by

Q == 0.9SN(0, 1) + 0.OSN(5, 0.1).

(27)

Thus, Q is essentially as in Example (B) , but with addition of a small distant "spike" of density; seeFigure 4. Starting unweighted K -meansfrom the initial conditions J1 ,0 = 0, 0"0 = 1, ILl == 0, 0"1 == 5 (which has Wo== 0.886, 1l (wo) == 0.513and woKL(QoIIPo)+ w1KL(Q11IP1 ) == 2.601) , we obtain convergenceto the solution ILo== - 0.219, 0'0 == 0.470, ILl == 0.906, 0' 1 == 1.979

(28)

which is shown in Figure 5 (and has Wo == 0.564, 1l (wo) == 0.988, and woKL (QolfPo) + wIKL (QIIIP1) == 2.850) . Thus, as in Example (B) , unweighted K -means starts with a solution that is better for the KL divergences, and worse for the partition information , and elects to degrade the former in exchangefor improvement in the latter . However, it is interesting to note that 1 (wo) == 1 (0.564) == 0.988 is still bounded significantly away from 1; presumably this is becauseany further improvement to the partition information would not be worth the degradation of the KL divergences. In other words, this solution found is a minimum of the K -means loss where there is truly a balanceof the two terms: movement of the parameters in one direction causesthe loss to increasedue to a decreasein the partition information , while movementof the parameters in another direction causes the loss to increasedue to an increasein the modeling error. Unlike Example (B) , there is also another (local) minimum of the unweighted K -means loss for this sampling density, at Po = 0.018, 0'0 = 0.997, III = 4.992, 0'1 = 0.097

(29)

with the suboptimal unweighted K -means loss of 1.872. This is clearly a local minimum where the KL divergenceterms are being minimized, at the expense of an uninformative partition (wo == 0.949) . It is also essentially the same as the solution chosenby weighted K -means (regardlessof the initial conditions) , which is easily predicted from Equation (22) . Not surprisingly, in this example weighted K -means convergesto a solution close to that of Equation (29) . Example (D ) . Let us examine a case in which the sampling density is a mixture of three Gaussians: Q = O.25N (- 10, 1) + o.5N (O, 1) + O.25N (10, 1).

(30)

510

MICHAELKEARNSET AL.

See Figure 6. Thus , there are three rather distinct subpopulations of the sampling density . If we run unweighted K -means on 10 thousand examples

from Q from the initial conditions J.Lo= - 5, J.Ll = 5, 0'0 = 0"1 = 1, (which has Wo= 0.5) we obtain convergenceto ILo= - 3.262, 0'0 = 4.789, ILl = 10.006, 0' 1 == 0.977

(31)

which has Wo == 0.751. Thus , unweighted K -means sacrifices the initial optimally informative partition in exchange for better KL divergences .

(Weighted K -means convergesto approximately the same solution, as we might have predicted from the fact that even the unweighted algorithm did

not chooseto maximize the partition information .) Furthermore, note that it has modeled two of the subpopulations of Q (N (- 10, 1) and N (O, 1)) using Po and modeled the other (N (10, 1)) using Pl . This is natural "clustering " behavior -

the algorithm prefers to group the middle subpopulation

N (O, 1) with either the left or right subpopulation, rather than "splitting " it . In contrast , unweighted EM from the same initial conditions converges to the approximately symmetric solution

ILo== - 4.599, iTo== 5.361, III == 4.689, iT1== 5.376.

(32)

Thus , unweighted EM chooses to split the middle population between Po and Pl . The difference between K -means and unweighted EM in this example is a simple illustration

of the difference

between the two quantities

woKL (QoIIPo) + wIKL (QlIIP1) and KL (Q//ooPo+ (1 - oo)P1) , and shows a natural case in which the behavior of K -means is perhaps preferable from the clustering point of view . Interestingly , in this example the solution found by weighted EM is again quite close to that of K -means.

5. K -Means ForcesDifferent Populations The partition lossdecomposition givenby Equation(8) hasgivenus a betterunderstanding of thelossfunctionbeingminimized by K -means , and allowedusto explainsomeof the differences between K -meansandEM on specific , simpleexamples . Arethereanygeneraldifferences wecanidentify? In this sectionwegivea derivationthat stronglysuggests a biasinherentin theK -meansalgorithm:namely , a biastowardsfindingcomponent densities that areas "different " aspossible , in a senseto bemadeprecise . Let V(Po, PI) denotethe variationdistance 3 between the densities Po andPI: IPo(z) - PI(z) Idz. (33)

V(Po ,PI)=1

3The ensuing argument actually holds for any distance metric on densities .

HARDANDSOFTASSIGNMENTS FORCLUSTERING

511

Note that V(Po, PI) ~ 2 always. Noticethat due to the triangleinequality, for any partitioned density (F, { Po, PI}), V(Qo, Qi ) .s V(Qo, Po) + V(PO,Pi) + V(Ql , PI).

(34)

Let us assumewithout lossof generalitythat Wo== Pr .-z:EQ[F (z) == 0] ~ 1/ 2. Now in the caseof unweightedor weightedK -means(or indeed, any other casewherea deterministicpartition F is chosen ) , V(Qo, Ql ) == 2, so from Equation (34) we may write V(Po, PI) ~ 2 - V(Qo, Po) - V (QI, PI) (35) == 2 - 2(woV(Qo, Po) + WIV(QI , PI) + ((1/ 2) - Wo) V(Qo, Po) + ((1/ 2) - WI) V(QI , PI)) (36) ~ 2 - 2(woV(Qo, Po) + WIV(QI , PI)) - 2((1/ 2) - WO ) V(Qo, Po137 ) ~ 2 - 2(woV(Qo, Po) + WIV(QI , PI)) - 2(1 - 2wo). (38) Let us examine Equation (38) in somedetail . First , let us assumeWo== 1/ 2, in which case2(1 - 2wo) == o. Then Equation (38) lower bounds V (Po, PI ) by a quantity that approachesthe maximum value of 2 as V (Qo, Po) + V (Ql ' PI ) approachesO. Thus, to the extent that Po and Pl succeedin approximating Qo and Ql , Po and Pl must differ from each other . But the

partition lossdecomposition of Equation (8) includes the terms KL (QbIIPb) , which are directly encouraging Po and Pl to approximate Qo and Ql . It is true that we are conflating two different technical senses of approximation

(variation distance KL divergence) . But more rigorously, since V (P, Q) ~

2 In2vKL (PIIQj holdsfor anyP andQ, andfor all z wehaveJZ ~ z+ 1/ 4, we may

write

V (PO, Pi ) ~ 2 - 4ln2 (woKL(QoIIPo) + wiKL (QiIIPl ) + 1/ 4) - 2(1 - 2woX39) == 2 - ln2 - 4ln2 (woKL(QoIIPo) + wiKL (QiIIPl )) - 2(1 - 2woX.40) Since the expression woKL (QoIIPo) + wIKL (QIIIP1) directly appears in Equation (8) , we see that K -means is attempting to minimize a loss function that encourages V (Po, Pi ) to be large, at least in the case that the algorithm finds roughly equal weight clusters (wo ~ 1/ 2) - which one might expect to be the case, at least for unweighted K -means, since there

is the entropic term - 1i2(wo) in Equation (12) . For weighted K -means, this entropic

term is eliminated

.

In Figure 7, we show the results of a simple experiment supporting the suggestion that K -means tends to find densities with less overlap than EM

512

MICHAEL KEARNS ET AL.

does

.

In

the

experiment

dimensional

,

means

(

the

between

tion

distance

solid

lines

)

(

middle

6

.

the

top

perhaps

Pb

with

true

)

nice

ment

is

PI

the

call

same

. "

)

=

F

,

a

Qb

,

if

Po

and

derivation

pected

think

density

Q

=

=

(

partition

1

/

2

line

grey

next

section

top

grey

line

)

)

)

.

.

Po

+

(

,

1

/

2

(

under

PI

(

,

the

Pb

prior

of

course

,

assign

from

that

Pb

Po

,

occurred

were

,

Q

one

=

(

1

we

can

/

2

)

"

Qo

+

-

the

but

tail

-

and

when

Gaussian

with

and

as

when

components

this

-

WTA

even

Gaussian

,

,

WTA

resulting

that

each

to

the

which

to

,

Recall

fixed

z

.

"

namely

any

assign

Pb

)

mixture

.

we

compared

-

on

method

assign

Thus

by

)

sampling

zero

assignments

partition

partition

Qb

=

Pb

.

(

(

(

=

given

z

Z

the

use

)

We

[

+

F

(

Ql

z

)

(

of

WTA

reflected

(

1

/

2

Z

)

b

)

]

jwb

Qo -

(

(

z

Qb )

+

(

Z Ql ~

(

Z

)

)

o

.

.

Q

=

(

Thus

,

1

/

2

the

)

Po

the

+

KL

Equation

(

of

F

posterior

(

see

/

2

)

PI

8

)

.

encourage

example

this

by

the

)

QI

'

to

However

lead

,

competing

a

moment

it

the

)

(

42

)

(

43

)

(

44

)

ex

-

tempt

closer

-

to

situation

is

for

.

41

the

is

us

constraint

the

model

,

(

above

in

reason

will

in

then

us

For

.

the

'

terms

partition

of

an

1

divergence

assignments

because

will

)

=

)

WTA

again

=

Z

definition

will

,

Pr

by

this

.

)

that

)

this

Z

Qo

=

than

than

grey

assignment

randomly

z

truncation

)

were

(

such

estimation

subtle

=

under

that

hard

generated

(

mixture

=

QbIIPb

loss

density

A

true

that

(

-

then

)

are

(

posterior

"

posterior

QI

Z

(

partition

to

(

PI

wbKL

sampling

=

PI

was

the

Gaussian

varia

three

the

(

hard

we

+

is

)

EM

means

making

partition

QbIIPb

Qb

the

is

PI

)

z

(

the

(

in

if

Qo

z

posterior

as

was

-

But

Po

(

potential

KL

of

natural

that

Po

Example

form

terms

(

F

the

the

resulted

and

this

in

density

back

informative

We

avoids

the

assignment

/

that

of

it

make

sampling

more

.

way

Suppose

)

the

top

in

-

the

and

the

discussed

-

distance

by

of

K

another

density

property

that

have

Z

sampling

mentioned

not

.

(

probability

be

signment

is

natural

the

found

is

one

between

,

lowest

two

variation

reference

weighted

one

there

a

(

which

and

of

Partition

is

But

,

as

means

Posterior

Pb

that

One

The

probability

posterior

not

.

more

with

the

may

ing

Pl

even

to

)

)

mixture

the

solutions

-

,

lines

:

and

the

K

grey

shows

line

for

method

Po

assumption

So

three

assignment

of

dark

descent

Algorithm

WTA

Pl

a

distance

axis

unweighted

was

varying

vertical

(

gradient

Q

with

The

and

for

loss

basis

-

and

posterior

.

Gaussians

Po

,

density

Gaussians

)

target

)

New

sampling

variance

between

,

the

axis

line

A

The

-

two

of

the

z

unit

horizontal

the

near

,

an

FORCLUSTERING HARDANDSOFTASSIGNMENTS Now

on

,

a

under

fixed

E

the

point

[ X

(

z

)

posterior

z

]

partition

here

we

will

= =

E

= =

-

the

call

[ -

log

Po

- side

on

of

(

)

( z

( z

)

A

a

)

(

z

partition

loss

( z

)

of

S

)

is

.

log

Po

( z

)

-

over

the

Po

z

Recall

E

that

P1

( z

and

that

K

from

if

we

= =

the

o

. 5N

(

-

This

is

F

2

,

1

(

2:

)

I2 : )

now

)

possible

at

the

F

the

. 5

)

(

z

)

+

1l

)

the

log

P1

( z

)

}

)

( 45

of

loss

summation

of

)

F

.

the

density

. 5N

( 2

2

. 55

. 03

,

divergence

initial

= =

as

,

,

1

(

. 5

;

The

right

-

in

Example

)

( 46

be

2

conditions

)

,

of

Po

. 140

,

0 " 0

/

is

)

2

z

.

)

is

)

the

1

2

ILl

= =

to

( 2

,

1

( z

-

)

(

1

( 1

2

)

= =

,

,

2

( WO

)

1

= =

the

holds

is

at

,

least

reducing

the

away

1l2

(

1

/

2

the

)

= =

1

while

stated

training

initial

posterior

arising

-

of

still

it

by

from

)

posterior

symmetrically

( wo

!

choice

the

/

the

can

means

Thus

,

improved

-

Under

)

F

Under

partition

K

2

starting

. 5

away

be

the

instance

on

the

.

1

means

:

in

in

loss

finding

a

local

solution

2

. 129

all

,

four

has

initial

the

)

In

issues

the

)

= =

N

means

partition

.

1

solution

to

)

for

case

respect

Pl

(

( wo

preserve

,

= =

improved

probabilistic

-

results

QI

the

.

descent

for

2

zIF

their

. 256

= =

move

cannot

Q

algorithmic

This

. 64

1

(

to

the

1

PI

deterministic

1

to

= =

,

of

1

moving

)

)

will

divergences

was

gradient

.

2

(

informativeness

F

with

+

KL

the

F

the

. 5

informative

the

indeed

loss

2

as

able

is

1

are

for

by

value

/

,

,

maximally

solution

or

This

to

1

F

better

,

of

-

2

divergences

because

steps

opposed

on

0

gradients

an

conditions

sampling

0 '

1

= =

. 233

( 47

parameters

are

expected

Of

course

has

)

smaller

posterior

.

density

1

loss

,

the

increased

KL

from

.

algorithm

loss

53

.

absolute

of

What

)

a

KL

conditions

0

a

-

according

may

)

the

in

#

(

)

expression

was

posterior

point

0

)

discussion

ILo

which

)

)

PI

I z

,

a

of

than

)

we

( z

for

minimum

lz

and

,

( F

below

: 1

is

Po

values

( see

terior

)

posterior

sampling

o

N

since

but

initial

lz

= =

distributed

)

there

parameter

the

PI

.

unweighted

the

of

stated

F

origin

reducing

Qo

-

is

( z

the

(

of

from

2:

(

that

variances

and

gener8

( here

1l

,

conditions

1

,

but

of

so

our

term

partition

at

,

-

and

definition

initial

because

the

= =

,

doing

partition

these

Po

weighted

symmetrically

by

posterior

at

( both

origin

from

(

start

- means

preserved

,

Po

is

then

F

{

randomization

the

-

1l

,

)

+

the

the

S

( z

)

loss

simply

all

( z

only

partition

then

over

Revisited

Q

is

( F

]

taken

of

sample

)

Pl

P1

is

( 45

)

the

)

+

case

Equation

Example

( A

( z

special

loss

hand

PF

expectation

this

posterior

,

is

Po

where

F

513

should

a

sample

one

?

Here

use

it

seems

in

order

worth

to

minimize

commenting

the

expected

on

the

pos

algebraic

-

514

MICHAELKEARNSET AL.

similarity by

between

EM

. In

and

sample

and

Pi

Equation

( unweighted data

( 45 ) and

) EM

S , then

, if we

our

the

have

next

solution

L zES

(

Po ( z ) PO ( Z ) + PI

While

the

summand

appear

quite

the

log

prefactors -

log

( Pt

we

must

use

way

of

log

their must

no

obvious

to

posterior An

us Po

109 =

to

to

gradient

+

{ 1 / 2 ) P { , where

,

P6

-

more

two PI log

and

get

is

minimize

let

' P be

in

the

a smoothly on

the

, we

loss )

the

Pt

( 45 ) ,

. An

use

our

current

the

~ together ( P6 , Pi

. Thus

, there

posterior class

and

of

PI

loss

densities

to

,

posteriors

evaluate

well

Pt

informal

the of

, to

as

parameterized Po

posterior log - losses

solution

( using

expected

of

( 48 )

Equation

the

log - losses

( PJ , Pi

parameters

fiz

can

~

to

the

Equation

. In

by

posterior

minimize

Equation

weighted

guess

each the

of

a potential

EM

for

according to

next

evaluate

that

then

algorithm

Pb

determined

labels

. For

. In

resulting

our to

- side

Pb ( Z ) / ( PO ( z ) + P1 ( z ) )

guesses the

, giving

random

P ~ , Pi

difference

current

order

( 48 )

- hand

prefactors

posteriors

labels

descent

mixture

and

Pt

: in

.

) is

. An , and

minimize

the

.

even

fix

the

right

crucial

minimize

difference

the

a

the

posterior

we

( z ) ) ) , and

iterative

loss

standard Let

Pl

labels

is to

to

generate

generate

alternative resort

{ 1 / 2 ) P6

(P { (z )))

( :z:-) log

( z ) ) : our

then

present

the

)

is

the

( Pt

- losses

explaining

we

log

is

Pb ( ~ ) / ( Po ( Z ) + with

-

respect

the

( Po , Pl

, there

z , and

( z ) ) with decoupling

guess

performed

{ 1 / 2 ) Po + { 1 / 2 ) Pl

I ( Po ( z ) )

( 48 ) and

between

each

such

Equation

similar

- losses

for

no

of

in

is a decoupling

and

is

( Z ) log

+ PO ( Z PI ) ) +( ZPI

( 45 )

minimization solution

minimize

-

there

iterative

a current

intriguing

difference

log - loss densities

as

Po

representing

can

be

and

revealed

Plover

the

between

mixture

by X

, and

the

posterior

examining a point

( 1 / 2 ) Po

+

( 1 / 2 ) PI

( Z ) ) to

be

the

their z

EX

( 1 / 2 ) PI

8L ' o g 1 -1 8Po (z) In (2)Po (z)+Pl (z).

( ( 1 / 2 ) Po ( z ) +

loss

mixture

and

the

derivatives . If

' and

log - loss

we we on

. think

define z , then

(49)

This derivative has the expected behavior . First , it is always negative , meaning that the mixture log-loss on z is always decreased by increasing Po(z ) , as this will give more weight to z under the mixture as well . Second, as Po(z ) + P1(z ) --+ 0, the derivative goes to - 00. In contrast , if we define the posterior loss on z LPO6t

-PO )(Z)10 )(Z)log (ZPo )+(zP1 gPO (Z)- PO (ZPl )+(ZPl Pl(Z)(50)

HARDANDSOFTASSIGNMENTS FORCLUSTERING

515

thenwe0btain 8Lpolt 8Po(z)

~;)~~(;)[-log Po (z)+P ;;(~~~(;ylog Po (z) (51) +Po )(z)1og P1() (zPI )+(ZP1 Z-~1].

-

This

derivative

shows

further

loss and the posterior of the derivative

curious

loss . Notice

is determined

differences

that

since

between

by the bracketed

expression

If we define Ro ( z ) == Po ( z ) / ( Po ( z ) + P1 ( z ) ) , then can be rewritten as

which

is a function

Equation

of Ro ( z ) only . Figure

( 52 ) , with

the value

8LpO6t / 8Po ( z ) can actually

can

a repulsive

force

0 .218 ) . The

explanation

have Equation it small

probability

It

is interesting

have

explicit

the literature From for

the likely

to

note

repulsive

that effects

on K - means preceding to lead

than , say , classical the

fact

that

K - means , this

phenomenon

the

Po and

PI

can be shown

value

ratio

centroids

, it might

posterior

once

as poorly

be natural

' P. This

data

manner

.

points

proposed

in

et al . , 1991 ) . class

" from

that , as ' P would

one another

intuition

in the sense given

general

as possibly

to expect

a density

we

is , gives

as possible

been

( Hertz

are " different

over

each other in a fairly

have

z

Ro ( z ) ==

( that

in which

maps

loss over

PI that

the plot

( approximately

poorly

z be modeled

on distant

estimation repel

the

of z to PI as deterministic algorithms

to Po and

when

in

, the point

is straightforward

clustering

discussion

axis . From namely

critical

and self - organizing

density

of the expression

-

z somewhat

that

the assignment

K - means , minimizing

be more

this

as Po models

) , it is preferable

by Po , so as to make

occurs

a certain

expression

( 52 )

a plot

be positive

below

for

( 8 ) : as long

bracketed

( 51 ) .

~ In ( 2 )

8 shows

on Po . This

Po ( z ) / ( Po ( z ) + Pi ( z ) ) falls

log -

in Equation

of Ro ( z ) as the horizontal

we see that exhibit

this

1 - Ro ( z ) Ro ( z ) -

( 1 - Ro ( z ) ) log

the mixture

l / ( Po ( z ) + Pl ( Z ) ) ~ 0 , the sign

derives

from

above . As for

( details

omitted

).

516

MICHAELKEARNSET AL. ~ ~ .

0

C\ I ~

0

0 ~

0 00 0 0

<0 0 0

~ a a

C\ I 0 0

0 0

-6

-4

-2

0

2

4

6

Figure 1: The sampling density for Example (A).

~ 0

M .

0

C' ! 0

or

-

0

0 0

-4

-2

0

2

Figure 2: The sampling density for Example (B) .

4

HARDANDSOFTASSIGNMENTS FORCLUSTERING

517

Figure 3: Evolution of the K -means loss (top plot ) and its decomposition for Example (B) : KL divergenceswoKL(QoIIPo) + wIKL (Ql \\P1) (bottom plot ) and partition information gain 1 2( wo) (middle plot ) , as a function of the iteration of unweighted K -means running on 10 thousand examplesfrom Q == N (0, 1) .

~ 0

C\ I 0

.

.

-

ci

0 0

-6

-4

-2

0

2

4

6

Figure 4: Plot of the sampling mixture density Q = O.95N (O, 1) + 0.OSN(5, 0.1) for Example (C) .

518

MICHAEL KEARNS ET AL.

aJ a

<0 .

a

-. : t

0

~ a

0 0

-6

-4

-2

0

2

4

6

Figure 5: Po and PI found by unweighted K -means for the pIing density of Example (C).

0

C\ ! 0

&l ) ~

0

0 ~

0

&! ) 0

a

0

ci

- 10

-5

0

5

Figure 6: The sampling density for Example (D).

10

sam -

HARDAND SOFTASSIGNMENTS FORCLUSTERING

0

N

.................................................................................

' . . . ' .

"

,

..

it ' )

.

.

......

-

. .

Q) CJ C

,.-

"

, "

..,

" ! h "

.."....

. .

. . ... .'

U)

C 0

"

.'

. .. '

.....

m ~

" C

. . , . . . . . . .

"

..

..'

.... .'

0

.." ,

. '

.

..-

.. .'

.. .'

-

tU ": : tU >

.....................,.. it ) .

0

0 ,

a

0

1

2 distance

between

3

4

means

Figure 7: Variation distance V (Po, Pi ) as a funlction of the distance betweenthe sampling meansfor EM (bottom grey line), unweighted K -means (lowest of top three grey lines) , posterior loss gradient descent (middle to top three grey lines), and weighted K -means (top grey line) . The dark line plots V (Qo, Ql ) . M

N

y -

o

.. .. . . . ... .. .. . . . .. . . . . .

. .. .. . .. .. .. . .. . .. .. . ... . .. ... .. . . . .. .. .. . ... .. .. .. . .. .. . . . . .. . . .. .. .. . . . . .. .. .. . .. ... . . . ... . .. .. . . .. . . . . .. . . . .. ... . . .. . . . . . .. . . . . . ... . ..

y I

0 .0

0 .2

0 .4

r

0 .6

0 .8

1 .0

Figure 8: Plot of Equation (52) (vertical axis) as a function of Ro = Ro(z ) (horizontal axis) . The line y = 0 is also plotted as a reference.

519

520

MICHAELKEARNSET AL.

References T .M . Cover and J .A . Thomas .

Element . 0/ In / ormation

Theory . Wiley - Interscience ,

1991 .

A .P. Dempster , N .M . Laird , and D .B . Rubin . Maximum -likelihood from incomplete data via the em algorithm . Journal 0/ the Royal Stati , tical Society B , 39:1- 39, 1977. R .O . Duda and P .E . Hart . Pattern Cla , ..ification and Scene Analy . i . . John Wiley and Sons , 1973 .

A . Gersho . On the structure

of vector quantizers . IEEE

Tran , action . on In / ormation

Theory, 28(2):157- 166, 1982. J . Hertz , A . Krogh , and R .G . Palmer . Introduction to the Theor 'JIof Neural Computation . Addison - Wesley , 1991. S. L . Lauritzen . The EM algorithm for graphical association models with missing data . Computational Stati ..tic . and Data Analy . i . , 19:191- 201, 1995. J . MacQueen . Some methods for classification and analysis of multivariate observations . In Proceeding . of the Fifth Berkeley Sympo . ium on Mathematic . , Stati . tic . and Prob ability , volume 1, pages 281- 296, 1967. L . Rabiner and B . Juang . Fundamentall of Speech Recognition . Prentice Hall , 1993.

LEARNING HYBRID BAYESIAN NETWORKS FROM DATA

STEFANO

MONTI

Intelligent Systems Program University of Pittsburgh 901M CL, Pittsburgh , PA - 15260 AND GREGORY

F . COOPER

Center for Biomedical Informatics University of Pittsburgh 8084 Forbes Tower ,

Pittsburgh, PA - 15261

Abstract . We illustrate two different methodologies for learning Hybrid Bayesian networks , that is, Bayesian networks containing both continuous and discrete variables , from data . The two methodologies differ in the way of handling continuous data when learning the Bayesian network structure . The first methodology uses discretized data to learn the Bayesian network structure , and the original non-discretized data for the parameterization of the learned structure . The second methodology uses non-discretized data both to learn the Bayesian network structure and its parameterization . For the direct handling of continuous data , we propose the use of artificial neural networks as probability estimators , to be used as an integral part of the scoring metric defined to search the space of Bayesian network structures . With both methodologies , we assume the availability of a complete dataset , with no missing values or hidden variables . We report experimental results aimed at comparing the two method ologies. These results provide evidence that learning with discretized data presents advantages both in terms of efficiency and in terms of accuracy of the learned models over the alternative approach of using non-discretized data .

521

522 1.

STEFANO MONTIANDGREGORY F. COOPER

Introduction

Bayesian belief networks (BN s) , sometimes referred to as probabilistic net works , provide a powerful formalism for representing and reasoning under uncertainty

. The construction

of BNs with domain

experts often is a diffi -

cult and time consuming task [16]. Knowledge acquisition from experts is difficult because the experts have problems in making their knowledge explicit . Furthermore , it is time consuming because the information needs to be collected manually . On the other hand , databases are becoming increasingly abundant in many areas. By exploiting databases, the construction time of BN s may be considerably decreased. In most approaches to learning BN structures from data , simplifying assumptions are made to circumvent practical problems in the implementa tion of the theory . One common assumption is that all variables are discrete

[7, 12, 13, 23], or that all variables are continuous and normally distributed [20]. We are interested in the task of learning BNs containing both continuous and discrete variables , drawn from a wide variety of probability distri butions . We refer to these BNs as Hybrid Bayesian networks . The learning task consists of learning

the BN structure , as well as its parameterization

.

A straightforward solution to this task is to discretize the continuous variables , so as to be able to apply one of the well established techniques available for learning BNs containing discrete variables only . This approach has the appeal of being simple . However , discretization can in general generate spurious dependencies among the variables , especially if "local " dis-

cretization strategies (i .e., discretization strategies that do not consider the interaction between variables) are used1. The alternative to discretization is the direct modeling of the continuous data as such. The experiments described in this paper use several real and synthetic databases to investi gate whether the discretization of the data degrades structure learning and parameter

estimation

when using a Bayesian network

representation

.

The useof artificial neural networks (ANN s) as estimators of probability distributions presents a solution to the problem of modeling probabilistic relationships involving mixtures of continuous and discrete data . It is par ticularly attractive because it allows us to avoid making strong parametric assumptions about the nature of the probability distribution governing the relationships among the participating variables . They offer a very general semi-parametric technique for modeling both the probability mass of dis1Most discretization techniques have been devised with the classification task in mind , and at best they take into consideration the interaction between the class variable and the feature variables individually . "Global " discretization for Bayesian networks learning , that is , discretization taking into consideration the interaction between all dependent variables , is a promising and largely unexplored topic of research , recently addressed in

the work described in [19].

523

LEARNING HYBRID BAYESIAN NETWORKS FROM DATA

crete variables and the probability density of continuous variables . On the other hand , as it was shown in the experimental evaluation in [28) (where only discrete data was used) , and ~ it is confirmed by the evaluation reported in this paper , the main drawback of the use of ANN estimators is the computational cost associated with their training when used to learn the BN structure . In this paper we continue the work initiated in [28), and further explore the use of ANNs as probability distribution estimators , to be used as an integral part of the scoring metric defined to search the space of BN struc tures . We perform an experimental evaluation aimed at comparing the new learning method with the simpler alternative of learning the BN structure based on discretized data . The results show that discretization is an efficient and accurate method of model selection when dealing with mixtures of continuous and discrete data . The

rest

of

introduce

to

learn

and

s

In

probability

.

- based

Section

4 ,

results

procedure

,

and

of

as

network

In

Section

we

3 ,

the

describe

.

use

,

the with

a

and

with

some

our

2

artificial

in

Section

5

space

we

exper

for

-

learning based

on

paper

with

the

suggestions

BN as

present

proposed

conclude

,

of

networks

alternative

We

how

method

the

the

briefly of

learning

neural

of

we

basics

search

of

simple .

some

Section

to

efficacy

variables

In and

describe

Finally

evaluating it

.

used

the

continuous

results

we

metric

comparing the

follows

formalism

scoring

at

at of

discussion

organized

estimators aimed

discretization

further

the a

research

.

Background

A

Bayesian

is

a

belief

directed variables

}

of

probability 7I " x

example

to

of

the

a

links

a

Furthermore

the

simple

set

set

,

,

arcs

E

of

we

see can

tumor

( G X

=

=

{

the

of

x

in

and

by

metastatic cause

X

, Xl

cause

papilledema

O .

, giving

increase

G

)

I

Xi

.

, Xj

E

; a

derived

1 , in

( Xl in

X

,

part

causal )

Xi

and

X

we

#

0

P

)

is

is E

a X

give

,

an

from

[ 11

] .

interpretation is

a

total

( xs

, E

;

node

Figure

a

( X

representing

variables

Given

In

==

}

variables2

cancer an

) , where

, . . . , xn

domain

structure ,

can

( Xi

in

parents

that

, P

{ Xl

domain

instantiations

also

, 0

among of

structure

it

brain

triple

nodes

network

network

that

a

of of

the

Bayesian

the

and

set

dependencies

over

displayed ) ,

by

a

instantiations

denote

at

( X3

) '

with

possible

to

looking

tumor

and

distribution

use

defined

with

probabilistic

space

we

is

graph ,

representing

the

By

network

acyclic

domain X j

( X2

data

ANN

distribution

imental

.

is belief

from

the .

paper

Bayesian

BN

define

structures

2

the

the

cause

of

serum

) ,

and

brain

calcium

both

brain

2An instantiation w of all n variables in X is an n-uple of values { xi , . . . , x~} such that Xi = x~ for i = 1 . . . n .

524

STEFANO MONTIANDGREGORY F. COOPER -

P (XI ) P (x2IXl ) P (x21 Xl ) P (x31xI ) P (x31 Xl ) P (x4Ix2 , X3) P (X4\ X2, X3) P (x41 X2, X3) P (X41X2, X3) P (x51x3) P (x51 X3)

0.2 0.7 0.1 0.6 0.2 0.8 0.3 0.4 0.1 0.4 0.1

Xl : X2: X3: X4: X5:

tumor

and

an increase

a coma The

key feature

is usually

calcium

of each variable

is their

events

( domain

to as the Markov

the

Bayesian

its parents

1T ' i , with

conditional

network

8i is represented corresponding

entry

in the

table

P ( X~ 11T ' ~, 8i ) for a given probability the

probabilities

in

the

by means to the

instantiation belief

example

all the

of the variable

complete

probability

for the distribu

network

needed

, with

refer are dis -

of a lookup

table ,

probability

Xi and its parents of X . In

-

P ( Xi l7ri , 8i )

variables

conditional

,

. This

can then fact , it

7ri .

be com has

been

shown [29 , 35 ] that the joint probability of any particular instantiation all n variables in a belief network can be calculated as follows :

of

(1)

-

n

-

-

-

from

instantiation

of any

its parents

, and it allows

. For

1 , where

of

) . In particular

9i the set of parameters

with

puted

to lapse

representation

distributions

probability

of Figure

given

joint

conditional

crete , each set of parameters

The

a patient

variables

property

of the multivariate

of the univariate

Xi given

explicit

of its non - descendants

referred

characterize

each

papilledema

can cause

networks

among

representation

over X in terms

ence to the

coma

set of nodes x = { Xl , X2, Xa, X4, X5} , and parent { X2, X3} , 7rx5 = { X3} . All the nodes represent domain { True , False} . We use the notation Xi tables give the values of p (Xi l7rxi ) only , since

serum

of Bayesian

is independent

parsimonious

to fully

in total

independence

each variable

tion

brain tumor

(X4 ) .

conditional property

total serum calcium

X5

Figure 1. A simple belief network , with sets 7rX} = 0, 7rx2 = 7rx3 = { Xl } , 7rx4 = binary variables , taking values from the to denote (Xi = False ) . The probability p (Xi l7rxi ) = 1 - p (Xi l7rxi ) .

into

metastatic cancer

-

-

-

-

-

-

P ( x ~ , . . . , x ~ ) = II P (x ~ 17r~i ' 8i ) . i ==l

-

-

guide to the literature -

3For a comprehensive

[6].

on learning probabilistic

networks , see

LEARNINrGHYBRIDBAYESIAN NETWORKS FROMDATA 2 .1. LEARNING

BAYESIAN

BELIEF

525

NETWORKS3

In a Bayesian framework , ideally classification and prediction would be performed by taking a weighted average over the inferences of every possible BN containing the domain variables4 . Since this approach is usually computationally infeasible , due to the large number of possible Bayesian networks , often an attempt has been made to select a high scoring Bayesian network of this

for classification

. We will assume this approach

in the remainder

paper .

The basic idea of the Bayesian approach is to maximize the probability

P (Bs I V ) = P (Bs , V )j P (V ) of a network structure Bs given a database of casesV . Becausefor all network structures the term P (V ) is the same, for the purpose of model selection it suffices to calculate P (Bs , ' D) for all Bs .

So far , the Bayesian metrics studied in detail typically rely on the fol -

lowing assumptions: 1) given a BN structure, all cases in V are drawn independently from the same distribution (random sample assumption); 2) there are no caseswith missing values (complete databaseassumption; some more recent studies have relaxed this assumption [1, 8, 10, 21, 37]); 3) the parameters of the conditional probability distribution of each variable

are independent (global parameter independenceassumption); and 4) for discrete variables

the parameters

associated

with each instantiation

of the

parents of a variable are independent (local parameter independence assumption ) . The last two assumptions can be restated more formally as

follows. Let 8Bs == { 8l , . . . , 8n} be the complete set of parameters for the BN structure Bs , with each of the 8i 's being the set of parameters that

fully characterize the conditional probability P (Xi l7ri). Also, when all the variables in 7ri are discrete, let 8i = { Oil' . . . ' Oiqi} ' where Oij is the set of parameters defining a distribution that corresponds to the j -th of the qi possible instantiations of the parents 7ri. From Assumption 3 it follows that P (8Bs I Bs ) = IIi P (8i I Bs ), and from assumption 4 it follows that

P(8i I Bs) = IIj P (8ij I Bs) [36]. The application of these assumptions allows for the following factoriza -

tion of the probability P (Bs , V ): n

P(Bs,V) = P(Bs)P(V IBs) = P(Bs) II S(Xi,7ri,V) ,

(2)

i= l

where each S(Xi, 7ri, V ) is a term measuring the contribution of Xi and its parents 7ri to the overall score of the network

structure

Es . The exact form

of the terms S(Xi 7ri , V ) slightly differs in the Bayesian scoring metrics de4Seethe work described in [24, 25] for interesting applications of the Bayesian model averaging approach .

526

STEFANO MONTIANDGREGORY F. COOPER

fined so far , and for the details we refer the interested reader to the relevant literature [7, 13, 23] . To date , closed-form expressions for S(Xi 7ri, V ) have been worked out for the cases when both Xi and 7ri are discrete variables , or when both Xi and 7ri are continuous (sets of ) variables normally distributed ; little work has been done in applying BN learning methods to domains not satisfying these constraints . Here , we only describe the metric for the discrete case defined by Cooper and Herskovits in [13], since it is the one we use in the experiments . Given a Bayesian network Bs for a domain X , let ri be the number of states of variable Xi , and let qi = I1XsE7rirs be the number of possible instantiations of 7ri . Let (}ijk denote the multinomial parameter correspond ing to the conditional probability P (Xi = k l7ri = j ), where j is used to index the instantiations of 7ri, with (}ijk > 0, and Ek (}ijk = 1. Also , given the database V , let Nijk be the number of cases in the database where Xi = k and 7ri = j , and let N ij = Ek N ilk be the number of cases in the database where 7ri = j , irrespective of the state of Xi . Given the assumptions described above, and provided all the variables in X are discrete , the probability P (V , Bs ) for a given Bayesian network structure Bs is given by

nqi r(ri) ri P(V,Bs )=P(Bs )gjllr(Nij +Ti )Elr(Nijk ),

(3)

where r is the gamma function5 . Once a scoring metric is defined , a search for a high -scoring network structure can be carried out . This search task (in several forms ) has been shown to be NP -hard [4, 9]. Various heuristics have been proposed to find network structures with a high score. One such heuristic is known as K2 [13] , and it implements a greedy forward stepping search over the space of network structures . The algorithm assumes a given ordering on the vari ables. For simplicity , it also assumes a non-informative prior over parameters and structure . In particular , the prior probability distribution over the network structures is assumed to be uniform , and thus , it can be ignored in comparing network structures . As previously stated , the Bayesian scoring metrics developed so far either assume discrete variables [7, 13, 23], or continuous variables normally distributed [20] . In the next section , we propose one generalization which allows for the inclusion of both discrete and continuous variables with arbitrary probability

distributions .

5Cooper and Herskovits [13] defined Equation (3) using factorials , although the generalization to gamma functions is straightforward .

LEARNING HYBRID BAYESIAN NETWORKS FROM DATA 3 . An ANN - based scoring

527

metric

In this section , we describe in detail the use of artificial neural networks as probability distribution estimators , to be used in the definition of a decomposable scoring metric for which no restrictive assumptions on the functional form of the class, or classes, of the probability distributions of the participating variables need to be made. The first three of the four assumptions described in the previous section are still needed. However , the use of ANN estimators allows for the elimination of the assumption of local parameter independence . In fact , the conditional probabilities corresponding to the different instantiations of the parents of a variable are represented by the same ANN , and they share the same network weights and the same training data . Furthermore , the use of ANN s allows for the seamless representation of probability functions containing both continuous and discrete variables . Let us denote with Vi = { C1, . . . , Cl - 1} the set of the first I - I cases in the datab ~ e, and with x ~i) and 7r~i) the instantiations of Xi and 7ri in the I-th case respectively . The joint probability P (Bs , D ) can be written as

m P{Bs,V) = P{Bs)P{VIBs) = P{Bs) l=l II P(CllVi,Bs) =

mn P{Bs) n n p(X~ ~l),Vl,Bs) . l=l i=l l)17r

-

If

we

assume

uninformative

structures ity

In

,

that

of

7ri

fact

,

the

are

we

priors

form

the

P

(

parents

can

Bs

(

Bs

,

V

)

=

where

tion

be

S

of

(

Xi

,

Xi

,

V

and

if

data

dictive

already

quential

validation

(

the

in

.

In

. e

.

,

fact

,

.

.

II

p

=

in

the

a

uniform

7ri

)

,

where

P

P

in

(

(

Bs

7ri

,

)

V

(

x

~

as

it

Hence

,

.

shown

the

It

)

17r

l

)

,

Vi

,

Bs

)

]

=

brackets

network

II

probabil

-

decomposable

,

5

so

.

as

to

obtain

terms

and

l

)

the

prequential

to

theoretically

,

is

7ri

,

V

)

only

15

]

prior

P

(

7ri

the

as

in

4

,

sound

)

can

.

The

of

,

interpreted

=

-

application

Usually

S

func

)

the

.

Vl

(5)

,

a

structures

Equation

cases

,

is

the

model

analysis

a

14

the

in

given

[

it

of

clearly

(

(

Xi

it

to

in

,

(

l

network

corresponds

Dawid

success

of

the

S

=

and

Bs

over

and

log

the

C

,

structure

prior

case

name

corresponds

~

square

more

each

network

the

is

4

i

in

of

is

l

by

measure

is

)

Equation

l

4

out

a

of

prediction

(

on

n

discussed

distribution

seen

P

priors

products

Equations

carried

as

decomposable

probability

between

assume

is

i

IIi

the

two

term

analysis

,

)

l

illustrated

score

7ri

parents

prequential

V

P

l

is

we

decomposition

tive

)

or

m

[

=

its

neglected

derivation

the

7ri

,

the

II

i

=

Xi

n

P

)

of

interchange

,

(4)

a

predic

-

predicting

,

{

above

the

we

x

form

(

l

which

)

,

.

a

.

.

,

x

pre

(

l

suggests

form

-

-

l

se

of

cross

)

}

-

-

528

STEFANO MONTIANDGREGORY F. COOPER

From a Bayesian perspective , each of the P (Xi l7ri, Vi , Bs ) terms should be computed as follows :

P(Xi l7ri, Vi, Bs) =

P{Xil7ri,9i,Bs)P{9i IVi,Bs)d9i.

In most casesthis integral does not have a closed-form solution; the following MAP approximation can be used instead: P (Xi l7ri, Vi , Bs ) = P (Xi [ 7ri, (Ji, Bs ) , (6) with lJi the posterior mode of (Ji, i .e., (Ji = argmax8i{ P (lJi I Vi , Bs )} . As '"a further approximation , we use ~he maximum likelihood (ML ) estimator (Ji instead of the posterior mode (Ji. The two quantities are actually equivalent if we assumea uniform prior probability for (Ji, and are asymptotically equivalent for any choice of positive prior . The approximation of Equation (6) correspondsto the application of the plug-in prequential approach discussedby Dawid [14]. ~ Artificial neural networks can be designed to estimate 8i in both the discrete and the continuous case. Severalschemesare available for training a neural network to approximate a given probability distribution , or density. In the next section, we describe the sojtmax model for discrete variables [5], and the mixture density network model introduced by Bishop in [2], for modeling conditional probability densities. Notice that evenif we adopt the ML approximation, the number of terms to be evaluated to calculate P (V I Bs ) is still very large (mn terms, where m is the number of cases, or records, in the database, and n is the number of variables in X ), in most casesprohibitively so. The computation cost can ~ be reduced by introducing a further approximation. Let 8i (l ) be the ML estimator of 8i with respect to the dataset Vi . Instead of estimating a distinct ~ 8i (l ) for eachl = 1, . . . , m, we can .. group consecutivecasesin batchesof cardinality t , and estimate a new 9i (l ) for each addition of a new batch to the dataset Vi rather than for each addition of a new case. Therefore, the same ~ 8i (l ), estimated with respect to the dataset Vi , is used to compute each of ~ (l ) , Bs ) , . . . , P (xi(i+t- l ) I 7ri(i+t- l ) , 8i ~ (l ) , Bs ) . the t terms P (Xi(i) l7ri(i) , 8i With this approximation we implicitly ~make the assumption that , given our present belief about the value of each8i , at leMt t new casesare needed to revise this belief. We thus achievea i -fold reduction in the computation ~ needed, since we now need to estimate only m/ t 8i 'S for each Xi, instead of the original m. In fact, application of this approximation to the computation of a given S(Xi, 'lri, V ) yields:

-

S(Xi, 'lri, V )

m '" ,lJi(l),Bs) n p ( X ~ l ) 17r ~ l ) l=l 1 ,

LEARNING

HYBRID

BAYESIAN

mjt =

-

l

With select

of

IVil

regard

to

a

constant

II

. The

estimate but

II

will

grows

A

scheme t

given

a

r All

of

Vi

training

4 .

ANN

set of

section

, we distributions variables

by

Bishop

in

Xi =

set

of

is

a U

,

discrete

common

r7fi

. a

=

practice

where to be

fk the

( 1ri )

neural

l1ri

is

=

+

when of

to

the

function

A

( l ) , this

l

new

is

in

small

cases

training

case

of

t

of

cases

can

,

for

,

as

data

the

be

l

set

data

set

summarized

already

is

t

in

seen

assuming

1 =

the

we

=

rO . 5ll

the

( i .e . ,

set

A

the

=

would

1 , 2 , 3 , 5 , 8 , 12

0 .5 ,

require

, 18

the 1ri

belongs thus

. These density

, 27

, 41

.

ri

values

of

and

softmax

model

with

continuous

with

( rj

model

introduced .

-

1 ) .

l7ri

) ,

output

k

rj

parents 7rf

1

ri

,

and

a

l7ri

discrete is

define

follows

)

input

variables units

, as

the

( Xi

r7fi

of

indicator

output

1 , . . . , ri

is

P

units

-

The =

of

and

representation

of .

set ,

distribution

The

regression

a

parents

ri

means

Vk

the

densities

probability

by

=

are

network

conditional

VARIABLES

network

( Xi

of

probability

set

EXjE7ff

representation

networks

conditional

the

:

( 7ri )

Eni J. = Ie

1ri .

the

DISCRETE

values

linear

for

with

P

)

~

network

(l )

mixture

statistical

input

that

example a

9i

cases

case

scheme

8i

for M

.

'"

of

t

addition

example

models

neural

rj

the

network

new

can

estimating

additional

conditional

The

l7rfl

in

Vk

interpreted

probability a

=

For

neural

is

probabilities

( Xi

a an

updating

the

efk

P

1 .

the

two

7rf

taking

conditional

,

variable

by

variable

( 7 )

t , we

new

the

number

~

FOR

parents

, where

input

to

, 000

the

modeling

where

approximated

units

of

difference

A

with

discrete

7rf

) .

increment

. When

addition

estimators

MODEL

be 7rf

to

updating

[ 5 ] , and

[ 2 ] , for

SOFTMAX

10

is

<

describe

discrete

7ri

cases

529

estimators

for

Let

50 ANN

probability

4 .1 .

0

) , Bs

for

choose

1 , adding =

1

probability

this

= l

where

of the

the

significant

and

( tk

value

can

preferable

incremental

, ) ,

we

insensitive l

a

the

=

data

the

appropriate

to

if

make

( X ~' ) 17r ~ ' ) , iJi

seems

, while

for

cardinality

an t , or

, when it

to

equation

of for

increasingly

example

unlikely

p

sensitive

become

doubling

very

In

very

DATA

l = tk + 1

approach

be

. For

means

choice value

second

will it

the

FROM

t (k + l )

k = O

-

NETWORKS

f J. ( 7r t' )

,

output Notice

of that

probability to configured

( 8 )

the

the

of

cl ~

k - th

of

,

with

k - th

the

output

unit

probability s ri a

P

membership cl ~ sum

ses

corresponding

( Xi of

. It

h ~

- of - squares

= 1ri ,

been or

Vk

l7ri

i .e . , proved

cross

)

can

~

the that

- entropy

530

STEFANOMONTI AND GREGORYF. COOPER

error function ,. leads to network outputs that estimate Bayesiana posteriori -

probabilitiesof classmembership[3, 32]. 4.2. MIXTURE DENSITY NETWORKSFOR CONTINUOUSVARIABLES The approximation of probability distributions by meansof finite mixture models is a well establishedtechnique, widely studied in the statistics liter ature [17, 38]. Bishop [2] describesa classof network models that combine a conventional neural network with a finite mixture model, so as to obtain a general tool for the representation of conditional probability distributions . The probability P (Xi l7ri, Vi , Bs ) can be approximated by a finite mixture of normals as is illustrated in the following equation (where we dropped the conditioning on VI and Bs for brevity ):

K P(Xi l7ri )=kL=lQk (7ri )< !J(Xi kl7ri ), (9) where KQk isthe number ofmixture components ,0functions ~Qk ~1,cl kk(= 1 ,..).is ,Ka, and Ek = 1 , and where each of the kernel > Xi l7ri normal density ofthe form : 2ak )2 (10 ) < /Jk(xii7r i)=Cexp {-(Xi-/ '/(k7ri (7r i))2} with c the normalizing constant , and JLk(7ri) and O'k(7ri)2 the conditional mean and variance respectively . The parameters C }l;k{7ri) , JLk(7ri) , and ak (7ri)2 can be considered as continuous functions of 7ri . They can therefore be estimated by a properly configured neural network . Such a neural network will have three outputs for each of the K kernel functions in the mixture model , for a total of 3K outputs . The set of input units corresponds to the variables in 7ri . It can be shown that a Gaussian mixture model such as the one given in Equation (9) can - with an adequate choice for K approximate to an arbitrary level of accuracy any probability distribution . Therefore , the representation given by Equations (9) and (10) is completely general , and allows us to model arbitrary conditional distributions . More details on the mixture density network model can be found in [2] . Notice that the mixture density network model assumes a given number K of kernel components . In our case, this number is not given , and needs to be determined . The determination of the number of components of a mixture model is probably the most difficult step, and a completely general solution strategy is not available . Several strategies are proposed in [31, 33, 38]. However , most of the techniques are computationally expensive , and given our use of mixture models , minimizing the computational cost of the selection process becomes of paramount importance . Given a set of alternative model orders K. = { 1, . . . , Kmax } , we consider two alternative

LEARNING

strategies on

HYBRID

for

a

test

based

on

model

where

()

K

is

ML

P

order

- out

and

MK

BIG

respect aims

; rain

provides

D

,

34

] . dataset

density

each

K

....

; est

] rain

selection

the

mixture

( V

based

model

[ 30

for

531

selection

and

splitting

A

10

Vi

network

E

K

, MK

)

,

and

is

the

selected

,

.

finding

the

model ) ,

)

p

to

DATA

the

following

model

order

asymptotic

K

that

approxima

-

:

P

( Vll

where

d

of

is

by

the

the

number

0

the

Given

use

in

of

inputs

.

of

weights

are

\ k

( 1 )

( 11

model l

is

=

M

the M

K

K

( in

size

of

the

( in

our

1 , . . . , K ,

selection

the

}

our

case

dataset

case

of

error

for

would

be

the

,

,

is

bound

)

the

,

A

(J

and

given

trained

neural

for

each

Equation

prequential

term

computationally

whole

too

dataset term

,

is

,

only

the

ANN

we

' D ,

and

( i .e . ,

costly

we

,

then

use

we the

.

have

train

, Bs

) ,

the

weights to

l

for

the

be

the

their

trained

real

that

Given

the

of

in

s

the

to

one

for

the

ANN as

database

the

, of

a the

trained

the

+

1

prequen

of

Dl

ANN , 7ri

, Vi

)

terms

the

+ t

-

the

S ( Xi

database of

( or

.5 ) .

subsequent

given

initialization Vl

first

' s

.5 ,

prequential on

the

ANN ( -

weights for

of

half

The

to

the

each

be .

interval to

do

number

to

corresponding

specifically

s ,

the

units

corresponding

network on

values

we

.

units

hidden

,

- fitting

hidden

three

ANN

More

ANN

over

convergence

Currently

function

initialized .

.

a

of of

weights

1 , . . . , m of

number

The

term

faster .

for as

ANN .

optimization

much

control

selected

with

several =

shows

minimum

metric

- gradient

backpropagation to

is

a

true

conjugate

algorithm

set

with

scoring

to

the

than

prequential

need

use

This

initialized

terms

, Vl

use

units

,

units

previous we

] .

we

maximum

hidden

randomly

of

the

the

) 2

the

s ,

[ 27 )

specifically

prequential

for

the

0

of

technique

of

however

can

in

prequential

ANN

in

input

term

l7ri

order

on

regularization

More

number

we

+

) ,

( 7ri

( Ji ( l ) )

each

the

local

number

,

l

conditions

based

order

any

The

( Xi

) , ak

"

each

described

term

( 7ri

model

of

( possibly

,

log

parameters

regularity

order

training

the

tial

2

TRAINING

the

This

-

network

the

) , J- Lk

the

model

ANN

not

neural

certain

selected

to

the

( 7ri

)

parameters

of

estimation

algorithm

for

of

estimator

the

For

( ) , MK

of

repeating

select same

( Vll

d

( 1 ) .

Since

4 .3 .

P

{ ak

) . is

=

weights

outputs

network

for

)

ML

the

11

MK

is

number A

P

D

at

...

(J

.

probability

BIG

: "

by

Vlest

set

with

K

( BIG

set

the

on

) .

E - out

performed

test

training

estimator

( Vll

K

( " hold

is

a

the

based

FROM

Criterion

maximizes

selection

maximizes

best training

hold

on

that

the

Model

tion

by

Vlrain

trained

order ....

NETWORKS

Information

set

then

the

during

Bayesian

training is

of

out

selection

a

MK

selection

held

the

Model

into

the

set

BAYESIAN

the if

we

Vi weights use

the

,

532

STEFANOMONTI AND GREGORYF. COOPER

updating schemedescribed at the end of Section 3). This strategy will be particularly beneficial for large sized VI , where the addition of a new case (or a few new cases) will not changesignificantly the estimated probability . 5. Experimental

evaluation

In this section we describe the experimental evaluation we conducted to test the viability of use of the ANN-basedscoring metric. We first describe the experimental design. We then present the results and discussthem.

5.1. EXPERIMENTAL DESIGN The

experimental

new to

scoring

metric

continuous

over

the

use

of

. To

the

Al

first

of

highest

by

Algorithm

uous

data by

We ture

would

variable

number

To were

the

designed

discovering evaluation

is aimed

gorithms

, and

the

. With

it

data allows

With networks

into

of

for

regard

their

at

regard

to

whose

search

the

testing

two

the of

original

BN

structure

the

the

of the

metric

discovered continuous

using

original

the

contin

-

struc

-

struc

-

discovered

to be faster

, due

the

to

than

the

the

information

algorithm

value

of bins

to

algorithms

loss

, or

predictive

first

data

goal , real the

robustness

goal , simulated parameterization

experiments algorithms

is fully the

of

the

are

appropriate

known

. The two

al -

patterns

assumptions

generated

in

variable

structural

of data

two class

accuracy

relevant

bin .

, the

of the

response

a simple

continuous

an approximately

each

simpler

performance

is

of the

so that

is assigned

the

Al

range

of discovering

of

and

the the

a given

both

testing

structure

the

Al

with

points

of the

compare

second

by

number

data

parents

to

to

to

it searches

scoring

of

accurate

used

a given

capability

a better

; then

the

parameters

parameters

, whereby

comparison

set

data

to

:

.

less

technique

so as to the

algorithms

on

scoring

advantage applied

.

of contiguous

make

is

( 7 ) applied

the

structure

" discretization

is partitioned

equal

highest

possibly

( 3 ) , which

of the

the

of Equation

discretization

density

the

any

learning

the

is applicable

, offers

based

applied

estimators

the

two

structure

estimates

of ANN

discretization

" constant

considered

estimators

for

also

A2 , but

from

The

it

Equation

estimates

metric

expect by

of

whether

( 9 ) , which

variables

a discretization

it

searches

means

metric

network

ANN

scoring , and

search

resulting

using

A2

- based

to discrete

performs

at determining

( 7 ) , ( 8 ) , and

M

end , we

( 3 ) ; finally

ANN ture

this

aimed

Equations

scoring

scoring

Equation

structure data . .

by

M well

the

data

Algorithm for

is primarily

, given

variables

discretized .

evaluation

from is more

in , and

made

.

Bayesian appro

-

533

LEARNING HYBRIDBAYESIAN NETWORKS FROMDATA

priate , since the generating BN represents the gold standard with which we can compare the model (s) selected by the learning procedure . To assess the predictive accuracy of the two algorithms , we measured mean square error (M8E ) and log score (18 ) with respect to the class variable y on a test set distinct from the training set. The mean square error is computed with the formula :

L[y(l)-fJ (7r l))]2 vtest

1

(12)

,

MSE = L

where vtest is a test set of cardinality

L , y (l ) is the value of y in the l -th

case of vtest , andy(7rl)) isthevalue ofy predicted bythelearned BNfor thegiveninstantiation 7rl) of y's parents . Morespecifically , y(7rl)) is the

expectation ofy withrespect to theconditional probability P(y 17r l)). Similarly , the log-score LS is computed with the formula :

18 = - logp (Vtest ) ==- L logp(y{l) \7rl)) ,

(13)

vtest

where p(y(l) \7rl)) is theconditional probability fory in thelearned BN. With

both For

MSE

the

and LS the lower

evaluation

with

real

the score the better databases

, we

used

the model . databases

from

the

data repository at UC Irvine [26]. In particular , we used the databases AUTO - MPG and

variable

ABALONE

can be treated

the class variable

. These

databases

were

as a continuous

variable

is miles - per - gallon

selected

because

. In the database

their

class

AUTO - MPG

. In the database ABALONE the class

variable is an integer proportional

to the age of the mollusk , and can thus

be treated

. The

as a continuous

variable

database

AUTO - MPG has a total

of

392 cases over eight variables , of which two variables are discrete , and six are

continuous

. The

variables , of which were

normalized

database

ABALONE

only one variable

has

a total

of 4177

cases

is discrete . All continuous

over

nine

variables

.

Since we are only interested

in selecting

the set of parents

of the re -

sponse variable , the only relevant ordering of the variables needed for the search algorithm is the partial ordering that has the response variable as the successor

of all the other

variables

.

All the statistics reported are computed over ten simulations . In each simulation , 10% of the cases is randomly selected as the test set , and the

learning algorithms use the remaining 90% of the cases for training . Notice that in each simulation the test set is the same for the two algorithms . For the evaluation with simulated databases, we designed the experi ments with the goal of assessing the capability of the scoring metrics to correctly identify the set of parents of a given variable . To this purpose ,

534

STEFANO MONTIANDGREGORY F. COOPER

ny

X '

.,' " " ,

y

.."

"" .

.

.

.

,: ":,. .., -;--0 ,

:

"

,

" ,

,

.

.

.

, ,

.

,

""" ,

." "" .

Figure

2

.

General

denoting

the

synthetic

BN

taining

8

lected

s

of

as

be

y

(

aimed

y

.

7r

7ry

y

.

All

y

BNs

1 -

)

.

the

The

testing

X

used

'

in

For

a

I

7r

the

of as

.

us

2

at

,

.

we

show

a

con

-

se

of

variables

us

denote

-

is

this

assigned

of

in

variables

X

'

metrics

the

7( " 11

X

randomly

set

scoring

determining

with

randomly

Let

are

variables

the

is

n1r

this

the

domain

X

parents

denote

of

. ,

' s

variables

Let

of

Figure

structure

,

models a

y

in

number

y

,

.

with

as

at

parents

of

identifying

conditional

indepen

prototypical

structure

-

for

the

.

BN

linear

modeled

In

be

experiments y

a

variables

given

to

effectiveness

experiments

given

mixtures

.

}

the of

considered

the

remaining

7r

( i . e

)

y

We

A

designation

the

y

of

.

in

ancestors

.

One

{

used

indirect

follows

-

in

independencies

dence

BNs

the

y

X

variables

U

at

conditional

is

}

'

as

from

the

{

synthetic X

variable

with

of

-

the

with

variables

select

X

is

and

response

variables

=

,

generated

the

parents

'

of

y

were

randomly

set

7r

of

continuous

to

then

X

structure

parents

finite

.

the

parameterization

That

is

mixture

of

,

each

is

conditional

linear

in

terms

of

probability

models

as

follows

P

finite

( Xi

l7ri

)

:

K

P

( Xi

l7ri

)

=

E k

J1 , k

( 7ri

)

=

GkN =

( Jl , k

( 7ri

)

,

O' k

)

(

14

)

l

{ 3ok

+

E

( 3jkXj

,

XjE7ri

where

N

( Jl "

O'

deviation

0 '

tribution

over

real

l , Bjkl

choice

conditional

.

)

denotes

All

a

O' k

the

numbers

are

[ 1

of

,

K

+

interval

distributions

real

distribution

numbers

interval

randomly

E

Normal

[ 1

,

drawn

1 ] ,

where

is

randomly

] .

from

K

is

justified

by

to

10

depart

with

All

drawn

the

a

the

uniform

fact

significantly

. u

from

regression

of

that

and

a

standard

uniform

parameters

over

mixture

we

from

dis

, Bjk

distribution

number

the

mean

the

like

a

singly

.

the

peaked

-

are

interval

components

would

' s

This

resulting

curve

,

LEARNING

as the number

HYBRID

of mixture

BAYESIAN

components

NETWORKS

FROM

DATA

535

increases . Therefore , we choose to

incre ~ e the magnitude of the regression parameters with the number of mixture components , in an attempt to obtain a multimodal shape for the corresponding conditional probability function . The Qk are real numbers

randomly drawn from a uniform distribution over the interval (0, 1], then normalized

to sum

to 1.

Several simulations

were run for different

combinations

of parameter

settings. In particular : i) the number n1r of parents of y in the generating synthetic network , was varied from 1 to 4; ii ) the number K of mixture

components used in Equation (14) was varied from 3 to 7 (Note: this is the number

of linear models included

in the mixtures

used to generate the

database ) ; and iii ) the number of bins used in the discretization was either 2 or 3. Furthermore , in the algorithm A2 , the strategy of order selection for

the mixture density network (MDN ) model was either hold-out or BIG (see Section 4) . Finally , we chose the maximum admissible MDN model order

(referred to as Kmax in Section 4) to be 5. That is, the best model order was selected from the range 1, . . . , 5. Finally , we ran simulations

with

datasets

of different cardinality . In particular , we used datasets of cardinality 600 , and

300,

900 .

For each parameter setting , both algorithms were run five times , and in each run 20% of the database cases were randomly selected as test set, and the remaining 80% of the database cases were used for training . We ran several simulations

, whereby a simulation

consists of the follow -

ing steps :

- a synthetic Bayesian network B s is generated as described above; - a database V of cases is generated from B s by Markov simulation ; - the two algorithms A 1 and A2 are applied to the database V and relevant statistics on the algorithms ' performance are collected . The collected statistics for the two algorithms were then compared by means of standard statistical tests . In particular , for the continuous statistics , namely , the MSE , and the LS , we used a simple t -test , and the Welch -modified two -sample t -test for samples with unequal variance . For the discrete statistics , namely , number of arcs added and omitted , we used a median test (the Wilcoxon test ) . 5 .2 .

RESULTS

Figure 3 and Figure 4 summarize the results of the simulations with real and synthetic databases, respectively . As a general guideline , for each discrete measure (such as the number of arcs added or omitted ) we report the 3-tuple

(min, median, max) . For each continuous measure (such as the log-score) we report

mean and standard

deviation .

536

STEFANOMONTIAND GREGORY F. COOPER

I

DB

I

PI

I

P2

I

+

I

-

I

auto-mpg 2 2 2 2 2.5 4 0 1 2 0 0 1 abalone 3 4 4 0 3 4 0 1 2 1 2 4

I

DB auto

I MSEI I MSE2 I

- mpg

abalone

LSI

J

LS2

I time-ratio I

0 . 13

0 .01

0 . 14

0 .02

0 .25

0 .04

0 .28

0 .07

40

5

0 .92

0 .06

1 .09

0 .3

1 .31

0 .06

1 .41

0 . 14

42

12

Figure 3. Comparison of algorithms Al and A2 on real databases . The first table reports statistics about the structural differences of the models learned by the two algorithms . In particular , PI and P2 are the number of parents of the response variable discovered by the two algorithms Al and A2 respectively . The number of arcs added (+ ) and omitted (- ) by A2 with respect to Al are also shown . For each measure , the 3-tuple (min , median , max) is shown . The second table reports mean and standard deviation of the mean square error (MSE ) , of the log -score (LS ) , and of the ratio of the computational time of A2 to the computational time of Al (time -ratio ) .

With regard to the performance of the two algorithms when coupled with the two alternative MDN model selection criteria (namely , the BIC based model selection , and the "hold -out " model selection ) , neither of the two criteria is significantly superior . Therefore , we report the results corresponding to the use of the BI C- based model selection only . In Figure 3 we present the results of the comparison of the two learning algorithms based on the real databases. In particular we report : the number of parents PI and P2 discovered by the two algorithms Al and A2 ; the corresponding mean square error MSE and the log-score LS ~and the ratio of the computational with

time

- ratio

time for A2 to the computational

time for AI , denoted

.

In Figure 4 we present the results of the comparison of the two learning algorithms based on the simulated databases. In particular , the first table

of Figure 4 reports the number of arcs added (+ ) and the number of arcs omitted (- ) by each algorithm with respect to the gold standard Bayesian network GS (i .e., the synthetic BN used to generate the database ). It also reports the number of arcs added (+ ) and omitted (- ) by algorithm A2 with respect to algorithm AI . The second table of Figure 4 reports the measures MSE and LS for the two algorithms Al and A2 , and the time -ratio . Notice that the statistics shown in Figure 4 are computed over different

LEARNING HYBRIDBAYESIAN NETWORKS FROMDATA # of I casesr

GS vsAl + I -

f - GS vsA2 I + I -

I I

537

Al vsA2 + I -

I I

300 0 0 2 0 1 4 0 0 2 0 1 3 0 0 2 0 0 2 600 0 0 1 0 0.5 3 0 0 3 0 1 3 0 0 3 0 0 2 900 0 0 1 0 0 3 0 1 3 0 0 3 0 1 3 0 0 2

# of cases

MSEI

300 600 900

MSE2

LSI

LS2

time-ratio

0.72 0.33 0.73 0.33 0.93 0.35 0.96 0.33 29 0.73 0.32 0.74 0.33 0.77 0.34 0.80 0.37 36 0.78 0.30 0.78 0.32 0.91 0.27 0.92 0.28 35

7 10 8

Figure 4 . Comparison of algorithms Al and A2 on simulated databases . In the first table , the comparison is in term of the structural differences of the discovered networks ; each entry reports the 3 - tuple ( min , median , max ) . In the second table , the comparison is in terms of predictive accuracy ; each entry reports mean and standard deviation of the quantities MSE and LS . It also reports the time ratio , given AI .

by the

settings The

is

not in

in

previous

difference

statistically terms

of

gorithm

larger

Al

tends

difference tistically

we

(p

number

add

more

in

the

number

respect unexpected

algorithms

when

accuracy

of

cases . The

fact

both

that

the

the

the

BNs

of

they

Al

(p

of arcs

900 when

' performance

using

. In select

than

with the

decreases

and

to

al -

, algorithm

is not

added

,

differ

.01 ) , while

prediction cases

error

to

variable

algorithms

4

discover

tends

databases

BN s used

decreased

square

A2

two

Figure

algorithms

algorithms

response

number

dataset

algorithms

the

standard

algorithms

mean

simulated

by

is the

the two

the

algorithm

the

gold

result

using

prediction

than omitted

that

to

the

two

two

, algorithm

for

to

arcs

of arcs

of

the

, the

for

.

3 and

of the

of

hand

structure

of parents

extra

terms

other

time

design

in Figure

accuracy

databases

regard

( remember

with ) . An

real

computational

experimental

reported

in

the the

the

.01 ) . With

significant

databases

to

to

is computed

results prediction

. On

A2 to the

on

the

compare

regard

for

section

, either

log - score

when , with

time

of the in

significant the

a significantly

the

analysis

the

particular

both

computational

statistical that

significantly

A2

of the

as stated

shows

or

ratio

the sta -

omitted

generate

the

accuracy

of

respect

to

the

dataset

of

600

suggests

that

538

STEFANO MONTIANDGREGORY F. COOPER

this is due to an anomaly from sampling. Howeverthe point grants further verification by testing the algorithms on larger datasets.

5.3. DISCUSSION The results shown in Figures 3 and 4 support the hypothesis that discretiza tion of continuous variables does not decrease the accuracy of recovering the structure of BN s from data . They also show that using discretized continuous variables to construct a BN structure (algorithm AI ) is significantly faster (by a factor ranging from about 30 to 40) than using untransformed continuous variables (algorithm A2 ) . Also , the predictions based on Al are at least as accurate as (and often more accurate than ) the predictions based on A2 . Another important aspect differentiating the two learning methods is the relative variability of the results for algorithm A2 compared with the results for algorithm AI , especially with regard to the structure of the learned models . In Figures 3 and 4 the number of parents of the class variable discovered by algorithm Al over multiple simulations remains basically constant (e.g., 2 parents in the database AUTO- MPG, 4 parents in the database ABALONE ) . This is not true for algorithm A2 , where the difference between the minimum and maximum number of arcs discovered is quite high (e.g., when applied to the database ABALONE , A2 discovers a minimum of 0 parents and a maximum of 4 parents ) . These results suggest that the estimations based on the ANN -based scoring metric are not very stable , probably due to the tendency of ANN -based search to get stuck in local maxima in the search space. 6.

Conclusions

In this paper , we presented a method for learning hybrid BNs , defined as BN s containing both continuous and discrete variables . The method is based on the definition of a scoring metric that makes use of artificial neural networks as probability estimators . The use of the ANN -based scoring metric allows us to search the space of BN structures without the need for discretizing the continuous variables . We compared this method to the alternative of learning the BN structure based on discretized data . The main purpose of this work was to test whether discretization would or would not degrade the accuracy of the discovered BN structure and parameter estimation accuracy . The experimental results presented in this paper suggest that discretization of variables permits the rapid construction of relatively high fidelity Bayesian networks when compared to a much slower method that uses continuous variables . These results do not of course rule out the possibility that we can develop faster and more accurate continu -

LEARNING HYBRID BAYESIAN NETWORKS FROM DATA

539

ous variable learning methods than the one investigated here. However , the results do lend support to discretization as a viable method for addressing the problem of learning hybrid BNs . Acknowledgments

We

thank

Chris

on

a preliminary

IRI

- 9509792

Bishop

and

version from

Moises

of

the

this

Goldszmidt

manuscript

National

Science

for

their

. This

work

Foundation

.

useful

WM

comments

funded

by

grant

References 1.

2.

J . Binder with

ingham

B4

C . Bishop

4.

.

R . Bouckrert

.

6. 7. 8.

rithms IEEE

in

15 .

in

Oxford

University

and

Com

-

, Birm

-

Press

,

Uncertainty .

NESTOR

and

networks

Artificial

and

Networks theory

tistical

Society

A . Dawid

. The

, 1992 .

University

Data

.

- based

and

prequential

of

.

of the

Alto

86 - 94 , Los

Press

7th

results

.

, editors

,

networks

Workshop

on

: search Artificial

1995 . . In

Angeles

for

the

Proceedings

diagnostic HPP

marginal

of the that

- 84 - 48 , Dept

12 - th

integrates . of

Com

-

, 1984 .

method of the

for 7th

constructing

Conference

Bayesian of

Uncertainty

, CA , 1991 .

Method

Learning

and

, 1996 .

approximation

, California

Bayesian

) : Theory

5th

medical

Bayesian

. Machine

position

Proceedings

Bayesian

Report

Proceedings

. A

data

from

R . Uthurasamy

. MIT

network

Technical A

-

, 1989 .

networks

. In

. Learning

Efficient

, Palo

. In

,

: Algo

for

the

Induction

of Proba

-

, 9 :309 - 347 , 1992 .

potential

developments

approach

( with

discussion

: Some ).

personal

views

. Sta -

Journal

of Royal

Sta -

Bayesian

inference

. In

A , 147 :278 - 292 , 1984 .

. Prequential

J .M . Bernardo

computer

, pages

from

. Present

tistical

: A

E . Herskovits

Mining

a Bayesian

AI , 1996 .

databases

Intelligence

G . F . Cooper

.

.

outputs

, 8 ( 3 ) , 1996 .

, and

Proceedings

in

E . Herskovits

from

Data

In

given

In

.

- computing

York

( AutoClass

, P . Smyth

112 - 128 , January

knowledge

, Stanford

G . F . Cooper

A . Dawid

data

, New

networks

D . Heckerman

D . Heckerman

Neuro

.

, pages

network

probabilistic

classification

.

Publishers

In

Verlag

networks

Intelligence

52 - 60 , 1991 .

and

, and

.

Engineering

Bayesian

- Shapiro

, pages

and

probabilistic

Science

on

results

Statistics

Data

AI , pages

Discovery

of incomplete

and

. Springer and

belief

Artificial

classification

on learning

. Bayesian

experimental

G . F . Cooper

Press

University

Kaufmann

recognition

literature

, D . Geiger

of

pattern

refinement

Knowledge

and

bilistic 14 .

/ 4288 , Neural

, Aston

Bayesian in

of feedforward

Knowledge

J . Stutz

D . M . Chickering

in

.

for

, 1994 . Morgan

, G . Piatetsky

D . M . Chickering

belief

NCRG

Recognition

Uncertainty

Applications

Uncertainty and

Conference

13 .

on

U . M . Fayyad

likelihood

12 .

and

. Theory of

P . Cheeseman

puter

networks

.

Science

algorithms of

statistical to the

1Tansactions

Conference

learning

interpretation to

W . L . Buntine

causal

Pattern

, California

. A guide

Intelligence

11 .

probabilistic

appear

Report

of Computer

Conference

Francisco

, Architectures

methods 10 .

. Technical

for

of

10th

relationships

Advances 9.

the

. Probabilistic

W . Buntine

In

. Adaptive

, 1997 . To

1994 .

Networks

Properties

of

102 - 109 , San J . Bridle with

networks

, Department

, U . K . , February

Neural

K . Kanazawa

Learning

, 1995 .

Proceedings

5.

density Group

7ET

, and

. Machine

. Mixture

Research

Oxford

, S . Russel

variables

C . Bishop puting

3.

, D . Koller

hidden

analysis

et ai , editor

, stochastic

, Bayesian

Statistics

complexity 4 , pages

and

109 - 125 . Oxford

University

540

MONTI

AND

GREGORY

F . COOPER

M . Druzdzel , L . C . van der Gaag , M . Henrion , and F . Jensen , editors . Build ing probabilistic networks : where do the numbers come from ?, IJCAI -95 Workshop , Montreal , Quebec , 1995. B . Everitt and D . Hand . Finite mixture distributions . Chapman and Hall , 1981. U . Fayyad and R . Uthurusamy , editors . Proceedings of the First International Con -

ference on Knowledge Discovery and Data Mining (KDD -95) , Montreal , Quebec, Press and

. M . Goldszmidt

.

Discretization

of

continuous

attributes

while

learning Bayesian networks . In L . Saitta , editor , Proceedings of 13-th International Conference on Machine Learning , pages 157- 165, 1996. D . Geiger and D . Heckerman . Learning Gaussian networks . In R . L . de Mantras and D . Poole , editors , Prooceedings of the 10th Conference of Uncertainty in AI , San Francisco , California , 1994. Morgan Kaufmann . D . Geiger , D . Heckerman , and C . Meek . Asymptotic model selection for directed networks with hidden variables . Technical Report MSR - TR -96-07, Microsoft Research , May 1996. W . Gilks , S. Richardson , and D . Spiegelhalter . Markov Chain Monte Carlo in Practice . Chapman & Hall , 1996. D . Heckerman , D . Geiger , and D . M . Chickering . Learning Bayesian networks : The combination of knowledge and statistical data . Machine Learning , 20:197- 243, 1995. D . Madigan , S. A . Andersson , M . D . Perlman , and C . T . Volinsky . Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs . Communications in Statistics - Theory and Methods , 25, 1996. D . Madigan , A . E . Raftery , C . T . Volinsky , and J . A . Hoeting . Bayesian model averaging . In AAAI Workshop on Integrating Multiple Learned Models , 1996. C . Merz and P. Murphy . Machine leaFning repository . University of California , Irvine , Department of Information and Computer Science , 1996. http : / / www. ics . uci . edu / mlearn / MLRepository . html . M . Moller . A scaled conjugate gradient algorithm for fast supervised learning . S. Monti and G . F . Cooper . Learning Bayesian belief networks with neural network

. ~ N

0. ~

. ~. N M ~

~. ~

. ~ ~

~. M . 00 N

1995 . AAAI N . Friedman

~. N

. ~. M ~. ~ 00 M

16.

STEFANO

Neural

Networks

M . Mozer

, M . Jordan

, and

T . Petsche

, editors

, Advances

in

Neural

~. N

. In

Information Processing Systems 9: Proceedings of the 1996 Conference , 1997. J . Pearl . Probabilistic Reasoning in Intelligent Systems : Networks of Plausible In ference . Morgan Kaufman Publishers , Inc ., 1988.

~. ~

estimators

, 6 : 525 - 533 , 1993 .

A . E. Raftery . Bayesian model selection in social research (with discussion) . Sociological Methodology , pages 111- 196, 1995.

. M. ~ N N

pages

. N N

chapter 24, pages 441- 464. G . Schwarz . Estimating the dimension of a model . Annals of Statistics , 6:461- 464 ,

~. N

R . Shachter . Intelligent probabilistic inference . In L . K . . J . Lemmer , editor , Un certainty in Artificial Intelligence 1, pages 371- 382, Amsterdam , North -Holland ,

0. N

D . Spiegelhalter , A . Dawid , S. Lauritzen , and R . Cowell . Bayesian analysis in expert systems . Statistical Science , 8(3) :219- 283, 1993. B . Thiesson . Accelerated quantification of Bayesian networks with incomplete data .

~. ~

A . E. Raftery . Hypothesis testing and model selection. In Gilks et al [22], chapter 10,

In Fayyad and Uthurusamy [18], pages 306- 311.

163 - 188 .

M . Richard and R . Lippman . Neural network classifiers estimate posteriori probabilities . Neural Computation , 3:461- 483, 1991.

Bayesian a-

C. Robert . Mixtures of distributions : Inference and estimation . In Gilks et al [22], 1996 .

. . ~ ~ OO ~

1986 .

D . Titterington , A . Smith , and U . Makov . Statistical Distributions . Wiley , New York , 1985.

Analysis

of Finite

Mixture

A MEAN FIELD LEARNING ALGORITHM FOR UNSUPERVISED NEURAL NETWORKS

LAWRENCE SAUL ATl!:JT Labs- Research 180 Park Ave D-130 Florham Park, NJ 07932 AND MICHAELJORDAN Massachusetts Institute of Technology Center for Biological and Computational Learning

79AmherstStreet , E10-034D Cambridge , MA 02139 Abstract . We introduce a learning algorithm for unsupervised neural networks based on ideas from statistical mechanics. The algorithm is derived from a mean field approximation for large, layered sigmoid belief networks . We show how to (approximately ) infer the statistics of these networks with out resort to sampling . This is done by solving the mean field equations , which relate the statistics of each unit to those of its Markov blanket . Using these statistics as target values, the weights in the network are adapted by a local delta rule . We evaluate the strengths and weaknesses of these networks for problems in statistical pattern recognition .

1. Introduction Multilayer neural networks trained by backpropagation provide a versatile framework for statistical pattern recognition . They are popular for many reasons, including the simplicity of the learning rule and the potential for discovering hidden , distributed representations of the problem space. Nevertheless, there are many issues that are difficult to address in this framework . These include the handling of missing data , the statistical interpretation 541

542

LAWRENCE SAULAND MICHAELJORDAN

of hidden units , and the problem of unsupervised learning , where there are no explicit error signals . One way to handle these problems is to view these networks as probabilistic models . This leads one to consider the units in the network as random variables , whose statistics are encoded in a joint probability distribution . The learning problem , originally one of function approximation , now becomes one of density estimation under a latent variable model ; the objective function is the log-likelihood of the training data . The probabilis tic semantics in these networks allow one to infer target values for hidden units , even in an unsupervised setting . The Boltzmann machine [l ] was the first neural network to be endowed with probabilistic semantics . It has a simple Hebb-like learning rule and a fully probabilistic interpretation as a Markov random field . A serious problem for Boltzmann machines, however, is computing the statistics that appear in the learning rule . In general , one has to rely on approximate methods , such as Gibbs sampling or mean field theory [2] , to estimate these statistics ; exact calculations are not tractable for layered networks . Experience has shown , however , that sampling methods are too slow , and mean field approximations too impoverished [3] , to be used in this way. A different approach has been to recast neural networks as layered belief networks [4] . These networks have a fully probabilistic interpretation as directed graphical models [5, 6] . They can also be viewed as top-down generative models for the data that is encoded by the units in the bot tom layer [7, 8, 9] . Though it remains difficult to compute the statistics of the hidden units , the directionality of belief networks confers an impor tant advantage . In these networks one can derive a simple lower bound on the likelihood and develop learning rules based on maximizing this lower bound . The Helmholtz machine [7, 8] was the first neural network to put this idea into practice . It uses a fast , bottom -up recognition model to compute the statistics of the hidden units and a simple stochastic learning rule , known as wake-sleep, to adapt the weights . The tradeoff for this simplicity is that the recognition model cannot handle missing data or support certain types of reasoning , such as explaining away[5] , that rely on top -down and bottom -up processing. In this paper we consider an algorithm based on ideas from statistical mechanics . Our lower bound is derived from a mean field approximation for sigmoid belief networks [4] . The original derivation [lO] of this approx imation made no restrictions on the network architecture or the location of visible units . The purpose of the current paper is to tailor the approx imation to networks that represent hierarchical generative models. These are multilayer networks whose visible units occur in the bottom layer and

A MEAN FIELD ALGORITHM FOR UNSUPERVISED LEARNING

543

whose topmost layers contain large numbers of hidden units . The mean field approximation that emerges from this specialization is interesting in its own right . The mean field equations, derived by maxi mizing the lower bound on the log-likelihood , relate the statistics of each unit to those of its Markov blanket . Once estimated , these statistics are used to fill in target values for hidden units . The learning algorithm adapts the weights in the network by a local delta rule . Compact and intelligi ble, the approximation provides an attractive computational framework for probabilistic modeling in layered belief networks . It also represents a vi able alternative to sampling , which has been the dominant paradigm for inference and learning in large belief networks . While this paper builds on previous work , we have tried to keep it self-contained . The organization of the paper is as follows . In section 2, we examine the modeling problem for unsupervised networks and give a succinct statement of the learning algorithm . (A full derivation of the mean field approximation is given in the appendix .) In section 3, we aBsessthe strengths and weaknesses of these networks based on experiments with handwritten digits . Finally , in section 4, we present our conclusions , as well as some directions for future research. 2.

Generative

models

Suppose we are given a large sample of bi~ary (0/ 1) vectors , then asked to model the process by which these vectors were generated . A multilayer network (see Figure 1) can be used to parameterize a generative model of the data in the following way. Let sf denote the ith unit in the lth layer of the network , hf its bias, and Jfj- l the weights that feed into this unit from the layer above. We imagine that each unit represents a binary random variable whose probability of activation , in the data -generating process, is conditioned on the units in the layer above. Thus we have:

p(sf = 1ISf -l) = (J(Lj J~-lSJ - l +hf) ,

(1)

where a (z) == [1 + e- zJ- l is the sigmoid function . We denote by a; the squashed sum of inputs that appears on the right hand side of eq. ( 1) . The joint distribution over all the units in the network is given by :

P(S) :=Il (af)sf(l - af)l - Sf li

(2)

A neuralnetwork , endowedwith probabilisticsemantics in this way, is knownasa sigmoidbeliefnetwork [4]. Layered beliefnetworks wereproposed ashierarchical generative modelsby Hintonet al[7].

LAWRENCE SAUL ANDMICHAEL JORDAN

544

~J.

82 J,

Figure 1. A multilayer sigmoid belief network that parameterizes for the units in the bottom layer .

a generative model

The goal of unsupervised learning is to model the data by the units in the bottom layer . We shall refer to these units a8 visible (V ) units , since the data vectors provide them with explicit target values. For the other units in the network - the hidden (H) units - appropriate target values must be inferred from the probabilistic semantics encoded in eq. (2) . 2

. 1

.

MAXIMUM

The

LIKELIHOOD

problem

of

of

density

a

large

is

to

learning

.

family

of

the

of

sample

distri

The

tion

framework

visible

h1

,

hidden

units

units

that

make

.

the

is

essentially

can

The

one

parameterize

learning

problem

statistics

of

the

visible

.

to

.

bu

the

biases

data

this

many

over

the

in

with

and

approach

training

marginal

,

network

Jfj

those

simple

A

distributions

weights

match

One

the

unsupervised

estimation

find

units

ESTIMATION

learning

is

likelihood

of

to

maximize

each

the

data

vector

log

is

-

likelihood

obtained

!

of

from

its

,

P

( V

)

= =

}

:

P

( H

,

V

)

,

( 3

)

H

where

by

( hidden

definition

and

gradients

biases

visible

of

of

the

P

the

network

)

log

( H

.

We

-

likelihood

.

,

V

)

can

For

f.. 1\] 1 U .J ~hf

= =

P

( S

)

derive

,

each

(X

the

local

In

data

<X

is

P

( V

joint

distribution

learning

)

,

rules

with

vector

E [(5f+1-

respect

,

over

this

gives

o-f+1)8;] , E [s~ 1.- a~ 1.] '

by

to

the

all

computing

the

on

the

weights

- line

units

and

updates

:

(4) (5)

1For simplicity of exposition, we do not consider forms of regularization (e.g., penalized likelihoods , cross-validation ) that may be necessaryto prevent overfitting .

A MEAN FIELD ALGORITHM FOR UNSUPERVISED LEARNING

545

where E [. . .] denotes an expectation with respect to the conditional distri bution , P (HIV ) . Note that the updates take the form of a delta rule , with unit activations 0' [ being matched to target values sf . Many authors [4, 18J have noted the associative , error -correcting nature of gradient -based learn ing rules in belief networks .

2.2. MEANFIELDLEARNING In general, it is intractable(ll , 12] to calculate the likelihood in eq. (3) or the statistics in eqs. (4~5). It is also time-consuming to estimate them by sampling from P (H IV ) . One way to proceed is based on the following idea[7] . Supposewe have an approximate distribution , Q(HIV ) ~ P (HIV ). Using Jensen's inequality, we can form a lower bound on the log-likelihood from :

(H,V InP(V)~LHQ(HIV )In[P {~(l{I -V~ ))].

(6)

If this bound is eaEYto compute , then we can derive learning rules baEed on maximizing the bound . Though one cannot guarantee that such learn ing rules always increase the actual likelihood , they provide an efficient alternative to implementing the learning rules in eqs. (4- 5) . Our choice of Q (H IV ) is motivated by ideas from statistical mechanics. The mean field approximation [13] is a general method for estimating the statistics of large numbers of correlated variables . The starting point of the mean field approximation is to consider factorized distributions of the form :

Q(HIV ) ==II.eiEH II (J.Lf)Sf(l - J.L)I-Sf f

(7)

The parameters 1.L1are the mean values of sf under the distribution Q (HIV ) , and they are chosen to maximize the lower bound in eq. (6) . A full derivation of the mean field theory for these networks , starting from eqs. (6) and (7) , is given in the appendix . Our goal in this section , however , is to give a succinct statement of the learning algorithm . In what follows , we therefore present only the main results , along with a number of useful intuitions . For these networks , the mean field approximation works by keeping track of two parameters , { 1.L1,~f } for each unit in the network . Roughly speaking , these parameters are stored as approximations to the true statis tics of the hidden units : 1.L1~ E[Sf ] approximates the mean of sf , while ~f ~ E [O'f] approximates the average value of the squashed sum of inputs . Though only the first of these appears explicitly in eq. (7) , it turns out that both are needed to compute a lower bound on the log-likelihood . The values of { J.Lf, ~f } depend on the states of the visible units , ~ well ~ the weights and biases of the network . They are computed by solving the mean

LAWRENCE SAULANDMICHAELJORDAN

546

field equations :

11 .1=0"[2:::J Jfj -lI.1-1+hf+LJ J]i(Jl ,~+l-~J+l)-~(1-2Jl ,f)L( ]i)2~J+l(1-~J+l)1 JJ j )8 ~f = 0"

[~ Jfj _lJ.L~_l +hf+i(l- 2~;)~ (Jfj -l)2J .L~-1(1- J.L~-1)] .

(9)

These equations couple the parameters of each unit to those in adjacent layers . The terms inside the brackets can be viewed as effective influences (or " mean fields " ) on each unit in the network . The reader will note that sigmoid belief networks have twice as many mean field parameters as their undirected counterparts [2] . For this we can offer the following intuition . Whereas the parameters JL~ are determined by top -down and bottom -up influences, the parameters ~f are determined only by top -down influences . The distinction - essentially , one between parents and children - is only meaningful for directed graphical models. The procedure for solving these equations is fairly straightforward . Ini tial guesses for { JLf, ~f } are refined by alternating passes through the network , in which units are updated one layer at a time . We alternate these passes in the bottom -up and top -down directions so that information is propagated from the visible units to the hidden units , and vice versa. The visible units remain clamped to their target values throughout this process. Further details are given in the appendix . The learning rules for these networks are designed to maximize the bound in eq. (6) . An expression for this bound , in terms of the weights and biases of the network , is derived in the appendix ; see eq. (24) . Gradient ascent in Jfj and hf leads to the learning rules:

1\J.'.e U lJ. cx: [(jJ ;f+l - ~f+l) J.L~cx : ~h1 (/1; - ~f).

Jfj~f+l (1- ~f+l)J.L~(1- J.L~)] ,

(10) (11)

Comparing these learning rules to eqs. (4- 5) , we see that the mean field parameters fill in for the statistics of sf and o-f . This is, of course, what makes the learning algorithm tractable . Whereas the statistics of P (HIV ) cannot be efficiently computed , the parameters { J.Lf, ~f } can be found by solving the mean field equations . We obtain a simple on-line learning algorithm by solving the mean field equations for each data vector in the training set, then adjusting the weights by the learning rules , eqs. (10) and (11) . The reader may notice that the rightmost term of eq. ( 10) has no counterpart in eq. (4) . This term , a regularizer induced by the mean field approximation , causes Jfj to be decayed according to the mean-field statistics

A MEAN

FIELD

ALGORITHM

FOR UNSUPERVISED

LEARNING

547

of0-;+1andSJ. In particular , theweightdecay issuppressed if either~f+l or J.L] issaturated nearzeroorone;in effect , weights between highlycorrelated units

are burned

in to their

current

values .

3 . Experiments We used a large database of handwritten

digits to evaluate the strengths

and weaknessesof these networks. The database[16] was constructed from NIST Special Databases 1 and 3. The examples in this database were deslanted , downsampled , and thresholded to create 10 X 10 binary images. There were a total of 60000 examples for training and 10000 for testing ; these were divided roughly equally among the ten digits ZEROto NINE. Our

experiments had several goals: (i) to evaluate the speed and performance of the mean field learning algorithm ; (ii) to assessthe quality of multilayer networks as generative models; (iii ) to seewhether classifiersbasedon generative models work in high dimensions; and (iv) to test the robustnessof these classifiers with respect to missing data . We used the mean field algorithm from the previous section to learn generative models for each digit . The generative models were parameterized by four -layer networks with 4 x 12 x 36 x 100 architectures . Each network was

trained by nine passes2through the training examples. Figure 2 shows a typical plot of how the scorecomputed from eq. (6) increasedduring train ing . To evaluate the discriminative

capabilities

of these models , we trained

ten networks , one for each digit , then used these networks to classify the images in the test set . The test images were labeled by whichever

network

assigned them the highest likelihood score, computed from eq. (6) . Each of these experiments required about nineteen CPO hours on an SGI R10000 , or roughly 0.12 seconds of processing time per image per network . We con-

ducted five such experiments; the error rates were 4.9%(x2 ), 5.1%(x2 ), and 5.2% . By comparison , the error rates3 of several k-nearest neighbor al-

goriths were: 6.3% (k == 1), 5.8% (k == 3), 5.5% (k == 5), 5.4% (k == 7) , and 5.5% (k == 9) . These results show that the networks have learned noisy but essentially accurate models of each digit class. This is confirmed by look ing at images sampled from the generative model of each network ; some of these are shown in figure 3. One advantage of generative models for classification is the seamless handling of missing data . Inference in this case is simply performed on the 2The first pass through the training examples was used to initialize the biases of the bottom layer ; the rest were used for learning . The learning rate followed a fixed schedule : 0.02 for four epochs and 0 .005 for four epochs . 3All the error rates in this paper apply to experiments with 10 x 10 binary images .

The best backpropagation networks(16) , which exploit prior knowledge and operate on 20 x 20 greyscale images , can obtain error rates less than one percent .

548

LAWRENCESAUL AND MICHAEL JORDAN

10 epoch

Figure 2. Plot of the lower bound on the log-likelihood , averagedover training patterns , versus the number of epochs, for a 4 x 12 x 36 x 100 network trained on the dig-it TWO. The score has been normalized

. . . . Figure 3.

by 100 x In 2.

. . .. . . . .m . . ... .#": . Synthetic images sampled from each digit 's generative model.

pruned network in which units corresponding to missing pixels have been removed (i .e., marginalized ) . We experimented by randomly labeling a certain fraction , f , of pixels in the test set as missing , then measuring the number of classification errors versus f . The solid line in figure 4 shows a plot of this curve for one of the mean field classifiers. The overall perfor mance degrades gradually from 5% error at f == a to 12% error at f == 0.5. One can also compare the mean field networks to other types of generative models . The simplest of these is a mixture model in which the pixel values (within each mixture component ) are conditionall .y distributed as independent binary random variables . Models of this type can be trained by an Expectation -Maximization (EM ) algorithm [19] for maximum likelihood estimation . Classification via mixture models was investigated in a separate set of experiments . Each experiment consisted of training ten mixture models , one for each digit , then using the mixture models to classify the

meanfield

Figure 4. Plot of classification error rate versus fraction of missing pixels in the test set. The solid curve gives the results for the mean field classifier; the dashed curve, for the mixture model classifier.

digits in the test set. The mixture models had forty mixture components and were trained by ten iterations of EM . The classification error rates in five experiments were 5.9% ( x3 ) and 6.2% ( x2 ) ; the robustness of the best classifier to missing data is shown by the dashed line in figure 4. Note that while the mixture models had roughly the same number of free parameters as the layered networks , the error rates were generally higher . These results suggest that hierarchical generative models, though more difficult to train , may have representational advantages over mixture models . 4.

Discussion

The trademarks of neural computation are simple learning rules , local message-passing, and hierarchical distributed representations of the prob lem space. The backpropagation algorithm for multilayer networks showed that many supervised learning problems were amenable to this style of computation . It remains a challenge to find an unsupervised learning algorithm with the same widespread potential . In this paper we have developed a mean field algorithm for unsuper vised neural networks . The algorithm captures many of the elements of neural computation in a sound , probabilistic framework . Information is propagated by local message-passing, and the learning rule- derived from a lower bound on the log-likelihood - combines delta -rule adaptation with weight decay and burn -in . All these features demonstrate the advantages of tailoring a mean field approximation to the properties of layered networks . It is worth comparing our approach to methods based on Gibbs sampling [4] . One advantage of the mean field approximation is that it enables one to compute a lower bound on the marginal likelihood , P (V ) . Estimat -

550

LAWRENCE SAULANDMICHAEL JORDAN

ing these likelihoods by sampling is not so straightforward ; indeed , it is considerably harder than estimating the statistics of individual units . In a recent study , Frey et al [15] reported that learning by Gibbs sampling Wag an extremely slow process for sigmoid belief networks . Mean field algorithms are evolving . The algorithm in this paper is considerably faster and easier to implement than our previous one[10, 15] . There are several important areas for future research. Currently , the overall computation time is dominated by the iterative solution of the mean field equations . It may be possible to reduce processing times by fine tun ing the number of mean field updates or by training a feed-forward network (i .e., a bottom -up recognition model [7, 8] ) to initialize the mean field parameters close to solutions of eqs. (8- 9) . In the current implementation , we found the processing times (per image) to scale linearly with the number of weights in the network . An interesting question is whether mean field algorithms can support massively parallel implementations . With better algorithms come better architectures . There are many possible elaborations on the use of layered belief TLetworksas hierarchical generative models . Continuous -valued units , as opposed to binary -valued ones, would help to smooth the output of the generative model . We have not exploited any sort of local connectivity between layers , although this structure is known to be helpful in supervised learning [16] . An important consider ation is how to incorporate prior knowledge about the data (e.g., transla tion / rotation invariance [20] ) into the network . Finally , the synthetic images in figure 3 reveal an inherent weakness of top -down generative models ; while these models require an element of stochasticity to model the variability in the data , they lack a feedback mechanism (i .e., relaxation [21]) to clean up noisy pixels . These extensions and others will be necessary to realize the full potential of unsupervised neural networks . ACKNOWLEDGEMENTS The authors acknowledge useful discussions with T . Jaakkola , H . Seung, P. Dayan , G . Hinton , and Y . LeCun and thank the anonymous reviewers for many helpful suggestions .

A . Mean field approximation In this appendix we derive the mean field approximation for large , layered sigmoid belief networks . Starting from the factorized distribution for Q (HIV ) , eq. (7) , our goal is to maximize the lower bound on the log-

A MEAN FIELD ALGORITHM FOR UNSUPERVISED LEARNING

551

likelihood , eq. (6) . This bound consists of the difference between two terms :

InP (V ) ~ [ - ~ Q (HIV ) lnQ (HIV )] - [ - ~

Q (HIV ) lnp (H , V )] . (12) The first term is simply the entropy of the mean field distribution . Because Q (HIV ) is fully factorized , the entropy is given by :

- LHQ(HIV )InQ(HIV ) = - ilEH L [JlfInJLf +(1- JLf )In(l - J1f )J

(13)

We identify the second term in eq. (12) as (minus ) the mean field energy ; the name arises from interpreting P (H , V ) == eln P(H,V) as a Boltzmann distribution . Unlike the entropy , the energy term in eq. (12) is not so straight forward . The difficulty in evaluating the energy stems from the form of the joint distribution , eq. (2) . To see this , let

z~ 1.==~ LI J~ 1.)~lS~ )-l + h~ 1. )

(14)

denote theweighted sumof inputsintounitsf. Fromeqs.(1) and(2), we canwritethejoint distribution in sigmoid beliefnetworks as: InP(S)

-

-

- Lfi {sf In[1+ e-zf] + (1- Sf) In[1+ ezf ]}

(15)

Lii {Sfz ! - In[1+ezf ]} .

(16)

The difficulty in evaluating the mean field energy is the logarithm on the right hand side of eq. (16) . This term makes it impossible to perform the averaging of In P (S) in closed form , even for the simple distribution , eq. (7) . Clearly , another approximation is needed to evaluate (In [1 + ezfJ) , averaged over the distribution , Q (HIV ) . We can make progress by studying the sum of inputs , zl , as a random variable in its own ri .ght. Under the distribu tion Q (HIV ) , the right hand side of eq. (14) is a weighted sum of indepen dent random variables with means J.L1- 1 and variances J.L1- 1(1 - J.L1- 1) . The number of terms in this sum is equal to the number of hidden units in the (f - l )th layer of the network . In large networks , we expect the statistics of this sum- or more precisely, the mean field distribution Q (zfIV )- to be governed by a central limit theorem . In other words , to a very good approx imation , Q (zf IV ) assumes a normal distribution with mean and variance :

(z~ '") == ~ L..., J.~ '"J~1,/J .,.J~-1+ h~ '", J

(17)

LAWRENCE SAUL AND MICHAEL JORDAN

552

(18)

((8zf)2) = ~J (Jfj-l)2J .L~-1(1- J.L~-1).

In what follows, we will use the approximation that Q(zfIV ) is Gaussian to simplify the mean field theory for sigmoid belief networks. The approximation is well suited to layered networks where each unit receivesa large number of inputs from the (hidden) units in the preceding layer. The asymptotic form of Q (zfIV ) and the logarithm term in eq. (16) motivate us to consider the following lemma. Let z denote a Gaussian random variable with mean (z) and variance (8z2) , and consider the expected value, (In[I + eZ]) . For any real number ~, we can form the upper bound[22]:

-~Z(l + eZ)]), (In[1 + eZ]) == (In[e~Ze

(19) (20) (21)

== ~(z) + (In[e-~Z+ e(l-~)Z]), ::; ~(Z) + In(e-~Z+ e(l-~)Z),

where the last line follows from Jensen's inequality . Since z is Gaussian , it is straightforward to perform the averages on the right hand side. This gives . us an upper bound on (In [1 + eZ]) expressed in terms of the mean and varIance :

( In [ l +

In

what

follows

( In [ l

+

that

appear

the

, we

will

ez1 ] ) . Recall

value

side

eZ ] ) ~

of

in of

~

the

~ ~ 2 ( 8z2

use

eq .

mean

field

that

makes

eq . ( 22 ) is

minimized

Eq

.

( 23 ) (z ) 0'

has

and

a

unique

( 8z2 ) ,

~

f -

in

eq . ( 22 ) . We

[(z ) +

! (1 can

it

; we

bound

2~)

5z2 ) ]

understand

as

~ (1 -

in solved is

approximate are

are tight

the the

exact

value

intractable

therefore as

( 22 )

motivated

possible

of

averages

. The

to right

find hand

:

[(z ) +

easily

to

e { Z } + ( 1 - 2 ~ ) { 8z2 } / 2 ] .

these

energy

solution is

[1 +

that

when

0'

In

bound

( 16 )

the

~ =

for

this

from

) +

the

2 ~ ) ( 8z2 ) ]

interval by

guaranteed eq . ( 23 ) as

~

iteration to a self

(23)

.

E ;

in

tighten - consistent

[ 0 , 1 ] . Given fact the

,

the

values iteration

upper

bound

approximation

for computing ~ ~ (a (z)) where z is a Gaussian random variable . To see this , consider the limiting behaviors : (a (z)) - + a ((z)) as (8z2) - + 0 and (a (z)) - + ! as (8z2) - + 00. Eq . (23) captures both these limits and interpo lates smoothly between them for finite (t5z2) . Equipped with the lemma , eq. (22) , we can proceed to deal with the intractable terms in the mean field energy. Unable to compute the average

A MEAN FIELD ALGORITHM FOR UNSUPERVISEDLEARNING 553 over Q(HIV ) exactly, we instead settle for the tightest possible bound. This is done by introducing a new mean field parameter, ~f , for each unit in the network, then substituting ~f and the statistics of zf into eq. (22). Note that these terms appear in eq. (6) with an overall minus sign; thus, to the extent that Q (zi IV ) is well approximated by a Gaussian distribution , the upper bound in eq. (22) translates4 into a lower bound on the log likelihood. Assembling all the terms in eq. (12) , we obtain an objective function for the mean field approximation:

InP(V) ~ +

-

l::: [J1 ,f In/.l1+(1- /.l7)In(1- /.l7)] + l::: ) ifEH ijfJfj-1/.l7/.l~-1(24 l::: /.l1- _ 2l::: 1 (~f)2(Jfj-l)2/.l~-l (1- /.l~-1) if h1 ijf Llin t+'"L.."",J[JtJ ,Ji-I+!.(1 2 -2~t~)(JtJ ,J~-1(1-J1 ,J1-1)]} ~if { 1+eh~ ~~1J1 ~~1)2J1

The mean field parameters are chosen to maximize eq. (24) for different settings of the visible units . Equating their gradients to zero gives the mean field equations , eqs. (8- 9) . Likewise , computing the gradients for J& and h1 gives the learning rules in eqs. (10- 11) . The mean field equations are solved by finding a local maximum of eq. (24) . This can be done in many ways. The strategy we chose was cyclic stepwise ascent- fixing all the parameters except one, then locating the value of that parameter that maximizes eq. (24) . This procedure for solving the mean field equations can be viewed as a sequence of local messagepassing operations that " match " the statistics of each hidden unit to those of its Markov blanket [5] . For the parameters ~f , new values can be found by iterating eq. (9) ; it is straightforward to show that this iteration always leads to an increase in the objective function . On the other hand , iterating eq. (8) for fl1 does not always lead to an increase in eq. (24) ; hence, the optimal values for flf cannot be found in this way. Instead , for each update , one must search the interval 1L1E [0, 1] using some sort of bracketing procedure [17] to find a local maximum in eq. (24) . This is necessary to ensure that the mean field parameters converge to a solution of eqs. (8- 9) .

4Our earlier work [lO] showed how to obtain a strict lower bound on the log likelihood ; i .e., the earlier work made no appeal to a Gaussian approrimation for Q(zfIV ) . For the networks considered here, however, we find the difference between the approrimate bound and the strict bound to be insignificant in practice. Moreover, the current algorithm has advantages in simplicity and interpretability .

554

LAWRENCE SAULAND MICHAELJORDAN

References 1.

D . Ackley , G . Hinton , and T . Sejnowski . A learning algorithm

for Boltzmann

ma -

chines. Cognitive Science 9: 147- 169 (1985). 2. 3.

C . Peterson and J . R . Anderson . A mean field theory learning algorithm for neural networks . Complex Systems 1:995- 1019 ( 1987) . C . Galland . The limitations of deterministic Boltzmann machine learning . Network 4 :355 - 379 .

4 .

learning of belief networks . Artificial Reasoning in Intelligent

Intelligence 56 :71- 113

Systems . Morgan

Kaufmann : San

Mateo , CA ( 1988). 6. S. Lauritzen . Graphical Models. Oxford University Press: Oxford (1996).

II _

II

5.

R . Neal . Connectionist ( 1992) . J . Pearl . Probabilistic

7.

. a N

8.

for unsuper -

vised neural networks. Science 268 :1158- 1161 (1995). P. Dayan , G . Hinton , R . Neal , and R . Zemel . The Helmholtz

machine . Neural Com -

putation 7:889- 904 ( 1995). M . Lewicki and T . Sejnowski . Bayesian unsupervised learning of higher order struc ture . In M . Mozer , M . Jordan , and T . Petsche , eds. Advances in Neural Information

Processing Systems 9: . MIT Press: Cambridge (1996).

10 .

L . Saul , T . Jaakkola , and M . Jordan . Mean field theory for sigmoid belief networks .

. t~

. '). 00 ~ 0 ~

9.

G . Hinton , P. Dayan , B . Frey , and R . Neal . The wake-sleep algorithm

G . Cooper . Computational complexity of probabilistic inference using Bayesian belief networks . Artificial Intelligence 42 :393-405 ( 1990) . P. Dagum and M . Luby . Approximately probabilistic reasoning in Bayesian belief

Journal of Artificial Intelligence Research4:61- 76 (1996). 11 .

. ~ ~

12 . 13 . 14 .

15.

networks is NP-hard. Artificial Intelligence 60 :141- 153 (1993) . G. Parisi. Statistical Field Theory. Addison-Wesley: Redwood City (1988) . J . Hertz , A . Krogh , and R .G . Palmer . Introduction

to the Theory of Neural Com -

putation . Addison-Wesley: Redwood City (1991). B . Frey , G . Hinton , and P. Dayan . Does the wake-sleep algorithm produce good density estimators ? In D . Touretzky , M . Mozer , and M . Hasselmo , eds. Advances in Neural Information Processing Systems 8 :661-667 . MIT Press : Cambridge , MA ( 1996) . Y . LeCun

, L . Jackel

, L . Bottou

, A . Brunot

, C . Cortes

, J . Denker

, H . Drucker

, I.

Guyon , U . Muller , E . Sackinger , P. Simard , and V . Vapnik . Comparison of learning algorithms for handwritten digit recognition . In Proceedings of / CANN '95. W . Press , B . Flannery , S. Teukolsky , and W . Vetterling . Numerical Recipes . Cam -

bridge University Press: Cambridge (1986). S. Russell , J . Binder , D . Koller , and K . Kanazawa . Local learning in probabilistic networks with hidden variables . In Proceedings of I J CA / - 95. A . Dempster , N . Laird , and D . Rubin . 1977. Maximum likelihood from incomplete data via the EM algorithm . Journal of the Royal Statistical Society B39 : 1- 38. P. Simard , Y . LeCun , and J . Denker . Efficient pattern recognition using a new transformation

distance

Neural Information

. In

S . Hanson

, J . Cowan

, and

C . Giles

Processing Systems 5 :50- 58. Morgan

, edge

Advances

in

Kaufmann : San Mateo ,

CA (1993) . S.

Geman

and

D . Geman

Bayesian restoration

. Stochastic

of images . IEEE

relaxation

Transactions

,

Gibbs

distributions

on Pattern

Analysis

,

and

the

and Ma -

chine Intelligence 6:721- 741 ( 1984). H . Seung . Annealed theories of learning . In J .-H . Dh , C . Kwon , and S. Cho , eds. Neural Networks : The Statistical Mechanics Perspective , Proceedings of the CTP -

PRSRI Joint Workshop on Theoretical Physics. World Scientific: Singapore (1995).

EDGE EXCLUSION TESTS FOR GRAPHICAL GAUSSIAN MODELS

PETER

W

. F . SMITH

Department The

of

Social

University

,

Statistics

Southampton

, ,

S09

5NH

UK

.

AND JOEWHITTAKER Department

of

University email

of .. joe

Mathematics Lancaster

. whittaker

and ,

LAl

@ lancaster

4 YF . ac

Statistics ,

UK

, .

. uk

Summary Testing that an edge can be excluded from a graphical Gaussian model is an important step in model fitting and the form of the generalised like lihood ratio test statistic for this hypothesis is well known . Herein the modified profile likelihood test statistic for this hypothesis is obtained in closed form and is shown to be a function of the sample partial correla tion . Related expressions are given for the Wald and the efficient score statistics . Asymptotic expansions of the exact distribution of this correla tion coefficient under the hypothesis of conditional independence are used to compare the adequacy of the chi-squared approximation of these and Fisher 's Z statistics . While no statistic is uniformly best approximated , it is found that the coefficient of the 0 (n - l ) term is invariant to the dimension of the multivariate Normal distribution in the case of the modified profile likelihood and Fisher 's Z but not for the other statistics . This underlines the importance of adjusting test statistics when there are large numbers of variables , and so nuisance parameters in the model . Similar comparisons are effected for the Normal approximation signed square-rooted versions of these statistics .

to the

Keywords .- Asymptotic expansions ; Conditional independence ; Edge exclusion ; Efficient score; Fisher 's Z ; Graphical Gaussian models ; Modified profile likelihood ; Signed square-root tests ; Wald statistic . 555

PETER W. F. SMITH ANDJOEWHITTAKER

556 1. Introduction

Dempster (1972) introduced graphical Gaussian models where the structure

of the

inverse

variance

matrix

, rather

than

the variance

matrix

itself ,

is modelled . The idea is to simplify the joint Normal distribution of p continuous random variables by testing if a particular element Wij of the p by p inverse variance matrix n can be set to zero. The remaining elements are nuisance parameters and the hypothesis is composite . Wermuth (1976) showed that fitting these models is equivalent to testing for conditional in dependence between the corresponding elements of the random vector X .

Speedand Kiiveri (1986) showedthat the test correspondsto testing if the edge connecting the vertices corresponding to Xi and X j in the conditional independence graph can be eliminated . Hence such tests are known as edge

exclusion tests. For an introduction to this material see Lauritzen (1989) or Whittaker

(1990) .

Many graphicalmodelselectionproceduresstart by makingthe (~) single edge exclusion tests , evaluating the (generalised) likelihood ratio statis tic and comparing

it to a chi -squared

distribution

. However , this is only

asymptotically correct , and may be poor , as is the case for models with discrete observations (Kreiner , 1987; Frydenburg and Jensen, 1989) . One

approach taken by Davison, Smith and Whittaker (1991) is to use the exact conditional

distribution

of a test statistic

, where

available

. However

, the

exact conditional test for edge exclusion for the graphical Gaussian case is equivalent

to the unconditional

test , and is based on the square of the sam -

ple partial correlation coefficient whose null distribution is a beta (Davison , Smith and Whittaker , 1991) . Thus in practice the exact test should be used. This statistic is the same as would be derived from a t- test for testing for a zero coefficient in multiple regression . Fisher 's Z transformation is also based on the sample partial correlation coefficient and allows Normal ta bles to be used to a reasonable degree of approximation . It is of interest to assess which

of several competing

best approximated

statistics

has an exact distribution

by the chi-squared over a varying number of nuisance

parameters .

To achieve this aim , explicit expressions for the modified profile likeli hood

ratio , Wald

and

efficient

inverting the information evant

submatrix

. These

score

statistics

are obtained

. This

involves

matrix and calculating the determinant of a reltest

statistics

turn

out

to be functions

of the sam -

ple partial correlation coefficient and it is natural to compare them with Fisher

' s Z transformation

In Section

.

2, after inverting

the information

matrix , the Wald and ef-

ficient score test statistics for excluding a single edge from a graphical Gaussian

model

are constructed

. In Section

3 a test

based

on the modified

558

PETERW. F. SMITHANDJOEWHITTAKER

where a = vec(E) is (a linear function of) the mean value parameter. The observed (and expected) information matrix is

(} 1 (} I = - 8wT U(w) = - -J a. 2 8wT

(3)

The inverse , K , of this information matrix is required to compute the test statistics

.

Consider more generally a linear exponential family model with pdimensional canonical parameter (), canonical statistics t and log-likelihood (}Tt (x ) - K((}) - h (x ) (Barndorff -Nielsen , 1988, p87). The information matrix for the canonical

parameter , Io , can be expressed as ()

If } = ""ijijf T(()) ,

(4)

wherer ==/eK ,(O) is themean -valuemapping . Asthemean -valuemapping is bijective (Barndorff-Nielsen, 1978, pI21),

8T.8TT 8f}=I' 80T I-I - aa(}T T 8--

the identity matrix . Hence the inverse of the information matrix can be computed from

This

(

result

1978

( 5

)

and

corollary

,

is

)

does

in

a

not

appear

well

different

- known

expression

To

( 3

find

)

to

The

the

partial

A

( IO

)

though

a

Amari

(

form

1982

appears

,

1985

in

,

Efron

p106

)

.

One

inverse

K

=

I

-

I

=

for

the

-

l

.

graphical

Gaussian

model

apply

( 6

)

( 5

)

give

K

trix

in

that

IT

to

,

,

(5)

derivatives

=

{

ail

}

of

with

=

the

respect

known (seeGraybill, 1983, as follows:

-

2

~

WTJ

elements

to

Corollary

the

of

-

a

elements

10

1

.

p -

dimensional

of

. 8

. 10

or

its

symmetric

inverse

McCullagh

B

,

ma

=

1987

{

)

brs

and

}

-

are

are

if r = s (airajs+ aisajr ) 1 .f r -.L ~ ={ -- airajs -;- s

.. t,J,r,s= l ,...,p. Hence theinverse information matrixofwcanbeobtained explicitly . So, in a sample ofsizen, thecovariance between themaximum likelihood estimates (mles ) ofanytwoelementsof the inverse variance matrix , asymptotically , is cov (GJijWrs) = .!. n (WirWjS + WisWjr ) .

(7)

EDGEEXCLUSION TESTSFORGRAPHICAL GAUSSIAN MODELS559 Cox

and

Wermuth

for graphical

( 1990 ) obtained

Gaussian

models

the

inverse

by another

of the

method

information

using

a result

matrix of Isserlis

( 1918 ) . Excluding model

the edge connecting

corresponds

alternative

vertices

accepting

the

is H A : W12 unspecified

are nuisance single

to

parameters

edge from

. The

a graphical

1 and 2 in a graphical

null

. The

hypothesis remaining

likelihood

ratio

Gaussian

distinct

test

model

Gaussian

H 0 : CJ )12 =

statistic

O. The

elements

of 0

for excluding

a

is

Tz = - n log ( 1 - ri2Irest ) ,

(8 )

where r121rest is the sample partial correlation of Xl the remainder X3 , . . . , Xp . The latter can be expressed of elements of the inverse variance matrix as

and X2 adjusted for in terms of the mles

-- ( W11W22 -- -- ) - 1/ 2 , r121rest = - W12 see for example efficient

score

2 .1. THE The

, Whittaker test

WALD

Wald

for the null

statistic

Tw

quadratic

approximation a single

asymptotic

( Cox

and

of the

edge from

variance

so leads

Hinkley

the Wald

and

.

of (;)12 from

to the closed

, 1974 , p314 , 323 ) based

likelihood

a graphical

varW12 and

are derived

hypothesis

TEST

excluding The

( 1990 , p189 ) . Below

statistics

(9 )

function Gaussian

equation

at

its

on

maximum

a

, for

model , is WI2 / var (W12) .

( 7 ) is

= ~ (WIIW22 + Wr2 ) , n

( 10 )

form

~

- 2 nw12 WIIW22 + Wf2

-

w

nr2 =

using

( 11 )

( 9 ) above .

2 .2 . THE The

l ~ rest 1 + r121rest

EFFICIENT

efficient

the conditional

score

SCORE test

TEST

Ts ( Cox

distribution

and

Hinkley

, 1974 , p315 , 324 ) is based

of the score statistic

for the interest

parameter

on

560

PETERW. F. SMITHANDJOEWHITTAKER

given the score statistic for the nuisance parameter, evaluated under the null hypothesis. From (2) the score is U(W12 ) = 0:12- 812, where . tilde denotes evaluation under the null hypothesis, with conditional varIance 1 2 1 -n (WIIW22+ W12 )- . (12) Evaluation of these requires estimates of W12 , Wll , W22and 0"12 under the null hypothesis. Under H 0 : W12= 0 the mles of n and ~ are W12 = 0, -- (1 - r121rest 2 ) Wii = Wii ~. = 1, 2 ail = 8ij i , j # 1, 2 -. - rl2lrest a12 = (-WIIW22 - )1/2(1 - r 2 ) + 812. 121rest

(13) (14) (15) (16)

Equation ( 13) restates the null hypothesis . Speed and Kiiveri (1986) showed that for all unconstrained Wij the files of the corresponding a ij are equal to Sij , hence (15). The other two equations are new and their proofs are given in Appendix C . The corollary of interest is the closed form expression for the score Ts

-

)2.W11W22 ... .... (1- r121rest 2 )2 n (-a12- 812 2 nr 12lrest '

(17)

using (16). 3.

The

Modified

Profile

Likelihood

Ratio

Test

The modified profile likelihood function (Barndorff-Nielsen, 1983, 1988), explicitly designed to take account of nuisance parameters , can be analyt ically expressed in the Gaussian context , and is an obvious candidate on which to base a test statistic . A number of authors have developed this work and for linear exponential

families

have calculated

the modified

pro -

file log-likelihood function : equations (10) of Cox and Reid (1987), (5.2) of Davison (1988) and (6) of Pierce and Peters (1992). Taking this as the starting point , the corresponding test statistic , in the canonical parameter isation

, is .

-

IIbbl Tm = Tl + log - . IIbbl

(18)

EDGE

The

EXCLUSION

term

Tt

is

information

parameter

{

12

}

)

and

log

been

used

is

rather

likelihood

and

;

Reid

The

(

the

)

matrix

and

to

Gaussian

model

prove

this

note

that

with

taking

III

The

last

from

p59

term

n

)

to

the

~

equals

The

ratio

on

and

test

is

for

this

-

:

,

the

,

-

p

+

-

P

~

l ) /

:

if

lp

+

,

+

l

.

l

)

I

= =

0

,

,

W

=

( wa

,

by

Wb

ordinary

)

have

modified

see

the

profile

comments

of

determinant

the

information

of

the

matrix

,

then

(

( 3

)

19

)

gives

ale

the

transformation

Theorem

~

3

. 7

;

the

,

result

modified

= =

Muirhead

the

the

where

a

.

mles

the

establishes

that

the

Here

.

is

~

1951

is

.

.

2IJII

This

Ibb

parameterisation

of

is

W12

1992

canonical

PI

( P +

aIkin

lp

so

,

of

section

Ho

( n

-

)

Jacobian

1~

this

Normal

)

the

and

of

statistic

=

2

value

dimensional

prove

( -

right

result

Tm

To

=

absolute

the

2

,

for

determinants

( Deemer

in

main

p -

the

(

18

and

Wb

the

form

(

=

b

the

Peters

evaluate

)

indexed

maximising

explicit

III

To

by

difficult

and

an

( 8

parameter

statistic

from

Pierce

at

561

parameter

interest

test

MODELS

nuisance

analytically

gives

required

graphical

the

is

lemma

information

the

obtained

latter

test

the

indexed

of

those

1987

following

a

into

version

than

directly

Cox

ratio

to

parameter

this

GAUSSIAN

- likelihood

partitioned

nuisance

in

,

GRAPHICAL

corresponding

the

that

FOR

ordinary

W

Note

for

the

submatrix

the

( =

TESTS

~

n

~

1982

,

.

profile

likelihood

underlying

distribution

is

is

1

following

)

log

(

1

-

r

identities

~ 2Irest

)

+

log

( Whittaker

(

1

1990

+

,

r

~ 2Irest

p149

,

)

169

.

( 20

)

are

)

used

..-..

lEI IIbbl

=

IIIIJCaal

and

2

~

=

1

-

r12lrest

,

lEI

where

Kaa

sponding

is

to

Equation

applying

the

the

(

lemma

submatrix

of

interest

18

)

the

inverse

parameter

is

evaluated

19

)

Wa

by

,

firstly

of

the

information

matrix

to

-

,

using

the

first

identity

. --lEI IKaal log ~ l'Ibbl -- (p+1)log~ +logjK:I -(

corre

,

followed

by

give

-

2 I ~ (p+1)log (1- T12lrest )+ ogIKaal

(21)

562

PETER W. F. SMITH AND JOE WHITTAKER

using the secondidentity and noting that the log 2Pterms cancel. The last term on the right is simplified by Kaa = WIIW22+ W?2 as already utilised in the expressions(10) and (12). So

.'"-. Kaa Kaa -

WllW22 + Wr2

-.)11(,1 -.)22 (1 - r 2 (,1 121rest)2

-

1+ r~2lrest (1 - r2121rest)28

(22)

Finally , combining (21) and (22) and substituting into (18) gives Tm

- n log(1 - r~2Irest ) + (p + 1) log(1 - r~2Irest ) + log

{ I+ri21rest } 2 2 (1 - r 12lrest)

== - (n - p - 1) log (1 - r12lrest ) + log { (1I -+ r121rest )2 } 2 r2~2lrest

(23)

== - (n - p + 1) log (1 - r ~2Irest) + log (1 + r ~2Irest). Note that (23) shows that the modified test statistic is the ordinary likelihood ratio test statistic where n has been replaced by n - p - 1, a multiplicative correction, plus another term which is the log of the ratio of the asymptotic variancesof the interest parameter evaluated under Ho and H A. The adjustment to the multiplicative term is not directly related to dimension of the nuisance parameters, although this might be expected if adjusting for the number of degreesof freedom used up by estimating these parameters. Instead this term is reduced by one more than the number of variables for which the samplepartial correction coefficient is adjusted. This is similar to the caseof linear regressionwhere the degreesof freedom are reduced: one for the mean and one for each variable included in the model. This modified test is a function of the sample partial correlation coefficient alone, which is a maximal invariant for the problem (see Davison, Smith and Whittaker , 1991); as are the tests derived in the previous section. 4. The Null Distributions Moran (1970) (seealso Hayakawa, 1975and Harris and Peers, 1980) showed that in general the likelihood ratio , Wald and efficient scorestatistics have the same asymptotic power. It has been shown here that for excluding a single edge from a graphical Gaussianmodel the four statistics: the generalised likelihood ratio , the Wald, the efficient scoreand the modified profile

EDGE

EXCLUSION

TESTS

FOR

GRAPHICAL

GAUSSIAN

MODELS

563

likelihood , are functions of the sample partial correlation coefficient , as is the test based on Fisher 's Z transformation . Consequently the tests have exactly the same power provided the null distributions are correct . Under the null hypothesis , the square of the sample partial correlation coefficient

from a sample of size n from a p- dimensional

Normal

distribution

hasa Beta(~, ~ ) distribution. Forexample , seeMuirhead(1982 , pI88). That

is 1

fu (u ) =

1 U- l / 2(1 - U)(n- p- 2)/ 2 B ( -2 ' ~.2=E) "

0 < U< 1

where u = ri21rest andB(.,.) is thebetafunction . By usingtherelevant transformation

, the

exact

null

distribution

of the

statistics

above

can

be

obtained . Following Barndorff -Nielsen and Cox (1989, Chapter 3) and expanding the log of the density functions in powers of n - l , the adequacy of the asymptotic chi-squared approximation can be assessed. Consider first the likelihood ratio test . With t == - n log (1 - u ) the density

Ii (t)

of Tl under the null hypothesis

is

1 -1/2( ( B(!2'1~.2 = .E .)u 1-U )n-p-2)/2.1-nU'

-

= nB(!2'!!.= (-t/n)}-1/2exp {-t(n- p)/2n }' 2p.){I - exp

The leadingterm of this expansioncorresponds to the XI distribution and so

1 fl (t) = 9x(t)[1+ 4(t - 1)(2p+ 1) n- lJ + O(n- 2),

t >

0,

where 9x (t ) is the density function of a XI random variable . Finally inte grating , from 0 to x , term by term , gives the expansion for the cumulative distribution function as Fl (X)

= = =

1 (X Gx (x ) + 4 (2p + 1) n - l Jo (t - l )gx (t )dt + O (n - 2) Gx (x ) - 21 (2p + 1) n - 1 ( 2X -;: ) 1/ 2 exp (- x / 2) + O (n - 2) 1 Gx (x ) - 2 (2p + 1) xgx (x ) n - 1 + O (n - 2) , x > 0,

564

PETER

W . F . SMITH

AND

JOE

WHITTAKER

TABLE 1. Coefficients of the O(n- l ) term in the asymptotic expansions of the density and distribution test

statistics

functions

of the five

.

a(t )

likelihood modified Wald score Fisher

b(x )

~(t - 1)(2p+ 1) ~(t - 1) i (t - 1)(2p+ 1) + 3t(3 - t) ~(t - 1)(2p+ 1) + t(3 - t) ~ (3 - 6t + t2)

-

~(2p+ 1) x !x ~(2p+ 1 - 3x) x ~(2p+ 1 - x) x i (x - 3) x

whereGx(x) is the distribution function of a xi randomvariable. These expressions

are of the form

fl (t ) = Fl (X) = with

9x(t ) + al(t ) 9x(t )n- l + O(n- 2) Gx(x) + bl(X) 9x(x) n- l + O(n- 2),

a and b appropriately defined . The approximations for the five test statistics

(24)

considered are of a similar

form and are displayed in Table 1. The expansions for the Wald and efficient score tests are similarly derived and details are included in Appendix A . In the case of the modified profile likelihood test , where 1

fm (t ) =

1

B(~, y )

it is not possible

u - l / 2(I - u )(n- p- 2)/ 2. to obtain

2

- u

n - p + 2+ (n - p)u"

a transformation

u in terms

of t

0< u < 1

(25) explicitly.

However, a function u*(t ), can be found such that u = u*(t ) + O(n- 3) and leads to the evaluation of the coefficients am and bm displayed in the table above; see Appendix B for details . Fisher

' s statistic

, 1

Zf == '2 log can

be

used

to

T121rest ( 11+ T12lrest ),

test for a zero partial

correlation

coefficient . Under

the

null hypothesis, it has expectation zero and variance 1/ [(n - p + 2) - 3] = l / (n - p - 1). SeeFisher (1970 , Ex. 30, p204) ,. -p160 and . . or Muirhead (1982 ~ Theorem 5.3.1) . Usually Z f is standardised and compared with a standard

EDGE EXCLUSION TESTS FOR GRAPHICAL GAUSSIAN MODELS 565 Normal distribution . However here, for comparison , the null distribution

of

T f = (n - p - 1) ZJ is considered and the terms in the asymptotic expansion are given . Details are included in Appendix A as the derivation is similar to that of the likelihood ratio above. The second term b. (x ) 9x (x ) n - 1 in the distribution function expansion gives F . (x ) - G x (x ) . If this coefficient is negative then the test will reject too few hypotheses (a conservative test ) whereas if the coefficient is positive too many will be rejected (a liberal test ). The striking feature of Table 1 is that to order n - 1 the distribution of the modified profile likelihood ratio test and Fisher 's Z f statistics do not depend on p ; hence to this accuracy their distributions do not depend on the number of nuisance parameters . The expansions of the modified profile likelihood and the likelihood ratio density functions are the same when p = 2. In general the actual sizes of the other three tests are increasing with p and hence when the coefficient of n - 1 becomes negative the adequacy of the chi-squared approximation decreases with p . When the number of nuisance parameters is large , that is large p , all the tests are on the conservative side. Inspection of Table 2 reveals that to order n - 1, for a 5% test , the like lihood ratio and efficient score tests are always conservative , whereas the Wald test rejects too few hypotheses for p less than 6 and too many for larger p . For a 1% test the likelihood ratio test is again always conservative and so is the efficient score test apart from when p = 2. The Wald test is liberal until p equals 10. As expected the modified profile statistic does well , but surprisingly so does Fisher 's statistic . 5.

Discussion

The asymptotic expansions of the null distribution functions given in Table 1 above (i ) allow a comparison of test accuracy among the five statistics considered here, (ii ) make explicit how nuisance parameters affect the tests to varying degrees, and (iii ) indicate the effect of sample size. The closed form expressions , of interest in their own right , enable the detailed calcula tion of these expansions . The main conclusion is that the modified profile likelihood test (and Fisher 's Z f test ) do not depend on the dimension p of the vector X . From Table 2 these two tests are in general more accurate than the others . For large p this superiority is uniform , since the accuracy of the others deteriorates . The implication is that it is important to modify the statistic when p is large and n is small . Interestingly the adjusting factor depends linearly on p rather than the number of nuisance parameters in the model which is 0 ( P2) . The signed square root version of the test statistics discussed in A p-

566

PETERW. F. SMITHAND JOEWHITTAKER

TABLE 2. Achievedsignificancelevelsfor the Wald, efficient Score, Likelihood, Modified profile likelihood and Fisher . 's test statistics, with varying samplesizenand dimenslonp. Nominal level 5% n

10

2 0.0126 0.0566 0.0786 0.0786 0.0516 6 0.0585 0.1025 0.1245 10 0.1043 0.1483 0.1703

20

2 0.0313 0.0533 0.0643 0.0643 0.0508 6 0.0542 0.0762 0.0872 10 0.0771 0.0991 0.1101

30

2 0.0375 0.0522 0.0595 0.0595 0.0505 6 0.0528 0.0675 0.0748 10 0.0681 0.0828 0.0901

50

2 6 10

200

0.0425 0.0513 0.0557 0.0517 0.0605 0.0649 0.0609 0.0697 0.0741

0.0557 0.0503

2 0.0481 0.0503 0.0514 0.0514 0.0501 6 0.0504 0.0526 0.0537 10 0.0527 0.0549 0.0560 Nominal level 1%

n

10

p 2

w

s

L

M

F

0.0178 0.0070 0.0193 0.0193 0.0123

6 0.0029 0.0219 0.0342 10 0.0120 0.0368 0.0491

20

2 0.0039 0.0085 0.0147 0.0147 0.0111 6 0.0036 0.0159 0.0221 10 0.0110 0.0234 0.0296

30

2 0.0007 0.0090 0.0131 0.0131 0.0108 6 0.0057 0.0140 0.0181 10 0.0107 0.0189 0.0230

50

2 0.0044 0.0094 0.0119 0.0119 0.0105 6 0.0074 0.0124 0.0148 10 0.0104 0.0154 0.0178

200 2 0.0086 6 0.0094 10 0.0101

0 .0098 0 .0106 0 .0113

0.0105 0.0105 0.0101 0.0112 0.0120

EDGE EXCLUSION TESTS FOR GRAPHICAL GAUSSIAN MODELS 567 pendix D lead to much the same conclusions . These conclusions will generalise to situations in the wider context of graphical model selection : in particular the cases of excluding several edges simultaneously , single edge exclusion from a non-saturated model , and of models involving discrete and mixed variables . All but Fisher 's test generalise conceptually , but it is hard to find closed form expressions. The results of the paper suggest that modifying the profile likelihood is most rewarding , and this can always be done numerically . A small simulation study for excluding two edges indicated this to be the case, Smith (1990) . In particular these show that the modified profile give the most accurate p-values and its accuracy is least affected by an increase in the number of nuisance parameters . Acknowledgements : We should like to thank Antony Davison for some extremely valuable comments on an earlier version of this paper . The work of the first author was supported by a SERa studentship .

6. References Amari , S-I . (1982). Geometrical theory of asymptotic ancillarity and conditional inference. Biometrika, 69, 1- 17. Amari , S-I . (1985). Differential -Geometric Methods in Statistics. Lecture Notes in Statistics 28, Springer-Verlag: Heidelberg. Barndorff-Nielsen, a .E. (1978). Information and Exponential Families in Statistical Theory. Wiley : New York. Barndorff-Nielsen, a .E. (1983). On a formula for the distribution of the maximum likelihood estimator. Biometrika, 70, 343- 365. Barndorff-Nielsen, a .E. (1986). Inference on full or partial parameters basedon the standardized signed log likelihood ratio . Biometrika, 73, 307- 322. Barndorff-Nielsen, a .E. (1988). Parametric Statistical Models and Likeli hood. Lecture Notes in Statistics 50, Springer-Verlag: Heidelberg. Barndorff-Nielsen, a .E. (1990a). A note on the standardised signed log likelihood ratio . Scand. J. Statist., 17 157- 160. Barndorff-Nielsen, a .E. (1990b). Approximate probabilities . J. R. Statist. Soc.B, 52, 485- 496. Barndorff-Nielsen, a .E. and Cox, D.R. (1989). Asymptotic Techniquesfor Use in Statistics. Chapman and Hall : London. Cox, D.R. and Hinkley, D. V . (1974). TheoreticalStatistics. Chapman and Hall : London. Cox, D.R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference, (with discussion). J. R. Statist. Soc. B, 49, 1- 39. Cox, D.R. and Wermuth, N. (1990). An approximation to maximum like-

568

PETERW. F. SMITHANDJOEWHITTAKER

lihood estimates in reduced models . Biometrika , 77 , 747- 761. Davison , A . C . (1988). Approximate conditional inference in generalised linear models . J . R . Statist . Soc. B -. 50 ~ . 445- 462. Davison , A . C ., Smith , P.W .F . and Whittaker , J . ( 1991) . An exact condi tional test for covariance selection . Austral . J . Statist ., 33 , 313- 318. Deemer , W .L . and aikin , O . (1951). The jacobians of certain matrix trans formations useful in multivariate analysis . Biometrika , 38 , 345- 367. Dempster , A .P. (1972) . Covariance selection . Biometrics , 28 , 157- 175. Efron , B . (1978) . The geometry of exponential families . Ann . Statist ., 6, 362- 376. Fisher , R .A . (1970) . Statistical Methods for Research Workers , 14th Edi tion . Hafner Press: New York . Fraser , D .A .S. (1991) . Statistical inference : likelihood to significance . J . Amer . Statist . Soc., 86 , 258- 265. Frydenburg , M . and Jensen, J .L . (1989). Is the 'improved likelihood ratio statistic ' really improved for the discrete case? Biometrika , 76 , 655661. Graybill , F .A . (1983). Matrices with Applications in Statistics . 2nd Edi tion . Wadsworth : California . Harris , P. and Peers, H .W . (1980) . The local power of the efficient score test statistic . Biometrika , 67 , 525- 529. Hayakawa , T . ( 1975). The likelihood ratio criterion for a composite hypothesis under a local alternative . Biometrika , 62 , 451- 460. Isserlis , L . (1918) . On a formula for the product -moment coefficient of anv ., order of a normal frequency distribution in any number of variables . Biometrika , 12 , 134- 139. Kreiner , S. (1987) . Analysis of multi -dimensional contingency tables by exact conditional tests : techniques and strategies . Scand. J. Stat ., 14 . Lauritzen , B.L . ( 1989) . Mixed graphical association models . Scand. J . Statist ., 16 , 273- 306. McCullagh , P. (1987) . Tensor Methods in Statistics . Chapman and Hall : London . Moran , P. A . P. (1970) . On asymptotically hypotheses . Biometrika , 57 , 45- 55. Muirhead , R .J . (1982) . Aspects of Multivariate New York .

optimal tests of composite Statistical Theory . Wiley :

Pierce , D .A . and Peters , D . (1992). Practical use of higher order asymp totics for multiparameter exponential families (with discussion ) . J. R . Statist . Soc. B , 54 , 701- 737. Smith , P.W .F . (1990) . Edge Exclusion Tests for Graphical Models. Unpub lished Ph .D . thesis . Lancaster University . Speed, T .P. and Kiiveri , H . (1986) . Gaussian Markov distributions over

EDGE EXCLUSIONTESTS FOR GRAPHICAL GAUSSIANMODELS 569 finite graphs. Ann . Statist., 14, 138- 150. Wermuth, N. (1976). Analogies between multiplicative models in contingency tables and covarianceselection. Biometrics, 32, 95- 108. Whittaker , J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley : Chichester.

Appendix A : Expansions Wald and Efficient Score

of the Density Functions for Test Statistics in Section 4

the

The Wald test statistic for excluding a single edge from a graphical Gaussian models is

nu, Tw ==1+u

where u = TI2lrest" Under the null hypothesis Ho : W12= 0 1

fr (u) =

1 U- 1/2(1 - U)(n- p- 2)/2 B (-2 ' !!.= ) , 2

0 < u < 1.

Putting t = n u/ (1 + u) gives the density function of Twas

1 -1/2( B(~,y)u 1-u)(n-p-2)/2.i~_.~n_~2:' n-t)(n-p-2)/2 tt)-1/2(1-~ B~:-1~ (~-=

O< u < l

fw(t) -

n

.

2E

(n - t)2'

0 < t < ~, since u = t / (n - i ). Using Stirling 's approximation, taking logs and expanding gives

logfl (t)=-~{t+log (27rt )}+~{(t- l)(2P - l)+3t(3-t)}n-l+0(n-2). The leadingterm of this expansioncorresponds to the XI distribution and so

1 n fw(t) = gx(t)[l + 4{ (t - 1)(2p+ 1) + 3t (3 - i )} n- l ) + O(n- 2), 0 < t < 2 ' wheregx(t) is the density function of a xi randomvariable. Finally integrating from 0 to x, term by term, givesthe expansionfor the cumulative

570

PETER W . F . SMITH AND JOE WHITTAKER

distribution

function

Fw (x )

1 (X Gx (x ) + 4n - l Jo { (t - 1) (2p + 1) + 3t (3 - t ) } gx (t )dt + O (n - 2)

=

M

=

Gx (x ) - 2n 1 - 1(2p + 1 - 3x ) ( 2X; ) 1/ 2 exp ( - x / 2) + O (n - 2) 1

=

Gx (x ) - 2 (2p + 1 - 3x ) xgx (x ) n - 1 + O (n - 2) ,

where Gx (x ) is the distribution

function

of a X~ random

x > 0,

variable .

The efficient score test statistic for excluding a single edge from a graph ical Gaussian model is

Ts = nu . Putting t = n u gives the density function of Ts under the null hypothesis as

f s(t)

-

-

1 -1/2 B(~,T)u (1-u)(n-p-2)/2..n!.' nB(!,1T)(;t:;:)-1/2(1-~)(n-p-2)/2

0 < u < 1,

0 < t < n,

since u = tin . . gIves

expanding

1 1 logf s(t) = - 2{ t + log(27rt)} + 4{ (t - 1)(2p+ 1) + t (3 - t)} n- l + O(n- 2). Now this expansionis identical to that for the Wald statistic, apart from 3 t (3 - t) is replacedby t (3 - f). Hence 1 f s(t) = 9x(t)[1 + 4{ (t - 1)(2p+ 1) + t (3 - t)} n- l ) + O(n- 2), 0 < t < n, 1 Fs(x) = Gx(x) - 2(2p+ 1 - x) X9x(x) n- l + O(n- 2), x > o.

Under the null hypothesis the density function of R = R121rest is 1

fR (r ) =

B (l2 ' !!=2E. )

(1 - r2)(n- p- 2)/2

,

- 1 < r < 1,

EDGEEXCLUSION TESTSFORGRAPHICAL GAUSSIAN MODELS571 Muirhead (1982, pI88 ). So the density function of Tf is

1 2(n-p-2}/2 2(1- r2) B(~~ )(l - r ) . (n- p- 1) log(~ )

-

ff (t )

2(1 - r2)(n- p)/2

1

-

B(~~-~ (n- p- 1) log(~ ) ' where

[2{tj(n- p- 1)}1/2] - 1 r - exp - exp [2{tj(n - p- 1)}1/2] + 1. Expanding thelogoff f(t) asbefore gives 1

Hence

f

f

( t )

=

gx

( t )

+

12

( 3

-

6t

+

t2

) n

-

) n

-

l

+

O

( n

-

2 ) .

-

log

ff (t) Ff(x)

Appendix

1

=

gx

( t ) [ l

+

12

( 3

-

6t

+

t2

l )

+

O

( n

-

2 ) ,

t

l

+

O

( n

-

2 ) ,

x

>

0 ,

1

=

Gx

( x

)

-

6

( x

-

B : The Derivation

3 )

xgx

( x

)

n

-

>

O .

of u*(t ) in Section 4

Recallt = - (n - p + 1) log (1 - u) + log (1 + u), whereu = r ~2Irest . Put

Ul - (1 - Ul) log(1 + ul )n- l

U2

U+ ~n- 2(1 - u) log (1 + u) ( 2P+ log (1 + u) - ~

) + O(n- 3).

So U = U2 + 0 (n - 2) . Finally put U3

-

12

(

U2- -2n- (1 - U2) log(1 + U2) 2p+ log (1 + U2) -

4)

1 + U2

U + O (n - 3) .

Recursively substituting for U2 and then Ul gives U3 as a function of t which is a equal to u to order n - 2 and hence is the required u* (t ) . Then am and bm are obtained by substituting u* (t ) in (25) and expanding as for the other test statistics .

572

PETER

W . F . SMITH

AND

JOE WHITTAKER

Appendix C : Proofs of ( 14) and (16) Sincen = ~- 1 (and files are invariant under continuous transformations) -

(. Vii

--JEiiJ IEJ '

where l~ iiJ is the of O'ii and so does not contain 0' 12 when i = 1 -- co-factor .... or 2. By (15) , IEiil = IEiil , for i = 1, 2, and hence ..-J L ' ii J I L /.1.. -\.AJ - 'f 1.1.

lEI lEI =

Wii( 1 - ri2 /rest) ,

which proves (14). For (16) note that -0 = (;;12(= (;;21) = ~ lEI

==: >

JE211= 0,

(26)

...where I~ 211is the cofactor of a21. Expanding 1~ 211about the first row of -L' 21 gives p IE211= - L alkl [E.21] lkl , k=2

(27)

where I[~ 21] 1kI is the cofactor of O'--lk from the submatrix of E without the first column and second row...... So I[E21] lkl does not contain 0: 12 (= 0:21) and hence, by (15) , is equal to I[~ 21] 1kI for all k . Combining (26) and (27) gives

0 = - 1~211 -

since alk = alk , k # 2

-

-

By rearranging,

a12 -

_ ~3-]I1-21 1+21 a12 I[JE

EDGE

EXCLUSION

TESTS

FOR

GRAPHICAL

.-

I~ I -

I~ I

=

Substituting

r121rest

for

MODELS

573

.-

1~ 211 .-

=

GAUSSIAN

.0' 12

+

1[ ~ 21 ] 121 (,1) 12

..(,1)11 (,1)22 -

.-

- 2 + (,1) 12

a12 .

- (; ) 12 ( (; ) 11 (;) 22 ) - 1/ 2 and

812

for

0: 12

gives

the

result

( 16 ) .

Appendix

D

There

is recent

with

null

1992

part

applied

signed ratio

( 11 )

that

,

deriving

,

1990a

, one

- root

of by

199Gb

;

- side

hypotheses

versions are

modification

of

given

Fraser

and

square ( 1992

Z

,

signed

square

- root

tests

the

Normal

distribution

1991

;

and

Pierce

have

Wald

=

proposed

modifying

Peters

the

by

statistic

directly

and

modifications

especial

Peters

,

relevance

.

( 17 ) . The

by

Tests

approximated

because

statistics

and

Pierce

1986 is

square

test

calculated

of

in

- root

better

, this

studies

The

(8 ) ,

Square

interest

- Nielsen

) ; in

hood

Signed

distributions

( Barndorff

in

:

the

( r12Irest

by

Barndorff

square

- rooting

do

, efficient

sgn

- root

not

score

and

likeli

) T1 / 2 together - Nielsen

test

( 1986

statistic

commute

. ) From

,

-

with

Zl .

)

is

( Note

equation

(8 )

) .1

Zm

=

Zl

+

log 2Zl

The

relevant

values

( 22 ) ; giving partial

a closed

correlation Asymptotic

score

and

for

can

and

efficient

be

to given

the

by be

score required

are

that

for

have

the by

lower

order

, so

to

from

those

. The

bounds a comparison

is

test

Section the

not for

in

sample

Wald

, efficient

. The

resulting

4 with

the

argu

modified

to

that invert statistic

-

density

distribution mind

possible the

the

chi - square

cumulative

bearing . It

the

's Z f in

of

and

.

compare

correspondingly

Normal

( 8 ) , ( 11 ) , ( 21 )

here

Fisher

integration

( 28 )

is a function

write to

with

. Zw

again to

similar , and

Z l log

obtained

calculated

along

squares

obtained

Zl

which

be

1

func the

-

Wald

Zm

, not

cannot

.

Numerical reveals

can

their

ILbb I

complicated

tests

densities

by

then

the

too

-

are

expression is

ratio

replaced

tions

even

form

expansions

replaced

function

substitution

, but

likelihood

expansion ments

for

ILbb I "' " -

nothing

results new

are .

given

in

Table

3 , but

comparison

with

Table

2

HEPATITIS B:

A

CASE

STUDY

IN

MCMC

D. J. SPIEGELHALTER MRCBiostatisticsUnit Instituteof PublicHealth Cambridge CB22SR UK N . G . BEST

Dept Epidemiology and Public H faith Imperial College School of Medicine at St Mary 's London

W2

1PG

UK W . R . GILKS

M RC

Biostatistics

Unit

Institute of Public Health Cambridge

CB2 2SR

UK AND

H . INSKIP

MRC Environmental Epidemiology Unit Southampton General Hospital Southampton

5016

6YD

575

DAVIDJ. SPIEGELHALTER ET AL.

576

1.

Introduction

This chapter features a worked example using Bayesian graphical modelling and the most basic of MCMC techniques , the Gibbs samplei', and serves to introduce ideas that are developed more fully in other chapters . This case study first appeared in Gilks , Richardson and Spiegelhalter ( 1996), and frequent reference is made to other chapters in that book . Our data for this exercise are serial antibody -titre measurements , obtained from Gambian infants after hepatitis B immunization . We begin our analysis with an initial statistical model , and describe the use of the Gibbs sampler to obtain inferences from it , briefly touching upon issues of convergence, presentation of results , model checking and model criticism . We then step through some elaborations of the initial model , emphasizing the comparative ease of adding realistic complerity to the traditional , rather simplistic , statistical assumptions ; in particular , \ve illustrate the accommodation of covariate measurernellt error . The Appendix conta,ins some details of a freely available software package (BUGS, Spiegelhalter et al ., 1994) , within which all the analyses in this chapter were carried out . We emphasize that the analyses presented here cannot be considered the definitive approach to this or any other da.taset , but merely illustrate some of the possibilities afforded by computer -intensive MCMC methods . Further details are provided in other chapters in this volume . 2. 2 .1.

Hepatitis

B immunization

BACKGROUND

Hepatitis B ( HB ) is endemic in many parts of the world . In highly endemic areas such as West Africa , almost everyone is infected with the HB virus during childhood . About 20% of those infected , and particularly those who acquire the infection very early in life , do not completely clear the infection and go on to become chronic carriers of the virus . Such carriers are at increased risk of chronic liver disease in adult life and liver cancer is a major cause of death in this region . The Gambian Hepatitis Intervention Study (GHIS ) is a national pro gramme of vaccination against HB , designed to reduce the incidence of HB carriage (Whittle et at., 1991) . The effectiveness of this programme will depend on the duration of immunity that HB vaccination affords . To study this , a cohort of vaccinated GHIS infants was followed up . Blood samples were periodically collected from each infant , and the amount of surface-antibody was measured . This measurement is called the anti -HBs titre , and is measured in milli -lnternational Units (mIU ). A similar study

HEPATITIS

in

neighbouring

where

is

Senegal

t

of

B

denotes

since

may

equivalent

to

a

where

y

anti

- HBs

the

infant

vary

linear

denotes

vaccine

Here

et

we

ale

a

would

In

,

via

2 .2 .

Figure

1

from - HBs

the the

titre

of

titre

288

post ,

to

Of

titre

the

et

titre

al

and

. ,

constant

1991

log

) .

time

This

:

t ,

Q' i

is

( 2 )

constant

validate

after

the

the 1 ,

as

in

the

findings

final

of

plausibility

( 2 ) .

predicting

76

dose

Coursaget

of

This

individuals

relationship

individual

a

or

with

a .

at

)

log

,

if

protectiol1

- log

These

the

true

,

against

,

of

the

a

made

of a

,

these

and

infants

infants

- monthly

106

baseline

vaccination

For ( 30

six

subset had

final .

approximately

for

each

subsequently were

at

scale

infants

time

taken

three

note

is

the

of

the

1

data

,

with

a

two

intervals

for

infant 1

each

labelled

mIU

at

over

measurements

,

being

.t

infant

,

but

and

possibly

days be

time

tha

with

826

could

titre

suggests

intercepts

behaviour

change of

the

Figure

different

from

atypical the

in

to

have

rose

after

a to

an

' * '

in

these

Figure

mIU as

outlying to

be

different

of

subject

might that

1329

thought

i .e .

it

1 , at

an

day

outlier

gradient

;

extraneous

or

error

,

. exploratory

line

[ Yij

]

expectation infant

analysis

,

for

each

infant

in

Figure

1

we

:

E

for

data

lines to

preliminary

denotes

the

straight allowed

both

straight

E

on study

measurements

observations a

plotted - up

of

to

one

As

,

taken

somewhat

outlying

where

log

investigate

for

data

particular

respect

observation

~ and

log

x

to

minus

follow

apparently

This

to

fitted

raw

- baseline

be

.

whose

i .e .

vaccination

( Coursaget

and

measurements

fit

should

gradients

due

we

tool

examination

reasonable

with

1

( 1 )

.

Initial

.

,

GHIS

vaccination

1077

-

data

measurement

two

measurements

lines

final

,

ANALYSIS

shows

least

final

~ t

( 1 ) .

infants

total

Q' i

titre

of

simple

PRELIMINARY

anti

' s

577

infants

infants

=

GHIS

particular

a

cx

MCMC

i .

gradient

provide

HB

at

) .

all

between

- REs

the

common

IN

for

between

infant

analyse

( 1991

having

anti

each

STUDY

titre

relationship

log

for

CASE

that

Y

of

A

concluded

time

proportionality

:

=

ai

+

and i .

We

standardized

, Bi

( log

tij

-

log

subscripts

ij log

t

730

index around

)

,

( 3 )

the

jth

log

730

post for

- baseline numerical

578

DAVIDtJ. SPIEGELHALTER ET AL.

0 0 0 0 0 .

-

0

8 0 .

-

0 0 0 -

0 0 -

0 -

-

300

400

500

600

time since final vaccination

700

800

900

1000

(days )

Figure 1. Itaw data for a subset of 106 GHIS infants: straight lines connect anti-HBS measurements for each infant .

(nIW) eJ~!~S8H~!~ue

stability ; thus the intercept Qi represents estimated log titre at two years post -baseline . The regressions were performed independently for each infant using ordinary least squares, and the results are shown in Figure 2. The distribution of the 106 estimated intercepts { ai } in Figure 2 appears reasonably Gaussian apart from the single negative value associated with " infant ' * ' mentioned above. The distribution of the estimated gradients { ,6i} also appears Gaussian apart from a few high estimates , particularly that for infant ' * ' . Thirteen ( 12%) of the infants have a positive estimated gradient , while four (4%) have a ' high ' estimated gradient greater than 2.0. Plotting estimated intercepts against gradients suggests independence of G:i and fJi, apart from the clear outlier for infant ' * ' . This analysis did not explicitly take account of baseline log titre , YiO: the final plot in Figure 2 suggests a positive relationship between YiOand Qi, indicating that a high baseline titre predisposes towards high subsequent titres . Our primary interest is in the population from which these 106 infants were drawn , rather than in the 106 infants themselves. Independently applying the linear regression model (3) to each infant does not provide a basis for inference about the population ; for this , we must build into our model assumptions about the underlying population distribution of Qi and ,6i. Thus we are concerned with 'random -effects growth -curve ' models . If

HEPATITIS B: A CASESTUDYIN MCMC

579

0 N a

-4 -2 0 2 4 6 8 10 Intercepts :logtitre at2years

tl) N 0 rl {) .

0

20

10 Gradients

CX ) ..q-

... I .~~ . 1\'8' " . -2 0 2 4 6 8 10 Intercept

. ~

.

"

~

.

.

. . ,- J;~

. ' .. ~ . .

.

.

)- .

81 '. . .

. .

.

.

NI

2

8

10

Baseline log titre

Figure 2. Results of independently fitting straight lines to the data for each of the infants in Figure 1.

\\'e are willing to make certain simplifying assumptions and asymptotic approximations , then a variety of techniques are available for fitting such models , such as restricted maximum likelihood or penalized quasi-likelihood (Breslow and Clayton , 1993) . Alternatively , we can take the more general approach of simulating 'exact ' solutions , where the accuracy of the solution depends only the computational care taken .

3. Modelling

Specification of model quantities and their qualitative conditional in dependence structure : we and other authors in this volume find it convenient to use a graphical representation at this stage. Specification of the parametric form of the direct relationships between these quantities : this provides the likelihood terms in the model . Each of these terms may have a standard form but , by connecting them together according to the specified conditional -independence structure , models of arbitrary complexity may be constructed . Specification of prior distributions for parameters : see Gilks et at. ( 1996) for a brief introduction to Bayesian inference .

-

-

I

_

-

I

11111111111I1111I111

This section identifies three distinct components in the construction of a full probability model , and applies them in the analysis of the GHIS data :

580

DAVIDJ. SPIEGELHALTER ET AL.

3.1. STRUCTURAL MODELLING We make the following minimal structural assumptions based on the exploratory analysis above. The Yij are independent conditional on their mean J.Lij and on a parameter 0" that governs the sampling error . For an individual i , each mean lies on a 'growth curve ' such that J.Lij is a deterministic func tion of time tij and of intercept and gradient parameters ai and {Ji. The ai are independently drawn from a distribution parameterized by ao and 0"Q, while the ,Bi are independently drawn from a distribution parameterized by ,Boand O"(j . Figure 3 shows a directed acyclic graph (DAG ) representing these assumptions ( directed because each link between nodes is an arrow ; acyclic because, by following the directions of the arrows , it is impossible to return to a node after leaving it ) . Each quantity in the model appears as a node in the graph , and directed links correspond to direct dependencies as specified above: solid arrows are probabilistic dependencies, while dashed arrows show functional (deterministic ) relationships . The latter are included to simplify the graph but are collapsed over when identifying probabilis tic relationships . Repetitive structures , of blood -samples within infants for example , are shown as stacked 'sheets' . There is no essential difference between any node in the graph in that each is con-sidered a random quantity , but it is convenient to use some graphical notation : here we use a double rectangle to denote quantities assumed fixed by the design (i .e. sampling times tij ) , single rectangles to indicate observed data , and circles to represent all unknown quantities . To interpret the graph , it will help to introduce some fairly self-explanat ory definitions . Let v be a node in the graph , and V be the set of all nodes. We define a ' parent ' of v to be any node with an arrow emanating from it pointing to v , and a ' descendant ' of v to be any node on a directed path starting from v . In identifying parents and descendants, deterministic links are collapsed so that , for example , the parents of Yij are ai , ,Bi and 0". The graph represents the following formal assumption : for any node v , if we know the value of its parents , then no other nodes would be inform ative concerning v except descendants of v . The genetic analogy is clear : if we know your parents ' genetic structure , then no other individual will gi ve any additional information concerning your genes except one of your descendants. Thomas and Gauderman ( 1996) illustrate the use of graphical models in genetics . Although no probabilistic model has yet been specified , the conditional independencies expressed by the above assumptions permit many prop erties of the model to be derived ; see for example Lauritzen et ale ( 1990) , Whittaker ( 1990) or Spiegelhalter et al. ( 1993) for discussion of how to read

HEPATITIS B: A CASESTUDYIN MCMC

Figure 9 .

off

independence

the

graph

served data initially

the example

Our cation we

now

joint

of

the show

a graph

, it

forms of

3 .2 .

PROBABILITY

The

preceding

pretation ( Lauritzen

, such

are

model

will

that

when

, Yi2 , Yi3

essentials

distribution

ditional

retained

is important

full

change

have

no

Cti

and

as

understand any

when

upon

, dependence

that

data

. is

0 b -

conditioning

common

' ancestor

fJi , this

conditioning

observed

to before

on ' will

be

independence

other

will

quantities

between

Cti

. For

and

, Bi ma .y

. use

of

Yil

. It

the

nodes

independent

, when

induced

DAGs

properties

, although

be

model for hepatitis B data .

of

independence

necessarily

example

from

properties

marginally

not

be

properties represents

, and . For

Graphical

581

in

this

of

the

a

all

convenient

model

discussion

we et

is

primarily

without basis

quantities

to

needing for

the

facilitate algebra

cornmuni . Howevel

specification

of

~, as

the

full

.

MODELLING

independence . If

example model

wish

al . , 1990

of

graphical

properties to

construct ) that

models

without a full a

DAG

has

been

necessarily pro

model

in

terms

of

a probabilistic

babili

ty

is

equivalent

model

, it to

can

con inter

be

assuming

shown that

-

582

DAVIDJ. SPIEGELHALTER ET AL.

the joint distribution of all the randomquantitiesis fully specifiedin terms of the conditionaldistribution of eachnodegivenits parents: P(1f) = II P(v I parents[v]), VfV

(4)

whel'e P ( .) denotes a probability distribution . This factorization not only allows extremely complex models to be built up from local components , but also provides an efficient basis for the implementation of some forms of MCMC

methods

.

For our example , we therefore need to specify exact forms of 'parent child ' relationships on the graph shown in Figure 3. We shall make the initial assumption of normality both for within - and between-infant variability , although this will be relaxed in later sections. We shall also assume a simple

linear relationship between expected log titre and log time , as in (3). The likelihood

terms

in the model

Yij

are therefore

I"V N(J1 ,ij , 0"2) ,

(5)

JLij == Qi + !3i(log tij - log 730),

(6)

ai rv N(ao, a; ),

(7)

,Bi I"V N(,Oo ,O "J),

(8)

where 'rv' means 'distributed as', and N(a, b) generically denotesa normal distribution with mean a and variance b. Scaling log t around log 730 makes the assumed prior independence of gradient and intercept more plausible , as suggested in Figure 2.

3.3. PRIORDISTRIBUTIONS To complete

the specification

of a full probability

model , we require prior

distributionson the nodeswithoutparents : 0'2, ao, 0'; , ,80andO '~. These nodes are known as 'founders ' in genetics. In a scientific context , we would often

like these priors

to be not too influential

in the final

conclusions ,

al though if there is only weak evidence from the data concerning some secondary aspects of a model , such as the degree of smoothness to be expected in a set of adjacent observations , it may be very useful to be able to include

external information in the form of fairly informative prior distributions . In hierarchical models such as ours , it is particularly important to avoid casual use of standard

improper

priors since these may result in improper

posterior

distributions (DuMouchel and Waternaux, 1992); see also Clayton (1996) and Carlin ( 1996) . The priors chosen for our analysis are

0:0, ,80 rv N(O, 10000),

a- 2, a~2, a~2 rv Ga(O.Ol, 0.01),

(9)

(10)

HEPATITIS B: A CASESTUDYIN MCMC

where

Ga

and

(

a

,

b

variance

might

generically

/

expect

have

(

)

a

b2

the

precisions

to

)

magnitude

We

at

all

.

a

estimate

our

,

of

1994

;

,

starting

-

any

full

total

the

,

:

In

,

(

'

or

forget

all

the

order

of

deviations

BUGS

.

at

.

(

software

1996

)

(

for

a

Gilks

et

description

sampling

unobserved

:

nodes

uno

(

parameters

of

the

of

length

of

the

a

be

of

the

,

for

con

burn

-

in

'

-

required

;

calculated

a

-

compu

from

unobserved

sampling

statistics

'

more

is

must

values

be

;

algorithm

interest

Gibbs

ust

upon

whether

MCMC

true

summary

the

identify

or

III

decided

on

to

about

node

them

decide

perhaps

quantities

examine

bserved

from

to

or

inference

each

of

volume

the

choice

any

other

'

widely

nodes

fifth

step

evidence

of

.

should

lack

of

fit

of

these

steps

briefly

;

further

details

are

provided

.

its

to

extreme

initial

the

the

useful

,

this

possibility

of

of

a

the

the

that

the

Gelman

very

,

long

posterior

of

-

in

However

(

the

is

the

no

guarantee

not

,

Raftery

,

main

1996

very

)

.

support

by

other

runs

are

.

aggravated

On

for

number

)

towards

.

Gibbs

enough

conclusions

1996

burn

converge

the

long

a

being

posterior

run

perform

check

to

since

be

to

(

to

fail

tails

mode

to

unimportant

should

values

may

,

the

is

lead

sampler

extreme

at

It

values

could

the

is

)

starting

distribution

in

simulation

.

of

values

,

posterior

values

sampler

states

choice

cases

starting

starting

the

starting

severe

instability

of

MCMC

dispersed

sensitive

the

to

this

principle

with

of

an

INITIALIZATION

to

In

for

for

discuss

in

sampler

it

80

.

now

.

of

least

Gibbs

each

implementation

added

We

1

et

parameterization

satisfactory

elsewhere

.

Examination

at

implement

sampling

,

statistics

model

the

Gilks

for

monitored

length

efficient

be

See

for

be

run

output

a

,

;

for

must

the

summary

also

.

to

distributions

tationally

For

)

provided

)

methods

output

the

and

components

standard

using

required

be

data

and

and

b

sampling

1996

are

must

conditional

the

-

steps

missing

structed

-

four

values

and

,

.

are

posterior

sampling

.

ao

variance

10

/

we

.

general

-

at

a

,

since

the

deviation

Gibbs

et

analysis

of

deviations

Gibbs

mean

distributions

the

inverse

corresponding

by

with

probability

on

the

standard

the

model

' oper

standard

using

sampling

In

and

prior

model

distribution

pl

effect

,

prior

Spiegelhalter

Gibbs

4

have

than

Fitting

gamma

are

minimal

these

greater

.

have

100

shows

a

these

deviation

results

4

Although

them

standard

final

denotes

.

583

numerical

hand

,

of

success

starting

if

584

DAVIDJ. SPIEGELHALTER ET AL.

the sampler is not miring well , i .e. if it is not moving fluidly around the support of the posterior . We performed three runs with starting values shown in Table 1. The first run starts at values considered plausible in the light of Figure 2, while the second and third represent substantial deviations in initial values. In particular , run 2 is intended to represent a situation in which there is low measurement error but large between-individual variability , while run :3 repl'esents very similar individuals with very high measurement error . From these parameters , initial values for for Qi and ,f3i were indepen dently generated from ( 7) and (8) . Such 'forwards sampling ' is the default strategy in the BUGSsoftware . Parameter

Run 1

Run2 Run 3

5.0

20.0

-10.00

- 1.0

-5.0

5.00

O" a

2.0

20.0

0.20

U {3

0.5

5.0

0.05

1.0

0.1

10.00

ao

,80

G'

TABLE 1. Starting valuesfor parameters in three runs of the Gibbs sampler

4.2. SAMPLING FROM FULL CONDITIONAL DISTRIBUTIONS Gibbs sampling works by iteratively drawing samples from the full condi tional distributions of unobserved nodes in the graph . The full conditional distribution for a node is the distribution of that node given current or known val lies for all the other nodes in the graph . For a directed graphical model , we can exploit the structure of the joint distribution given in (4). For any node v , we may denote the remaining nodes by V- v, and from (4) it follows that the full conditional distribution P (vIV - v) has the form

P(v I V-v)

cx

P(v, V-v)

cx

wEcht .

P (w I parents[w]),

( 11)

where cx means 'proportional to '. (The proportionality constant, which ensures that the distribution integrates to 1, will in general be a function of

HEPATITIS B: A CASESTUDYIN MCMC the

remaining

nodes

tribution

for

v

components

'

only

co

For

(

to

11

-

'

,

tells

the

us

(

5

,

6

)

,

)

We

see

prior

(

of

v

of

intercept

,t

v

I

tile

hill

pal

the

.

ai

the

,

given

v

The

number

of

(

7

)

,

)

dis

and

and

for

and

co

-

any

parents

,

.

prescription

ai

ni

is

proportional

likelihood

observations

-

likelihood

general

for

by

]

,

of

ai

v

conditional

children

children

term

[

full

,

conditional

' ents

distribution

for

is

tlla

(

' rhus

the

collditional

prior

)

parents

of

the

full

ni

11

.

its

parents

the

where

' om

P

child

other

the

fI

component

values

consider

of

by

.

the

are

that

product

given

v

each

on

parents

example

)

-

from

depends

where

I

a

arising

node

of

'

contains

585

OIl

terms

the

ith

,

infant

.

Thus

P

(

ai

I

.

)

cx

exp

-

2

{

(

2aa -

ai

exp

where

the

except

of

'

ai

(

12

)

,

,

it

.

(

'

i

in

. e

can

P

.

V

be

(

-

ai

CXi

shown

[

n =

I

)

.

"

X

~

O

)

2

}

Yij

-

Q '

i

-

13i

{

l

.

)

all

data

nodes

completing

that

(

log 2

tij

-

log

730

)

the

P

(

ai

I

.

)

is

., 2

and

square

a

,

]

2 a

denotes

By

(12)

-

111

j

(

normal

all

for

}

parameter

ai

distribution

in

nodes

the

exponent

with

mean

; f + ~ ~ j ~l Yij - j3i(logtij - log730) a

1

-0- 2

n "

+

~ 0- 2

a

and variance

1 l0 - 2

+

.!.!.L . 0- 2

a

The full conditionals for ~i , ao and !3ocan similarly be shown to be normal distributions

.

The full conditional distribution

for the precision parameter a~ 2 can

also be easily worked out . Let Ta denote a;;2. The general prescription ( 11) tells us that the full conditional for TCY . is proportional to the product of the

prior for TO ' , given by ( 10), and the 'likelihood' terms for TO ' , given by (7) for each i . These are the likelihood are the only children

P(Tcx I .)

<X

tel 'ills for Ta because the Q'i parameters

of TO ' . Thus we have

106 1 {I

}

T~.01- 1e- 0.O1Ta II TJexp - "iTQ ' (ai - ao)2 1= 1

-

cx

0 01 + 1Q . - 1

1

2

TO '. 2 exp {-TQ ' (0.01+'2~ 106 (ai- ao ) )} Ga ' 0.01+'2 ~ (ai- ao )2). (0.01+2106 1106

586

DAVIDJ. SPIEGELHALTER ET AL.

Tilus

the

The

full

full

to

be

In

ley

,

1987

plify

) .

so

MONITORING

The

values

must

be

many

three

method

of two

monitored

stabilize

of

length

runs

1

2

to

can

distribution

similarly

normal

.

be

. for

shown

gamma

are

Gilks

( 1996

dis

example

distributions

techniques see

or

( see

do

available

-

sim

-

con

-

not

for

-

Rip

efficiently

) .

generated

use

iterations

Rubin

are

of

of of

,

quickly

given

down

,

in

Gelman

10

called

CODA

values

of 3

1 .

took

) .

,

for

around

)

statistics

Details

of

Each

and

the

( Best , 80

sampler and

( 1992

Table

BUGS

Gibbs miring

( 1996

using

run

the

check Rubin

as

by

sampled

settled

and

started

S - functions the

to

Gelman

SPARCstation

suite

by

summarized

000

a

trace 2

gamma

-

conditional

quantities

5

the the

and

;

the

on

using shows

full

statistically

a .nd

minutes

4

,

illustrate

Gelman

a

straightforward

, several

unknown

we

another

OUTPUT

and

I ' tIllS

around

while

THE

Here

is

and

reduce is

distributions

the

2

conditionals

applications

such

for

.

full sampling

. However

graphically

vergence

to

In

from

4 .3 .

all

which

conveniently

sampling

Figure

,

from

T Q

a ~

.

exarnple ,

for

for

distributions

tllis

butions

distribution

distributions

galnrna

tri

on

conditional

conditional

et the

run runs

aI

. ,

three 700

the

took were

1995

) .

runs

:

iterations

.

Table 2 shows Gelman - Rubin statistics for four parameters being mon itored (b4 is defined below ) . For each parameter , the statistic estimates the reduction in the pooled estimate of its variance if the runs were continued indefinitely . The estimates and their 97.5% points are near 1, indicating that reasonable convergence has occurred for these parameters . Parameter

Estimate

,80

1.03

1 . 11

0" / 3

1.01

1 . 02

0-

1.00

1 .00

b4

1.00

1 . 00

TABLE

2 .

97.5% quantile

Gelman- Rubin statistics for

four parameters

The Gelman - Rubin statistics can be calculated sequentially as the runs proceed , and plotted as in Figure 5: such a display would be a valuable tool in parallel implementation of a Gibbs sampler . These plots suggest discarding the first 1 000 iterations of each run and then pooling the remaining 3 x 4000 samples.

587

mil

HEPATITIS B: A CASESTUDYIN MCMC

0

1000

2000

3000

4000

5000

Iteration

Figure 4. Sampled values for .80 from three runs of the Gibbs sampler applied to the model of Section 3; starting values are given in Table 1: run 1, solid line ; run 2, dotted line ; run 3, broken line .

588

DAVIDJ. SPIEGELHALTER ET AL.

b4

betaO 0

0 0 < 0 0 ~

' . : ' -,

. "

. .

.

. .

.

. .

.

I.()

.

: .

:1

~

0

:

C\ J

~

0j~ ,

,

.

0

0

2000

0

4000

iteration

O!Js!JeJs~-~

O!lS!lBlS ~-E)

0

0

2000

4000

iteration

.

sigma.beta

sigma 0

0

0

0

0

~

. . -

.. ..

: : :

. .. Ii 0 t

O!lS!le~S ~ -E)

a

. .

.

. .

: : :

. ..

.

.

.

.

.

.

.

.

.

.

0

:

U)

: . . .

.

.

.

.

l _-

0

2000

4000

iteration

O!Js!Jets~ -~

.

.

.

0 0 l ! )

L ~ 2000

_4000

iteration

Figure 5. Gelman - Rubin statistics for four parameters from three parallel runs . At each iteration the median and 97.5 centile of the statistic are calculated and plotted , based on the sampled values of the parameter up to that iteration . Solid lines : medians ; broken lines : 97.5 centiles . Convergence is suggested when the plotted values closely approach 1.

HEPATITIS B: A CASESTUDYIN MCMC

5S9

4.4. INFERENCEFROMTHE OUTPUT Figure

6

shows

evidence of

the

is

not

gradient

are

4 .5 .

,

provided

maximum - or

- fit

ically

to

classical sumptions ,

not

care

.

and

.

Gelfand

for ,

) ,

( 1996 assessing

we

measuring

simply

again

a

that

that

'd

) ,

our

' .

in

example

is

no

.

In

-

con

-

large theol

' Y

particular

( 1996

describe

) ~ a

George

variety

of

. standard

may

by

as

with

Meng )

which

model

consider

likelihood

and

-

- subject

require

( 1996

-

compared

models

adequacy

manner

good specif

also

subjects

- level

,n

be )

within

therefore

Smith

assumed

may

asymptotic

Gellna

for are

( 1986

multi

model

the

standardized

of this

' GG

basis

between

a ,nd

an

is for

estimates

standard

of

improving

from

emphasize

define

( 1996

illustrate

The

value

headed

natural

Solomon

comparison

Phillips

and

departures

a

and

standal and

and

this

column

deviance

Cox

fitting

which

Raftery )

0 .3 .

statistics

the

indepelldence the

criticism

( 1996

1 :

parameter

[ I ' om

assuming

for

) ~ in

the

of .

allow

Model

McCulloch

Here

,

minus

clear size

around

Summary

provide

since

departures

parameters

hold

techniques

,

measures

data

be

is

absolute

- FIT

models

methods of

does

18

There

the to

2 . 1 .

page

.

although

estimated

methods

detecting

such

MCMC

numbers

( see

- OF

minimize

for in

trast

3

nested

tests

0' { 3

values

~

Section

comparison

alternative

sampled

around

in

Table

model

designed

between

with

likelihood

and

the

gradients

concentrated

GOODNESS

Standard ness

,

noted

in

ASSESSING

is

as

of

the

great

, 80

interest

model

plots

between

variability

particular

We

density

variability

underlying

we

kernel

of

be

means

statistics

calculated a

,

definitive

for although analysis

.

residual

Yij Tij

-

/ lij

= a

with

mean

zero

and

uals

can

be

calculated

and

can

be

used

culate

a

normal

model

,

and

here

mean

and

fourth

is Meng

truly

Various

we

to

( 1996

)

.

values For

These

of

IIij

example

deviations

of

.

,

from

standardized

we

the

resid and can

a

) ,

cal

-

assumed

residuals

could

be

calculate

of

then for

model

current

detect

functions

moment

normal

( using

error

statistics

intended

4

the

assumed

iteration

b

bution

the

summary is

.

under

each

construct that

error

1

at to

statistic

considered

variance

more

the

this formal

-

1 288

~

r . . Z)

. . 4 Z) ,

standardized

statistic assessment

residual

should

be of

such

.

close

If

the

to summary

error

3 ;

see

distri

-

Gelman statistics

.

590

DAVIDJ. SPIEGELHALTER ET AL.

C\ I

N 0 ~

0

0

5

10

-1.5

-1.0

-0.5

betaO

b4

l {)

C\ J

0

0

0.0

0.5

sigma

1.0

0.0

0.5

1.0

1.5

sigma.beta

Figure 6. Kernel density plots of sampled valuesfor parameters of the model of Section 3 based on three pooled runs, each of 4 000 iterations after 1000 iterations burn-in . Results are shown for ,80, the mean gradient in the population ; u,B, the standard deviation of gradients in the population ; 0' , the sampling error; and b4, the standardized fourth moment of the residuals.

HEPATITISB: A CASESTUDYIN MCMC

591

Figure 6 clearly shows that sampled values of b4 are substantially greater

than 3 (mean = 4.9; 95% interval from 3.2 to 7.2). This strongly indicates that the residuals are not normally distributed . 5.

Model

5 .1.

elaboration

HEAVY

- TAILED

DISTRIBUTIONS

The data and discussion

in Section 2 suggest we should take account

of

apparent outlying observations . One approach is to use heavy-tailed distri butions , for example

t distributions

, instead of Gaussian distributions

for

the intercepts ai , gradients fJi and sampling errors Yij - lLij . Many

researchers

duced within

have shown how t distributions

can be easily intro -

a Gibbs sampling framework by representing the precision

(the inverse of the variance) of each Gaussian observation as itself being a random quantity

with a suitable gamma distribution ; see for example

Gelfand et al. (1992). However, in the BUGSprogram, a t distribution on v degrees of freedom can be specified directly as a sampling distribution , with priors being specified for its scale and location , as for the Gaussian distributions

in Section 3. Which value of v should we use? In BUGS, a prior

distribu tion can be placed over v so that the data can indicate the degree of support for a heavy- or Gaussian-tailed distribution . In the results shown below , we have assumed a discrete uniform

prior distribution

for 1/ in the

set

{

1, 1.5, 2, 2.5, . . . , 20, 21, 22, . . . , 30, 3.5, 40, 45, 50, 75, 100, 200, . . . , 500, 750, 1000 } .

We fitted the following models : GG GT

Gaussian sampling errors Yij - Jlij ; Gaussian intercepts Qi and gradients ,Bi, as in Section 3; Gaussian sampling errors ; t-distributed intercepts and gra dients ;

TG

i -distributed

samping

errors ; Gaussian intercepts

and gra -

dients ;

TT

t -distributed

samping

errors ; t -distributed

intercepts

and

gradients . ReGul ts for these models are given in Table 3 , each based on 5 000 i tera -

tions after a 1000-iteration correlations

in the parameters

burn -in . Strong auto -correlations and crossof the t distributions

were observed

, but

the

,80sequencewas quite stable in each model (results not shown). We note that the point estimate of ,8ois robust to secondary assumptions about distributional shape, although the width of the interval estimate is

592

DAVIDJ. SPIEGELHALTER ET AL. GG

Parameter /30

mean

O'fJ

- 1.05

- 1. 13

- 1.06

- 1. 11

- 1. 33 , - 0 . 80

- 1. 95 , - 0. 93

- 1. 24 , - 0 . 88

- 1 . 26 , - 0 . 99

mean

0.274 0.070, 0.698

0.033 0.004, 0.111

0 . 028 0. 007 , 0. 084

mean

95 % c.i .

3.

00

19 5 , 30

1 1, 1

mean

TABLE

2 .5 2, 3.5

;' , 20

95 % c.i .

VfJ

0 . 007 , 0 . 176

3.5

12

mean

0 .065

62 .5, 3.5

00

95 % c. i . Va

TT

95 % c . i .

95 % c. i . v

TG

GT

Results

of fitting

models

00

16 8 . 5 , 26

GG , GT , TG

and TT

to the GHIS

data :

posterior means and 95 % credible intervals ( c . i ) . Parameters 1/ , I/a and 1/13are the degrees of freedom in t distributions for sampling errors , intercepts and gradients , respectively . Degrees of freedom = 00 corresponds to a Gaussian distribution .

reduced dom

by for

errors

35 % when

both

and

error of

, due

alone

sampling and

belief

to

a heavy

butions

population

to

the

have

- tailed

at

all

levels

sampling

distributions coefficients

( va. ~

heavy

( model

TG

TT

) supports

2 . 5 ) and

11/3 ~

) leads

(v ~

a fairly

the

degrees

. Gaussian ( model

) tails ( v {3 ~

distribution

(v ~ 19 ,

( Cauchy

individuals

tails

( model

unknown

regression

outlying

heavy

of

for very

sampling

distribution

gradients

in

t distributions

and

t distributions

overwhelming gradients

allowing

for

the

sampling

GT ) leads

the

a confident

judgement

allowing

assumption

of a heavy shape

of

sampling

3 .5 ) , while

Gaussian

to

distribution

1 ) . Allowing to

of free -

for

t distri

-

- tailed

intercepts

16 ) .

5.2. INTRODUCING A COVARIATE

.~~ .-

.. ..

As noted in Section 2.2, the observed baseline log titre measurement , YiO, is correlated with subsequent titres . The obvious way to adjust for this is to replace the regression equation (6) with

_ . .~. .- ~

J1 ,ij == ai + , (YiO- Y.O) + fii(logtij - log 730),

(13)

where Y.o is the mean of the observations { YiO} . In ( 13), the covariate YiOis ' centred ' by subtracting Y.o: this will help to reduce posterior correlations between 'Y and other parameters , and consequently to improve miring in the Gibbs sampler . See Gilks and Roberts ( 1996) for further elaboration of this point .

HEPATITISB: A CASESTUDYIN MCMC As for all anti - RES titre

measurements , YiO is subject

593

to measllrernent

error . We are scientifically interested in the relationship between the ' true ' underlying log titres JLioand !lij , where ~iO is the unobserved ' trtle ' Jog titl 'e on the ith infant

at baseline . Therefore , instead of the obvious

regression

model ( 13), we should use the 'errors-in-variables' regressionmodel /1ij == ai + , (/1iO- Y.O) + ,6i(log tij - log 730). Information ment

YiO . We

( 14)

about the unknown }.LiDin ( 14) is provided by the mea,suremodel

this

with

YiOI"V N ( /-LiO, a 2).

( 15)

Note that we have assigned the same variance a2 to both Yij in (5) and YiO

in ( 15), becausewe believe that Yij and YiOare subject to the samesources of measurement(sampling) error. We must also specify a prior for /liD. We choose

lLiDr"'-I N((),
( 16)

where the hyperparameters 8 and r,p- 2 are assigned vague but proper C~aussian and gamma prior

distributions

.

Equations (14- 16) constitute a measurement-error model, as discussed further by Richardson ( 1996). The measurement-error model ( 14- 16) forms one component of our complete model , which includes equations (5) and ( 710) . The graph for the complete model is shown in Figure 7, and the results from fitting both this model and the simpler model with a fixed baseline ( 13)

instead of ( 14- 16), are shown in Table 4. Fixed

baseline

Parameter .80

0' / 3

- 1.08

- 1.32 , - 0.80

- 1.35 , - 0.81

0 . 31

95 % c. i .

the GHIS

4.

Results of fitting

data : posterior

0 . 24

0. 07 , 0. 76

0. 07 , 0. 62

0 .68

1 . 04

0.51 , 0.85

0. 76, 1.42

mean

95 % c. i . TABLE

baseline

- 1.06

mean

I

in

( 14- 16)

mean 95 % c. i .

Errors

( 13)

alternative

means

regression models to

and 95 % credible

intervals

We note the expected result : the coefficient I attached to the covari ate measured with error increases dramatically when that error is prop erly taken into account . Indeed , the 95% credible interval for / under the

594

DAVIDJ. SPIEGELHALTER ET AL.

- 1 -

- - -- - - - -

-

-

- - --

- - -

-

- - -.-

- - -- - .-

- -- -

- --

\ \ \ - - - -- \ . - - -

- ~

- -

- - -

--

-

- - -

.-

-

-

--

-

- -

--

- - --

\ \ \ \ \

I

""

\ \

~ \

I

~ ~

\ \ I

I

i I

------------------

I

-----

t ij

I I

I

I

!I

YiO

!

!I

Yij

blood -sample j

I

L - - - - --- - -- -

Figure 7.

infant i

,-

Graphical model for the GHIS data, showing dependenceon baseline titre ,

measured with error .

errors -in -baseline model does not contain the estimate for l' under the fixed baseline model . The estimate for 0' {3 from the errors -in -baseline model suggests that population variation in gradients probably does not have a major impact on the rate of loss of antibody , and results (not shown) from an analysis of a much larger subsample of the GHIS data confirm this . Setting 0' .0 == 0, with the plausible values of I == 1, f3o == - 1, gives the satisfyingly simple model :

titre at time t

1

ti tre at time 0 <X t ' which is a useful elaboration of the simpler model given by ( 1). 6. Conclusion We have provided a brief overview of the issues involved in applying MCMC to full probability modelling . In particular , we emphasize the possibility for constructing increasingly elaborate statistical models using 'local ' associations which can be expressed graphically and which allow strajghtforward

HEPATITISB: A CASESTUDYIN MCMC

595

implementation using Gibbs sampling . However , this possibility for complex modelling brings associated dangers and difficulties ; we refer the reader to other chapters in Gilks , Richardson and Spiegelhalter ( 1996) for deeper discussion of issues such as convergence monitoring and improvement , model checking and model choice. Acknowledgements

. IS

References

Best, N. G., Cowles, M . K . and Vines, S. K . (1995) CODA: ConvergenceDiagnosis and Output Analysis software for Gibbs Sampler output : Version 0.3. Cambridge : Medical Research Council Biostatistics Unit .

Breslow, N. E. and Clayton , D. G. (1993) Approximate inference in generalized linear

mixed

models

. J . Am . Statist

. Ass . , 88 , 9 - 25 .

Carlin , B. P. (1996) Hierarchical longitudinal modelling. In Markov Chain Monte Carlo in Practice (eds W . R. Gilks, S. Richardson and D. J. Spiegelhalter), pp . 303- 320. London : Chapman & Hall .

Clayton , D. G. (1996) Generalized linear mixed models. In Markov Chain Monte Carlo in Practice (eds W . R. Gilks, S. Richardson and D. J. Spiegelhalter), pp . 275- 302. London : Chapman & Hall . Coursaget , P., Yvonnet , B ., Gilks , W . R ., Wang , C . C ., Day , N . E ., Chiron , J . P.

and Diop-Mar , I . (1991) Scheduling of revaccinations against Hepatitis B virus

. Lancet

, 337 , 1180 - 3 .

Cox , D . R . and Solomon , P. J . ( 1986) Analysis of variability of small samples . Biometrika , 73 , 543- 54.

with large numbers

DuMouchel, W . and Waternaux, C. (1992) Discussionon hierarchical models for combining information and for meta-analyses (by C. N. Morris and S. L. Normand). In Bayesian Statistics 4 (eds J. M . Bernardo, J. O. Berger, A . P. Dawid and A . F. M . Smith), pp. 338- 341. Oxford: Oxford University Press. Gelfand, A . E. (1996) Model determination using sampling-based methods. In Markov Chain Monte Carlo in Practice (eds W . R . Gilks , S. Richardson and

D. J. Spiegelhalter), pp. 145- 162. London: Chapman & Hall . Gelfand, A . E., Smith , A . F. M . and Lee, T .-M . (1992) Bayesiananalysis of constrained parameter and truncated data problems using Gibbs sampling . J. Am . Statist

. Ass . , 87 , 523 - 32 .

596

DAVIDJ. SPIEGELHALTER ET AL.

Gelman, A. (1996) Inferenceand monitoringconvergence . In MarkovChainMonte Carlo in Practice(edsW. R. Gilks, S. Richardsonand D. J. Spiegelhalter ), pp. 131- 144. London: Chapman& Hall. Gelman, A. and Meng, X.-L. (1996) Model checkingand modelimprovement . In Markov ChainMonte Carlo in Practice(edsW. R. Gilks, S. Richardsonand D. J. Spiegelhalter ), pp. 189- 202. London: Chapman& Hall. Gelman, A. and Rubin, D. B. (1992) Inferencefrom iterative simulation using multiple sequences (with discussion ). Statist. Sci., 7, 457- 511. George , E. I. and McCulloch, R. E. (1996) Stochasticsearchvariableselection . In Markov ChainMonte Carlo in Practice(edsW. R. Gilks, S. Richardsonand D. J. Spiegelhalter ), pp. 203- 214. London: Chapman& Hall. Gilks, W. R. (1996) Full conditionaldistributions. In Markov Chain Monte Carlo in Practice(eds W. R. Gilks, S. Richardsonand D. J. Spiegelhalter ), pp. 75- 88. London: Chapman& Hall. Gilks, W. R. S. Richardsonand D. J. Spiegelhalter ). (1996) Strategiesfor improving MCMC. In Markov Chain Monte Carlo in Practice(edsW. R. Gilks, S. Richardsonand D. J. Spiegelhalter ), pp. 89- 114. London: Chapman& Hall. Gilks, W. R. and Roberts, G. O. (1996) Strategiesfor improving MCMC. In Markov ChainMante Carlo in Practice(edsW. R. Gilks, S. Richardsonand D. J. Spiegelhal ter), pp. 89- 114. London: Chapman& Hall. Gilks, W. R., Richardson , S. and Spiegelhalter , D. J. (1996) Introducing Markov chain Monte Carlo. In Markov Chain Monte Carlo in Practice(eds W. R. Gilks, S. Richardsonand D. J. Spiegelhalter ) , pp. 1- 20. London: Chapman & Hall. Gilks, W. R., Thomas, A. and Spiegelhalter , D. J. (1994) A languageand program for complexBayesianmodelling. TheStatistician, 43, 169- 78. Lauritzen, S. L., Dawid, A. P., Larsen, B. N. and Leimer, H.-G. (1990) Independ encepropertiesof directedMarkovfields. Networks , 20, 491- 505. Phillips, D. B. and Smith, A. F. M. (1996) Bayesianmodelcomparisonvia jump diffusions. In Markov Chain Monte Carlo in Practice(eds W. R. Gilks, S. Richardsonand D. J. Spiegelhalter ), pp. 215- 240. London: Chapman& Hall. Raftery, A. E. (1996) Hypothesistesting and modelselection . In Markov Chain Monte Carlo in Practice(edsW. R. Gilks, S. Richardsonand D. J. Spiegel halter), pp. 163- 188. London: Chapman& Hall. Richardson , S. (1996) Measurement error. In MarkovChainMonte Carlo in Practice (edsW. R. Gilks, S. Richardsonand D. J. Spiegelhalter ) , pp. 401- 418. London: Chapman& Hall. Ripley, B. D. (1987) StochasticSimulation. NewYork: Wiley. Spiegelhalter , D. J., Dawid, A. P., Lauritzen, S. L. and Cowell, R. G. (1993) Bayesiananalysisin expertsystems(with discussion ) Statist. Sci., 8, 219- 83.

HEPATITIS B: A CASESTUDYIN MCMC

597

Spiegelhalter, D. J., Thomas, A . and Best, N. C; . ( 1996) Computatioll on Bayesian graphical models. In Bayesian Statistics 5, (eds J. M. Bernardo, J. O. Berger, A . P. Dawid and A . F. M . Smith ,), -pp - 407- 425. Oxford: Oxford UniversityPress. Spiegelhalter, D. J., Thomas, A ., Best, N. G. and Gilks , W . R . ( 1994) BUGS : Bayesian inference Using GibbsSampling, Version 0.30. Cambridge : Medical ResearchCouncil Biostatistics Unit . Thomas

, In

D

. C

. and

Markov

and

D

Whittaker

.

,

Whittle

J .

J .

chester

:

, H

Gauderman

Chain

( 1990

. , lnskip

.

Appendix

a

:

is

a

can

be

Figure

Lancet

.

419

)

Gibbs

Practice

- 440

Models

. , Hall

, A

.

sampling ( eds

London

in

. J . , Mendy

Hepatitis ,

W :

337

,

which

language

747

provides

for from

- variables

Applied

B - 750

, M

and

. , Downes

protection

, &

S .

Hall

genetics

.

llichardson .

Analysis

.

Chi

-

,

R

. and

Hoare

viral

, S . ( 1991

carriage

in

)

The

.

running

the

model

model described

a

syntax

Gibbs

for sampling

description in

specifying

graphical

sessions shown

equations

( i

in

l

: I

beta [ iJ

7 -

*

} covariate

yO [ i ] muO [ i ]

with measurement error - dnorm ( muO[ i ] , tau ) ; - dnorm ( theta , phi ) ;

lines

beta [ i ] alpha

[i ]

dnorm ( betaO , tau . beta ) ; dnorm ( alphaO , tau . alpha ) ;

} # prior distributions tau - dgamma ( O.O1, O. O1) ; gamma alphaO

dnorm ( O, O.OOO1) ; dnorm ( O, O. OOO1) ;

.

below ( ~1 ,

) {

+

random

in

Gilks

against

for ( j in 1 : n [ i ] ) { y [i ,j ] - dnorm ( mu [ i , j ] , tau ) ; mu [ i , j ] <- alpha [ i ] + gamma*

#

.

Multivariate

{

#

R

Chapman

7 .

for

methods .

BUGS

obtained - in

, H

program

command

errors

Graphical

against

Gambia

pp

in

.

Vaccination

BUGS

) ,

)

. J . ( 1996

Carlo

Spiegelhalter

Wiley

. C

, W

Monte

An ,

10

,

models

idea

of

the

and syntax

corresponding 14

-

16

)

to a .nd

shown

the in

DAVIDJ. SPIEGELHA .LTERET AL.

598 betaO tau . beta tau . alpha theta phi

sigma ( - l / sqrt (tau ) ; sigma .beta <- 1/ sqrt (tau .beta ) ; sigma . alpha <- 1/ sqrt (tau . alpha ) ;

} The essential correspondencebetween the syntax and the graphical representation should be clear: the relational operator "'" correspondsto 'is distributed as' and <- to 'is logically defined by'. Note that BUGS parameterizesthe Gaussian distribution in terms of mean and precision (= l / variance). The program then interprets this declarative model description and constructs an internal representation of the graph, identifying relevant prior and likelihood terms and selecting a sampling method. Further details of the program are given in Gilks et at. (1994) and Spiegelhaltel' et ai. ( 1996). The software will run under UNIX and DOS, is available for a number of computer platforms , and can be freely obtained, together with a manual and extensive examples from http : / / www.mrc- bsu . cam. ac . uk/ bugs/ , or contact the authors at bugs~mrc- bsu . cam. ac .uk.

PREDICTION WITH GAUSSIAN PROCESSES : FROMLINEARREGRESSION TOLINEARPREDICTION ANDBEYOND C. K. I. WILLIAMS Neural ComputingResearchGroup, Aston University BirminghamB4 7ET, UK Abstract . The main aim of this paper is to provide a tutorial on regression with Gaussian processes. We start from Bayesian linear regression , and show how by a change of viewpoint one can see this method as a Gaussian process predictor based on priors over functions , rather than on priors over parameters . This leads in to a more general discussion of Gaussian processes in section 4. Section 5 deals with further issues, including hierarchical mod elling and the setting of the parameters that control the Gaussian process, the covariance functions for neural network models and the use of Gaussian processes in classification problems .

1. Introduction In the last decade neural networks have been used to tackle regression and classification problems , with some notable successes. It has also been widely recognized that they form a part of a wide variety of non-linear statistical techniques that can be used for these tasks ; other methods include , for example , decision trees and kernel methods . The books by Bishop (1995) and Ripley ( 1996) provide excellent overviews . One of the attractions of neural network models is their flexibility , i .e. their ability to model a wide variety of functions . However , this flexibil ity comes at a cost , in that a large number of parameters may need to be determined from the data , and consequently that there is a danger of "overfitting " . Overfitting can be reduced by using weight regularization , but this leads to the awkward problem of specifying how to set the regularization parameters (e.g. the parameter Q in the weight regularization term QwT w for a weight vector w .) The Bayesian approach is to specify an hierarchical model with a prior distribution over hyperparameters such as Q, then to specify the prior distri 599

600

CI K. II WILLIAMS

bution of the weights relative to the hyperparameters . This is connected to data via an "observations " model ; for example , in a regression context , the value of the dependent variable may be corrupted by Gaussian noise. Given an observed dataset , a posterior distribution over the weights and hyperpa rameters (rather than just a point estimate ) will be induced . However , for neural network models this posterior cannot usually be obtained analyti cally ; computational methods used include approximations (MacKay , 1992) or the evaluation of integrals using Monte Carlo methods (Neal , 1996) . In the Bayesian approach to neural networks , a prior on the weights of a network induces a prior over functions . An alternative method of putting a prior over functions is to use a Gaussian process (GP ) prior over functions . This idea has been used for a long time in the spatial statistics community under the name of "kriging " , although it seems to have been largely ignored as a general-purpose regression method . Gaussian process priors have the advantage over neural networks that at least the lowest level of a Bayesian hierarchical model can be treated analytically . Recent work (Williams and Rasmussen, 1996, inspired by observations in Neal , 1996) has extended the use of these priors to higher dimensional problems that have been tradition ally tackled with other techniques such as neural networks , decision trees etc and has shown that good results can be obtained . The main aim of this paper is to provide a tutorial on regression with Gaussian processes. The approach taken is to start with Bayesian linear regression , and to show how by a change of viewpoint one can see this method as a Gaussian process predictor based on priors over functions , rather than performing the computations in parameter -space. This leads in to a more general discussion of Gaussian processes in section 4. Section 5 deals with further issues, including hierarchical modelling and the setting of the parameters that control the Gaussian process, the covariance functions for neural network models and the use of Gaussian processes in classification problems . 2. Bayesian

regression

To apply the Bayesian method to a data analysis problem , we first specify a set of probabilistic models of the data . This set may be finite , countably infinite or uncountably infinite in size. An example of the latter case is when the set of models is indexed by a vector in ~m. Let a member of this set be denoted by 1l0 , which will have a prior probability P {1lo ) . On observing some data V , the likelihood of hypothesis 1la is P {VI1la ) . The posterior probability of 1 0 is then given by posterior

<X prior x likelihood

(1)

P (1laIV )

<X P (1la )P (VI1la ) .

(2)

PREDICTIONWITH GAUSSIAN PROCESSES

601

d

Figure 1. The four possiblecurves(labelled a, b, c and d) and the two data points (shownwith + signs).

The

P

an

proportionality

(

V

)

=

can

Eo

P

integration

VI1la

we

models

individual

models

prediction

is

P

(

turned

1la

)

into

(

where

)

now

say

asked

we

the

to

are

to

(

make

)

=

L

be

dividing

may

a

prediction

through

be

by

interpreted

using

some

for

y

equality

summation

as

.

predict

prediction

P

an

the

appropriate

are

;

be

)

where

Suppose

abilistic

(

P

y

(

yl1lo

is

quantity

given

)

by

P

(

1laIV

this

y

P

(

)

.

yl1la

.

set

Under

of

each

)

.

The

prob

of

-

the

combined

(

3

)

0

In this paper we will discuss the Bayesian approach to the regression problem, i .e. the discovery of the relationship between input (or independent) variables x and the output (or dependent) variable y. In the rest of this section we illustrate the Bayesianmethod with only a finite number of hypotheses; the caseof an uncountably infinite set is treated in section'3. We are given four different curves denoted fa (x ), fb(X), fc (x ) and fd (X) which correspond to the hypotheseslabelled by 1la, 1lb, 1lc and rid ; seethe illustration in Figure 1. Each curve 1la has a prior probability P (1la ); if there is no reason a priori to prefer one curve over another, then eachprior probability is 1/ 4. The data V is given as input -output pairs (Xl , tl ), (X2, t2), . . . , (xn , tn ). Assuming that the targets ti are generated by adding independent Gaussian noise of variance u~ to the underlying function evaluated at Xi, the likelihood of model1la (for a E { a, b, c, d} ) when the data point (Xi, ti ) is

602

C. K. I. WILLIAMS

observed is

P(tilxi,1{a) = (27r0 1v)1/22 exp - (ti - fa "2 O "v2(Xi))2.

(4)

The likelihood of each hypothesis given all n data points is simply lli P (ti lXi , 1 0). Let us assume that the standard deviation of the noise is much smaller (say less than 1/ 10) than the overall y-scale of Figure 1. On observing one data point (the left -most + in Figure 1), the likelihood (or data fit ) term given by equation 4 is much higher for curves a, band c than for curve d. Thus the posterior distribution after the first data point will now have less weight on 1ld and correspondingly more on hypotheses 1la , 1lb and 1lc . A second data point is now observed (the right -most + in Figure 1) . Only curve c fits both of of these data points well , and thus the posterior will have most of its mass concentrated on this hypothesis . If we were to make predictions at a new x point , it would be curve c that had the largest contribution to this prediction . From this example it can be seen that Bayesian inference is really quite straightforward , at least in principle ; we simply evaluate the posterior prob ability of each alternative , and combine these to make predictions . These calculations are easily understood when there are a finite number of hypotheses , and they can also be carried out analytically in some cases when the number of hypotheses is infinite , as we shall see in the next section . 3 . From

linear

regression

...

Let us consider what may be called "generalized linear regression" , which we take to mean linear regression using a fixed set of m basis functions { (x ) , for some vector of "weights " w . If there is a Gaussian prior distribution on the weights and we assume Gaussian noise then there are two equivalent ways of obtaining the regression function , (i ) by performing the computations in weight -space, and (ii ) by taking a Gaussian process view . In the rest of this section we will develop these two methods and demon strate their equivalence .

3.1. THE WEIGHT-SPACEVIEW Let the weights have a prior distribution which is Gaussian and centered

on the origin, w rv N(O, ~w), i.e.

P(w)=(27r 1:Ewll {-2w IT }J;1w}. )m /21 /2exp

(5)

604 We

C. K. I. WILLIAMS

can

also

mean

is

obtain

gi

ven

O' ~

To

( x

)

to

In

a

the

JRP

( Xl

)

,

of

J - Lws

.

for

the

variance

about

the

)

( W

-

WMP

) TJC

/ > ( x

*

)

*

)

it

is

necessary

due

to

to

the

add

noise

,

(

14

)

(

15

)

(

16

)

O' ~

to

since

the

.

regression

is

and

in

the

discussed

Tiao

(

in

1973

.

problem

.

to

the

)

most

texts

.

( x

)

is

Y

( Xk

)

)

in

a

covariance

matrix

and

.elow

x

- space

be

any

consider

A

finite

only

vari

usually

.

general

-

finite

processes

by

giving

of

Gaussian

a

stochas

any

Gaussian

subset

-

be

of

specified

-

view

random

will

.

way

can

for

x

di

points

- space

of

,

deal

the

function

distributions

consistent

that

at

collection

b

probability

processes

to

values

a

of

through

possible

or

considered

the

described

also

function

dimensionality

,

is

process

Y

cases

was

It

the

stochastic

giving

stochastic

the

weights

respect

the

.

( x

Box

the

'

)

.

t

process

,

) 2 ]

variance

linear

is

is

)

WMP

uncertainty

In

)

.

.

VIEW

by

( X2

-

/ > ( x

example

This

.

( x

var

stochastic

x

function

- space

that

can

.

random

bination

the

equation

5

view

be

Y

.

Y

( x

stochastic

at

)

=

only

points

.

processes

the

In

fact

which

have

is

are

is

clear

shown

set

Wj

< pj

point

( x

)

,

is

indicates

,

the

a

I ' V

weights

and

with

input

simply

W

the

mean

kinds

functions

in

where

that

the

the

basis

x

which

W

consider

of

space

linear

N

are

covariance

com

( O

,

~

w

-

)

as

viewed

as

functions

for

,

Ew

it

we

fixed

particular

calculate

Ew

it

j

W

can

regression

a

variables

notation

We

process

since

~

a

random

( The

. )

linear

from

value

Gaussian

variables

of

generated

The

variable

of

random

functions

-

lc

over

p

weights

variables

-

to

specialize

the

fact

;

.

random

In

)

uncorrelated

the

A

Y

.

[ ( w

A

with

further

function

the

prediction

additional are

and

mean

In

the

specified

vector

shall

this

variance

where

subset

mean

in

)

,

.

( Y

a

a

.

in

is

subset

is

( x

by

in

of

c/ >T

uncertainty

process

zero

=

section

problem

vector

for

( x

- SPACE

indexed

we

) Ewlv

interested

the

are

.

distribution

are

tic

( x

variation

previous

ables

[ ( y

approach

with

of

c/ >T

for

probability

we

=

FUNCTION

rectly

"

Ewlv

statistics

THE

bars

=

Bayesian

Bayesian

.

error

predictive

of

The

. 2

)

account

sources

on

.

the

*

two

3

( x

obtain

O' ~

"

by

[ Y

( x

[ Y

( x

) Y

derived

( x

as

that

Y

in

is

Figure

) ]

=

0

' ) ]

=

ct > T

a

a

linear

)

wct

> ( x

)

,

using

' )

of

process

( b

~

combination

Gaussian

2

( x

.

the

Some

basis

.

Gaussian

examples

functions

(

17

)

(

18

)

random

of

shown

sample

in

PREDICTION WITHGAUSSIAN PROCESSES

605

(b )

(a)

Figure 2. (a) shows four basis functions , and (b) shows three sample functions generated by taking different linear combinations

of these basis functions .

Figure 2(a). The sample functions are obtained by drawing samplesof W from N (O, ~ w). 3 .3 .

PREDICTION

USING

GAUSSIAN

PROCESSES

We will first consider the general problem of prediction using a Gaussian process, and then focus on the particular Gaussian process derived from linear regression . The key observation is that rather than deal with com plicated entities such as priors on function spaces , we can consider just the function values at the x points that concern us , namely the training points Xl , . . . , Xn and the test point x * whose y -value we wish to predict . Thus we need only consider finite - dimensional objects , namely covariance matrices .

Consider n + 1 random variables (Zl , Z2, . . . , Zn , Z * ) which have a joint Gaussian

distribution

with

mean

0 and

covariance

matrix

K + . Let

the

(n + 1) x (n + 1) matrix K + be partitioned into a n x n matrix K , a n x 1 vector

k and

a scalar

k*

K+=( :r ~)

(19 )

If particular values are observed for the first n variables , i .e. Zl = Zl , Z2 = Z2, . . . , Zn = Zn, then the conditional distribution for Z * is Gaussian (see, e.g. von Mises , 1964, section 9.3) with

E [Z*] = var[Z*] =

kTK - lz k* - kTK - lk

(20) (21)

where zT = (Zl , Z2, . . . , Zn). Notice that the predicted mean (equation 20) is a linear

combination

of the z 's .

C. K. I. WILLIAMS

606

3.4. LINEAR REGRESSIONUSING THE FUNCTION-SPACEVIEW Returning to the linear regressioncase, we are interested in calculating the distribution of Y (x *) = Y. , given the noisy observations (tl ' . . . ' in ). To do this we calculate the joint distribution of (Tl , T2, . . . , Tn, Y*) and then condition on the specific values Tl = tl , T2 = t2, . . . , Tn = tn to obtain the desired distribution P (Y. It ). Of coursethis distribution will be a Gaussian so we need only compute its mean and variance. The distribution for (TI , T2, . . . , Tn, Y. ) is most easily derived by first considering the joint distribution of Y + = (YI , Y2, . . . , Yn, Y. ). Under the linear regression model this is given by Y + ~ N (O, +Ew~ ), where <1>+ is an extended matrix with an additional bottom row c!>. = c!>(x . ) = PI (x . ), m(x . ) ). Under the partitioning used in equation 19, the structure of + ~ w~ can be written

~wct >* ct>; EwT ct >; ~wct >*

+Ew ~=( Ew T The T

joint ' s

are

x

n

and n

distribution

hence

that

matrix

( The

a

reason

,

corrupting

( Tl

, . . . Tn

~ In

Tl

var

where

we

basis

functions

it

the will

the

a P

By

,

then

=

tl

N

a

~ In

, T2

=

t2

be

( O , +

the

not

can

found

corresponding

r....I

~

right

+

l

is

w

~

and that =

by Y

' s

+

E

realizing

with

+

tn

E

+

bands

predicting

and

the noise

where

by

are

that

Gaussian

) ,

bottom

we

, . . . , Tn

=

J. L / s ( x

[Y * ]

=

c / >;

P

)

is

be

T

multiplying can

with

= = T

+

through be

substituted

zero

c/ >;

+

f3 T Ew

A into

w

~

T

w

is

of Y * ,

using

so .

u ~ In

) . of

that

,

the

zeros

not

equations

T 20

. * . )

and

-

l <1> ~

wc

/ >*

that

if

the

( n

) ,

points will

be

rank

probability ,

the

addition

( 23

)

( 24

)

number as

of

will

often

- deficient of

points

of

O' ~ In

,

i .e .

lying ensures

. these

results

obtained 13

equation

p

- l 23

we

for in

we

= = ( 3 T ( O' ~ I

and

lt

Note

the

expressions

equation

p

T

in

- I

-

data

However

that

T

p

T

Ew

definite show

as

~

number

, be

the

by

-

matrix

to

mean

cJ >~

T

the

will

now

=

/ >*

I > Ew

than

positive

the

wc

eigenvalues

strictly is

for

~

* )

covariance

zero

consistent

formula

=

less

subspace

challenge are

[Y * ]

the

some linear

will

AEw

which

, Y * ) on

is

defined ( m

have

The ance

have

case

outside that

+

, Y *

the

(22)

obtain

E

be

, . . . Tn

bordered E

on

we

Tl

by

that

Conditioning 21

for

obtained

).

note

+

obtain to

yield

Ew

the

mean

section

and

3 . 1 .

To

vari

-

obtain

that

T )

=

~ w T p the

desired

{ 3 T P .

- l

=

, aA

result

( 25

)

- l T , .

PREDICTIONWITH GAUSSIAN PROCESSES The by

equivalence

using

1992

,

the

of

section

A

- I

=

+

YZ

( ~ ~ I

+

may

be

Given it

is

the

that

.

In

one

less

time

weight

tenable

4 .

. . .

As

we

just

a

linear

m

a

zero

- mean

covariance definite

on

One

an

lIn tions functions

be

isotropic

fact

Bochner

C ( h ) which .

x

) - IZX

to

give

et

- 1

equation

- space

is

m

n

x

al ,

( 26

)

( 27

)

P

thus

and

,

the

- space

inversion

of

which

takes

m

n

of

function

invert

function

method

kinds

the

form

. As

the

other

equivalent more

the

problems

for

and

to

choose

regression

are

for

matrix

.

computationally

necessary

to

24

views is

. Similarly n

. But

section

3 .3 ,

Gaussian of and

how opens

is

x ' .

In

in class

=

C

( lx

and

linear

so

a

the

prediction

- space

derived

is

as

the

probability

' s theorem are

the

x ' l)

continuous

as

view

is

of

E

points

or

a

the

y - value said

as

a

linear the

.

-

(x , x ' )

of

( x ' ) ] . Formally a

non

, . . . , Xk

condition

In

function

covariance C

generate , X2

be

regression

function (x ) Y

to

predictor

from

specifying

[Y

Gaussian

is

linear

seen ,

( Xl

this

general predicted

method

be

will

obey

a

, the - negative

) .

It

, although

is

non

-

several

.

C

and

( h ) , where

characteristic .

( see , e . g . Wong 0

and

isotropic

For

, 1971

satisfy

covariance

1 . 1 denotes function

densityl

at

of

that

stationary =

the )

a

the

covariance

defined

set

;

1990

way

function

that

is

can

the

any

prior , then

possibilities one

literature

-

,

further

( x ) is

functions the

t - values

just

any for

with

the

general Y

be

the

model

regression

up

weights

and

if

noise

Tibshirani

linear

matrix up

(x , x ' )

- Iy

methods

natural

linear

in a

process

- known

can

proved

. . .

can

well

These

infinite

function

known

C

be

Gaussian

are

where

can

the x

come

families

preferred

This

covariance to

be

seen .

points

trivial

simple

will

( Hastie

have

prior

between

be , Press

~ wcI >T ) - IcI > ~ w ,

it

the is

combination

viewpoint

with

can

formula

ZX

function two

m

invert

seen

smoother

space

and

( l3 ) , it

assume

linear

we

+

approach

for

already

3

+

16

the

dimension

0

(I

equation

of

and

- IY

~ w T ( O' ~ I

prediction

we

some

section

variance

.

linear

and

is

the

Woodbury

obtain

- space

time

one

have

-

X

- space

of

form

4 )

to

process

for ( the

-

to

which

is

view

- I

- 1

weight ask

. Usually

only

X

~ w

weight

to

- space section

=

into

which

takes

( see

of

the

has

1 matrix

=

the

A

1 x

- I

to

matrix

view

expressions identity

substituted

interesting

efficient

) - I

{ 3 <1>T
A

which

two

matrix

2 .7 )

( X

on

the

following

607

( or

example

) states 0 (0 ) =

the

C

that 1 are

functions

Euclidean

Fourier

( h )

the exactly

norm transform

=

exp

positive the

( -

. )

( h / u ) V )

definite characteristic

func

-

C. K. I. WILLIAMS

608 is

a

valid

covariance

corresponding v

=

1

and

length

v

=

- scale

those

2

of

, =

paths

function

mean

- square

20

and

up

of

to

21

;

to

come

from

or

a

as

Wo

)

=

e -

some

<

v

~

correlation

functions

may

2 ,

when

the

have

( e .g .

no

preferred

splines

widely

are

Cressie

splines power

lhl

.

On

=

)

/J( x

that

)

with

sample

the

that

,

very other

- line

noted

I - d

has the

wTc

straight

paths

a

the

, ;

is

is

covari

are

infinitely

use

equations

-

is

not

on

her )

Gaussian

.

a

processes

"

with

,

-

and

although

of

spline

Kimeldorf

and

overview

a

) .

- dimensional use

to

useful

; and

pro

( J ournel

three

back

' s ( 1963

" kriging

and

.

topic

1940

Gaussian

the

dates

provides

recent the

field

-

below

Whittle

promoting

work

5 .2

very

in

as

assumed

estimation

models

two

in

is

section

a

made

covariance

but

in

known

mostly

( 1990

in

geostatistics

it

the

)

discussed

the

If

data

be

likelihood

process

where

;

.

practice

Kolmogorov are

the

will

term

in

discussed

Gaussian

problems

simply

certainly

influential

Wahba

noise

and

focussed

characterize

maximum

this

in )

to

; we

function

case

then

Wiener

been

to

points

and

are

has

prediction in

.

A

- law

WM

the in

covariance function

a

be

in

e -

.

particular

Essen

choice

of

3 .

- free

3Technically

y

to

covariance

taken

1993

has

although

stationary

covariance

assumed

test

be

known ,

although

noise

is

processes

well

used ) ,

sample

regression

process

1989

example

from

leads

also

overall

be

series

also

function

Gaussian

the

to

correspond

covariance

arises )

to

new

class

regression ) ,

rise

properties

For

differentiable

should

always

back

naturally

( 1970

different .

function

( which

which

can

Wahba

for

It

for

will

time

;

'

gives

multivariate

1978

.

techniques

have

h2

Gaussian

is

spaces

2For

sets

very

- square

function

goes

literature

tions

)

have

( 0 ' 5 , O' r ) ) .

parametric

for

,

is

0'

covariance

covariance

O' rxx

WIX

3 .4

( as

to

,

0

.

with

Huijbregts

+

( O , diag

covariance

prediction

aI

case

function

mean

function

theory

can

not

+

section

models

and

for distributions

densities2

( with

0' 5 N

=

unknown

applications

tially

= "' -'

approach

basic

Wahba

this

covariance

predictions

in

Prediction

input

' )

y

make

Bayesian

this

, x w

( h

prior

is

ARMA

and

Gaussian

other

spectral

the

are

covariance

easy

the

cess

( x

and

C

function

the

in

although

process

form

a

is

,

p

and

processes

differentiable

Given

, it

field

of

which

C ) T

the

ance

dimensions

that

- law

choice

paths

( 1 , x of

Note

power

- Uhlenbeck

choosing

)

.

Gaussian the

sample

l/J( x

they

random

from

Ornstein

hand

the

respectively

the

on

rough

all Cauchy

.

Samples

the

for

multivariate

to

scale

depending

et

the

corresponding

length

V

function

to

analysis this

also of

suggested computer

application

connection

to

functions

it

neural

the

by

spectral

O

' Hagan

experiments is

assumed

( e .g that

networks

was

density

is

the

( 1978

the

made

Fourier

) ,

Sacks

observa

by

-

Poggio

transform

of

. require spectral

generalized density

covariance S ( c,,; )

cx: c,,; - { 3 with

functions .B >

( see O .

Cressie

5 .4 ) ,

and

WITHGAUSSIAN PROCESSES PREDICTION

609

and Girosi (1990) and Girosi, Jones and Poggio (1995) with their work on Regularization Networks. When the covariancefunction C(x , x ' ) depends only on h == Ix - xii , the predictor derived in equation 20 has the form ~ i ciC (lx - xii ) and may be called a radial basisfunction (or RBF ) network. 4.1. COVARIANCE FUNCTIONS AND EIGENFUNCTIONS It turns out that general Gaussian processes can be viewed as Bayesian linear regression with an infinite number of basis functions . One possible basis set is the eigenfunctions of the covariance function . A function
C(x , x')l/J(x)dx = Al/J(X')

(28)

is called an eigenfunction of C with eigenvalue A. In general there are an infinite number of eigenfunctions, which we label
00 C(x, X') ==L Ai< i=l />i(X)i(X').

(29)

-

This decomposition is just the infinite -dimensional analogue of the diagonalization of a real symmetric matrix . Note that if n is JRP , then the summation in equation 29 can become an integral . This occurs , for example , in the spectral representation of stationary covariance functions . However , it can happen that the spectrum is discrete even if n is IRPas long as C (x , x ' ) decays fast enough . The equivalence with Bayesian linear regression can now be seen by tak ing the prior weight matrix ~ w to be the diagonal matrix A == diag (Al , A2, . . .) and choosing the eigenfunctions as the basis functions ; equation 29 and the equivalence of the weight -space and function -space views demonstrated in section 3 completes the proof . The fact that an input vector can be expanded into an infinite -dimensional space PI (x ) ,
610

C. K. I. WILLIAMS

to Gaussian noise) , a modified version of the 11error metric Iti - Yi I is used, called the e-insensitive loss function . Finding the maximum a posteriori (or MAP ) y-values for the training points and test point can now be achieved using quadratic programming (see Vapnik , 1995 for details ). 5 . . . . and beyond

...

In this section some further details are given on the topics of modelling issues, adaptation of the covariance function , computational issues, the covariance function for neural networks and classification with Gaussian processes.

5.1. MODELLING ISSUES As we have seen above, there is a wide variety of covariance functions that can be used. The use of stationary covariance functions is appealing as usually one would like the predictions to be invariant under shifts of the origin in input space. From a modelling point of view we wish to specify a covariance function so that nearby inputs will give rise to similar predictions . Experiments in Williams and Rasmussen (1996) and Rasmussen (1996) have demonstrated that the following covariance function seems to work well in practice :

C(X(i),X(j))

Voexp{- ~ tl=1Ql(X~i) - x~j))2} p +ao+ a1~ X~i)X~j) + v1r5 (i ,j ), l=1

(30)

where Od,;} (log vo, log VI , log aI , . . . , log ap , log ao, log al ) is the vector of adjustable parameters . The parameters are defined to be the log of the vari ables in equation (30) since they are positive scale-parameters . The covariance function is made up of three parts ; the first term , a linear regression term (involving ao and al ) and a noise term VI8(i , j ). The first term expresses the idea that cases with nearby inputs will have highly correlated outputs ; the a [ parameters allow a different distance measure for each input dimension . For irrelevant inputs , the corresponding al will become small , and the model will ignore that input . This is closely related to the Automatic Relevance Determination (ARD ) idea of MacKay and Neal (MacKay , 1993; Neal 1996) . The Vo variable gives the overall scale of the local correlations , ao and al are variables controlling the scale of the bias and linear contributions to the covariance . A simple extension of the linear regression part of the covariance function would allow a different

612 It

C. K. I. WILLIAMS

is also

possible

likelihood

with

to

~8l as derived the its

, for

to

example and

the

with

respect

the

an

likelihood

rameters ( GCV

. An

is to

In the some

one

maxima

these

may

be

the

defining

a prior

once

a prediction

for

test

distribution

P ( lJIV ) , i .e .

a new

over

point

is

not

possible

techniques paper

(}

high

Two

distribution

and

However

, the

the

random

that

is - walk

is

high this

for

- Hastings available behaviour

. To the

a make

posterior

general

, but

dimension

parameter

locate by

chain

nu -

, then

example

samples

Monte

the

regions

are

( see , e .g . , Gelman are not

has

given

the

form

not

that - space

utilize it

the

tends . Following

to

equation

derivative have the

in

chain the

.

Gibbs

et aI , 1995 ) .

amenable by

)

whose

integral

Markov

methods

distributions

( MCMC

chain

) ; the

the

of

techniques

Carlo

P ( 8IV from

MCMC

does

the

gridding

a Markov

distribution

algorithms

means

to

density

constructing

algorithm

in

obtaining

used . See , for

constructing

using

function

, which

approach

( 33 ) in

Markov by

parameter

covariance

over

, or if

be local

) d8 .

difficult

desired

- Hastings

conditional

very

case

is the

methods

may

then

low

posterior

work

approximated

Metropolis

if the

Metropolis

tion

have

when

points

is attractive

averages

be

appear

estimates

Bayesian

sufficiently

can

would

of parameters

if there

analytically

8 is of

pa -

, ( 1993 ) . it

. In

simply

of

of the

of data

and

seen

P ( y * 18 , V ) P ( 8IV

(} - space

used . These

33 is then

sampler

sampling

which

be

standard

in

been

land

informa

cross - validation

the

parameters

integration

Stein

sampling

may

equilibrium equation

and

, or

reasons

of

maximum

, it

point

determined

has

this

number

making

these

V

used . If

- dimensional

- space

importance

this

be grids

Handcock is

parameter

methods

do

may

involving

by

If

or

to

methods

generalized

number

x * one

I

feed

estimation

to the

the

data

P ( y * IV ) =

merical

about

relative

evaluation

a local

a large

log

( 32 )

O ( n3 ) . Given to

( 1990 ) . However

. For

the

time

likelihood

when

the

K - 1t ,

obtain

( CY ) or

be poorly

distribution

to

Wahba

concerned

surface

distribution

It

order

of

equation

( 1984 ) . The

takes

methods

may

likelihood

Marshall

maximum

is large

parameters

in

posterior

use

in

the

1 T K - 1 ~oK 2t

+

a cross - validation

of parameters

of the

in

to

derivatives

is straightforward

package

to

general

number

)

and

8 it

, as discussed

that it is difficult are involved .

partial , using

derivatives to

alternative

use

) method

Mardia

partial

optimization

the

hyperparameters

( K - 1~ OK

, in

likelihood

to

analytically

the

1 - 2tr

=

derivatives

tion

of

express

respect

an work

to Gibbs 30 , and informa inefficient of

Neal

-

PREDICTIONWITH GAUSSIAN PROCESSES

613

( 1996) on Bayesian treatment of neural networks , Williams and Rasmussen

(1996) and Rasmussen(1996) have used the Hybrid Monte Carlo method of Duane et al (1987) to obtain samplesfrom P (8IV ). Rasmussen(1996) carried out a careful comparison of the Bayesian treatment of Gaussian pro cess regression with several other state -of-the-art methods on a number of problems and found that its performance is comparable to that of Bayesian

neural networks as developedby Neal (1996), and consistently better than the other

methods

.

5 .3 . COMPUTATIONAL

ISSUES

Equations 20 and 21 require the inversion of a n x n matrix . When n is of the order of a few hundred then this is quite feasible with modern

computers . However , once n '""" 0 (1000) these computations can be quite time consuming , especially if this calculation must be carried out many times

in an iterative

of interest

scheme

as discussed

to consider approximate

above

in section

5 .2 . It is therefore

methods .

One possible approach is to approximate the matrix inversion step needed for prediction , i .e. the computation of K - lz in equation 20. Gibbs

and MacKay (1997a) have used the conjugate gradients (CG) algorithm for this task, based on the work of Skilling (1993). The algorithm iteratively computes an approximation to K - 1z; if it is allowed to run for n iterations it takes time O (n3) and computes the exact solution to the linear system , but by stopping the algorithm after k < n iterations an approximate solution is obtained . Note also that when adjusting the parameters of the covariance

matrix , the solution

of the linear system that was obtained

with

the old parameter values will often be a good starting point for the new CG

iteration

.

When adjusting the parameters one also needs to be able to calculate

quantities such as tr (K - 18K / 8(}i ). For large matrices this computation can be approximated

by using the "randomized trace method " . Observe that

if d '""" N (O, In ), then E [dT Md ] = trM , and thus the trace of a matrix M may be estimated by averaging dTM d over several d 's. This method has

been used in the splines literature by Hutchinson (1989) and Girard (1989) and also by Gibbs and MacKay (1997a) following the independent work of Skilling (1993). Similar methods can be brought to bear on the calculation of log det K . An alternative approximation scheme for Gaussian processes is to project the covariance

kernel

onto

a finite

number

of basis functions

, i .e . to find

the

Ew in equation 5 which leads to the best approximation of the covariance function . This method has been discussed by Silverman (1985), Wahba

(1990) Zhu and Rohwer (1996) and Hastie (1996). However, if we also can

C. K. I. WILLIAMS

614

b

u 10

.

. Xl

Figure ;' .

choose

which

which sion

the

makes

enough

so

for

There

when

d

et

be is

l

Ai

in

then

the

used

.

.

m

This

equation

converges

29

The

eigenfunctions

truncated

expan

decay

to

technique

is

analysis

,

zero

an

and

is

fast

infinite

-

discussed

,

) .

methods

ignore

use be

components

( 1997

other

to

making

~

principal

al

to

should

eigenvalues

E of

well

wish

of

data

predictions

speeding

points

up

which

, thereby

are

producing

the

computations

far a

away

; one

from

smaller

the

matrix

test to

be

.

5 .4 .

THE

COVARIANCE

own

interest

Radford

Neal

the

over as

single

Consider units

then

of

hidden

layer

network linearly

is

which

,

1996

neural will

section

NEURAL

processes

a

weights

this

hidden

a and

Gaussian

by

number of

OF

( Neal

its

the

remainder a

using

observation

produced

functions

with

FUNCTION

in 's

functions

distribution

H

the

sum

Zhu

example

inverted

the

the

may

possible

My

when

, in

we

eigenvalues

analogue

example

point

functions

largest

that

X

The architecture of a network with a single hidden layer.

sense

dimensional

.

.

X2

basis

have

.

to

a

units

in

the

.

takes

with

tend

derived

an the

regression

under

network

covariance

combines

for

) , that

the

NETWORKS

outputs

Gaussian

for

x , has

one the

kinds

to

a

neural

hidden

of

, prior

prior

tends

hidden

by

treatment

process

network

of

sparked

Bayesian

certain

function

input

was

a

over

infinity

. In

network

layer units

with with

a

615

PREDICTION WITH GAUSSIANPROCESSES bias b to obtain f (x ). The mapping can be written

H Llvjh (x;OJ ) j=

f (x ) = b +

where

h

( x

bounded

;

)

ture

is

with

0

)

units

to

variance

,

are

infinity

)

.

,

Let

O' ~

b

we

has

for

O' ~

a

[ f

wide

[ f

( x

)

f

( x

,

( x

' ) )

(

transfer

1993

the

is

architec

that

networks

number

of

( but

- mean

weights

-

hidden

excluding

distributions

OJ

.

assume

This

)

the

zero

let

.

functions

independent

and

shall

o

Hornik

distributed

1996

we

weights

as

of

have

,

) )

( which

hidden

by

class

' s

identically

Neal

Ew

-

shown

respectively

( following

Ew

been

v

and

obtain

- to

approximators

the

independently

function

input

universal

and

and

transfer

the

it

layer

tends

w

unit

on

because

hidden

be

hidden

depends

polynomials

unit

the

important

one

of

is

which

(34)

for

each

Denoting

all

hidden

weights

by

)

=

0

= =

a

~

+

L

a

;

Eu

[ hj

( x

;

u

)

hj

h

( x

( x

' ;

D

) ]

u

( 35

)

( 36

)

(

37

)

;

u

) ]

) )

J

= =

where

equation

37

distributed

.

by

letting

O' ~

The

of

Theorem

be

Gaussian

By

sets

work

is

sian

,

)

One

finite

over

currently

the

and

that

For

(

of

full

in

joint

this

some

high

/ ) ]

for

H

- dimensional

'

;

units

are

(,,(.,' 2Eu

x

[ h

identically

( x

;

u

)

h

( x

/

t

'

in

the

to

are

weight

,

covariance

-

-

net

-

converge

2

/

and

of

the

course

and

,

input

f

~

a

e

functions

-

integrals

over

.

integrals

and

~

testing

neural

weights

these

priors

=

dis

mo

Limit

will

training

priors

)

all

Central

describe

biases

weight

transfer

t2dt

are

or

given

For

can

be

function

( ii

in

)

a

Gaus

-

Williams

.

infinite

neural

networks

computation

should

circumstances

weight

the

process

x

the

( z

,

.

and

of

)

00

bounded

hence

expectations

the

independently

is

needed

and

1997b

and

stochastic

-

all

Gaussian

in

hidden

and

the

as

These

using

posterior

)

becomes

bounded

function

Bayesian

networks

the

( x

for

Williams

the

h

error

attraction

u

function

function

expressions

;

identically

that

function

.

( i )

explicit

be

limit

.

transfer

( x

the

transfer

distributions

of

either

1997a

)

H

the

covariance

analytically

that

( x

of

[ h

37

over

showing

process

choices

is

As

the

probability

calculated

is

[ h

the

Gaussian

relevant

some

,

in

Eu

.

.

applied

obtain

a

H

will

Eu

can

as

the

/

O" ~

equation

36

process

evaluating

we

(,,(.,' 2

equation

distribution

can

a

H

all

in

variables

the

+

because

term

as

in

random

ments

(

follows

final

scale

sum

tributed

to

The

O" ~

.

- space

integration

( represented

be

Finite

much

networks

and

( hyper

can

only

as

easier

require

tackled

)

with

integration

) parameter

be

GPs

than

space

with

,

MCMC

and

616

C. K. I. WILLIAMS

methods . With GPs , in effect the integration over the weights can be done

exactly (using equations 20 and 21) so that only the integration over the parameters remains . This should lead to improved computational efficiency for GP predictions over neural networks , particularly for problems where n is not too large . 5 .5 . CLASSIFICATION

Given an input

PROBLEMS

x , the aim of a classifier

is to produce

an estimate

of the

posterior probabilities for each classP (klx ), where k = 1, . . . C indexes the C classes. Naturally we require that 0 ~ P (k Ix) ~ 1 for all k and that }:::k P (klx ) = 1. A naIve application of the regressionmethod for Gaussian processes using , say, targets of 1 when an example of class k is observed and 0 otherwise will not obey these constraints . For the two -class classification problem it is only necessary to represent

P (llx ), since P (2/x ) = 1 - P (llx ). An easy way to ensure that the estimate 7r(x ) of P (lfx ) lies in [O, lJ is to obtain it by passing an unbounded value y(x ) through a the logistic transfer function u(z) = 1/ (1 + e- Z) so that 7r(X) = a (y (x )) . The input y (x ) to the logistic function will be called the activation . In the simplest method of this kind , logistic regression , the activation

is simply

computed

as a linear combination

of the inputs , plus a

bias , i .e. y (x ) = wT x + b. Using a Gaussian process or other flexible meth ods allow y (x ) to be a non-linear function of the inputs . An early reference

to this approach is the work of Silverman (1978). For the classification

problem

with

more than

two classes , a simple

extension of this idea using the "softmax" function (Bridle , 1990) gives the predicted probability

for class k as

7r(klx ) =

exp Yk(X) ~ mexPYm(x ) .

For the rest of this section we shall concentrate

(38)

on the two - class problem ;

extension of the methods to the multi -class case is relatively straightfor ward

.

By defining a Gaussian process prior over the activation y(x ) automatically induces a prior over 7r(X), as illustrated in Figure 5.5. To make predictions for a test input x . when using fixed parameters in the GP we

would like to compute 7r. = J 7r. P (7r. lt , 8) d1r. , which requires us to find P (7r. lt , 9) = P (7r(x . )lt , 8) for a new input x . . This can be done by finding the distribution P (y. It , 8) (y. is the activation of 7r. ) as given by

P(y* It , 8) =

! P(yiYlt,8)dY= ~

/

P(y. , yI8)P(tly )dy

(39)

617

PREDICTION WITHGAUSSIAN PROCESSES 1

y

1t

I

~

=>

- 3

Figure 5. 7r(x ) is obtained from y(x ) by "squashing" it through the sigmoid function 0' .

and then using the appropriate

Jacobian to transform the distribution .

When P (tly ) is Gaussianthen the integral in equation 39 can be computed exactly to give equations 20 and 21. However , the usual expression for

P(tly ) = lli '1r ~i(1 - '1ri)l - ti for classification data (wherethe t's takeon values of 0 or 1), meansthat the marginalization to obtain P (y* It , 8) is no longer analytically

tractable .

Faced with this problem there are two routes that we can follow: (i) to use an analytic approximation to the integral in equation 39 or (ii ) to use Monte Carlo methods , specifically MCMC methods , to approximate it . These

two

methods

will

be considered

in turn .

The first analytic method we shall consider is Laplace 's approximation ,

where the integrand P (y*, ylt , 8) is approximated by a Gaussian distribu tion centered

at a maximum

of this function

with

respect to y * , y with

an

inverse covariance matrix given by - VV log P (y* , ylt , 8 ) . Finding a max -

imum can be carried out using the Newton-Raphson (or Fisher scoring) iterative

method

y * to be calculated

on y , which then allows the approximate . This

is also

the

method

used

distribution

to calculate

the

of

maxi -

mum a posteriori estimate of y* . For more details see Green and Silverman ( 1994) 5.3 and Barber and Williams (1997) . An alternative analytic approximation is due to Gibbs and MacKay

(1997b). Instead of using the Laplace approximation they use variational methods to find approximating Gaussian distributions that bound the marginal likelihood P (t 18) above and below , and then use these approxi mate distributions to predict P (y* It , 8 ) and thus ir (x *) . For the analytic approximation methods , there is also the question of what

to do about

the parameters

8 . Maximum

likelihood

and GOV

ap -

proachescan again be used as in the regressioncase (e.g. 0 'Sullivan et aI, 1986) . Barber and Williams (1997) used an approximate Bayesian scheme based on the Hybrid Monte Carlo method whereby the marginal likelihood

P (tI8 ) (which is not available analytically ) is replaced by the Laplace approximation of this quantity . Gibbs and MacKay (1997b) estimated 8 by maximizing their lower bound on P (t 18) .

618

C. K. I. WILLIAMS

Recently sian

Neal

process

P ( y , 81V ual

) in

Yi ' S are

time

a two

should

use

it

to

for

a

regression

the

be

The

it

x , and

C is

fixed

9 , each

where standard

also the

be

- field

N

process

it

in

a

drawn

from N

to

from

is

a

(x ) =

can how

to

to

be

assumed . This

is

the

Gaussian

that

model

on

from

( see

.

model

depends

generated

Z (x )

im -

noise

regression

prior

exp

Monte

that

a variance

it

chain

describes

hierarchical

has

takes

processes

)

the

-

between

Hybrid

desired

than

process

Z ( x ) by

is

"

Markov

distribution

when

individ

O ( n3 ) ) , so

Gaussian

model

outliers

used

( x ) is

noise

used to

noise

the

( 1997

Gaussian

widely

using

-

from

n

scans the

for

Gaus

" sweep

time

makes

Neal

the

sensitive

can

Gaussian

method

the

sampling

updated

example

of

. This

Gibbs

the

samples

( in

probably

MCMC

is

sampling computed

a few this

. For

the

that noise

Gibbs been

for

an

Goldberg

et

Discussion

this

sion

paper

to

issues

in

using

taken

can the to

covariance posed

of

by

Sampson

Gaussian .

the in

relaxed

original be function

co - ordinate

with model

be

is

that

in and functions

processes4

the

input

space

a

number

of

space

~ - space

x

into

dimension . This

Guttorp

one

can

can

regres

some

of

made

of

problem

. This ways

is

-

the that

course

weakness used

space

, say

then

to

this may

a hierarchical

use

, the

assump method

~ , which a

is be

Gaussian

is

may

stationary

approach

warping again

sta occur

strong

appealing

of

are

interactions

a very

speaking

1 , . . . p , which derive

which

. One

x , and

) . Of

. One

functions

over

is , roughly

=

kinds

be

another as

( 1992

~i (X ) , i

linear

discussed

the

covariance

- scales

same

, so

the

length

in

input the

for

that

that

Bayesian

have

on .

elaborations above

everywhere

, and warp

described

used

simple

, and

prediction

been

further

same

the

be

also

means

tion to

still

move processes

process

have

, which

are

how

Gaussian

Gaussian

are

method

tionary

shown with

networks

There the

I have

regression

neural

by

than

approach

independent al , 1997 ) .

that

" , i . e . less

this

, for

are

problem

assumed

that

. Firstly

parameters

t - distribution

" robust

M CM

where

generating

, as

noted

method

by

quite

situations

rather as

model

be

MCMC

works

- 1 has

perform

, the

a

. This

using K

parameters

other

t - distributed portant

6 .

to

the

also

applied

process

matrix

faster . Secondly method . It

be

of

developed

sequentially

sense

update

mix Carlo

has model

the

makes

each

)

stage

updated

O ( n2 ) once

actually

In

( 1997

classification

pro

-

defined

modelled process

4Itmay toimpose thecondition that thexto( mapping should be bijective . bedesirable

619

PREDICTIONWITH GAUSSIAN PROCESSES

It is also interesting to consider the differences between finite neural network and Gaussian process priors . One difference is that in functions

generated from finite neural networks , the effects of individual basis func tions can be seen. For example , with sigmoidal units a steep step may be

observed where one basis function (with large weights) comes into play. A judgement about whether this type of behaviour is appropriate or not should depend on Pl.ior beliefs about the problem at hand . Of course it is also possible to compare finite neural networks and Gaussian process

predictions empirically. This has been done by Rasmussen(1996), where GP predictions (using MCMC for the parameters ) were compared to those from Neal 's MCMC

Bayesian neural networks . These results show that for

a number of problems , the predictions of GPs and Bayesian neural networks are similar , and that both methods outperform several other widely -used regression techniques . ACKNOWLEDGEMENTS

I thank

David

Barber , Chris Bishop , David

MacKay , Radford

Neal , Man -

fred Opper , Carl Rasmussen, Richard Rohwer , Francesco Vivarelli and Huaiyu Zhu for helpful discussions about Gaussian processes over the last few years , and David Barber , Chris Bishop ments on the manuscript .

and David

MacKay

for com -

References

Aizerman , M . A ., E. M . Braverman, and L . I . Rozoner (1964) . Theoretical foundations of the potential

function

Remote

25 , 821 - 837 .

Control

method in pattern

recognition

learning . Automation

and

Barber , D . and C. K . I . Williams (1997). Gaussian Processesfor Bayesian Classification via Hybrid Monte Carlo. In M . C. Mozer, M . I . Jordan, and T . Petsche (Eds.) , Advances in Neural Information Processing Systems 9. MIT Press . Bishop , C . M . ( 1995) . Neural Networks for Pattern Recognition . Oxford : Clarendon Press

.

Box , G . E . P. and G . C . Tiao ( 1973) . Bayesian Inference in Statistical Analysis . Reading , Mass .: Addison -Wesley . Bridle , J . ( 1990) . Probabilistic interpretation of feedforward classification network out puts , with relationships to statistical pattern recognition . In F . Fougelman -Soulie

and J. Herault (Eds.), NATO AS! series on systems and computer science. Springer-

Verlag . Cressie , N. A. C. (1993 ). Statistics for Spatial Data. NewYork: Wiley. Duane , S., A. D. Kennedy , B. J. Pendleton , andD. Roweth (1987 ). HybridMonteCarlo . Physics LettersB 195, 216 - 222. Gelman , A., J. B. Carlin,H. S. Stern , andD. B. Rubin(1995 ). Bayesian DataAnalysis . London : Chapman andHall. Gibbs , M. andD. J. C. MacKay (1997a ). Efficient Implementation ofGaussian Processes . Draftmanuscript , available from http:/ /vol.ra.phy.cam .ac.uk/mackay /homepage .html. Gibbs , M. andD. J. C. MacKay (1997b ). Variational Gaussian Process Classifiers . Draft

620

C.

manuscript,

Girard,

available

D.

problems

Girosi,

(1989).

with

F.,

M.

Goldberg, dependent Green, P.

P.

Architectures.

models.

Hastie, Hastie, and Hornik,

6

Journel,

Kimeldorf, stochastic 495 502.

MacKay,

(1996). J. and K. (1993). (8),

1069 1072.

M. A.

G.

and

for

J. J.

: //www.

R.in

O Hagan, Journal O Sullivan, gression

Association

and

M.

(1989). and

Bayesian

Neal, Notes

ra.

Math.

56,

Carlo

Poggio

(1995).

phy.

cross-v

1 23

Reg

7(2),

219 269.

cs

M. of

A.

Stein

A

C.

C.

J.

stochastic

E.

Bayesian and

Regression

. toronto

and

96 103.

Mining

Bayesian

448 472.

Methods

K.

f

Schulten

(Ed

Classification.

. edu/radford/.

Bayesian

Learning

(1978). Curve Royal Statistical F., B. S. Yandell,

81,

f

J. Marshall (1984). Maximum in spatial regression. Biometrik Monte Carlo Implementation

the

Functions

of th Genera neural

A correspo by splines

Practical

4(3),

Domany,

R.

(1978).

A

(1993).

B

estimator

Huijbregts

(1992).

A

Communications

Wahba (1970). and smoothing

Computation

C.

(1993).

Journal (1990). results on

splines.

(1996). 118.

Statistics

Hall.

L.

Pseudosplines. R. J. Tibshirani Some new

Mardia, K. V. and residual covariance Neal, R. M. (1997). http

I/vol.

Computation

G. and G. processes

Hemmen,

Springer.

T.

smoothing 1059 1076.

D.

van

S.

403 410.

Neural

MacKay,

http:

Monte

Ntimer.

and

Chapman

M.

D.

works.

data.

Neural

London:

Hutchinson,

Laplacian tion 18,

via

A fast

Jones,

T. T.

Hall.

WILLIAMS

W., C. K. I. Williams, and C. A Gaussian Process Treatment. J. and B. W. Silverman (1994). Non

35(4),

works

noisy

I.

Noise:

Handcock, rics

K.

in

Generalized

for

Neu

Fitting and Optima Society B 40(1), and W. J. Rayno Linear

Models.

Poggio, T.78, and F. Girosi (1990). Networks for of IEEE 1481 1497. Press, W. H., S. A. Teukolsky, W. T. Vetterling, Recipes in C (second ed.) Cambridge University Rasmussen, C. E. (1996). Evaluation of Gaussian linear

Available

Regression. from

http:

Ph.D.

thesis,

//www.

cs

Ripley, B. Press. (1996). Pattern University Sacks, J., W. J. Welch, T. Computer

Sampson, covariance Silverman, Statistics

Experiments.

Recognition J. Mitchell,

Statistical

P. D. and P. Guttorp structure. Journal of B. W. (1978). Density

27(1),

26 33.

Dept.

.utoronto.

Science

of

Comp

ca/carl/.

and Ne and H. 4(4),

(1992). Nonpa the American Ratios, Empi

PREDICTION WITHGAUSSIAN PROCESSES

621

New York : Springer

Submitted to

New

Contributors

Nicky

G

.

Department and

Nir

Best of

Public

387

College

London

Computer

Epidemiology

Health

Imperial

W2

Friedman

School

of

Medicine

Science

Soda

University

of California

Berkeley

IPG

Division

Hall

, C A

94720

USA

UK

Christopher

M

Microsoft

.

Dan

Bishop

Research

St

. George

1

Guildhall

Science

Department

Technion

House

Haifa

Street

Cambridge

Geiger

Computer

CB2

, 32000

ISRAEL

3NH

UK

Zoubin Joachim

M

Institut

.

fur

Department

Buhmann

Informatik

U niversitat .

D - 53117

III

, Ontario

M5S

3H5

CANADA

164

Bonn

Wally

R . Gilks

MRC

Biostatistics

Institute Gregory

F

Forbes

.

Unit

of Public

Health

Cooper

Tower

,

University

of

Pittsburgh

, PA

Suite

University

8084

Pittsburgh 15213

Forvie

Robinson

Site

Way

Cambridge

- 2582

CB2

2SR

UK

USA

Robert

G

School

of

Science

.

Moises

Cowell

Mathematics

and

, Actuarial

Statistics

Goldszmidt

SRI

International

333

Ravenswood

Menlo

University

Park

Ave . , EK329

, CA

94025

USA Northampton

Square

London

EC

IV

ORB

David

UK

Heckerman

Microsoft Rina

One

Dechter

Information

and

Computer

Science

University

of ,

CA

California

92697

USA

623

Research

Microsoft

Redmond USA

Irvine

Science

of Toronto

Toronto

GERMANY

City

of Computer

University

Bonn

Romerstr

Ghahramani

, W A

Way 98052

624 Geoffrey E . Hinton Department of Computer Science University of Toronto Toronto , Ontario M5S 3H5 CANADA Hazel Inski p MRC Environmental Unit

David

J

.

Cavendish

Michael I . Jordan Massachusetts Institute ogy E25-229 Cambridge , MA 02139 USA

CB3

Mansour

Department

of

of

Computer

Science

Mathematical

Tel

-

Aviv

University

Tel

-

Aviv

69978

Sciences

ISRAEL

Christopher

Meek

Microsoft

Research

One

Microsoft

Way

Redmond

,

W

A

98052

USA

Stefano

Michael J . Kearns AT & T Labs - Research RoomA201 180 Park Avenue Florham Park , NJ 07932 USA

OHE

UK

School

of Technol -

MacKay

Road

Cambridge

Yishay

Tommi S. Jaakkola Department of Computer Science University of California Santa Cruz , CA 95064 USA

.

Laboratory

Madingley

Epidemiology

Southampton General Hospital Southampton 8016 6YD UK

C

Monti

Intelligent

Systems

University

of

901M

Program

Pittsburgh

CL

Pittsburgh

,

PA

15260

USA

Radford

M

.

Department

of

Department

100

Statistics

of

University

Toronto

Neal

of

St

.

Computer

Science

Toronto

George

,

and

Street

Ontario

M5S

3G3

CANADA

Uffe Kjrerulff Department of Computer Science Aal borg U ni versi ty Fredrik Bajers Vej 7E DK -9220 Aalborg 0 DENMARK

Andrew

Artificial

Y

.

N

g

Intelligence

Laboratory

MIT

Cambridge

USA

,

MA

02139

625 Thomas S. Richardson Departmentof Statistics Box 354322 University of Washington Seattle, WA 98195 -4322 USA

Jirina Vejnarov ~

Brian Sallans Departmentof ComputerScience University of Toronto Toronto, Ontario M5S 3H5 CANADA

Joe

Laboratory of Intelligent Systems University

of Economics

Ekonomicka

957

148 00 Prague CZECH

REPUBLIC

Whittaker

Department

of Mathematics

and Statis -

tics

Lancaster

University

Lancaster

LA

! 4YF

UK Lawrence

K . Saul

AT & T Labs

Christopher K . I . Williams Neural ComputingResearchGroup Aston University BirminghamB4 7ET UK

- Research

180 Park

Avenue

Florham

Park , N J 07932

USA Peter

W . F . Smith

Department The

of Social Statistics

U ni versi ty

Southampton

, S09 5NH

UK

David MRC

J . Spiegelhalter Biostatistics

Institute

of Public

University Robinson

Unit Health

Forvie Site Way

Cambridge CB2 2SR UK

Milan Studeny Institute and

of Information

Theory

Automation

Academy

of Sciences of Czech

Republic Pod

vodarenskou

18208 CZECH

Prague REPUBLIC

vezl

4

INDEX

see inference

A

acceptance

rate

adaptive

beta prior

190

rejection

sampling

binomial

222

Adler

' s

overrelaxation

ancestral

method

sets

20

,

approximation

Gaussian

see

Laplace

see

sampling

see

variational

sampling

307

Bol tzmann

209

distribution

21

117 , 170 , 358 , 408

machine 116 , 134 , 178 , 542 bucket elimination 75

methods

see

308

between-separated models 244

method

approximation

BUGS

approximation

193 , 583 , 597

burn - in period 201, 583

methods

methods

C chi

-

squared

artificial

563

neural

see

networks

neural

( ANNs

canonical parameter ca usali ty

)

networks

asymptotic

expansion

462

,

563

,

569

causal graph 248, 344, 465 causal knowledge 344

,

571

causal augmented

graph

AutoClass

342

,

344

relevance

determination

)

234 , 303 , 343

CG - potentials 17 chain graphs 231, 235, 238 chain graph Markov properties Andersson - Madigan - Perlman prop -

211

610

erty 239

B

basis

function

Baum

393

- Welch

see

,

602

,

also

EM

'

factor

Bayes

'

theorem

Bayesian

characteristic

algorithm

,

networks

422

score

472

52

,

337

462

,

,

464

Bayesian

434 ,

,

,

classification

531

76

,

,

522

472

313

,

373

,

340 , 383 , 616

clique 19, 53

clique -marginal representation 23, 41

430

clustering

networks

see

(BIG)

criterion

466

238

variable

Cheeseman - Stutz scoring function

10

,

property

random

435 , 439 , 441 , 444

328

332

belief

Markov

400

information

Bayesian

Lauritzen - Wermuth -Frydenberg

606

algorithm

Bayes

central

networks

496 belief

condition

causal relationships

472

time

( ARD

BDe

Markov

95

autocorrelation

automatic

557

updating

627

367 , 383 , 386 , 391 , 406 ,

628 K - means 367 , 386 , 407 , 496 , 503 , 506

detailed

pairwise

dimensionality 176, 177, 179, 183,

410

topological

pairwise

co - connection

413

185 , 192

determined

models

optimization

406

intrinsic

community

of experts network 334

dependence entropy

structure

333 ,

Dirichlet

prior

dissimilarity

13 , 25 , 236 , 237 ,

266 , 445

dependence family

of

269

E

edge exclusion test 556 eigenfunction 609, 614 EM algorithm 118, 141, 325, 355,

distributions

376 , 382 , 409 , 412 , 480 ,

308

492 , 496 , 503 , 506 , 548

function

CODA

586

context

- specific

126 , 161

see also generalized EM algorithm

independence

( CSI )

see also incremental EM algorithm

423 , 424 , 433 , 434 125 , 168 , 192 - Herskovits 464

propagation

covariance cross cutset

333 , 334

dummy evidence 33, 34

information

stochMtic

cost

matrix 411

distribution equivalence d-separation 237

239 , 265 , 556

Cooper

311 , 431 , 440 , 463 ,

38

mutual

convexity

Jacobian

464

268 , 429

independence

conjugate

of the

matrix

270 , 271

Gaussian

378 , 417

see also rank

479

completeness 241 conditional

conjugate

391

reduction

combinatorial

193 , 216

deterministic annealing 416

246

complete

balance

function

validation conditioning

scoring

function

see also sparse EM algorithm embedding 413 energy

69

function

117 , 147 , 170 , 178 ,

199 , 358 , 482 608 , 609 , 611

329 , 612 99

D decimation117 decisiontree 340, 424, 425, 433, 434, 435, 436, 438, 440, 441, 442, 459 default table 424, 425, 433, 434, 435, 436, 441 delta rule 482, 545 density estimation 372, 542, 544

ensemble learning 156 entropic function 268 equivalence class of structures equivalent evidence

334

samples 309 311

expect at ion - ma : ximization

see EM algorithm explaining away 114, 542 exploratory data analysis 405 exponential family 311, 557, 558, 560

629

GT model591

F

factor

analysis

factorial

375

hidden

,

377

,

Markov

480

model

119

H

,

Hamiltonian dynamics 194 handwritten digit recognition 383,

149

factorized

mean

field

tion

fast

35

free

-

,

37

backward

energy

,

heat

48

algorithm

170

function

-

488 , 547

retraction

forward

approxima

167

,

- space

356

,

,

482

,

method

192 , 206

heavy - tailed distribution

400

408

view

bath

Helmholtz

491

machine

591

142 , 154 , 542

Hepatitis B (HB ) 576

604

hidden

Markov

decision

tree

122 ,

152

G

Gambian

Hepatitis

Intervention

( GHIS

)

hidden

Study

Markov

model

118 , 156 , 367 ,

399

576

see also

Gaussian

factorial

hidden

Markov

model approximation

324

see graphical

model

moment

characteristics

potentials

39

processes

Gelman

38

,

473

,

,

hidden

40

Rubin

generalized

EM

generalized

statistics

algorithm

linear

hierarchical

356

models

cross

393

,

475

,

- validation

( GCV

potential

models

topographic

mixtures

mapping

( G

time

model

TM

)

hyperparameters

I

imagined future data 309

distribution

Boltzmann

I - map

distribution

sampling

155

219

,

583

,

323

,

,

192

484

,

,

201

491

,

,

206

,

549

,

612

parameter

independence

464

,

446

importance sampling 180 incomplete data 303, 321 incomplete network structures 335 incremental EM algorithm 360 independence axiomatization

Markov

property

236

,

237

238

growth

- based

curve

200 , 308 , 334

399

591

525

gradient

387

hypothesis equivalence 334, 465

through

global

200 , 223 , 386 ,

hybrid Bayesian networks 521 hybrid Monte Carlo 194, 207, 618

representation

392

global

149 , 233 , 310 , 322 ,

Hugin propagation 52, 53, 60, 62,

)

22

Gibbs

deci -

69

generalized

see

Markov

387 , 479 , 483 , 543 , 582

612

Gibbs

variables

hierarchical

generalized

GG

tree

467 , 544

586

602

generative

hidden

sion

342 , 347 , 355 , 373 , 462 ,

599

and

also

557

43

optimization

580

325

292

,

equivalence 333 independency model 265

630 see also conditional

indepen -

dence

see also context - specific inde pendence

see also global parameter in dependence see also local

parameter

inde -

pendence induced

width

86 , 87

inducing path 243 inference

9 , 27 , 51 , 79 , 316

see also bucket

elimination

see also cutset conditioning see also H ugin propagation

see also poly -tree algorithm see also sampling methods see also Shafer-Shenoy propa gation see also variational inference

methods

rule

regular

283

logical implication perfect pure

of 285

291

pro babilistically

sound 288

287

redundant

289

semantics

of 284

soundness

of 292

information

matrix

558 , 561

information -modeling trade -off 497, 502 , 505

inseparability interaction

243

graph 76

intervention

see causality invariant distribution 193 , 200 inverse variance matrix 556

Ising model 178 J

Jensen '8 inequality 545

join tree seej unction tree joint probability distribution 18, 52, 128, 313, 357, 373 junction tree 18, 21, 28, 40, 45, 52, 98, 108

140 , 148 , 168,

K K -meansalgorithm seeclustering kriging 600 Kullback-Leibler divergence 139, 165, 266, 358, 416, 444, 492, 496 L Laplace approximation 332, 617 Laplace's method 332, 465 latent space375 latent variables seehidden variables learning causal relationships 343 parameters 326, 334 structure 326, 334, 421, 436 leave-one-out 329 likelihood complete data 377 equivalence334 function 11, 106, 141, 307, 311, 372, 376, 381, 557, 559 see also maximum likelihood estimation linear Gaussianunit 479, 480, 481, 486, 488 linear smoothing 607 linkage analysis 70 local parameter independence464, 525 local 2, 433, 436, 438, 439 log score (18) 533 logistic belief network

631

see sigmoid belief network logistic function 115, 124, 616 L p- propagation 47

Metropolis algorithm 185, 187, 206,

M

missing data

612

minimum description length (MD L ) 332 , 427 , 462

see hidden

magnification factor 395, 397 manifold

mixed graphs 235

374 , 392 , 395

marginal likelihood 156, 311, 331,

mixed

models

approximation

marginal representation 23, 52 marginalization

models

strong 44 Markov

blanket 553

reversible

530 , 548

chain

of experts

145 , 148 , 155 , 543 ,

model

156 , 493

selection

327 , 328 , 339 , 379 ,

462 , 531 , 556 momentum 194

193

irreversible

166

322 , 342 , 362 , 373 , 375 ,

378 , 385 , 393 , 480 , 497 ,

44

Mar kov chain

Markov

42

mixture

462

weak

variables

194 , 198 Monte

Monte - car 10 methods

see sampling methods

Carlo

see Gibbs sampling see Metropolis algorithm Markov equivalence 241

moralization

Markov

multidimensional scaling (MDS ) 415,

properties

16, 52 , 55 , 231 ,

most probable explanation (MPE ) 75 , 79 , 89 417

524

see also global Markov prop -

multiinformation

random

function

decomposition

erty

see also chain graph Markov properties Markov

19 , 20 , 85 , 111

field

109 , 232

maximal branching 452

maximum a posteriori (MAP ) 30, 75 , 79 , 92 , 324 , 528

mutual

maximum likelihood estimation (ML )

of 279

information

167 , 261

N

naive Bayes models 470, 472, 474 nested junction tree 55 neural

maximum entropy principle 408 maximum expected utility (MEU ) 75 , 79 , 92 , 93 , 95

267

networks

115 , 146 , 479 , 522 ,

523 , 527 , 531 , 541 , 615

seealso sigmoid belief network noisy - OR model 113 , 132

nonserial dynamic programming 75

325 , 355 , 372 , 544 , 558

max -marginal representation 30 max -propagation 30 mean

field

methods

143 , 157 , 163 ,

165 , 412 , 545 , 546 metric

tensor

396

0 ordered

overrelaxation

Ornstein

- Uhlenbeck

208 , 215 process

out-marginalization35 output monitoring586

608

632 overrelaxation 197 , 207 see ordered overrelaxation

Q

QMR

p

-

DT

database

112

,

132

R

parameter independence 320 see global parameter indepen -

random

-

effects

growth

els

-

curve

mod

-

578

dence randomized

see local parameter indepen -

trials

random

234

walk

188

dence randomized

parameter modularity

trace

rank

partial

correlation

method

613

335, 464

coefficient

of

the

556 ,

Jacobian

matrix

468

,

469

562 , 564 recursive

partition

function

partitioned

factorization

14

110 , 176 , 358 recursi

density 498

ve

recursive

path consistency 435 , 436 , 446 persistent motion 197

tree

growing

tree

trimming

regression

442

340

Bayesian

physical probability 304, 343 poly - tree algorithm 96

442

regression

errors

-

in

-

599

variables

regression

model

593

posterior

11 measurement

posterior assignment 496 posterior

partition

see

512

-

also

error

model

generalized

593

linear

mod

-

els

potential 19, 53 , 55 , 109 prequential analysis 527

principal component analysis (PCA ) 378 , 379

prior

probabilistic 11

380 , 383

structures

sampling

data

,

474

183

,

222

410

entropy

Kullback

-

Leibler

posterior

running

divergence

probability

328

intersection

property

21

,

112

308 , 311 , 333 , s

inference

sampler

see inference

density

sampling

probabilistic logic sampling 27 proper scoring rule 329 proposal distribution

rejection

relational

relative

know ledge 303

469

546

see

333 , 335

on parameters 335 , 582

probabilistic

rank

relative

see beta prior see Dirichlet prior on

regular

regularization

183 , 186 , 190 ,

200

pseudo- independent model 277

180

methods

see

importance

see

Gibbs

see

Metropolis

see

rejection

Cauchy

sampling

sampling

algorithm

sampling

182

Gaussian

182

independent

samples

188

176

,

184

,

633 likelihood -weighting sampling 47 score function search

557

methods

best - first

336 search

338

greedy search 337, 440 heuristic

search

simulated

427 , 440 , 452

annealing

selection

bias

selection

variables

337

345 233

selective model averaging 327 self-organizing map (SOM ) 395 separable criterion

336

distributions

190 , 195

separation 236 separators

22 , 53

sequential log- marginal -likelihood criterion

330

set-chain representation 3 S- function

586

Shafer-Shenoy propagation 52, 53, 60 , 62 , 69

sigmoid belief network 115, 146, 475 , 481 , 482 , 543

simulated annealing 199, 337 softmax

function

529 , 616

sparse EM algorithm 365 standardized

residual of

measure strong

root

U unsupervisedlearning 115, 342, 479, 496, 544 utility function 329

276

of 278 45 , 47

structural equation model 248 structure

Bayesian network structure 315, 326

complete network structure

333

equivalence class of structures 334

incomplete network structures 335

T temperature 179, 199, 408 critical 179 test statistic 556 asymptotic power of 562 efficient score 559 Fisher's Z 563 generalizedlikelihood ratio 559 modified profile likelihood 560 signed square-root 573 t test 556 Wald statistic 559 tetrad constraints 474 TG model 591 topological ordering 15 transition probability 193, 198 tree-width 86 trek-sum rule 474 triangulation 21, 45, 52, 56, 111 triplex 239 TT model 591 typical set 179, 183

589

stochastic dependence levels

prior 333 substitution mapping 284 sufficient statistics 308, 311, 356, 361, 363, 377 syntactic record 283

V variable firing 66 variable propagation 66 . varIance of an estimator 176, 181, 201 variation distance 510 variational methods 105, 123, 165, 358, 544, 617

634 see

vector

also

mean

field

quantization

see

also

methods

406

,

497

clustering

visualization

384

,

385

,

405

,

413

,

416

v

- structure

333

W weight-space view 602 winner-take-all (WTA ) 367, 418, 496

Semi-Supervised Learning (Adaptive Computation and Machine Learning)

Read more

Deep Learning Adaptive Computation And Machine Learning Series

Read more

Probabilistic Graphical Models: Principles and Techniques (Adaptive Computation and Machine Learning series)

Read more

Probabilistic Graphical Models: Principles and Techniques (Adaptive Computation and Machine Learning)

Read more

Graphical Models for Machine Learning and Digital Communication

Read more

Principles of Data Mining (Adaptive Computation and Machine Learning)

Read more

The Minimum Description Length Principle (Adaptive Computation and Machine Learning)

Read more

Bioinformatics: The Machine Learning Approach, Second Edition (Adaptive Computation and Machine Learning)

Read more

Machine Learning

Read more

Machine Learning

Read more

Machine Learning

Read more

Machine Learning

Read more

Adaptive Behaviour and Learning

Read more

Machine Learning in Action

Read more

Machine Learning

Read more

Evolutionary Computation Machine Learning And Data Mining In Bioinformatics

Read more

Graphical models: Representations for learning, reasoning and data mining

Read more

Phase Transitions in Machine Learning

Read more

Ensembles in Machine Learning Applications

Read more

Dataset Shift in Machine Learning

Read more

Phase Transitions in Machine Learning

Read more

Learning machine translation

Read more

Machine Learning: ECML-94

Read more

Machine Learning and Systems Engineering

Read more

Machine Learning for Hackers

Read more

Machine Learning for Hackers

Read more

Bayesian Reasoning and Machine Learning

Read more

Machine learning and robot perception

Read more

Pattern Recognition and Machine Learning

Read more

Graphical models

Read more

Recommend Documents

Semi-Supervised Learning (Adaptive Computation and Machine Learning)

Semi-Supervised Learning computer science/machine learning Olivier Chapelle and Alexander Zien are Research Scientists...

Deep Learning Adaptive Computation And Machine Learning Series

Deep Learning Adaptive Computation And Machine Learning Series

Probabilistic Graphical Models: Principles and Techniques (Adaptive Computation and Machine Learning series)

Probabilistic Graphical Models: Principles and Techniques (Adaptive Computation and Machine Learning)

...

Graphical Models for Machine Learning and Digital Communication

Principles of Data Mining (Adaptive Computation and Machine Learning)

...

The Minimum Description Length Principle (Adaptive Computation and Machine Learning)

The Minimum Description Length Principle Peter D. Grünwald foreword by Jorma Rissanen The minimum description length (MD...

Bioinformatics: The Machine Learning Approach, Second Edition (Adaptive Computation and Machine Learning)

Bioinformatics Adaptive Computation and Machine Learning Thomas Dietterich, Editor Christopher Bishop, David Heckerma...

Machine Learning

Machine Learning Mark K Cowan −2 −2 x1 −2 +3 −2 −2 −2 y +3 −2 +3 −2 x2 +3 This book is based partly on conten...

Machine Learning

I Machine Learning Machine Learning Edited by Yagang Zhang In-Tech intechweb.org Published by In-Teh In-Teh Ola...