Series Foreword
The goal of building systems that can adapt to their environments and learn from their experience has attracted researchersfrom many fields , including computer science, engineering, mathematics, physics, neuroscience, and cognitive science. Out of this research has come a wide variety of learning techniques that have the potential to transform many industrial and scientific fields . Recently, several research communities have begun to converge on a common set of issues surrounding supervised, unsupervised, and reinforcement learning problems. The MIT Press Series on Adaptive Computation and Machine Learning seeksto unify the many diverse strands of machine learning research and to foster high-quality research and innovative applications. This book collects recent research on representing, reasoning, and learning with belief networks . Belief networks (also known as graphical models and Bayesian networks) are a widely applicable formalism for compactly representing the joint probability distribution over a set of random variables. Belief networks have revolutionized the development of intelligent systems in many areas. They are now poised to revolutionize the development of learning systems. The papers in this volume reveal the many ways in which ideas from belief networks can be applied to understand and analyze existing learning algorithms (especially for neural networks). They also show how methods from machine learning can be extended to learn the structure and parameters of belief networks . This book is an exciting illustration of the convergence of many disciplines in the study of learning and adaptive computation.
Preface
Graphical models are a marriage between probability theory and graph theory . They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering - uncertainty and complexity - and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms . Fundamental to the idea of a graphical model is the notion of modularity - a complex system is built by combining simpler parts . Probability theory provides the glue whereby the parts are combined , insuring that the system as a whole is consistent , and providing ways to interface models to data . The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly -interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms . Many of the cl~ sical multivariate probabilistic systems studied in fields such as statistics , systems engineering , information theory , pattern recogni tion and statistical mechanics are special cases of the general graphical model formalism - examples include mixture models , factor analysis , hid den Markov models , Kalman filters and Ising models . The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism . This has many advantages- in particular , specialized techniques that have been developed in one field can be trans ferred between research communities and exploited more widely . Moreover , the graphical model formalism provides a natural framework for the design of new systems . This book presents an in -depth exploration of issues related to learn ing within the graphical model formalism . Four of the chapters are tutorial articles (those by Cowell, MacKay , Jordan , et al., and Heckerman ). The remaining articles cover a wide spectrum of topics of current research in terest . The book is divided into four main sections: Inference , Indepen dence , Foundations for Learning , and Learning from Data . While the sections can be read independently of each other and the articles are to a large extent self-contained , there also is a logical flow to the material . A full appreciation of the material in later sections requires an understanding 1
2 of the material in the earlier sections . The book begins with the topic of probabilistic inference . Inference refers to the problem of calculating the conditional probability distribu tion of a subset of the nodes in a graph given another subset of the nodes. Much effort has gone into the design of efficient and accurate inference algorithms . The book covers three categories of inference algorithms - exact algorithms , variational algorithms and Monte Carlo algorithms . The first chapter , by Cowell , is a tutorial chapter that covers the basics of exact infer ence, with particular focus on the popular junction tree algorithm . This material should be viewed as basic for the understanding of graphical models . A second chapter by Cowell picks up where the former leaves off and covers advanced issues arising in exact inference . Kjcerulff presents a method for increasing the efficiency of the junction tree algorithm . The basic idea is to take advantage of additional independencies which arise due to the partic ular messages arriving at a clique ; this leads to a data structure known as a "nested junction tree ." Dechter presents an alternative perspective on exact inference , based on the notion of "bucket elimination ." This is a unifying perspective that provides insight into the relationship between junction tree and conditioning algorithms , and insight into space/ time tradeoffs . Variational methods provide a framework for the design of approximate inference algorithms . Variational algorithms are deterministic algorithms that provide bounds on probabilities of interest . The chapter by Jordan , Ghahramani , Jaakkola , and Saul is a tutorial chapter that provides a general overview of the variational approach , emphasizing the important role of convexity . The ensuing article by Jaakkola and Jordan proposes a new method for improving the mean field approximation (a particular form of variational approximation ) . In particular , the authors propose to use mixture distributions as approximating distributions within the mean field formalism . The inference section closes with two chapters on Monte Carlo meth ods. Monte Carlo provides a general approach to the design of approximate algorithms based on stochastic sampling . MacKay 's chapter is a tutorial presentation of Monte Carlo algorithms , covering simple methods such M rejection sampling and importance sampling , M well as more sophisticated methods based on Markov chain sampling . A key problem that arises with the Markov chain Monte Carlo approach is the tendency of the algorithms to exhibit random -walk behavior ; this slows the convergence of the algorithms . Neal presents a new approach to this problem , showing how a sophisticated form of overrelaxation can cause the chain to move more systematically along surfaces of high probability . The second section of the book addresses the issue of Independence . Much of the aesthetic appeal of the graphical model formalism comes from
3 the "Markov properties " that graphical models embody . A Markov prop erty is a relationship between the separation properties of nodes in a graph
(e.g., the notion that a subset of nodes is separated from another subset of nodes, given a third subset of nodes) and conditional independenciesin the family of probability distributions associated with the graph (e.g., A is independent of B given C , where A , Band C are subsets of random
variables). In the case of directed graphs and undirected graphs the relationships are well understood (cf . Lauritzen , 1997) . Chain graphs , however, which are mixed graphs containing both directed and undirected edges, are less well understood . The chapter by Richardson explores two of the Markov properties that have been proposed for chain graphs and identifies natural "spatial " conditions on Markov properties that distinguish between these Markov
properties
and those for both directed
and undirected
graphs .
Chain graphs appear to have a richer conditional independence semantics than directed and undirected graphs The chapter by Studeny and Vejnarova addresses the problem of characterizing stochastic dependence. Studeny and Vejnarova discuss the proper ties of the multiinformation function , a general information -theoretic func tion from which many useful quantities can be computed , including the conditional
mutual
information
for all disjoint
subsets of nodes in a graph .
The book then turns to the topic of learning . The section on Founda tions for Learning contains two articles that cover fundamental concepts that are used in many of the following articles . The chapter by H eckerman is a tutorial
article
that
covers
many
of the
basic
ideas
associated
with
learning in graphical models . The focus is on Bayesian methods , both for parameter learning and for structure learning . Neal and Hinton discuss the
expectation-maximization (EM) algorithm . EM plays an important role in the graphical model literature , tying together inference and learning prob lems. In particular , EM is a method for finding maximum likelihood (or maximum a posteriori ) parameter values, by making explicit use of a prob abilistic inference (the "E step" ) . Thus EM - based approaches to learning generally make use of inference algorithms as subroutines . Neal and Hinton describe the EM algorithm as coordinate ascent in an appropriately -defined cost function . This point of view allows them to consider algorithms that take partial E steps, and provides an important justification for the use of approximate inference algorithms in learning . The section on Learning from Data contains a variety of papers concerned with the learning of parameters and structure in graphical models . Bishop provides an overview of latent variable models , focusing on prob abilistic principal component analysis , mixture models , topographic maps and time series analysis . EM algorithms are developed for each case . The
article by Buhmann complements the Bishop article , describing methods
4 for dimensionality reduction , clustering , and data visualization , again with the EM algorithm providing the conceptual framework for the design of the algorithms . Buhmann also presents learning algorithms based on approxi mate inference and deterministic annealing . Friedman and Goldszmidt focus on the problem of representing and learning the local conditional probabilities for graphical models . In partic ular , they are concerned with representations for these probabilities that make explicit the notion of "context -specific independence ," where , for example , A is independent of B for some values of C but not for others . This representation can lead to significantly more parsimonious models than standard techniques . Geiger, Heckerman , and Meek are concerned with the problem of model selection for graphical models with hidden (unobserved ) nodes. They develop asymptotic methods for approximating the marginal likelihood and demonstrate how to carry out the calculations for several cases of practical interest . The paper by Hinton , Sallans , and Ghahramani describes a graphical model called the "hierarchical community of experts " in which a collection of local linear models are used to fit data . As opposed to mixture models , in which each data point is assumed to be generated from a single local model , their model allows a data point to be generated from an arbitrary subset of the available local models . Kearns , Mansour , and Ng provide a careful analysis of the relationships between EM and the K -means algorithm . They discuss an "information -modeling tradeoff ," which characterizes the ability of an algorithm to both find balanced assignments of data to model components , and to find a good overall fit to the data . Monti and Cooper discuss the problem of structural learning in networks with both discrete and continuous nodes. They are particularly concerned with the issue of the discretization of continous data , and how this impacts the performance of a learning algorithm . Saul and Jordan present a method for unsupervised learning in layered neural networks based on mean field theory . They discuss a mean field approximation that is tailored to the case of large networks in which each node has a large number of parents . Smith and Whittaker discuss tests for conditional independence tests in graphical Gaussian models . They show that several of the appropriate statistics turn out to be functions of the sample partial correlation coefficient . They also develop asymptotic expansions for the distributions of the test statistics and compare their accuracy as a function of the dimen sionality of the model . Spiegelhalter , Best , Gilks , and Inskip describe an application of graphical models to the real-life problem of assessing the effectiveness of an immunization program . They demonstrate the use of the graphical model formalism to represent statistical hypotheses of interest and show how Monte Carlo methods can be used for inference . Finally ,
5 Williams provides an overview of Gaussian processes , deriving the Gaussian process approach from a Bayesian point of view , and showing how it can be applied to problems in nonlinear regression, classification , and hierarchical modeling . This volume arose from the proceedings of the International School on Neural
Nets
" E .R . Caianiello
," held
at
the
Ettore
Maiorana
Centre
for
Scientific Culture in Erice , Italy , in September 1996. Lecturers from the school contributed chapters to the volume , and additional authors were asked to contribute chapters to provide a more complete and authoritative coverage of the field . All of the chapters have been carefully edited , following a review process in which each chapter WM scrutinized by two anonymous reviewers and returned to authors for improvement . There are a number of people to thank for their role in organizing the Erice meeting . First I would like to thank Maria Marinaro , who initiated the ongoing series of Schools to honor the memory of E .R . Caianiello , and who co- organized the first meeting . David Heckerman was also a co- organizer of the school , providing helpful advice and encouragement throughout . Anna Esposito at the University of Salerno also deserves sincere thanks for her help in organizing the meeting . The staff at the Ettore Maiorana Centre were exceedingly professional and helpful , initiating the attendees of the school into the wonders of Erice . Funding for the School was provided by the NATO Advanced Study Institute program ; this program provided generous support that allowed nearly 80 students to attend the meeting . I would
also like
to thank
Jon Heiner
, Thomas
Hofmann
, Nuria
Oliver
,
Barbara Rosario , and Jon Yi for their help with preparing the final docu ment
.
Finally , I would like to thank Barbara Rosario , whose fortuitous atten dance as a participant at the Erice meeting rendered the future condition ally independent of the past .
Michael I. Jordan
INTRODUCTION TO INFERENCE FOR BAYESIAN NETWORKS
ROBERT
COWELL
City University , London .
The School of Mathematics, Actuarial Science and Statistics, City University, Northampton Square, London EC1E OHT
1. Introduction The field of Bayesian networks , and graphical models in general , has grown enormously over the last few years, with theoretical and computational developments in many areas. As a consequence there is now a fairly large set of theoretical concepts and results for newcomers to the field to learn . This tutorial aims to give an overview of some of these topics , which hopefully will provide such newcomers a conceptual framework for following the more detailed and advanced work . It begins with revision of some of the basic axioms of pro babili ty theory .
2. Basic axioms of probability Probability
theory
ing
under
degree
data
,
of
at
of
using
of
recent
us
of
obeys
P
(
A
)
,
absence
is
a
of
as
proposition
,
event
1
if
but
system
of
certainty
a
.
reason
-
Within
numerical
the
measure
consistency
of
being
,
denoted
with
and
only
if
A
P
of
(
A
)
:
is
,
the
certain
.
9
,
is
,
was
and
the
abandoned
.
algorithms
AI
number
uncertainty
prohibitive
community
It
that
.
probability
a
encapsulated
with
systems
the
axioms
axioms
logic
cope
computational
within
by
following
expert
efficient
revival
basic
,
to
became
in
of
a
Boolean
made
calculations
inference
had
or
were
the
some
A
the
,
Attempts
for
with
an
deductive
.
,
has
begin
=
the
interpreted
a
development
theory
ability
used
theory
the
which
in
rules
theory
probability
Let
logic
under
is
systems
production
probability
with
is
belief
probability
use
that
inductive
.
expert
sets
as
probability
consistent
hand
Early
by
known
,
framework
the
1
also
uncertainty
Bayesian
is
,
theory
in
the
.
interval
The
prob
[ 0
-
,
1
]
,
10
ROBERT COWELL
2 If A and B are mutually exclusive , then P (A or B ) = P (A ) + P (B ). We will be dealing exclusively with discrete random variables and their probability distributions . Capital letters will denote a variable , or perhaps a set of variables , lower case letter will denote values of variables . Thus suppose A is a random variable having a finite number of mutually exclusive states (al , . . ' , an) . Then P (A ) will be represented by a vector of non negative real numbers P (A ) = (Xl " , . , xn ) where P (A = ai ) = Xi is a scalar , and Ei Xi = 1. A basic concept is that of conditional probability , a statement of which takes the form : Given the event B = b the probability of the event A = a is X, written P (A == a I B == b) = x . It is important to understand that this is not saying : "If B = b is true then the probability of A = a is x " . Instead it says: "If B = b is true , and any other information to hand is irrelevant to A , then P (A == a) == X" . (To see this , consider what the probabilities would be if the state of A was part of the extra information ) . Conditional probabilities are important for building Bayesian networks , as we shall see. But Bayesian networks are also built to facilitate the calculation of conditional probabilities , namely the conditional probabilities for variables of interest given the data (also called evidence) at hand . The fundamental rule for probability calculus is the product rulel
P(A andB) = P(A I B)P(B).
(1)
This equationtells us how to combineconditionalprobabilitiesfor individual variablesto definejoint probabilitiesfor setsof variables. 3. Bayes ' theorem The -
simplest
form
written
marginal
of
Bayes
as
P (A , B )
and
conditional
-
' theorem of
two
relates
events
or
probabilities
the
joint
probability
hypotheses
A
and
P (A B
is This
with
Bayes can
a prior
lOr
more
' theorem be
(2)
probability
I B ) =
P (B
(3)
I A ) P (A ) P (B ) ,
.
interpreted
generally
of
we easily obtain P (A
which
terms
B )
:
P(A, B) = P(A I B)P(B) = P(B I A)P(A). By rearrangement
in
and
as
follows
P ( A ) for
P ( A and
. our
We are interested in A , and we begin belief
about
A ,
B I C ) :=: P ( A I B , C ) P ( B I C ) .
and then
we observe
INTRODUCTION TO INFERENCEFORBAYESIANNETWORKS 11 B . Then Bayes' theorem, (3), tells us that our revisedbelief for A, the posterior probability P(~ I B) is obtainedby multiplying the prior P (A) by the ratio P (B I A)/ P(B ). The quantity P(B I A), asa functionof varying A for fixed B , is calledthe likelihoodof A. We canexpressthis relationship in the form: posterior
cx:
prior
x likelihood
P(A IB) cx : P(A)P(B IA). Figure 1 illustrates this prior -to- posterior inferenceprocess. Each diagram
0 0 P (A ,B )
Figure 1.
0f P (0 B )P (A IB )
Bayesian inference as reversing the arrows
represents in different ways the joint distribution P (A , B ) , the first repre sents the prior beliefs while the third represents the posterior beliefs . Often , we will think of A as a possible "cause" of the "effect" B , the downward arrow represents such a causal interpretation . The "inferential " upwards arrow then represents an "argument against the causal flow " , from the observed effect to the inferred cause. (We will not go into a definition of "causality " here.) Bayesian networks are generally more complicated than the ones in Figure 1, but the general principles are the same in the following sense. A Bayesian network provides a model representation for the joint distri bution of a set of variables in terms of conditional and prior probabilities , in which the orientations of the arrows represent influence , usually though not always of a causal nature , such that these conditional probabilities for these particular orientations are relatively straightforward to specify (from data or eliciting from an expert ) . When data are observed, then typically an inference procedure is required . This involves calculating marginal prob abilities conditional on the observed data using Bayes' theorem , which is diagrammatically equivalent to reversing one or more of the Bayesian network arrows . The algorithms which have been developed in recent years
12
ROBERT COWELL
allows these calculations to be performed in an efficient and straightfor ward
manner
4 . Simple
.
inference
problems
Let us now consider some simple examples of inference . The first is simply Bayes' theorem with evidence included on a simple two node network ; the remaining examples treat a simple three node problem . 4 .1 .
PROBLEM
I
Supposewe have the simple model X - + Y , and are given: P (X ), P (Y I X ) and Y == y . The problem is to calculate P (X I Y == y ) .
Now from P (X ), P (Y I X ) we can calculate the marginal distribution P (Y ) and hence P (Y = y ). Applying Bayes' theorem we obtain
P (X I Y = y) = P (Y = yIX )P (X ) P (Y = y) .
4 .2 .
PROBLEM
(4)
II
Suppose now we have a more complicated model in which X is a par -
ent of both Y and Z : Z +- X -::,. Y with specified probabilities P (X ), P (Y I X ) and P (Z I X ) , and we observe Y = y . The problem is to calculate P (Z I Y = y ) . Note that the joint distribution is given by P (X , Y, Z ) = P (Y I X )P (Z I X )P (X ) . A 'brute force ' method is to calculate :
1. The joint distribution P (X , Y, Z ). 2. The marginal distribution
P (Y ) and thence P (Y = y ) .
3. The marginal distribution P (Z, Y ) and thence P (Z, Y = y). 4. P (Z I Y = y) = P (Z, Y = y)/ P (Y = y). An alternative method is to exploit the given factorization :
1. Calculate P (X I Y = y) = P (Y = y I X )P (X )/ P (Y = y) using Bayes' theorem, where P (Y = y) = Ex P (Y = y I X )P (X ).
2. Find P(Z I Y = y) = Ex P(Z I X )P(X I Y = y). Note that the first step essentially reverses the arrow between X and Y . Although the two methods give the same answer, the second is generally more efficient . For example , suppose that all three variables have 10 states .
Then the first method in explicitly calculating P (X , Y, Z ) requires a table of 1000 states . In contrast the largest table required for the second method has size 100. This gain in computational efficiency by exploiting the given factorizations is the basis of the arc-reversal method for solving influence
INTRODUCTION
diagrams
,
example
shows
4 .3 .
TO
and
of
the
Suppose
calculate
that
P
The
calculational
' sent
'
( Z
I Y
2 .
Find
P
3 .
Find
P
4 .
Find
P
( Z
=
) ,
) .
Note
P
( Z
, X
P
( Y
, X
( X
, Y
( X
)
( Z
, X
)
==
( Z
, Y
=
, Z
the
P
I Y
==
( X
undirected
)
and
( Y
, X
=
P
( Z
I X
) P
( X
)
)
=
P
( Y
I X
) P
( X
)
)
=
P
( Z
, X
proceeds
) P
( Y
using
y
, X
= )
( X
, Y
) P
( X
ExP =
= ) j
( Z
P
( Z
, Y
y
P
=
we
had
( X
, Y
, Z
)
=
P
likewise
the
for , X
=
P x
)
( Y
I Z
Z the
- E-
X
)
=
y
) j
P
( X
-
X
-
problem
XY is
, to
) .
' message
'
in
step
1
which
is
) .
Ez
I X
P
P
) P
( X P
I Y
P
( Y
, X I X
)
=
P
- +
P
P
=
x
conditional X
joint
graphs
( Z
=
of
graph
directed
) j
the
( Z
, Y
=
y
)
( Z
I X
) P
( X
) ,
get
example
fact
.
) .
( Y
=
In
tree
) .
, Y
( Z
( Y
, Z
, X
I X
( Y
( Z )
I X
and
I X
) . P
( Z
) P
Y
with
this can
( Z
( Z
I X
, X
) P
)
)
Hence I Y
given , X
=
X x
( Dawid
distribution be
( X
)
independence
probability
)
)
P
I Z
following
that
=
an
The junction
ZX
Again
, X
a
( X
, X
,
( Y
) .
)
P
( Z
)
we
P
is
a
structure
P
:
Ey
P
=
example
which
P
on
that
now
y
P
and
.
propagation
13
independence
IMt
from
algorithms
using
given
, X
y
P
Conditional
the
NETWORKS
2 :
Calculate
In
are
( Z
steps
step
1 .
.
BAYESIAN
propagation
calculation
we P
P
5
- tree
same
probabilities
in
FOR
III
now
and
junction
the
PROBLEM
INFERENCE
,
factorized
)
=
= P
( 1979 though
x ( Z ) ) .
this
according
Z
+ -
X
- t
Y
:
P
( X
, Y
, Z
)
=
P
( X
Z
- t
X
- t
Y
:
P
( X
, Y
, Z
)
=
P
( Y
) P I X
( Y ) P
I X ( X
Z
+ -
X
+ -
Y
:
P
( X
, Y
, Z
)
=
P
( X
I Y
) P
( Z
) P I Z I X
( Z
I X
) .
) P
( Z
) .
) P
( Y
) .
,
I X
three
we =
We is
to
:
say
obtain x
) .
This
associated not
unique distinct
.
ROBERT COWELL
14
Each of thesefactorizations follows from the conditional independenceproperties which each graph expresses , viz Z 11 Y I X , (which is to be read as "Z is conditionally independent of Y given X" ) and by using the general factorization property : P (X1 , . . . Xn ) = = -=
P (X11 X2, . . . , Xn )P (X2, . . . , Xn ) P (X 1 I X2, . . . , Xn )P (X2 \ X3, . . . , Xn )P (X3, . . . , Xn ) . .. P (X1 / X2, . . . , Xn ) . . . P (Xn - ll Xn )P (Xn ).
Thus for the third example P (X , Y, Z ) = P (Z I X , Y )P (X I Y )P (Y ) = P (Z / X )P (X I Y )P (Y ). Note that the graph Z - t X ~ Y does not obey the conditional independence property Z lL Y I X and is thus excluded from the list ; it factorizes as P (X , Y, Z ) = P (X I Y, Z )P (Z )P (Y ). This example showsseveralfeaturesof generalBayesiannetworks. Firstly , the use of the conditional independenceproperties can be used to simplify the general factorization formula for the joint probability . Secondly, that the result is a factorization that can be expressedby the use of directed acyclic graphs (DAGs). 6. General specification
in DAGs
It is these features which work together nicely for the general specification of Bayesian networks. Thus a Bayesiannetwork is a directed acyclic graph, whosestructure definesa set of conditional independenceproperties. These properties can be found using graphical manipulations, eg d-separation (see eg Pearl(1988)). To each node is associateda conditional probability distri bution , conditioning being on the parents of the node: P (X I pa(X )). The joint density over the set of all variables U is then given by the product of such terms over all nodes: P (U) = IIp x
(X I pa(X )).
This is called a recursivefactorization according to the DAG ; we also talk of the distribution being graphical over the DAG. This factorization is equivalent to the general factorization but takes into account the conditional independenceproperties of the DAG in simplifying individual terms in the product of the general factorization. Only if the DAG is complete will this formula and the general factorization coincide, (but even then only for one ordering of the random variables in the factorization).
INTRODUCTION TO INFERENCE FORBAYESIAN NETWORKS15 6.1. EXAMPLE Considerthe graph of Figure 2.
P(A,B,C,D,E,F,G,H,I) =P(A)P(B)P(C) P(D IA)P(E IA,B)P(F IB,C) P(GIA,D,E)P(H IB,E,F)P(I IC,F). Figure 2.
It
to
is
simply
marginalising
useful
to
removing
note
that
it over
the
and
Nine node example.
marginalising
any variable
over
edges H
to in
it the
from
a
its
above
childless
node
parents gives
. Thus
is
for
equivalent
example
,
:
P(A, B, C, D, E, F, G, I ) = L P(A,B, C, D, E, F, G,H, I ) H = L P(A)P(B)P(C)P(D IA)P(E IA, B)P(F IB, C) H P(GIA, D,E)P(H IB, E, F)P(I IC, F) = P(A)P(B)P(C)P(D IA)P(E IA, B)P(F IB, C) P(GIA, D, E)P(I IC, F) L P(H IB, E, F) H
= P (A)P (B )P (C)P (D I A)P (E I A, B )P (F I B , C) P (G I A , D , E )P (I I C, F ), which can be represented by Figure 2 with H and its incident edges removed .
Directed acyclic graphs can always have their nodes linearly ordered so
that for each node X all of its parents pa(X ) precedesit in the ordering. Such and ordering is called a topological ordering of the nodes. Thus for example (A , B , C, D , E , F , G, H , I ) and (B , A , E , D , G , C, F , I , H ) are two of the many topological orderings of the nodes of Figure 2. A simple algorithm to find a topological ordering is as follows : Start with the graph and an empty list . Then successively delete from the graph any node which
does not have any parents , and add it to the end of the
list . Note that if the graph is not acyclic , then at some stage a graph will be obtained in which no node has no parent nodes, hence this algorithm can be used as an efficient way of checking that the graph is acyclic .
16
ROBERTCOWELL
Another equivalent way is to start with the graph and an empty list , and successively delete nodes which have no children and add them to the
beginning of the list (cf. marginalisation of childless nodes.) 6 .2 . DIRECTED
MARKOV
PROPERTY
An important property is the directed Markov property . This is a condi tional independence property which states that a variable is conditionally independent of its non-descendents given it parents :
X Jl nd(X ) I pa(X ). Now recall that the conditional probability P (X I pa (X )) did not necessarily
mean that if pa(X ) = 7r* say, then P (X = x ) = P (x 17r*), but included the
caveat
For
the
that
DAGs
any this
other
information
' other
information
is irrelevant
to X
' means , from
the
for
this
to hold .
directed
Markov
property , knowledge about the node itself or any of its descendents. For if all of the parents of X are observed, but additionally observed are one or more descendents Dx of X , then because X influences Dx , knowing
D x and pa(X ) is more informative than simply knowing about pa(X ) alone . However having information about a non-descendent does not tell us anything more about X , because either it cannot influence or be influenced by X either directly or indirectly , or if it can influence X indirectly , then only through influencing the parents which are all known anyway . For example , consider again Figure 2. Using the previous second topological ordering we may write the general factorization as:
P (A , B , C, D , E , F, G, I , H ) = P (B ) * P (A I B ) * P (EIB , A ) * P (D I B , A , E ) * P (G f B , A , E , D ) * P (C I B , A , E , D , G) * P (FIB , A , E , D , G, C) * P (I I B , A , E , D , G, C, F ) * P (HIB , A , E , D , G, C, F , I )
(5)
but now we can use A lL B from the directed Markov property to simplify
P (A I B ) - t P (A ), and similarly for the other factors in (5) etc, to obtain the factorization
in Figure 2. We can write the general pseudo- algorithm of
what we have just done for this example as
INTRODUCTION
TO INFERENCE
Topological
FOR BAYESIAN
ordering
General factorization Directed :~
7.
Making
the
inference
Markov
17
NETWORKS
+ +
property
Recursive factorization
.
engine
We shall now move on to building the so called " inference engine " to in troduce new concepts and to show how they relate to the conditional in dependence / recursive factorization ideas that have already been touched upon . Detailed justification of the results will be omitted , the aim here is to give an overview , using the use the fictional ASIA example of Lauritzen and Spiegelhalter . 7 .1 .
ASIA : SPECIFICATION
Lauritzen lows :
and Spiegelhalter
describe their fictional
problem
domain
as fol -
Shortness -of -breath (Dyspnoea ) may be due to Tuberculosis , Lung can cer or Bronchitis , or none of them , or more than one of them . A recent visit to Asia increases the chances of Tuberculosis , while Smoking is known to be a risk factor for both Lung cancer and Bronchitis . The results of a single X -ray do not discriminate between Lung cancer and Tuberculosis , as neither does the presence or absence of Dyspnoea .
@ / @ I
P(U) =P(A)P(S) P(T IA)P(L I S) P(B I S)P(E I L, T) P(D IB, E)P(X I E)
Figure3. ASIA
18
ROBERTCOWELL
The network for this fictional example is shown in Figure 3. Each vari able is a binary with the states ( " yes" , " no " ) . The E node is a logical node taking value " yes" if either of its parents take a " yes" value , and " no " oth erwise ; its introduction facilitates Lung cancer and Tuberculosis .
modelling
the relationship
of X -ray to
Having specified the relevant variables , and defined their dependence with the graph , we must now assign (conditional ) probabilities to the nodes . In real life examples such probabilities may be elicited either from some large database (if one is available ) as frequency ratios , or subjectively from the expert from whom the structure has been elicited (eg using a fictitious gambling scenario or probability wheel ) , or a combination of both . However M this is a fictional example we can follow the third values . (Specific values will be omitted here .) 7.2. CONSTRUCTING
THE INFERENCE
route and use made - up
ENGINE
With our specified graphical model we have a representation density in terms of a factorization : P (U )
=
IIP v
(Vlpa
(V ) )
=
P (A ) . . . P (X I E ) .
of the joint
(6) (7)
Recall that our motivation is to use the model specified by the joint distri bution to calculate marginal distributions conditional on some observation of one or more variables . In general the full distribution will be computa tionally difficult to use directly to calculate these marginals directly . We will now proceed to outline the various stages that are performed to find a representation of P ( U ) which makes the calculations more tractable . (The process of constructing the inference engine from the model specification is sometimes called compiling the model .) The manipulations required are almost all graphical . There are five stages in the graphical manipulations . Let us first list them , and then go back and define new terms which are introduced . 1. Add undirected edges to all co- parents which are not currently joined (a process called marrying parents ) . 2. Drop all directions in the graph obtained from Stage 1. The result is the so- called moral graph . 3 . Triangulate the moral graph , that is , add sufficient additional undi rected links between nodes such that there are no cycles (ie . closed paths ) of length 4 or more distinct nodes without a short -cut . 4 . Identify the cliques of this triangulated graph . 5. Join the cliques together to form the junction tree .
INTRODUCTION TO INFERENCEFOR BAYESIAN NETWORKS 19 Now let us go through these steps, supplying somejustification and defining the new terms just introduced as we go along. Consider first the joint density again. By a changeof notation this can be written in the form
P(U)
-
-
(8)
II a(V,pa(V)) v a(A) ...a(X,E).
(9)
where a(X , pa(X )) == P (V I pa(V )). That is, the conditional probability factors for V can be consideredas a function of V and its parents. We call such functions potentials. Now after steps 1 and 2 we have an undirected graph, in which for each node both it and its set of parents in the original graph form a complete subgraph in the moral graph. (A complete graph is one in which every pair of nodes is joined together by an edge.) Hence, the original factorization of P (U) on the DAG G goesover to an equivalent factorization on these complete subsetsin the moral graph Cm. Technically we say that the distribution is graphical on the undirected graph Gm. Figure 4 illustrates the moralisation processfor the Asia network. Now let us de-
0
0
j
0
@
@
""
@
@
Figure4. MoralisingAsia: Two extra links arerequired, A - Sand L - B . Directionality is droppedafter all moral edgeshavebeenadded. note the set of cliques of the moral graph by om . (A clique is a complete subgraph which is not itself a proper subgraph of a complete subgraph , so it is a maximal complete subgraph .) Then each of the complete subgraphs formed from { V } U pa (V ) is contained within at least one clique . Hence we can form functions ac such that
P(U) =
nac(Vc) cEcm
where ac (Vc ) is a function of the variables in the clique C . Such a factoriza tion can be constructed as follows : Initially define each factor as unity , i .e.,
20 ac
ROBERT COWELL
(
Vc
)
and
only
and
multiply
to
=
1
for
one
a
Note
not
that
by
Those
:
fact
an
are
ancestor
is
separates
a
there
the
the
First
)
one
( V
I
pa
the
of
{
)
is
a
functions
)
V
of
result
of
( V
function
find
}
one
U
pa
that
( V
)
clique
potential
on
rep
the
in
the
-
cliques
of
set
A
and
B
A
is
S
the
of
The
.
B
The
A
or
in
node
a
definitions
Y
A
E
we
A
:
the
moral
an
node
of
a
A
set
and
have
is
is
a
.
-
A
of
set
nodes
a
)
set
ancestral
the
node
( ii
ancestral
of
these
condi
.
parent
.
between
With
exploit
elicudating
definitions
sets
path
.
a
B
ancestors
every
of
to
.
.
for
more
of
its
is
original
sets
ancestral
if
node
used
later
construction
( i )
it
the
specification
are
described
parents
the
numerical
graph
be
some
of
of
process
of
the
ancestral
either
the
moralisation
in
moral
require
if
the
"
powerful
of
the
some
buried
will
we
union
sets
a
in
independences
on
B
and
is
"
"
is
.
node
least
through
Lemma
done
conditional
which
itself
Y
passes
the
terms
edges
the
still
graph
( at
node
nodes
of
properties
of
the
of
is
in
" visible
moral
of
P
subgraph
into
extra
all
Markov
the
factor
complete
this
the
of
independence
ancestor
When
computations
Aside
tional
adding
remain
local
In
.
they
which
.
each
the
distributions
read
though
efficient
for
distribution
joint
to
,
Then
.
possible
DAG
.
contains
function
the
am
am
conditional
new
of moral
in
which
this
obtain
the
cliques
clique
resentation
8
all
b
set
S
E
B
1
Let
P
factorize
recursively
according
to
g
.
Then
AlLEIS
whenever
of
A
the
A
,
Band
separates
sets
To
B
of
S
in
( QAn
A
can
and
,
we
in
subgraph
the
of
if
is
look
U
if
( AUBUS
BUS
)
) m
,
graph
.
d
to
Y
.
is
When
into
them
is
in
.
set
which
or
not
ancestral
,
Suppose
a
no
possible
any
set
.
S
) )
let
set
of
children
longer
m
in
the
.
-
smallest
calculation
us
that
of
Then
( AUBUS
of
picture
have
An
.
conditional
ways
the
Q
( Q
check
alternative
ancestral
it
minimal
B
properties
are
G
graph
to
- separation
from
acyclic
from
want
come
the
,
A
finding
find
the
directed
we
they
set
nodes
left
that
for
wish
set
separates
-
ancestral
a
S
at
graphs
delete
not
the
us
algorithm
that
successively
are
only
either
simple
G
and
moral
why
following
Then
if
subsets
tell
the
understand
graph
nodes
by
containing
disjoint
lemmas
we
ancestral
be
from
these
dependences
they
separated
set
S
A
What
the
are
2
Let
-
B
ancestral
Lemma
d
and
smallest
.
consider
we
have
nodes
Y
,
the
~
U
provided
delete
any
.
INTRODUCTION TO INFERENCEFOR BAYESIAN NETWORKS 21 Now recall that deleting a childless node is equivalent to marginalising over that node. Hence the marginal distribution of the minimal ancestral set containing A lL B I S factorizes according to the sub-factors of the original joint distribution . So these lemmas are saying that rather than go through the numerical exercise of actually calculating such marginals we can read it off from the graphical structure instead, and use that to test conditional independences. (Note also that the directed Markov property is also lurking behind the sceneshere.) The "moral" is that when ancestral sets appear in theorems like this it is likely that such marginals are being considered. 9. Making the junction
tree
The remaining three steps of the inference-engine construction algorithm seem more mysterious , but are required to ensure we can formulate a consistent and efficient message passing scheme. Consider first step 3 - adding edges to the moral graph am to form a triangulated graph Gt . Note that adding edges to the graph does not stop a clique of the moral graph formed from being a complete subgraph in at . Thus for each clique in am of the moral graph there is at least one clique in the triangulated graph which contains it . Hence we can form a potential representation of the joint prob ability in terms of products of functions of the cliques in the triangulated graph :
P(U) =
II cECt ac(Xc )
by analogy with the previous method outline for the moral graph . The point is that after moralisation and triangulation there exists for each a node-parent set at least one clique which contains it , and thus a potential representation can be formed on the cliques of the triangulated graph . While the moralisation of a graph is unique , there are in general many alternative triangulations of a moral graph . In the extreme , we can always add edges to make the moral graph complete . There is then one large clique . The key to the success of the computational algorithms is to form triangulated graphs which have small cliques , in terms of their state space . SIze. Thus after finding the cliques of the triangulated graph - stage 4 - we are left with joining them up to form a junction tree . The important prop erty of the junction tree is the running intersection property which means that if variable V is contained in two cliques , then it is contained in every clique along the path connecting those two cliques . The edge joining two cliques is called a separator . This joining up property can always be done , not necessarily uniquely for each triangulated graph . However the choice of
22
ROBERT COWELL
tree
is
immaterial
junction
except
tree
pendence
of
the
( not
.
necessarily
this
.
sage
The
running
algorithms
.
possible
If
.
the
ie
,
5
shows
of
a
!
)
@
~
"
-
j
Inference
:
~ ~ - _/
"
"
manageable
in
the
of
size
then
version
basic
the
is
the
mes
unit
of
-
the
computational
local
of
It
computation
Asia
and
a
possible
......
, , " "' .. .-..
the
summarise
tree
joint
consistence
.
becomes
@
on
unction
the
them
passing
become
to
between
@
FIgure
will
edges
between
granularity
-
some
~
@
We
loses
@
V
.
It
.
/
10
cliques
The
inde
extra
message
ensures
triangulated
.
adding
separators
the
.
conditional
independence
with
the
the
DAG
of
given
define
are
0
(
cliques
and
they
cliques
Figure
tree
. ,
of
original
process
property
cliques
,
,
conditional
computation
intersection
between
the
the
retain
)
local
all
on
does
considerations
necessarily
by
it
that
computation
junction
efficiency
not
distribution
However
fact
passing
local
the
but
neighbouring
of
possible
,
independences
graph
because
is
of
conditional
moral
computational
many
properties
the
for
captures
.
We
have
Junction
junction
some
probability
5 .
seen
the
basic
Asia
that
results
we
can
functions
P
for
tree
of
using
tree
( U
of
form
a
defined
)
=
II
the
( Xc
passing
potential
on
ac
message
)
on
the
representation
cliques
of
:
.
CEct
This
sections
potential
can
be
of
generalized
to
neighbouring
representation
cliques
include
functions
)
to
form
on
the
the
following
:
P(U)=D _ CECt ac Xc llSESt bs ((Xs )).
separators
so
( the
called
generalized
inter
-
INTRODUCTION TO INFERENCE FORBAYESIAN NETWORKS23 (for instanceby makingthe separatorfunctionsthe identity). Now, by sendingmessages betweenneighbouring cliquesconsistingof functionsof the separatorvariablesonly, whichmodifythe interveningseparatorand the cliquereceivingthe message , but in sucha waythat the overallratio of productsremainsinvariant,wecanarriveat the followingmarginal representation : p(U) = llCECp(C). llSESp(S)
(10)
Marginalsfor individualvariablescanbe obtainedfrom theseclique(or separator ) marginalsby furthermarginalisation . Suppose that weobserve "evidence " , : X A = xA' Definea newfunction P* by . P*(x) = { 0 P(x) otherwIse if XA ~ xA
(11)
ThenP*(U) = P(U, ) = P([' )P(U I[,). Wecanrewrite(11) as P*(U) = P(u) n l (v), vEA
(12)
wherel (v) is 1 if Xv==x~, 0 otherwise . Thusl(v) is the likelihood function basedon the partialevidence Xv = x~. Clearlythis alsofactorizes on the junctiontree, andby message passingwemayobtainthe followingclique marginalrepresentation p(VI ) = llCECP(CIt :) . llSESP(SI )
(13)
or by omittingthe normalization stage , p(V, ) =: rICECP (C, ) . rIsEsp(S, )
(14)
Againmarginaldistributionsfor individualvariables , conditionaluponthe evidence , canbe obtainedby furthermarginalisation of individualclique tables, as can the probability(accordingto the model) of the evidence , P( ). 11. Why the junction tree? Giventhat themoralgraphhasniceproperties , whyis it necessary to goon to formthejunctiontree? Thisis bestillustratedby anexample , Figure6:
24
ROBERTCOWELL A
@-----( )---- @ ) E Figure 6.
A non-triangulated graph
The cliquesare (A, B , C), (A, C, D), (C, D, F ), (C, E, F ) and (B , C, E) with successive intersections(A, C), (C, D), (C, F ), (C, E) and (B , C). Suppose we havecliquemarginalsP (A, B, C) etc.. WecannotexpressP(A, B , C, D) in terms of P(A, B , C) and P (A, C, D) - the graphicalstructure doesnot imply B Jl D I(A, C). In generalthere is no closedfor expressionfor the joint distribution of all six variablesin terms of its cliquesmarginals. 12 . Those extra
edges again
Having explained why the cliques of the moral graph are generally not up to being used for local message p~ sing , we will now close by indicating where the extra edges to form a triangulated graph come from . Our basic message passing algorithm will be one in which marginals of the potentials in the cliques will form the messages on the junction tree . So let us begin with our moral graph with a potential representation in terms of functions on the cliques , and suppose we marginalise a variable Y say,which belongs to more than one clique of the graph , say two cliques , 01 and O2, with variables Y U Zl and Y U Z2 respectively . They are cliques , but the combined set of variables do not form a single clique , hence there must be at least one pair of variables , one in each clique , which are not joined to each other , Ul and U2 say. Now consider the effect of marginalisation of the variable Y . We will have
L aCt(Y UZl)aC2 {Y UZ2) ==f {Zl UZ2), y a function of the combined variables of the two cliques minus Y . Now this function cannot be accommodated by a clique in the moral graph because the variables ul and U2 are not joined (and there may be others ) .
INTRODUCTION TOINFERENCE FORBAYESIAN NETWORKS25 Hence
we cannot
P (U the
Y ) on
missing
is
can
P (U why
such
one
adds
Pearl
is one
reasoning book
properties
good
collection
which
covers
for
be
doing
scheme
in this
representation extra
able
out
fill
, then
edges
to
that
. This
accommodate one
so results
must
fill - in
in being
able
.
Bayesian intelligence
of
its
. This
in
junction
expert
systems
for
graphical
connected
DAGs with
is Shafer but
also
and
other
overviews
selected
by
papers
number
text
-
prob
-
models
;
( ie prior them
Pearl
.) A
( 1990 ) ,
formalisms
for
the
ex -
. An
is ( Spiegelhalter
a large
. His
introducing
propagating
good
of the
contain
community , from
reasoning
contains
uncertain
and
reasoning
also
for
singly
trees
uncertain
significance
references
material
use ; axiomatics
propagation
probabilistic
methods
editors
introductory
et ai . , 1993 ) . Each
of
references
for
further
. ( 1979 ) introduced
dependence phasis
. More
on
showed
Asia how
' s using
J unction in
do
relational
on this
given
in other
also
the
by
for
treating
conditional
independence
properties
latter
Lauritzen
also
are
in -
with
given
contains
discussion
textbook
reprinted
and
) ; see
by
proofs
Spiegelhalter
em Whit of
-
the
( Lauritzen section
Bayesian
( 1988 ) , who
in multiply
in
are known
in junction on
and
calculations
is also
areas
of propagation
introductory
Markov
probability , ( it
databases
and
their
basis conditional
8 .)
was
consistent
arise
of
( 1996 ) . ( The
section
propagation
trees
formulation
A recent
in
to
and
Lauritzen
example
axiomatic
accounts
models
and
stated
The
the
recent
graphical
( 1990 )
lemmas
more
, and
artificial
for
on
probabilistic
Dawid
eral
these
, to
a wealth
of making
only
a potential
of
, if we
cliques
find
turns
helped
in the
; etc , to
historical
three
reading
who
arguments
uncertainty
these
taker
popular
not
two
having
. It
distribution
. However
of the
graph
graph
joint
reading
of papers
the
moral
passing
pioneers
and
removed
graph
expressions
, 1988 ) contains
development
review
the
Y
we can
moral
message
the
to the
node
of the
of variables
, and
to
further
theory
plaining
pairs
a triangulated
become
handling
with
the
marginal
Markov
trees
edges
of
( Pearl
ability
graph
reduced
a consistent
Suggested
DAG
the
to form
up
representation
be accommodated
13 .
of
moral
intermediate
set
a potential
between
Y ) on
sufficiently to
the
edges
marginal for
form
( Shafer
and
by different and
connected Pearl
, 1990 ) ) .
names
( eg join
Spiegelhalter
of that
paper
trees
is given
networks
, 1988 ) for
. A recent by
and
Dawid
is ( Jensen
gen -
( 1992 ) .
, 1996 ) .
References Dawid , A . P. (1979) . Conditional independence in statistical theory (with discussion). Journal of the Royal Statistical Society, Series B, 41 , pp. 1- 31.
26
ROBERTCOWELL
Dawid
, A . P . ( 1992 expert
Jensen
Applications
) .
An ) .
, S . L . and
on
graphical
Journal , J . Mateo
( 1988 .
the ) .
, G . R . and mann
Spiegelhalter analysis Whittaker Sons
and introduction
, San
Pearl
Mateo
expert
, J . ( 1990 , Chichester
) . .
) .
application
, J . ( ed . ) ( 1990
Graphical
, Series in
) .
Local to
. UCL
B , 50
, pp
systems
in
Press
probabilistic
, London
computations expert
intelligent
Readings
for
uncertain
systems . 157 - 224 . Morgan
reasoning
with ( with
.
probabilities discussion
) .
. Kaufmann
, San
. Morgan
Kauf
.
, A . P . , Lauritzen
systems
networks
, D . J . ( 1988
Society
algorithm
. 25 - 36 .
Bayesian .
inference
, California
propagation , 2 , pp
. CUP
their
Statistical
, D . J . , Dawid in
to
and
Probabilistic
general
models
Spiegelhalter
Royal
a
Computing
Graphical
structures of
of
Statistics
, S . L . ( 1996
Lauritzen
Shafer
.
, F . V . ( 1996
Lauritzen
Pearl
) .
systems
.
Statistical models
in
, S . L . , and Science
, 8 , pp
applied
multivariate
Cowell
, R . G . ( 1993
) .
Bayesian
. 219 - 47 . statistics
. John
Wiley
and
-
ADVANCED INFERENCE IN BAYESIAN NETWORKS
RO BERT
COWELL
City University , London .
The Schoolof Mathematics , Actuarial Scienceand Statistics, City University, NorthamptonSquare , LondonEC"lE OHT
1. Introduction The previous chapter introduced inference in discrete variable Bayesian net works . This used evidence propagation on the junction tree to find marginal distributions of interest . This chapter presents a tutorial introduction to some of the various types of calculations which can also be performed with the junction tree , specifically : -
Sampling . Most likely configurations . Fast retraction . Gaussian and conditional Gaussian models .
A common theme of these methods is that of a localized message-passing algorithm , but with different 'marginalisation ' methods and potentials tak ing part in the message passing operations . 2 . Sampling Let us begin with the simple simulation problem . Given some evidence t: on a (possibly empty ) set of variable X E , we might wish to simulate one or more values for the unobserved variables .
2.1. SAMPLING IN DAGS Henrion proposedan algorithm called probabilistic logic sampling for DAGs, which works as follows. One first finds a topological ordering of the nodes of the DAG Q. Let us denote the ordering by (Xl , X2, . . . , Xn ) say after relabeling the nodes, so that all parents of a node precedeit in the ordering, hence any parent of X j will have an index i < j . 27
28
ROBERT COWELL Assume
from
if
at
P
X2
( XI
)
h ~
no
Otherwise
h
the
X
stage will
state
1[ * .
*
=
, x2
the
we
over
Now
suppose can
samples are
say
that
.
x ;
at
Xl
rejection
.
,
will
x
is
This a
.
We
Let
us
of
the
see
we the
( U
small
junction
To
draw
an
that
made
directed
rators
such
graph and that
analogy
a
which
clique as that
have
=
ordering
: all it
II
P
( C
the
I
) j
is
the
, 01 a
,
because
the
Then
,
correct
if
it
otherwise
the
the
.
that
This
when
a
distribution
networks
, of
can
the
start
case so
number
is
we
current
of sampling
let
.
alone
nodes
be
large bearing
used
to
sample
and
edges
topological
a
P
tree
( S
from that
a
marginal
:
I t : ) .
tree , 8m
the also
, Om ordering
)
DAG
edges
of
directed
from is
the
the
are
away
, . . .
have
s
,
also
and
junction
n
sampling
root
point is
, 81
,
from
evidence on
direct as
follows
( 00
suppose evidence
continue
.
ensure
Thus
logic
tree
propagated
I t : )
the
they
to
TREE
probability
fixed
because
separators this
to is
previous
the
probabilistic
the
c
pose
) .
from
with
, .
The
from
node
quite
JUNCTION
joint
P
I
say
drawn
cases
.
THE
that
the
.
probabilities
' s
even
how
evidence
USING
of
of
such
steps
( U
for
case
independently
then
next
is
of
nodes
xi
the
the
correctly
set
random
generated
exponentially now
and
obtain
Henrion
with
}
the
definite
sampled
known
at
it is
is
all
some
rejection
x
to
generated sampling
with
assume
j
because
sampled
more
.
no
general
obtain
a
P
balancing
that
or
j
=
values
correct
is
one
X
ie
say
are In
I
Now
x2
) .
in
will
* ) ,
be
drawn
sample
the
successfully
x ;
X
to
all
shall
representation
set
proceed
problem
even
SAMPLING
we
u
xi
be
generate
will
of
been
samples
increases
efficiently
2 .2 .
,
=
to
.
there
1[ * )
we
X
X2
obtain
=
=
,
distribution state
to
Xl
)
introduces
correct
have
one
rejection
rejection
evidence
}
now
simply
,
( U
on
one
the
not
ensures
However ones
cannot
discarding
case
interest
We
( Xj
nodes
case
evidence
the
( X21
and
the
such
, but
I pa
wish
)
sample
node
ordering P
( Xj
of
we
Each
that
case
== ,
.
If
for
( X2
themselves
P
) .
have
but
Instead
step
complete
Xl
from ,
P
can
state
P
from
each
( U
one
a
topological
probability
applied
j
.
again
sampled
we
drawn
distribution
case
from
been
Then
from
the
sample
sample
P
be
resulting
can
with
with
stage
x }
~ )
.
samples
sample ( by
sampled
that
still
at
be
the
, x
again
are
we
. . .
can
we
distribution
begin
scheme
to
'
full
evidence one
parent
hence
have
no
Then
we
a
can
we
.
then
already
When
is
say
as
we
have
( xi
from
I
) ;
jth
there xi
,
~
possibilities
parents
u
that
obtain
parents
i t
other in
first
to
,
the
between root acyclic
such
that of
the
let
us
now
junction cliques
.
The
.
Let 00 nodes
result us is
and is
label the in
sup
tree
a the
root the
-
are sepa
-
directed cliques clique directed
,
/"/-.[~ / ~ ~ /]-".'Dir ["jun ]'~ -tre .".
ADVANCED INFERENCE IN BAYESIAN NETWORKS
29
tree, and with Si the parent of Ci ; seeFigure 1. (Note that this notation
Figure 1.
has a subtlety that the tree is connected , but the disconnected case is easily dealt with .) Then we may divide the contents of the parent separator Si into the clique table in Ci to obtain the following representation :
m P(UIf;) = P(XcoIf;) 11P(XCi\SiIXSi,f;). i=l
This is called the set-chain representation , and is now in a form similar to the recursive factorization on the DAG discussed earlier , and can be sampled from in a similar manner . The difference is that instead of sampling individual variables at a time , one samples groups of variables in the cliques . Thus one begins by sampling a configuration in the root clique , drawing from P (Xco , ) to obtain xco say. Next one samples from P (XCl \ Sl I XS1-' ) where the states of the variables for XS1 are fixed by xco because XS1 C XCo . One continues in this way, so that when sampling in clique Ci the variables XSi will already have been fixed by earlier sampling , as in direct sampling from the DAG . Thus one can sample directly from the correct distribution and avoid the inefficiencies of the rejection method .
3. Most likely configurations One contributing reason why local propagation on the junction tree to find marginals "works " is that there is a "commutation behaviour " between the
30
ROBERTCOWELL
operation of summation and the product form of the joint density on the tree , which allows one to move summation operations through the terms in the product , for example :
L,B,Cf(A,B)f(B,C)=2 :::,Bf(A,B)LCf(B,C). A A However summation is not the only operation which has this another very useful operation is maximization , for example :
property ;
A ,B,Cf{A,B)f{B,C)=max A,B(f{A,B)maxf C {B,C)) , max provided the factors are non-negative, a condition which will hold for clique and separator potentials representing probability distributions .
3.1. MAX-PROPAGATION So suppose we have a junction tree representation of a probability distri -
nca(C)j nsb(S) b*(8)=max C\Sa(C), (which can be performed
locally through
the commutation
property
above )
what do weget? The answeris the max-marginalrepresentation of thejoint density:
-=rIG maxU \G P ( U ,& ) rIs maxU \SP(S,).
P(U,&) = no (c, nspmax pmax (s,t:)) The interpretation isthatforeach configuration c* ofthevariables inthe clique C , the value Pcax (c* ) is the highest probability value that any configuration of all the variables can take subject to the constraint that the variables of the clique have states c* . (One simple consequence is that this most likely value appears at leMt once in ever). clique and separator .) To see how this can come about , consider a simple tree with two sets of variables in each clique :
1 P(A,B,C,E:) = a(A,B)~
a(B,C).
ADVANCED INFERENCE IN BAYESIAN NETWORKS
31
Now recall that the messagep~ sing leaves invariant the overall distribution . So take ~ maximization
to be the root clique , and send then the first message, a over C :
b*(B) = maxa c (B,C). after "collecting" this messagewe have the representation:
P(A,B,C, ) = (a(A,B)b (B} b;(:1B)a(B,C). b*(B)) The root clique now holds the table obtained over C , because pmax (A , B , )
by maximizing
P (A , B , C , )
:=
max c P (A , B , C , )
=
mgx ( a (A , B ) -:r;rii b* (B ) ) biB1 ) a (B , C )
=
B ) )- ) mgx b; (1:B ) a (B , C ) ( a (A , B ) b b*("(B
=
( a (A , B ) -:r;rii b* (B ) ) .
By symmetry the distribute message results in the second clique table hold ing the max - marginal value maxA P (A , B , C , ) and the intervening separa tor holding maxA ,C P (A , B , C , E ) . The more general result can be obtained by induction on the numbers of cliques in the junction tree . (Note that one can pass back to the sum - marginal representation from the max - marginal representation by a sum -propagation .) A separate but related task is to find the configuration of the variables which takes this highest probability . The procedure is as follows : first from a general potential representation with some clique Co chosen as root , per form a collect operation using maximization instead of summation . Then , search the root clique for the configuration of its variables , Co say, which has the highest probability . Distribute this as extra " evidence " , fixing succes sively the remaining variables in the cliques further from the root by finding a maximal configuration consistent with the neighbouring clique which has also been fixed , and including the states of the newly fixed variables as evi dence , until all cliques have been so processed . The union of the " evidence " yields the most likely configuration . If there is " real " evidence then this is incorporated in the usual way in the collect operation . The interpretation is that the resulting configuration acts as a most likely explanation for the data .
32
ROBERTCOWELL
Note the similarity to simulation, where one first does a collect to the root using ordinary marginalisation, then does a distribute by first randomly selecting a configuration from the root , and then randomly selecting configurations from cliques successivelyfurther out.
3.2. DEGENERACY OFMAXIMUM It is possible to find the degeneracy of the most likely configuration , that is the total number of distinct configurations which have the same maximum probability pmax (U I ) = p* by a simple trick . (For most realistic applica tions there is unlikely to be any degeneracy, although this might not be true for ego genetic -pedigree problems .) First one performs a max -propagation to obtain the max -marginal representation . Then one sets each value in each clique and separator to either 0 or 1, depending on whether or not it has attained the maximum probability value , thus :
1 if Pcax (XG1 ) = p* Ic (xc 1 ) = { 0 otherwise , and
.1 Is(xs 1)={I0otherwise ifp.s;n(xs ax .)=p* Then
I (UI ) = 11Ic(xcI )j n Is(Xs1 ) c s is a potential representation of the indicator function of most likely config urations , a simple sum-propagation on this will yield the degeneracy as the normalization .
3.3. TOP N CONFIGURATIONS In finding the degeneracy of the most likely configuration in the previous section , we performed a max -propagation and then set clique elements to zero which did not have the value of the highest probability . One might be tempted to think that if instead we set to zero all those elements which are below a certain threshold p < 1 then we will obtain the number of configurations having probability ~ p . It turns out that one can indeed find these configurations after one max -propagation , but unfortunately not by such a simple method . We will discuss a simplified version of an algorithm by Dennis Nilsson which allows one to calculate the top N configurations
ADVANCED
by
a
sequence
can
be To
have (X
by
of
propagations
found
after
a
begin
with
an
, . . .
do
a
M1
M2
= state
, Xn
max
=
the
of
(X ~ 2
(x ~
) ,
that
and
let
propagations
least
we ,
)
has
no
the
j - th find
recently
evidence
ordering
, and
33
shown
how
they
.)
have
any
do
have
the
.
will
node
Necessarily
must
one
with
( 1997
NETWORKS
kj
most
The
.
first
Let
states
likely
step
is
us
write
( xJ
, . . . x ~j ) .
to
this
configuration
as We
, denoted
1
l , . . . x ~ n ) . 2
at
BAYESIAN
- propagation
nodes
1 , . . . x ~ n ) ,
of
max
the
- propagation 1
IN
. ( Nilsson
single
assume
ordering
1 , X2
now
INFERENCE
the
differ
from
variable
" pseudo
second
.
the
So
- evidence
- most most
now
"
j
we
as
likely
configuration
likely
configuration
perform
follows
,
a
in
further
n
max
-
.
ml & 1
=
Xl
#
Xl
1 ml
& 2
=
Xl
=
Xl
ml 1 and
X
2
#
X2
2
ml & 3
=
X
1 =
XlI
1 =
Xl
ml an
d
X
2
=
ml
X 2 2
an
d
X
3
#
X 3 3
. . . 1 C "' n
By
this
ing one
procedure
the ,
of
found
the
we
most
likely
them
has
by
& j
third
set
of
at
least
the
most
=
one
or
. To
using
=
it
and
. . .
Xj
with
which
we
for
and
" " j
an
M2
,
found
Xj
as
= F Xj
Xj
=
=
to
be
found
M1
and
need
each
,
and
which
to
x7
in
can
be
the
jth
set
, ie
1
~ 11
and
Xj
either M2
perform
follows
-
only
i = x7j
. Now
in
other
the
configuration a
further
in n
-
j
+
1
:
2 . J
m ej
one
.
- l
the
m
=
, exclud
Hence
each was
is
2 j
.
of
- J. ml - ; - Xn n
n
configurations
configuration
M3
and
of
X
1
1 j
d
an
found
configuration
X ~ l
"
- l 1
set
likely
likely 1
out
=
mn -
- ) normalizations
" evidence
j
Xn
already
disagrees
find
1 =
remaining
configuration ,
place
propagations
Xl
n -
have
most
( max most
likely
propagations
we
second the
X
. . .
the
MI
second
propagating
1 d
an
partition
at
the
ml
one
looking
Suppose
by
X
=
Xj
2 .
2 .+ l
m
J and
Xj
+ 1 #
Xj
~ l
,
. . .
" n - j + 1 " j
This find
further the
partitions third
most
likely
configuration
most
likely
to
develop
=
the likely by
suitable
X . . .
allowed
essentially etc
states
notation
. . We
performing .
The to
m ; - l X n - 1 an
n - 1 =
configuration
configuration a
d
idea keep
After can a
is track
quite of
d
X
-. L m2 I X n n .
n
propagating then similar simple which
these
find
the
partition ,
the
partition
we
fourth on main one
can most
the
third
problem is
up
to
.
34
ROBERT COWELL
If we have prior evidence , then we simply take this into account at the beginning , and ensure that the partitions do not violate the evidence . Thus for example if we have evidence about m nodes being in definite states , then instead of n propagations being required to find M2 after having found M1 , we require instead only n - m further propagations . One application of finding a set of such most likely explanations is to explanation , ie, answering what the states of the unobserved variables are likely to be for a particular case. We have already seen that the most likely configuration offers such an explanation . If instead we have the top 10 or 20 configurations , then in most applications most of these will have most variables in the same state . This can confirm the diagnosis for most variables , but also shows up where diagnosis is not so certain (in those variables which differ between these top configurations ) . This means that if one is looking for a more accurate explanation one could pay attention to those variables which differ between the top configurations , hence serve to guide one to what could be the most informative test to do (cf. value of information ) . The use of partitioned "dummy evidence" is a neat and quite general idea , and will probably find other applications .!
4. A unification One simple comment to make is that minimization can be performed in a similar way to maximization . (In applications with logical dependencies the minimal configuration will have zero probability and there will be many such configurations . For example in the ASIA example half of the 256 configurations have zero probability .) Another less obvious observation is that sum , max and min -propagation are all special cases of a more general propagation based upon P norms , used in functional analysis . Recall that the LP norm of a non-negative real val ued function is defined to be
1 LP (f)=(lEX fP (X)dX )P
For p = 1 this gives the usual integral , for p - t 00 this give the maximum of the function over the region of integration , and for p - t - 00 we obtain the minimum of f . We can use this in our message propagation in our junction tree : the marginal message we pass from clique to separator is the P marginal , lSee for example (Cowell, 1997) for sampling without replacement from the junction tree.
ADVANCED
INFERENCE
IN BAYESIAN
35
NETWORKS
defined by : 1 p
bs(Xs) =
L a~(Xc) Xc \ Xs
So that we can obtain the LP marginal representation :
P (U I ' ) =
llCEC pbP ( X I t : c ) llSES PfP (X1s)
which is an infinite -family of representations . Apart from the L2 norm , which may have an application to quadratic scoring of models , it is not clear if this general result is of much practical applicability , though it may have theoretical
5.
uses .
F' ast retraction
Suppose that for a network of variables X we have evidence on a subset of
k variables U: , = { 'u : u E U*} , with ,U of the form "Xu = x~" Then it can be useful to compare
each item of evidence
with
the probabilistic
prediction given by the system for Xu on the basis of the remaining evidence
\ {u} : "Xv = x~ for v E U \ { u} " , as expressedin the conditional density of Xu given E\ { u} . If we find that
abnormally
low probabilities
are being
predicted by the model this can highlight deficiencies of the model which could need attention or may indicate a rare case is being observed . Now one "brute force" method to calculate such probabilities is to per form k separate propagations , in which one takes out in turn the evidence on each variable
in question
and propagates
the evidence
for all of the
remaining variables . However it turns out that yet another variation of the propagation algorithm allows one to calculate all of these predictive probabilities in one p.ropagation , at le~ t for the c~ e in which the joint probability is strictly
positive, which is the casewe shall restrict ourselvesto here. (For probabilities with zeros it may still be possible to apply the following algorithm ; the matter depends upon the network and junction tree . For the Shafer-Shenoy message passing scheme the problem does not arise because divisions are not necessary.) Because of the computational savings implied , the method is called fast -retraction . 5 .1.
QUT
- MARGINALISATION
The basic idea is to work with a potential representation of the prior joint probability even when there is evidence. This means that , unlike the earlier
36
ROBERT COWELL
sections , we do not modify the clique potentials by multiplying them by the evidence likelihoods . Instead we incorporate the evidence only into forming the messages, by a new marginalisation method called out-marginalisation , which will be illustrated for a simple two- clique example :
~
~
Here A , Band C are disjoint sets of variables , and the clique and separator potentials are all positive . Suppose we have evidence on variables a E A , ,B E Band ' Y E C . Let us denote the evidence functions by ha , hp and h'Y' where ha is the product of the evidence likelihoods for the variables a E A etc . Then we have
P(ABC) P(ABC, 0.) P(ABC, "1) P(ABC,[,0., "1)
-
-
-
1 g(AB )~9(B)~g(BC ) P(ABC)ha P(ABC)h'Y P(ABC)hah'Y.
where the g 's are the clique and separator We message
take from
the
clique C! ! : QJ to
~
as ~
root defined
g* (B) =
. Our
first
potentials step
is to
. send
an
out
- marginal
as :
LCg(BC )h"Y.
That is we only incorporate into the message that subset of evidence about the variables in C , thus excluding any evidence that may be relevant to the separator variables B . Note that because we are using the restriction that the joint probability is non zero for every configuration , this implies that the potentials and messages are also non zero. Sending this message leaves the overall product of j unction tree potentials invariant as usual :
P(ABC )=g(AB )g*)(B)@ --g*(B -_!..-)@ g(BC ). g(B
ADVANCED
N ow
let
us
use
INFERENCE
this
representation
pout
( AB
,
\ AUB
-
shows
by
that
symmetry
of
the
the
joint
g ( B ) is
joint
P
Further
out
desired
of
, with an
thus
)
and
the
clique
~
arrive
pout
at
( AB
,
the
have
)
these
follows on
Fast
retraction
predictive ing
re
with
the
following
evidence
alter
,
of between
, the
of
=
then
which
be
over
: EA
pout
two
back
the
( AB
of , and
out - margin
A . The
separator
, t : \ AUB ) ha
==
representation
to
because
.
( C
, E \ c
( S
,
,
.
the ,
use the
besides
,
.d
.
by
To
way
multiplied one do
to
this
,
by
,
the
one
would
deal ev
require
potential
- initialise
-
a
with represen
the
-
might
evidence a
re
of by
case
propagating
retains need
.
comparing
previous
are
case
no
the
describe
about
always is
yield evidence :
previously
another
However
tree there
then having
,
Consider
another
.
) .
)
potentials
at
\ BUG
.
evidence
tree
)
\ s
tree
has
clique
look
will
case
the
, [
representation
pout
in
( BC
variables
- marginal
evidence
junction
) pout
potentials
- clique
cliques
the
, [ ; \ B
pout
propagating
probability es
also
individual
lls
applicable
After
the
joint c ~
)
simple
evidence
of
- retraction
( U
where
in .
the
out - margin
out - marginal
( ~
out
against
likelihoods to
tation
the number ,
- initialisation
fast
tree
from
the
probabilities
idence wish
,
uction
will
clique
for the
pout
llc
ind
the
message
taken
following
\ AUB
of
P
which
is simply out - margin
out - marginalisation
probabilities we
~
send
marginalisation
- marginalisation
general
clique
now
:
=
predictive In
content
probability
( ABC
of the can
also
( B , & \ B ) . We
of the
content . We
probability
potential pout
the
probability
calculate
g(AB )g* (B ) g(B ) .
-
joint
37
NETWORKS
~ g*)(B)I:& >.g_*.(!B _-)I:& >g(BC )h'Y L.., g()B cg(AB g(AB g*)(B)0.g._ ..-) >c=g(BC )h'Y==g*(B) g()B *(!.B
-
the
to
BAYESIAN
Lc P(ABC , "'() L P(ABC )h"'(
)
-
This
IN
junction
-
38
ROBERTCOWELL
6 . Modelling
with
continuous
variables
All examples and discussion has have been restricted to the special case of discrete random variables . In principle , however, there is no reason why we should not build models having continuous random variables as well as, or instead of , discrete random variables , with more general conditional probability densities to represent the joint density , and use local message passing to simplify the calculations . In practice the barrier to such general applicability is the inability of performing the required integrations in closed form representable by a computer . (Such general models can be analyzed by simulation , for example Gibbs sampling .) However there is a case for which such message pa8sing is tractable , and that is when the random variables are such that the overall distribution is multivariate - Gaussian . This further extends to the situation where both discrete and continuous random variables coexist within a model having a so called conditional -gaussian joint distribution . We will first discuss Gaussian models , and then discuss the necessary adj ustments to the theory enabling analysis of mixed models with local computation . 7.
Gaussian
models
Structurally , the directed Gaussian model looks very much like the discrete models we have already seen. The novel aspect is in their numerical specifi cation . Essentially , the conditional distribution of a node given its parents is given by a Gaussian distribution with expectation linear in the values of the parent nodes, and variance independent of the parent nodes. Let us take a familiar example :
[ YJ--t [ KJ-t ~ . NodeY, whichhasno parents , hasa normaldistributiongivenby
Ny(J1 ,y;a})cx : exp ( -(y20 -'2 yJ.L~)2) , where ,uy and ay are constants . Node X has node Y as a parent , and has
the conditionaldensity:
Nx(J1 ,x+fJx ,yy;uk)cx :exp (-(x- Jl ,x2u2 -X{3x ,yy)2) '
ADVANCEDINFERENCEIN BAYESIANNETWORKS
39
whereJi,x , 'sx,y and ax are constants . Finally, node Z has only Xasa parent ; its conditional
density
is given by
Nz(Jlz+f3z ,xx;O "~)cx :exp ( -(z- Jl ,z2O -2(lZ'XX )2) . "z In general, if a node X had parents { Y1, . . . , Yn} it would have a conditional density:
Nx (J.Lx+
Li (3X ,~Yi;ak)cx :exp (-(x- J.Lx2a2 - Ei ,~Yi )2) ' x (3X
Now the joint density is obtained by multiplying together the separate component
Gaussian distributions
:
P(X, Y, Z) = Ny (Jl,y; a} )Nx (Jl,x + f3x,yY; alr)Nz(J1 ,z + f3z,xx ; a~)
exp (-~(x- /lx,Y- /lY,Z- /lz)K(x- /lx,Y- /lY,Z- /lZ)T) ,
cx:
where K is a symmetric (positive definite ) 3 x 3 matrix , and T denotes transpose . In a more general model with n nodes, one obtains a similar expression with an n x n symmetric (positive definite ) matrix . Expanding the exponential , the joint density can be written as:
exp h y)-2 K yxKxy K yyKXZ K yZ)(X Y hz Kzx Kzy Kzz z)) ((XYz)(hX 1(xyz)(KXX where the
hx
most
will
= useful
consist
for
of
properties
7 .1 .
J.Lx / a ~
we
Suppose
we in
exp
jjz
, Bz ,x / a ~
constructing
functions shall
GAUSSIAN
potential
+
of
be
using
local this
type
etc
. This
form
messages . Let
us
of
, and now
the
joint
indeed
define
density
local them
is
messages and
list
the
.
POTENTIALS
have
n
a subset
9 + (Yl
hI K 1 1 . . K 1 k Yl , , 1 . . . . Yk )hk :.--2(YIKk ,l..Kk ,kYk
continuous
random
{ YI , . . . , Yk }
of
variables
variables
Xl
is
, . . . , Xn
a function
. A
of
the
Gaussian form
:
Yk)
where K is a constant positive definite k x k matrix , h is a k dimen sional constant vector and 9 is a number . For shorthand we write this as a
40 triple respective
ROBERT COWELL ,
(g , h , K triples
) .
G
. aUSSlan potentials
together
can be multiplied by adding their
:
4>1 * 4>2 = (91 + 92, hI + h2 , K1 + K2 ) . Similarly
division
is
easily
handled
:
These operations will be used in passing the "update factor " from separator to clique . To initialize cliques we shall require the extension operation combined with multiplication . Thus a Gaussian potential defined on a set of variables Y is extended to a larger set of variables by enlarging the vector hand matrix K to the appropriate size and setting the new slots to zero. Thus for example : <jJ(x ) = exp (g + xTh - ! xTK x ) extends to
cf >(x,y) = cf >(x) = exp(g+ (x y) (~) - ~(x y) (~ ~) (:) ) . Finally , toformthemessages wemust define marginalisation , which isnow anintegration . LetustakeY1andY2tobetwosets ofdistinct variables , and 4>(Yl,Y2 ) =exp(9+(Yl Y2 ) (~~) - ~(Yl Y2 ) (~~:~ ~~:~) (~~) ) sothatthehandK areinblocks . Then integrating over Y1yields anew vector h andmatrix K asfollows : h = h2- K2,lK~th1 K = K22 , - K21K1 , -11K12o , ' (Discussion of the normalization will be omitted , because it is not required except for calculating probability densities of evidence .) Thus integration has a simple algebraic structure . 7.2. JUNCTION TREES FOR GAUSSIAN NETWORKS Having defined the directed Gaussian model , the construction of the junc tion tree proceeds exactly as for the discrete case, as far as the structure is concerned . The difference is with the initialization . A Gaussian potential of correct size is allocated to each clique and separator . They are initialized with all elements equal to zero.
ADVANCED INFERENCE IN BAYESIAN NETWORKS
41
Next for each conditional density for the DAG model , a Gaussian poten tial is constructed to represent it and multiplied into anyone clique which contains the node and its parents , using extension if required . The result is a junction tree representation of the joint density . Assum ing no evidence , then sending the clique marginals as messages results in the clique marginal representation , as for the discrete case:
P(U) = II p(Xc )/ll
sP(Xs).
c
Care must be taken to propagate evidence. By evidence [; on a set of nodes Y we mean that each node in Y is observed to take a definite value . (This is unlike the discrete case in which some states of a variable could be excluded but more than one could still be entertained .) Evidence about a variable must be entered into every clique and separator in which it occurs . This is because when evidence is entered on a variable it reduces the dimensions of every h vector and K matrix in the cliques and separators in which it occurs . Thus for example , let us again take Y1 and Y2 to be two sets of distinct variables , and
K2 ,2)(Yl Y2 Yk )(~~)-~(YlY)(Kl kK2 ,l1Kl ))
>(Yl, Y2) <Xexp ( (Yl so
that
the
variables
h
=
hI
-
After
the
Y
to
Y2K2
, 1
such
Gaussian
are
take
and
K
evidence
evidence
again
in
values
=
has
propagation
with
(
K
2
standard
tation
the
hand
of
Y2
KI
, I
distributions
blocks
Then
been
entered
yield
.
on
.
the
Suppose
we
potentials
now
observe
become
the
modified
to
.
will
included
)
'
in
the
Further
every
clique
within
individual
clique
-
and
marginal
clique
nodes
separator
,
density
marginals
then
represen
then
-
gives
.
7.3. EXAMPLE Let us take out three node example again , with initial tions as follows :
[Y]- +[RJ - +~ N(Y) N(X I Y) N(Z IX)
-
N(O,l ) N(y,l ) N(x,l )
conditional
distribiU-
42
ROBERTCOWELL
The cliquesfor this treeare[ K: rJ and[ K! ] . After initializingand propagating , the clique potentials are
exp (-~(x y)(!1~1 )(:)) cx exp (-~(x z)(~~~1 )(:))
<jJ(x , y) cx <jJ(x , z)
with separator >(x ) cxexp(- x2/ 4); Now if we enter evidenceX = 1.5, say, then the potentials red lice to :
4>(X = 1.5, y) cx : exp(1.5y - y2) and 1
2
4>(X = 1.5, z) cx : exp(1.5z - 2z ), because in this example X makes up the separator
between the two cliques .
The marginal densities are then :
P (Y ) = N (0.75, 0.5) and P (Z ) = N (1.5, 1). Alternatively idence
that
cj>( X on
,
)
cx
Z
exp
[ ! : : XJ
is
=
As as
we
and
have
(X
(
)
seen
( x
=
,
message
so
y )
N
the
that
(
the
from
after
105 )
( l , 2 / 3 )
root
clique
~
to
propagation
,
and
[ K : rJ the
enter is
ev
given
clique
-
by
potential
-
~
( x
y )
(
!
1
~
1 )
( :
)
)
. ,
and
P
(Y
)
=
N
( 1 / 2 , 2 / 3 ) .
models
treatment
models employed
The
and
of minor
(2 )
Gaussian
networks
differences
evidence
are
has
to
be
is
in
( 1 )
entered
much
the
the
nature
into
every
same of
the
clique
.
passage . The
)
as
:
Gaussian
separator The
ences
[ K : rJ
densities
discrete
potentials
take the
O . 75x2 form
Conditional
for
Then
< X exp
P
8 .
-
the
marginal
we
1 .5 .
( I . 5x of
cf>( x , y )
with
suppose
first
to
mixed
is
a
restriction
models
proceeds in
the
with modeling
some stage
more :
important
continuous
differ variables
-
ADVANCEDINFERENCE IN BAYESIAN NETWORKS are
not
allowed
discrete
to
parents
differ
in
ple
,
the
of
to
as
for
discrete
the
on
following
original
8
is
paper
. 1
.
CG
The
by
set
of
element
of
X
( r
joint
y
( x
)
E
( i
)
,
O
f
,
( i
I
}
,
tentials
0
,
for
sim
to
,
the
sub
to
theory
,
we
but
( i
)
( i
and
)
=
the
)
)
( i
)
parent
-
but
con
-
configurations
include
more
indicator
careful
for
more
follow
( i
)
=
{
with
details
closely
)
( i
the
)
g
( i
{
any
4 > ( i
-
y
the
see
here
the
.
,
E
)
( i
)
-
we
+
yTh
is
}
)
exp
of
{
g
-
at
lrllog
( i
The
.
define
)
discrete
joint
( i
density
The
/
2
}
f
the
is
with
,
triple
it
is
form
) y
and
typical
the
the
K
i
a
of
;
given
( i
( t : J. )
denote
has
yT
( g
only
,
h
,
K
)
defined
moment
charac
-
by
=
K
( i
)
are
( 27r
)
-
l
K
-
~
CG
the
)
distribution
can
~
y
values
it
positive
and
,
,
.
( i
the
generalize
4 >
X
)
( i
that
one
( i
( i
,
=
~
variables
=
the
characteristics
logdetK
)
,
lh
functions
,
)
case
canonical
+
x
variables
f
{ p
K
discrete
Let
means
of
is
the
.
denoting
which
exp
networks
=
i
r
continuous
triple
the
( i
are
u
whether
the
Gaussian
4 > ( x
X
that
have
which
the
,
=
~
with
of
y
by
logp
=
space
when
we
,
into
V
characteristics
denoted
Inverting
As
again
potentials
by
need
have
notation
thus
indicates
~
g
Gaussian
we
we
the
whose
distribution
canonical
>
teristics
CG
=
{
the
) ~
to
partitioned
,
values
a
)
is
)
state
the
be
f
( i
called
( i
,
V
to
where
K
are
certain
,
and
former
indexed
because
allowed
guide
Lauritzen
variables
and
assumed
X
,
have
nodes
.
brief
variables
the
variables
for
are
only
discrete
- POTENTIALS
continuous
is
a
can
for
The
K
Also
be
nodes
.
latter
potentials
g
discrete
matrices
.
not
Gaussian
the
and
parents
constants
The
,
h
might
the
normalization
.
nodes
models
discrete
ie
specified
continuous
vectors
variables
functions
of
discrete
,
,
probabilities
those
g
of
children
conditional
constants
figurations
discrete
The
character
tables
with
have
.
43
( i
.
( i
) T
)
=
K
( i
E
( i
)
-
) ~
( i
)
distributions
l
}
to
,
/
h
2
CG
( i
)
=
.
po
-
form
)
+
yTh
( i
)
-
yT
K
( i
) y
/
2
}
.
K (i ) is restricted to be symmetric , though not necessarily invertible . However we still call the triple (g, h , K ) the canonical characteristics , and if for all i , x (i ) > 0 and K (i ) is positive definite then the moment characteristics are given M before . Multiplication , division and extension proceed as for the Gaussian potentials have already been discussed. Marginalisation is however different ,
44
ROBERT COWELL
because adding two CG potentials in general will result in a mixture of CG potentials - a function of a different algebraic structure . Thus we need to distinguish two types of marginalisation - strong and weak.
8.2. MARGINALISATION
Marginalising continuous variables corresponds tointegration . Let Y2) , h = ( hI h2) , K = ( KI1 K21K22 y = ( YI K12) withYlhaving dimensionp andY2dimension qwithK11ispositive definite . Thentheintegral f ~(i,Yl,Y2 )dYlisfiniteandequal toa CGpotential ~ withcanonical characteristics given as g(i) = g(i) + {plog(27r ) - logdetK11 (i) + h1(i)TKll(i)-lh1(i)} /2 h(i) = h2(i) - K21 (i)K11 (i)-lh1(i) K(i) = K22 (i) - K21 (i)K11 (i)-lK12 (i). Marginalising discrete variables corresponds to summation . Since in general addition of CG potentials results is a mixture of CG potentials , an alternative definition based upon the moment characteristics {p , ~, E } is used which does result in a CG potential ; however it is only well defined for K (i-, j ) positive definite . Specifically , the marginal over the discrete states - of ~ is defined as the CG potential with moment characteristics {j), ~, E } where and
p(i) = LP (i,j ), ~(i) = L ~(i,j )p(i,j )jjj(i),
J
J
t (i) = 2:::: . ~(i,j )p(i,j )/p(i) +L. (~(i,j ) - ~(i))T(~(i,j ) - ~(i))p(i,j )/p(i). J
Note
that
J
the
latter
can
be
written
as
p(i) (~(i) +~(i)T~(i)) =~ p(i,j) (~(i,j)+~(i,j)T~(i,j)) . J
so that if E (i , j ) and ~(i , j ) are independent of j then they can be taken through the summations as constants . This observation is used to define a marginalisation over both continuous and discrete variables : First marginalise over the the continuous variables and then over the discrete variables . If , after marginalising over the continuous variables the result ing pair (h, K ) is independent of the discrete variables to be marginalised over , (summation over these discrete variables then leaves the pair (h , K )
ADVANCED INFERENCE IN BAYESIAN NETWORKS
45
unmodified ) , we say that we have a strong marginalisation . Otherwise one sums over the discrete variables using the moment characteristics , and the overall marginalisation is called a weak marginalisation . Weak and strong marginalisation satisfy composition :
LA LB
AuBuC = AUB L AUBUC , but in general only the strong marginalisation satisfies
L (ifJAUB 'l/JB ) = 'l/JB L ifJAUB . A
Under
both
the
type
correct
8
( I
the
. 3
.
=
)
p
( i
of
we
in
moralize
cg
the
,
sulting
cliques
to
it
.
Unlike
the
then
Any
can
the
R
tree
to
,
when
discrete
,
extra
20ne
way
nodes and
tion orderings
to the
cases
joining . .
Finding
any
of
V
( Y
take
density
will
I
I
=
i
)
=
then
~
expectations
have
( i
)
,
.
\
A
we
A
~
r
or
B
of
B
satisfies
( B
two
we
n
A
)
Ll
the
from
select
has
restricted
the
a
clique
defined
re
-
root
of
as
.
the
follows
:
neighbours
cliques
root
First
.
neighbouring
from
Then
themselves
~
and
.
continuous
any
root
cliques
a
the
must
choose
strong
,
of
.
Now
freely
called
than
by
all
.
tree
tree
triangulate
variables
tree
variables
junction
eliminate
cannot
discrete
junction
the
Then
first
so
R
)
of
the
discrete
pair
to
away
only
is
not
purely
continuous
vertices
.
a
good
.
we
we
between
' unmarked pairs
,
construct
junction
a
closer
triangulate
all
ordering
,
choose
separator
status
)
construct
to
the
a
furthest
the
( i
to
we
way
eliminate
separator
clique
beyond
the it
a
the
~
used
is
,
( B
Thus
=
'
marginalisation
usual
for
A
)
i
step
the
which
' marginalised
how
Specifically
we
with
=
construct
Instead
clique
on
I
a
.
under
pure
.
. e
is
first
we
we
tree
I
adjust
in
. 2
previous
junction
( Y
i
TREE
The
DAG
and
E
,
- potentials
ordering
variables
,
2
distribution
have
messages
,
order
JUNCTION
- closure
elimination
)
CG
THE
that
we
=
correct
non
means
pass
to
i
MAKING
The
marginalisation
moments
P
where
of
A
graph ' .
its
unmarked triangulations
is One
to
take
then
an works
neighbours is
ordering through .
then
The
equivalent
of
the
each
nodes node
ordering
is to
finding
and in called good
give
turn
, an
all
of
marking elimina
elimination
-
46
ROBERT COWELL
8.4. PROPAGATION ONTHEJUNCTION TREE The point of this restriction is that on the collect operation , only strong marginalisations are required to be performed . This is because our restricted elimination ordering - getting rid of the continuous variables first , is equiva lent to doing the integrations over the continuous variables before marginal ising any of the discrete variables . Thus our message passing algorithm takes the form : 1. Initialization : Set all clique and separator potentials to zero with unit indicators , and multiply in the model specifying potentials using the extension
operation
where appropriate
.
2. Enter evidence into all clique and separator potentials , reducing vector and
matrix
sizes
as necessary
.
3. Perform a collect operation to the strong root , where the messages are formed by strong marginalisation by first integrating out the redundant continuous
variables , and then summing
over discrete variables .
4. Perform a distribute operation , using weak marginalisation where appropriate discrete
when mixtures variables
might
be formed
on marginalising
over the
.
The result is a representation of the joint CG-distribution including evidence , because of the invariant nature of the message passing algorithm . Furthermore , because of the use of weak marginalisation for the distribute operation , the marginals on the cliques will themselves be CG -distributions whose first two moments match that of the full distribution . The following is an outline sketch of why this could be. First by the construction of the junction tree , all collect operations are strong marginals , so that after a collect -to-root operation the root clique contains a strong marginal . Now suppose, for simplicity , that before the
distribute operation we move to a set-chain representation (cf. section 2.2). Then
apart
from
the strong
root , each clique will
have the correct
joint
density P (XCi\ Si l Xsi ) where Si is the separator adjacent to the clique Ci on the path between it and the strong root . Now on the distribute oper ation the clique Ci will be multiplied by a CG-potential which will either be a strong marginal or a weak marginal . If the former then the clique potential will be the correct marginal joint density . If the latter then we
may write the clique potential as the product P (XCi\ Si) * Q(Xsi ) where Q is the correct weak marginal for the variables X Si. Now consider taking an expectation of any linear or quadratic function of the XCi with respect to this "density " . We are free to integrate by parts . However , choosing to inte -
grate wrt . XCi\ Si first meansthat we form the expectation wrt the correct CG-density P (XCi\ Si I Xsi ), and will thus end up with a correct expectation (which will be a linear or quadratic function in the XSi ) multiplied
ADVANCED INFERENCE IN BAYESIAN NETWORKS
47
by the correct weak marginal Q (X Si) . Hence performing these integrations we will obtain the correct expectation of the original function wrt the true joint density . For brevity some details have been skipped over here, such as showing that the separator messages sent are correct weak marginals . Detailed justifications and proofs use induction combined with a careful analysis of the messages sent from the strong root on the distribute operation . See the original paper for more details . 9 . Summary This tutorial has shown the variety of useful applications to which the junction -tree propagation algorithm can be used. It has not given the most general or efficient versions of the algorithms , but has attempted to present the main points of each so that the more detailed descriptions in the orig inal articles will be easier to follow . There are other problems , to which the junction -tree propagation algorithm can be applied or adapted to , not discussed here, such as: - Influence diagrams : Discrete models , with random variables , decisions and utilities . Potentials are now doublets representing probabilities and utilities . Junction tree is generated with a restricted elimination generalising that for cg-problems to emulate solving the decision tree . - Learning probabilities . Nodes presenting parametrisations of probabil ities can be attached to networks , and Bayesian updating performed using the same framework . - Time series. A network can represent some state at a given time , and they can be chained together to form a time -window for dynamic mod elling . The junction tree can be expanded and contracted to allow forward - prediction or backward smoothing . Doubtless new examples will appear in the future .
10. Suggested further
reading
Probabilistic logic sampling for Bayesian networks is described by Henrion (1988) . A variation of the method - likelihood -weighting sampling in which rejection steps are replaced by a weighting scheme is given by Shachter and Peot (1989) . Drawing samples directly from the junction tree is described by Dawid (1992), which also shows how the most likely config uration can be found from the junction tree . The algorithm for finding the N - most likely configurations is due to Nilsson (1994) , who has also developed a more efficient algorithm requiring only one max -propagation on the junction tree . V -propagation is not described anywhere but here.
48
ROBERT COWELL
Fast retraction is introduced in (Dawid , 1992) and developed in more detail in (Cowell and Dawid , 1992) .
Gaussian networks are described by Shachter and Kenley(1989), who use arc-reversal and barren -node reduction algorithms for their evaluation .
(The equivalenceof various evaluation schemesis given in (Shachter et al., 1994) .) The treatment
of Gaussian and conditional -gaussian networks is
based on the original paper by Lauritzen( 1992). For pedagogical reasons this chapter specialized the conditional -gaussian presentation of (Lauritzen , 1992) to the pure gaussian case, to show that the latter is not so different from the pure discrete case. Evaluating influence diagrams by junction -trees is treated in (Jensen et al., 1994) . For an extensive review on updating probabilities (Buntine , 1994) . Dynamic junction trees for handling time
series is described by Kjrerulff (1993). See also (Smith et al., 1995) for an application using dynamic junction trees not derived from a DAG model . References Buntine , W . L . ( 1994) . Operations for learning with graphical models . Journal ficial Intelligence Research , 2 , pp . 159- 225.
of Arti -
Cowell, R . G. (1997) . Sampling without replacement in junction trees, Research Report 15, Department of Actuarial Science and Statistics , City University , London . Cowell , R . G . and Dawid , A . P. ( 1992) . Fast retraction of evidence in a probabilistic expert system . Statistics and Computing , 2 , pp . 37- 40 .
Dawid , A . P. (1992) . Applications of a general propagation algorithm for probabilistic expert systems . Statistics
and Computing , 2 , pp . 25- 36.
Henrion , M . (1988) . Propagation of uncertainty by probabilistic logic sampling in Bayes' networks. In Uncertainty in Artificial Intelligence, (ed. J. Lemmer and L . N . Kanal ), pp . 149 - 64 . North
- Holland
, Amsterdam
.
Jensen, F ., Jensen, F . V ., and Dittmer , S. L . (March 1994). From influence diagrams to junction trees . Technical Report R -94-2013 , Department Computer Science Aalborg University , Denmark .
of Mathematics
and
Kj cerulff, U . (1993) . A compu tational scheme for reasoning in dynamic pro babilistic networks . Research Report R -93-2018 , Department Science , Aalborg University , Denmark .
of Mathematics
and Computer
Lauritzen , S. L . (1992) . Propagation of probabilities , means and variances in mixed graphical
association models . Journal
of the American
Statistical
Association , 87 ,
pp . 1098 - 108 .
Nilsson , D . ( 1994) . An algorithm for finding the most probable configurations of discrete variables that are specified in probabilistic expert systems . M .Sc. Thesis , Department of Mathematical Statistics ) University of Copenhagen .
Nilsson, D . (1997) . An efficient algorithm for finding the M most probable configurations in a probabilistic
expert system . Submitted
to Statistics and Computing .
Shachter, R. D ., Andersen, S. K ., and Szolovits, P. (1994). Global conditioning for probabilistic Uncertainty
inference in belief networks . In Proceedings of the Tenth Conference on in Arlifical Intelligence , pp 514- 522.
Shachter, R . and Kenley, C. (1989) . Gaussian influence diagrams. Management Science, 35 , pp . 527 - 50 .
Shachter, R . and Peot, M . (1989) . Simulation approaches to general probabilistic inference on belief networks . In Uncertainty in Artificial Intelligence 5, (ed . M . Hen rion , R . D . Shachter , L . Kanal , and J . Lemmer ) , pp . 221- 31. North -Holland , North -
ADVANCED
Holland
INFERENCE
IN
BAYESIAN
NETWORKS
49
.
Smith , J. Q., French, S., and Raynard , D . (1995). An efficient graphical algorithm for updating
the estimates of the dispersal of gaseous waste after an accidental release . In
Probabilistic reasoning and Bayesian belief networks, (ed. A . Gammerman) , pp. 12544 . Alfred Waller , Henley -on - Thames .
INFERENCE IN BAYESIAN NETWORKS USING NESTED JUNCTION TREES
UFFE
KJ
lERULFF
Department
of
Fredrik
Abstract
.
Shenoy
architectures
lations to
clique
sent
in
and
reduced
.
junction
is large
Introduction
Kim
and
was
are
passed
1990
) .
More
variable
a
tree
be
leaves
but
distributions
( or ,
,
a
Shafer
this
That
is
,
the
we
show of
may
the
the
nested
usefulness
of
evaluation
both
that
methods
The
Hugin
-
the
messages
exploiting .
-
mes of
paper
of
re
factorization
computation
empirical
the
root
;
nodes ( at
least
tree
)
Jensen
the
involving and
the
at
Shafer
of
up
a
51
,
-
. ,
1990
the
) normalizing
the to
;
for
then
a
Shenoy
,
particular
the
leaf
of
interest
.
toward
the
root
contain constant
Later
network
and
from
the
.
messages
the
Shafer
variable from
will
network only
where
inward
messages
cliques to
Bayesian
distribution
messages containing
( or
a
networks
corresponding et
probability
node
in
connected networks
sending
propagation all
inference singly
join
1988
posterior by
,
In
reductions
connected
tree
computed
performed
.
the
propagation
and
for
multiply
outward is
. a
way
formulated ,
junction
toward
subsequent
posterior
first
Spiegelhalter precisely
can the
)
to
and
clique
the
thorough
networks
passing
in
( Lauritzen
tree
and independence
.
( 1983
extended
ugin
via
in
such a
Bayesian
a
structured
achieve
H
Denmark
the
conventional a
algorithms
message
this
of
- world
the
computed
junction
the
University (!),
exploiting
trees
through
Pearl
through
of
to
both
of
be
a
Aalborg
Aalborg
messages
presents
technique
inference
1 .
by
of
costs
paper
,
- 9220
in
junction
emphaBized real
Shenoy
DK
inference
can
form
time
trees
method ten
the
,
improved
clique
nested
The
7E
incoming a
such
space
be
be
the
from
exploiting
both
of can
by
be
Science
Vel
efficiency
potential
by
a
The
induced
sage
Computer
Bajers
the ) .
nodes
correct
If
52
UFFEKJlERULFF
The Hugin and the Shafer-Shenoy propagation methods will be reviewed briefly in the following ; for more in -depth presentations , consult above references. We shall assume that all variables of a Bayesian network are discrete . 2. Bayesian
Networks
and Junction
Trees
A Bayesian network can be defined as a pair (Q,p), where Q = (V, E ) is an acyclic , directed graph or , more generally , a chain graph (Frydenberg , 1989) with vertex set V and edge set E , and p is a discrete probability function with index set V . To each vertex v E V corresponds a (discrete ) random variable
Xv with domain
Xv . Similarly , a subset A ~ V corresponds
to a set
of variables XA with domain XA = XvEAXv. Elements of XA are denoted XA = (xv ) , v E A .
A probability function "' v(xv, Xpa(v)) is specified for each variable Xv , where pa(v) denotesthe parents of v (i.e., the set of vertices of 9 from which there are directed links to v). If pa(v) is non-empty, "'v is a conditional probability distribution for Xv given Xpa(v); otherwise, "'v is a marginal probability distribution for Xv . The joint probability function pv = P for X V is Markov with respect to the acyclic , directed graph g . That is, Q is a map of the independence relations represented by p : each pair of vertices not connected by a directed edge represents an independence relation . Thus , 9 is often referred to as the independence graph ofp . The probability function p being Markov with respect to 9 implies that p factorizes according to g :
p = n ~V' vEV
For a more detailed account of Markov fields over directed graphs , consult
e.g. the work of Lauritzen et al. (1990. Exact inference in Bayesian networks involves the computation of marginal probabilities for subsets C ~ V , where each C induces a fully connected (i .e., complete ) subgraph of an undirected graph derived from Q. (Formally , a set A ~ V is said to induce a subgraph QA = (A , EnAxA ) ofQ = (V, E ) .) A fixed set C of such complete subsets can be used no matter which variables have been observed
and no matter
for which variables
we want posterior
marginals . This is the idea behind the junction tree approach to inference . The derived undirected graph is created through the processes of moral ization and triangulation . In moralization an undirected edge is added between each pair of disconnected vertices with common children and , when this has been completed , all directed edges are replaced by undirected ones. In the triangulation process we keep adding undirected edges (fill -ins ) to the moral graph until there are no cycles of length greater than three with -
NESTEDJUNCTIONTREES
53
out a chord (i .e., an edge connecting two non-consecutive vertices of the cycle ). The maximal complete subsets of the triangulated graph are called cliques. It can easily be proved that an undirected graph is triangulated if and only if the cliques can be arranged in a tree structure such that the intersection between any pair of cliques is contained in each of the cliques on the path between the two . A tree of cliques with this property is referred to as a junction tree. Now , inference can be performed by passing messages between neighbouring cliques in the junction tree .
3. Inference in Junction
Thees
Inthefollowing wewillbriefly review thejunction tree approach toinfer ence .For amore in-depth treatment ofthesubject consult e.g.thebook of Jensen (1996 ). IntheHugin architecture , apotential table >c isassociated with each clique , C.The linkbetween any pair ofneighbouring cliques ,say Cand D, shall bedenoted aseparator , and itislabelled byS=CnD.FUrther ,with each separator , S, isassociated one ortwo potential tables ('mailboxes '). The Hugin algorithm uses one mailbox perseparator , referred to~ 8 and \f~'out S. Inaddition ,weassign each function I'\.v,vEV,toaclique , C,such that {v} Upa(V) EC. That is,foreach clique , C, isassociated asubset ofthe conditional probabilities specified fortheBayesian network .LetKcdenote thissubset , and define ~c=KE n.K :c1'\.. As mentioned above, inference in a j unction tree is based on message passing . The scheduling of message passes is controlled by the following rule : a clique , C , is allowed to send a message to a neighbour , D , if C has received messages from all of its neighbours , except possibly from D , and C has not previously sent a message to D . Thus propagation of messages is initiated in leaf cliques and proceeds inwards until a 'root ' clique , R , has received messages from all of its neighbours . (Note that any clique may be root .) The root clique is now fully informed in the sense that it has got information (via its neighbours ) from all cliques of the junction tree , and thus PR cx cPR,with the proportionality constant being the probability of the evidence (i .e., observations ) , if any. An outward propagation of messages from R will result in Pc cx cPc for all C E C, where C denotes the set of cliques .
54
UFFEKJ.lERULFF
A messageis passedfrom clique C to clique D via separator S = C n D , as follows: 1. C generatesthe messageV = >; , where >V is the clique potential for D after the absorption. Now, assumethat we are performing the inward pass, and that clique C is absorbing messages4JS1 ' . . . , 4Jsnfrom neighbouring cliques C1, . . . , Cn and sending a message
1. Absorb messages
L\S< Pc C i=l
2. Generate message
\f' Si
f -
,.,h*
. -
\f' Si ' ~ -
1
, . . . , n
4. If algorithm = Hugin, then store clique potential : c/Jc f - c/Jc
5. DiscardcPO , Sl' . . . , cPsn 6. SendcfJSo to Co (Note that if C is the root clique , we skip steps 2 and 6.) Thus , considering the inward p~ s, the only difference between the H ugin and the ShaferShenoy algorithms is that the Hugin algorithm stores the clique potential . Storing the clique potential increases the space cost, but (most often ) reduces the time cost of the outward pass, as we shall see shortly . Next , assume that we are performing the outward pass, and that clique C receives a message, 4>80' from its 'parent clique ' Co (i .e., the neighbouring clique on the path between C and the root clique ) , and sends messages 4>Sl ' . . . , 4>sn to its remaining neighbours , 01 , . . . , On. (The * 's should not be confused by the similar potentials of the inward pass.) This is done as follows .
1. Absorb messagecPSo : if algorithm = Hugin then
~* -- 'fiG ~ 4>80 'fiG
NESTEDJUNCTIONTREES
55
.C L\Si ~ 1,..,i-nl,i+1,..,n< ~Si =C L\Si
else
2. Store
if algorithm = Hugin then
4>Sot - 4>80 else ~ out ' f ' So
3
.
If
4
.
Discard
5
.
Send
+
=
< Pc
c / > Sl
'
4
index
)
,
{
sets
=
and
~
time
c
Ui
Ui
the
}
U1
)
'
Thus
,
,
potential
:
4 > c
+
-
4 > c
,
if
very
the
expensive
generation
be
of
obtained
approach
due
.
to
the
these
This
is
generation
can
what
be
is
relaxed
,
exploited
in
.
Trees
U
.
be
can
trees
( g
( fam
v
clique
may
savings
network
=
store
trees
subgraph
( v
then
ciJSn
junction
Bayesian
fam
( U
,
Junction
complete
1 -
.
junction
Nested
a
.
and
nested
In
.
,
< PSo
potentials
space
.
Hugin
and
in
clique
both
the
~ * ' f ' So
algorithm
Inference
of
-
.
where
pa
.
,
( v
Urn
( v
)
.
~
each
=
)
( V
,
In
,
fam
( v
general
V
Ui
)
)
an
induces
, p
x
,
induce
~
E
)
each
fam
a
potential
( v
set
)
)
of
undirected
a
complete
of
~
the
v
,
moral
,
subgraph
E
V
,
graph
potentials
graph
v
,
~
1i
Ul
of
'
=
induces
.
.
.
( U
( Ui
'
,
,
Q
~
Ui
Ui
a
,
um
where
,
with
Ui
x
x
Ui
Vi
)
)
of
potential
~u = II ~Ui 1;
is Markov with respect to 1 (i.e., 1 is an independencegraph of ~u). Thus, if triangulating 1 does not result in a complete graph, we can build a junction tree corresponding to the triangulated graph, and then exploit the messagepassing principle for computing marginals of ~u . Assume that ~Ul' . . . , ~um are the potentials involved in computing a message(i .e., a marginal), say >80' in a clique C. That is, ~Ul' . . . , ~Umare the incoming messagesand the I'\,v'S associatedwith C. Then, instead of computing the
56
UFFEKJlERULFF
clique potential ~* \fiG -
n ~Ui 'l;
(cf. Step 1 of the above inward algorithm ), inducing a complete graph, and
marginalizingfrom so (cf. Step2 of the aboveinwardalgorithm ), we might be able to exploit conditional independencerelationships between variables of So given the remaining variables of C in computing
cPSo through messagepassingin a junction tree residinginside C (which, remember , is a clique of another , larger junction tree) . This , in essence, is the idea pursued in the present paper . In principle , this nesting of junc tion trees can extend to arbitrary depth . In practice , however, we seldomly encounter instances with depth greater than three to four . As a little sidetrack , note the following property of triangulations . Let 0 be a clique of a junction tree corresponding to a triangulation of a moral
graph 9 , let KG = { ~Vl' . . . '
the graph induced by
graph , unless the triangu -
lation of (] contains a superfluous fill -in : a fill -in between vertices u and v is
superfluous if {u, v} ~ Si for each i = 1, . . . ,n , becausethen C can be split into two smaller neighbouring cliques 0 ' = C \ {u} and C" = C \ { v} with Ci , i == 1, . . . , n , connected to either
to C ' if v E Ci , to C " if u E Ci , or , otherwise ,
0 ' or C " .
Therefore , assuming the triangulation to be minimal in the sense of not containing any superfluous fill -ins, the nested junction tree principle cannot be applied
when a clique , C , receives messages from all of its neighbours ,
or when a clique potential cPais involved in the computation (cf. the out ward pass of the Hugin algorithm ) . However , in the inward pass and in the outward pass of the Shafer-Shenoy algorithm , any non -root clique receives messages from all but one, say Co, of its neighbours , making it possible to exploit the junction tree algorithm for generating the message to Co. To illustrate the process of constructing nested junction trees, we shall consider the situation where clique 016 is going to send a message to clique 013 in the junction tree of a subnet , here called Munin1 , of the Munin net -
work (Andreassenet ai., 1989). Clique 016 and its neighbours are shown in Figure 1. The variables of 016 = { 22, 26, 83, 84, 94, 95, 97, 164, 168} (named corresponding to their node identifiers in the network ) have 4, 5, 5, 5, 5, 5, 5, 7, and 6 possible states , respectively .
The undirectedgraph inducedby the potentialscPs1 ' cPs2 , cPs3 , and Vl may be depicted as in Figure 2. At first sight this graph looks quite messy, and it might be hard to believe that its triangulated graph will be anything but complete . However , a closer examination reveals that the graph
NESTED JUNCTION TREES G19
G13
I/>s~ c/J~S Figure
1
.
Clique
respectively
a
message
is
22
must
be
26
,
83
,
Figure
2
SO
,
table
and
-
Thus
4 >
82
so
. e
.
,
95
,
,
625
,
clique
cPs1
)
-
to
'
clique
its
and
< ! > S3
from
probability
cliques
G19
potential
013
,
,
cPVl
026
=
,
P
{
and
G63
97122
,
,
26
)
,
.
cliques
are
{
83
,
84
,
97
,
164
,
168
}
and
induced
i
.
e
.
been
,
by
clique
potentials
containing
reduced
tables
of
cps1
a
size
< / JS2
nine
to
total
,
,
< / JS3
tree
000
(
'
and
< / JV1
variables
junction
381
'
)
with
a
including
a
'
with
5
-
a
clique
separator
.
be
shall
,
the
further
cPV1
try
5
to
-
)
continue
clique
broken
three
and
4 > S2
{ 22, 26, 83, 84, 94, 95, 168} { 83, 84, 97, If >4, 168} { 94, 95, 97} { 22, 26, 97}
.
(
has
we
,
}
'
the
that
168
clique
000
remaining
cPs3
,
tree
cannot
,
sent
graph
with
750
the
97
undirected
junction
it
got
i
,
encouraged
clique
'
94
9
2
size
and
and
original
8
of
has
generated
The
size
an
table
,
.
< ! > Sl
messages
and
84
the
of
-
messages
these
triangulated
,
two
on
81 = 82 = 83 = VI =
I'\-S3
receives
Based
already
{
(
016
.
57
down
potentials
induce
our
has
clique
associated
.
The
8
-
clique
associated
the
graph
break
in
it
,
with
shown
-
with
on
it
Figure
the
.
down
only
other
These
3
.
the
hand
potentials
.
In
potential
,
NESTEDJUNCTIONTREES
59
(Cb n So) \ Ca = { 22, 26, 94, 95} ; that is, 4 x 5 x 5 x 5 = 500 messages
must be sentvia separator{83, 84,97, 168} in order to generate>80. Sending a message from Cb to Ca involves inward propagation of messages in
the (Cc, Cd) junction tree, but , again, since neither Cc nor Cd contain all variables of the (Ca, Cb) separator, we need to send multiple messagefrom Cd to Cc (or vice versa). For example, letting Cc be the root clique of the inward pMS at this level, we need to send 5 messagesfrom Cd (i .e., one for each instantiation of the variables in (Cd n { 83, 84, 97, 168} ) \ Cc = { 97} ) for each of the 500 messages to be sent from Cb (actually from Cc) to Ca. Similarly , for each message sent from Cd to Cc, 20 messages must be sent
from Ce to Cf (or 25 messagesfrom Cf to Ce). Clearly, it becomesa very
time consumingjob to generate4>80' exploitingnestingto this extent.
525
872, 750,000
000
22 , 26 , 83 , 84 , 94 , 95 , 164 , 168
2t625
5t250
( 7x )
, 000
75 , 000
( 150x
. .
500
83
22
97
84
94
168
95
26
4
!100 ( 20x )
..
( 500x .
It 210t 000
)
~
750
)
17 , 000
( 5x )
/' ..I '
~
,."
, ,
Figure 5. The nested junction tree for clique C16 in Muninl . Only the connection to neighbour C13 is shown. The small figures on top of the cliques and separators indicate table sizes, assuming no nesting. The labels attached to the arrows indicate (1) the time cost of sending a single message, and (2) the number of messagesrequired to compute the separator marginal one nesting level up.
A proper balance between space and time costs will most often be of interest . We shall address that issue in the next section . Finally , however, let us briefly analyze the case where Cb is chosen as root instead of Ca (see Figure 6) . First , note that , since Cb contains
60
UFFEKJlERULFF
the
three
potentials
4>1 , . . . , 4>1 , from subgraph
.
Ca
The
via
time
S
and
=
cost
of
and
receives
message
{ 83 , 84 , 97 , 168 } , Cb
collapses
computing
clique
the
first
potentials to
a
,
complete
potential
, 4>Cb
=
525 , 000 7, 429 , 500
22 , 26 , 83 , 84 , 94 , 95 , 164 , 168
2, 625 , 000
5, 250 ( 7x ) .. 750 83 84 97 168
Figure
6.
The
( i .e . , the with
Cb as root
a single
4>81
one
*
4>83
the
( 750
+
cost
of
and
using
Cb 4
The
nesting
is
2 , 625
, 000
, using
43 % , respectively
order must
junction
~
C16 in Muninl is computed
the of
time
arrows
cost
is
tables
, with
via
of
4
x
375
marginalization
of
measure
, the
total
time
.
the
clique
cost
clique
of sending
the
separator
the
clique
Hugin
potentials
potentials
based
time
cost
Using
six
root
propagation
compute
, 000
remaining
a conservative
) . Thus
to
the
message
( I ) the
required
the
Cb being
inward
indicate
messages
computing
) . Each
( 750
costs
of
and
5 x
nested
Time
on of
has
the
a
larger
generating
of 4>80 '
375
, 000
) +
conventional 2 , 625
, 000
space
525
, 000
( i . e . , non =
13 , 125
trees and
7 x
time
costs
- nested
, 000
approach
=
7 , 427
, 500
) message
, respectively provides
savings
. gener
-
. Thus
, in
of
85 %
.
Costs the
its
+
junction
, of
evaluate
comnare -
a of
6 x
the
and to
to
number
( this
time
and
we
has
cost
+
case
In
4>1 ,
clique C16
, is
, 000
and
Space
up .
, 000
this
5.
level
, 000
root
375
space
ation
375
the
as x
*
for by
attached
( 2 ) the
time
525
tree
generated
labels
, and
, the
6 x
time
be
) . The
*
algorithm
junction
to
message
marginal
is
nested
message
applicability
~
snace -
tree approaches .
and
of time
the
costs
nested with
junction those
of
tree the
approach
conventional
,
S331 :IL NOI~~Nflr a3LS~N
' HddV'IVNOI~N'aANOO log
. -
r
L _
11-
-
1881
.
1
1
1 So I
. J -,
1
C
n
19
HOVO
62
UFFEKJ.lERULFF
space
required
pass
,
In
be
cost
The
clique
,
Co
of
<1>c
shall
assume
is
it
.
) .
For
( m
n
) IXcl
time
during
that
' l / Jc
,
.
is
is
the
outward
stored
in
the
clique
Note
potential
that
if
utilized
shall
,
,
,
,
information
this
however
< Pc
cost
,
may
refrain
from
.
c/ Jc
only
not
generated
we
compute
will
the
are
combination
always
C
+
simplicity
of
If
to
potentials
will
is
message
the
and
compute root
store
c/ Jc
clique
it
in
when
( i .e . ,
< Pc
, whereas
more
So
=I
than
0 ) ,
the
one
message
,
.
processing
8 .
Note
if
message
,
always
Using
that
have
to <1>so
n
times
algorithm
an
Note
marginalize when
V; c
, <1>80
' Jig ! , . . . , J~
So
=
or
0
generating
Kc
=
0
once
that the
-
( but
! , J~ not
=
,
to
.
both
. . . , Jign ) ,
and
the
The
)
if n
to
# 1
if
in
0
pass
,
# =
case
the
Hugin
algorithm which
+
Kc So
.
do
case
we
contributing
n
and
we
which
tables
equals
both
in :
4>c of
Ci
0
the that
inward
- Shenoy
to
So -
#
Shafer
Note
clique
algorithms
number '
an
clique on
.
leaf
Kc two
shown
a
based
by
unless the
4 > Si
was
a
is be
generated
being
contributes
,
will
preceded
whereas
table
+ 1 '
be
C
'
pass
there pass
must be
Jso
table
message
outward ,
inward )
between
one
the
with
Jc
only
from
4> 0
since
only
directly
to
the < Pc
pass
difference
unless
in ( =
algorithm
4> c ,
during algorithm
C
4Jc
outward
, the
computes it
,
- Shenoy
C
Hugin in
otherwise
anything
.
clique
the
processing ;
Shafer
do
v; c
in using
assume
the
=
does
1996
method
shall
4> c
we
extra
Outward
cost
not
little
contributing
algorithm to
conventional we
a
tables c/ Jc
algorithm - Shenoy
Figure
table
n
methods
contribute
5 . 1 .2 .
use
contributing
,
refined
Hugin
to
+
the
( Shenoy
Shafer
'
,
computing of
such
table
and
Anyway
m
of
reduced
The
in
with
domain
using
< Pso
,
time
.
I,
.
general
the
the
" ' UVm ' l / Jc
pass
the
IXvlu
recomputing
inward
on
,
0
1
0 , and
( i .e . ,
n Kc
if
tables
either =
0 .
5.2. NESTEDAPPROACH In describing the costs associated with the nested junction tree approach , we shall distinguish between message processing at level 1 (i .e., the outer most level) and at deeper levels.
5.2.1. Levell The processing of messages in a non-root clique C (i .e., receiving messages from neighbours C 1, . . . , Cn via separators 81, . . . , 8n , and sending to clique Co) may involve inference in a junction tree induced by VI U . . . U Vm U 81 U . . . U 8n (see Figure 9). Note that , in this setup , C may either be involved in inward message passing using either algorithm , or it may be involved in outward Shafer-Shenoy message passing . In the inward case, C1, . . . , Cn are
63
NESTEDJUNCTIONTREES I I
I
ct >So
I
r
I
I So I
t
-
L
I-
-
I-
,
~
I
C
c/>ST&
c!>Sl . .
81
Sn
ConvCost~ut (C) if algorithm
= Hugin
if So # 0
{conventional inward messageprocessing in C }
if
{ computing ~c = 4>0 ~ ""So}
Ct +- lx501 + IXcl else
Ct +- ( m + n + l ) IXcl { computing
c = so I17 = 14>Si I17 = 1
for i = 1 to n Ct +- Ct + IXcl
{ computing
else if ( So # 0 ) 1\ (n = 0 ) 1\ ( m > 0 ) Ct +- Ct + 21Xc I
{ C is a leaf clique
{ computing
with
Kc
# 0}
CPo = 'l/JcCPso}
for i = 1 to n if r > 1
{ r is the number
Ct f - Ct + rlXcl
of tables
contributing
to ~ c }
{ computing
4>0 = Il/Jc
Ct f - Ct + IXcl
{ computing
<1>Si = EC \ Si c/>c }
Cs f - Cs + IXsi I
{ storing
c/>Si in
Figure 8 . Space and time costs of receiving a message from the inward neighbour and sending to the outward neighbours . Note that r = n + 1 if m > 0 1\ So # 0 , r = n if m > 0 V So :1: 0 , and r = n - 1 if m = 0 1\ So = 0 .
the
outward
outward the
neighbours neighbour
inward
neighbour
assuming cannot The
done
message , the inward
and
Co
which
C
plus
a minimal be
tree . Thus performing
to
the
inference
of
message
propagation
inward
is going
remaining
triangulation
through
the
. In to
outward
, computing in
an
through processing towards
the
outward
send , and
case , Co
C1 , . . . , Cn
neighbours a marginal
induced
inference in a root
clique clique
. ( Recall
that
a root
clique
in
junction
is the
includes ,
tree . )
in
the C
level
equals
in the
level
2 junction the
cost
2 junction
of
64
UFFEKJlERULFF
cPSo
4>51
c/JSn .
.
.
NestedCostfn /out(C) CSf - c~oot
{space cost of inward propagation towards clique 'root '} { time cost of inward propagation towards clique 'root '} {storing ~So}
Ct -+- c~oot
Csf - Cs+ Ix801 Figure 9.
Spaceandtimecostsof receivingn messages andsending0 or 1, wherethe marginalis computed in a junctiontreeinducedby VI U. . . UVmU81U. . . U8n.
tree
( see
below
describe clique
is
5 .2 .2 . We
shall
inference
the
clique
2
( more
outward
operation Now
, since
at
cPSo . In
Section
propagation
of
situation
, say
separator
levell
6 , we
towards
shall a
root
10
processing
an
S * - marginal
that , and
the
message
purpose
generate tree
where
to
of
perform
a message
, we
-
that
going
at
1 , wanting
the tree
to
clique tree
C * can
j unction is
to
clique
junction
levell
S * . Note
defining
1 is
conventional
the
should
from
only
compute
-
per
-
a marginal
.
a
C * , at
of
junction a root
Figure in
costs . Since
>
the
marginals in
those
time deeper
towards
neighbours
message are
2 or
tree
passing
a clique
and
level
1 containing
) a set
the
via
or
l -
outward in
storing inward
space
at
a junction
typically
neighbour
of
conventional
the
message
contained
cost
performing
-
trees
level
Consider its
deeper
analyze
in
inward
from
or
j unction
at
space
of
.
now in
the
cost
calculated
ing
or
plus
the
Level
passing
form
)
how
E::: ~
relevant
be
be
C
receives
level
l
to
send
engaged
generated
1
which
a message in
potentials containing
messages
>
either
involved clique and
C
is to
inward in
that
C . might
share
a
NESTEDJUNCTIONTREES
65
ConvCost ~1(C) if algorithm
= Hugin
if m + n >
1
Ct t - (m + n) IXcl
{
computing
Cs t - IXcl
{
storing
{
comp
{
a
{
computing
{
storing
{
computing
the
c
.
=
rem
first
c
}
2:::~=1(ri - 1)n~~~rj 4>0'8}
.
number
of
1/ ; 0
1 / ; 0
~
}
absorptions
=
I1i
}
CPVi
}
}
I1i
Ti
CPo
' s
}
if (a > 0) V (m > 1) Ct-+- Ct+ (m + n) f1i rilXcl {computingf1i Ti I } if So =F0 Cs-+- Cs+ IXSoI {storing
{computingf1i ri marginals}
~
~
Figure 10. Spaceand time costsof receivingmessages from outward neighboursand sendingto the inward neighbourin a junction tree at nestinglevel greaterthan 1.
66
UFFEKJlERULFF
variables to
with
send
( assuming
That to
be
message
able
to
the
of of
This
way as
of
that
in
the
Denote to
by
send
that
to
chrononizes C
of
to
ICI
<
to
the
first
) 4>0
remaining
Ili
replacing messages
C
the
messages
Co .
for
Co
* ) \ co
neighbour
messages
will
be
from
the
outward
in
a junction
alternative
C ,
by
the
neighbours tree
will
to the
to
multiplied
called
, but
, or
, if , for
is
.
referred
variable
shall
that
outward
that
the
messages
are
messages
in
one
, and
such
, for
prop
refrain
-
from
each
>
neighbours
combination
of
C 1 syn
received
, for
either of
the
messages
.
( Note
-
from
, we
each
sends
part
have such
that
C2 , . . . , Cn
1 . Thus , C
the
Ci
scheduled
message
from
i
== 0 , generates
neighbour
batch
each
messages
I for
outward
So
that of
I ) I XSi
actual
algorithm
=
I1i
: l >~
r i
-
I
Using of
greater
. time
the
time
except once .
1
of
marginal
and
of
is
the
the
IIi
IXcl
>
root ri time .
not
have
time
need
combina
IX ( cns
that
-
* ) \ co 1
S * - marginal
=
=
( ri
( ii ) IIi
4>~ is
a
we
cor
-
, typically
>
+ IIi
S * - marginal
we
must an
is
(m
+
-
0
,
I Xc
m
>
1 ) IIi 4>~
TilXcl is
is
compute
case
the
re -
:
.
(i)
be
( i ) , it
( ii ) to
be
is
similar
computed
composed
the
pays 4>0 ' 8 ,
. For
corresponding equals
the
, a ) is
multiple
. Case
to
that
cases
the
going
by
absorptions
1 . In
going
. The
performed
. Note
two
only
S * - marginal
n ) IXcl
I multiplications
of
computing
compute
processed
1 ) Il ; ~ l rj
number or
to
are
distinguish
before (n
C
are
and
( i . e . , the
, since
, the
,
cost
messages
I divisions
algorithm
by
r1 . replacements
ri
I XSi
therefore
of
taken
that
Ci , where
computed
cost
step
of
such
1 , and
clique
first
messages
1/ Jc
which is
. The and
m
compute
1/ Jc
time
involves
of
cost
C
Ci
the
= l 4>Sj ' which
from
from
- Shenoy
to
,
r7 . combinations a
Shafer
)
that
at
combinations
than
( wrt
Ilj Ei
originating
the
number
=
message
a message
each
with
from
Hugin
one
placing
If
to
x ( cns
IS * I . )
Using ( the
of
their
(r i -
the
C
is going
* ) \ co I messages
multiple
from
) ; an
it
. Furthermore
considering
combinations
Co , or , if
responding
sent
which
inward
sending
marginals
of
size
messages
messages
IXsol
activity
of
its
clique
received
( 1996
to
.
all
all
IX ( cns
to
root
be
Co
configuration
are
worth
number
send
its
storage
tion
for
the
processes
extra
the
to
Jensen be
send
each
== 1 , . . . , n . Assuming
C2 , . . . , Cn
C 1,
by
paper
r i
C , i
is
messages
) might
present
Co
arbitrary
firing , 1994
clique
messages
needed
computing
( Xu
to
for
neighbours
of
with
have
sent
if
number
variable
agation
share
appropriate
messages
the
will
be
, C ' s outward
number
not
0) , C
S * - marginal
reason
product
#
does
must
generate
the
same
it
So
is , a
generate
to
S * which
S *larger
of
NESTED JUNCTION TREES
67
5.2.3. Level .2 or deeper- nested The processing of messagesin a non-root clique C at levell > 1, where the message, <1>80' to be sent is generated through inference in an induced junction tree at levell + 1, is shown in Figure 11. This situation resembles
8 .
(ro x)4>80 t So C
4>V1' . ' " 4>v'tJI .' 4>81' . ' " 4>sn
(rl x) rPs1 I / 81
8n .
.
\ . rPSn(TnX) '\
.
NestedCostt ; (C) CS-+- c~oot
{ space cost of inward prop . towards 'root '}
Ct -+- c~oot Cs-+- Cs+ IXso I
{ time cost of inward prop . towards 'root '} { storing 4>50}
Ct -+- CtIX(cns. )\ col lli Ti Cs-+- Cs+ E ~=2(ri - l )IXsi I
{inward prop. IX(cns. )\ col lli Ti times} {storing multiple messages for eachi > I } and
.
Figure 11. Space and time costs of receiving messages from outward neighbours sending to the inward neighbour in a junction tree at nesting level greater than 1.
~
the situation shown in Figure 9, the only difference being that C may receive multiple messagesfrom each outward neighbour, and that it may have to send multiple messagesto clique Co. Since C needs to perform IIi ri absorptions, with each absorption correspondingto an inward passin
UFFE KJlERULFF
68
the junction tree at levell + 1, and IX(cns*)\coI marginalizationsfor each combinationof messages , a total of IX(cns*)\col ni ri inward passesmust be performedin the levell + 1 tree. 5.3. SELECTINGCOSTFUNCTION Now, depending on the level, directionality of propagation, and algorithm used, we should be able to select which of the five cost functions given in Figures 7- 11 to use. In addition , we need a function for comparing two pairs of associated space and time costs to select the smaller of the two. To determine which of two costs, say c = (cs, Ct) and c' = (c~, c~), is the smaller, we compare the linear combinations Cs+ , Ct and C~ + 'Yc~, where the time factor , is chosenaccording to the importance of time cost. The algorithm Cost(C) for selecting the minimum cost is shown in Figure 12, where ' -<' refers to the cost comparison mentioned above.
Cost(C) if level = 1 if (direction
= inward
if ConvCostfn
) V ( algorithm
( C ) - < NestedCostfn
c = ConvCostfn
= Shafer - Shenoy ) / out ( C )
(C )
else c = NestedCostfn
/ out ( C )
else c = ConvCost
~ut ( C )
else if ConvCost
~ l ( C ) - < NestedCost
c = ConvCost
~ l (C )
~ 1( C )
else c = NestedCost
~ l (C )
Figure 12 .inclique Selecting functions and the minimum cost associated with message processing C.cost 5.4. SUMOF COSTS
Undera giventimefactor," theoverallminimumspaceandtimecosts of inwardor outwardpropagation of messages towards /froma givenroot
NESTEDJUNCTIONTREES clique, R, can
now be computed
69
as
CR= } : Cost (C) + C;emp , CEC
(1)
where
max IXsl +{CIct Hugill prop . SES > cmax not stored }IXclifoutward c;emp=
max CECIXc I
if Shafer-Shenoyprop.
0
otherwise.
During outward H ugin propagation we need auxiliary spacewhen generating messages ; thus, a space of size maxSESIXsl suffices. Further , for each clique, C, for which
Costs
All of the cost functions mentioned aboveare relative to a given root clique, R, and we are therefore only able to compute the cost of probability propagation (inward or outward) with R as root . However, we want to select the root clique such that the associatedcost is minimal , and, therefore, we must be able to compute the 'root cost' cC for each clique C E C. Assuming that root clique R has neighbours C1, . . . , On, Equation 1 can be re-expressedas n cR = Cost(R) + L cCi\ R + c~emp, i=l
(2)
where cCi\ R denotes the root cost in the subtree rooted at Ci and with the R-branch cut off. Note that , since cCi\ R = Cost(Ci ) +
L CC\Ci, CEneighbours (Ci)\ { R}
70
UFFEKJlERULFF
the root costscan be computedthrough inwardpropagationof costs. This is illustrated in Figure 13, whereeachclique C sendsthe cost message Cost(C) + l::~ 1CCi\C to its inward neighbourCo.
CCl \C
. . .
Cost(G) + E ~=l CCi\C . . .
.
.
CCYL \C
Figure 13.
Propagating costs of probability propagation .
Thus, what lacks to compute the root cost for Ci , i = 1, . . . , n, is cR\ Ci (i.e., the root cost for the subtree rooted at R and with the Ci-branch cut off ). However, this is nothing but the cost messagesent from R to Ci if we perform outward propagation of costs from R. That is, after a full propagation of costs (i.e., inward and outward) we can easily compute the root cost of any clique. Note that Cost() dependson the directionality (i.e., inward or outward). So, to compute the root costs for both inward and outward probability propagation we need to perform two full cost propagations. 7. Experiments To investigate the practical relevance of nested junction trees, the cost propagation schemedescribed above has been implemented and run on a variety of large real- world networks. The following networks were selected. The KK network (50 variables) is an early prototype model for growing barley. The link network (724 variabIes) is a version of the LQT pedigreeby ProfessorBrian Suarezextended for linkage analysis (Jensenand Kong, 1996). The Pathfindernetwork (109 variables) is a tool for diagnosing lymph node diseases(Heckerman et al., 1992). The Pignet network (441 variables) is a small subnet of a pedigree of breeding pigs. The Diabetesnetwork (413 variables) is a time-sliced network for determining optimal insulin dose adjustments (Andreassen et al., 1991). The Muninl - 4 networks (189, 1003, 1044, and 1041variables, respectively ) are different subnets of the Munin system (Andreassenet al., 1989).
NESTED JUNCTION TREES
71
The Water network (32 variables ) is a time -sliced model of the biological processes of a water treatment plant (Jensen et al., 1989) . The average space and time costs of performing a probability propaga tion is measured for each of these ten networks . Tables 1- 4 summarize the results obtained for inward Hugin propagation , inward Shafer-Shenoy prop agation , full H ugin propagation (i .e., inward and outward ) , and full ShaferShenoy propagation , respectively . All space/ time figures should be read as millions of floating point numbers / arithmetic operations . The first pair of space/ time columns lists the costs associated with conventional junction tree propagation . The remaining three pairs of space/ time columns show, respectively , the least possible space cost with its associated time cost , the costs corresponding to the highest average relative saving , and the least possible time cost with its associated space cost. The largest average relative savings were found by running the algorithm with various , -values for each network . The optimal values, , * , are shown in the rightmost columns .
TABLE
1
tional
(
.
Space
and
approach
,
minimum
space
and
(
iv
)
time
and
cost
minimum
costs
the
)
for
nested
,
(
iii
time
)
inward
propagation
trees
maximum
cost
Hugin
junction
average
relative
Space
Time
saving
15
. 1
50
. 3
0
. 7
28
. 1
83
. 3
0
. 3
0
. 7
0
. 1
2
. 2
Diabetes
. 8
. 0
207
Munin3
Munin4
As
728
9
3
. 7
12
.
. 3
64
. 6
28
. 3
anticipated
ated
with
costs
of
minimum
,
1119
. 9
because
cost
may
,
be
ii
(
)
i
)
the
conven
maximum
space
-
nesting
and
that
. 8
0
. 3
0
. 9
0
. 5
4
. 9
1128
0
. 6
3532
. 7
is
not
time
costs
,
(
1 '
. 0
2
. 7
26
.
9
. 5
9
. 9
Pathfinder
. 8
0
. 25
.
1
0
. 15
. 3
0
. 30
8
. 4
0
. 30
9
. 25
2
. 4
2
. 4
3
. 4
0
. 0
0
. 30
. 9
22
. 3
0
.
time
costs
general
than
,
since
network
very
limited
15
associ
the
-
time
nesting
Pathfinder
a
315
50
larger
to
1
29
. 9
maximum
The
0
. 8
12
the
much
although
possible
68
1
,
in
only
. 25
. 3
63
recommended
is
0
. 6
. 3
.
. 30
. 6
5
. 5
large
. 30
0
0
1
are
0
. 7
. 6
6
)
. 5
61
. 0
. 8
0
37
2
. 9
*
. 3
. 2
315
,
. 3
0
. 1
,
12
. 6
. 5
=
Thus
100
Time
0
34
. 4
=
8
1
but
unacceptably
nesting
68
. 7
85
.
. 5
. 0
1
costs
. 9
62
90
203
,
Space
40
61
. 2
.
. 5
. 2
95818
networks
,
. 7
0
. 2
it
11
. 1
. 5
space
6
. 3
1
propagation
cost
of
Time
. 1
29692
. 5
=
Space
0
all
minimum
space
time
19
. 7
for
conventional
,
. 04
0
. 7
. 2
8
0
. 1
3
18
Water
33
. 7
Munin2
differently
0
11
Munin1
ated
. 2
0
Time
link
0
=
Space
KK
Pignet
(
Nested
,
Pathfinder
with
.
Conventional
Network
using
approach
yields
the
associ
-
behaves
degree
.
72
UFFEKJlERULFF
TABLE 2. Space and time costs for inward Shafer-Shenoy propagation using (i) the conventional approach, and the nested junction trees approach with (ii ) maximum nesting (minimum space cost) , (iii ) maximum average relative saving of space and time costs, and (iv ) minimum time cost. Conventional
Nested 1 = 0
Network
Space
Time
Space
'Y=
Time
' Y*
'Y =
Space
Time
100
Space
Time
-
--y* .
KK
7 .9
52 . 2
6 .0
787 .9
6 .8
42 .7
7 .0
42 .4
0 . 15
link
4 .5
83 .3
2 .1
587 .4
3 .6
76 . 1
4 .0
73 .8
0 .05
Pathfinder
0 .1
0 .7
0 .1
1 .3
0 .1
0 .7
0 .1
0 .7
0 . 10
Pignet
0.3
2.2
0.2
4.0
0.2
2.4
0.3
2.1
0.15
Diabetes
1 .2
33 .9
0 .5
97 .7
728 .8
79 .2
Munin2
0 .7
9 .9
0 .2
Muninl
51 .0
0 .7
36 .2
1.2
33 .3
0 .05
.7
85 .3
379 .3
88 . 0
368 .6
0 . 10
50 .7
0 .4
11 .1
0 .7
9 .1
0 .05
31260
Munin3
0 .7
12 .2
0 .2
324 .0
0 .5
12 .0
0 .7
10 .5
0 .05
M unin4
3 .0
64 .3
1.1
235 . 7
2 .1
52 .6
2 .4
51 .4
0 .05
Water
6 .1
29 .2
5 .3
115 .8
5 .9
27 . 1
6 .1
26 .9
0 .15
However , as the , = , * columns show, a moderate increase in the space costs tremendously reduces the time costs. (The example in Figure 5 demonstrates the dramatic effect on the time cost as the degree of nesting is varied .) In fact , for , = , * , the time costs of nested computation are either roughly identical or smaller than those of conventional computation , while space costs are still significantly reduced for most of the networks . Interestingly , for all networks the minimum time costs (, = 100) are less than the time costs of conventional propagation , and , of course, the associated space costs are also less than in the conventional case, since the saving on the time side is due to nesting which inevitably reduces the space cost . Comparing Tables 3 and 4 with , = , * , we note , somewhat surprisingly , that the time costs of a full H ugin propagation are consistently smaller than those obtained using the Shafer-Shenoy algorithm , while the space costs are either comparable or smaller for the H ugin algorithm . Note , however , that the , * 's are significantly smaller in the Shafer-Shenoy case, indicating an attempt to keep the space costs under control .
ACKNOWLEDGEMENTS I wish to thank Steffen L . Lauritzen for suggesting the cost propagation scheme, Claus S. Jensenfor providing the link and Pignet networks, David
73
JUNCTION TREES NESTED
TABLE3. Space andtimecostsfora fullHuginpropagation using(i) theconventional approach , andthenested junctiontreesapproach with(ii) maximum nesting (minimum space cost), (iii) maximum average relativesavingof space andtimecosts , and(iv) minimum timecost. Conventional Nested , =0 1= , . l' = 100 Network Space Time Space Time SpaceTime SpaceTime , . KK 15.8 93.3 1.4 1162 .2 2.7 104 .1 9.0 80.5 0.20 link 28.4 164 .1 0.5 29773 .2 7.2 164 .6 12.6 142 .6 0.20 Pathfinder 0.2 1.2 0.1 1.6 0.2 1.1 0.2 1.1 0.15 Pignet 0.9 4.5 0.1 64.1 0.4 4.3 0.6 4.1 0.20 Diabetes 11.0 59.5 0.6 116 .5 1.0 61.0 5.3 55.5 0.15 Munin1 213.3 1362 .3 25.5 96451 .8 74.0 949.6 74.4 948.8 0.15 Munin2 3.2 18.7 0.6 212.6 1.4 19.0 2.4 17.3 0.20 Munin3 3.8 23.6 0.5 97.3 1.4 21.6 2.5 20.8 0.15 Munin4 18.4 119 .4 5.0 1183 .2 5.5 122 .1 13.0 105 .2 0.20 Water 9.0 46.3 1.0 3550 .7 3.2 44.0 4.4 40.3 0.15
TABLE 4. Spaceand time costsfor a full Shafer-Shenoypropagationusing (i) the conventional approach, and the nestedjunction trees approachwith (ii) maximumnesting (minimum spacecost) , (iii ) maximumaveragerelativesavingof spaceand time costs, and (iv) minimum time cost. Conventional
Network KK link Pathfinder Pignet Diabetes Munin1 Munin2 Munin3 Munin4 Water
Nested , =0 "Y= , * , = 100 Space Time Space Time Space Time Space Time
,*
9.0 6.8 0.1
153.6 263.1 2.1
7.1 4.5 0.1
889.3 767.1 2.6
7.9 6.0 0.1
114.9 222.5 2.1
8.1 6.4 0.2
114.4 220.8 2.0
0.05 0.05 0.05
0.5 1.8 117.0 1.1 1.2 4.9 6.7
7.0 80.9 2411.9 31.6 44.8 209.9 60.6
0.3 1.1 98.4 0.6 0.7 3.0 5.9
8.9 97.9 32943.8 72.4 356.6 381.3 147.3
0.4 1.3 105.8 0.8 1.0 3.1 6.5
7.1 82.0 1309.1 31.2 44.1 199.4 55.5
0.5 1.8 110.9 1.1 1.2 5.7 6.6
6.8 79.0 1263.4 29.2 42.5 154.9 55.3
0.05 0.05 0.05 0.05 0.05 0.01 0.15
74
UFFEKJlERULFF
References Andreassen, S., Hovorka, R., Benn, J., Olesen, K . G. and Carson, E. R. (1991) A modelbased approach to insulin adjustment , in M . Stefanelli , A . Hasman , M . Fieschi and
J. Talmon (eds), Proceedings of the Third Conference on Artificial Intelligence in Medicine , Springer -Verlag , pp . 239- 248. Andreassen , S., Jensen , F . V ., Andersen , S. K ., Falck , B ., Kjrerulff , U ., Woldbye , M .,
S0rensen, A . R., Rosenfalck, A . and Jensen, F . (1989) MUNIN - an expert EMG assistant, in J. E. Desmedt (ed.) , Computer-Aided Electromyography and Expert Systems, Elsevier Science Publishers B. V . (North -Holland ) , Amsterdam , Chapter 21. Frydenberg, M . ( 1989) The chain graph Markov property , Scandinavian Journal of Statistics , 17 , pp . 333 - 353 .
Jensen , F . V . ( 1996) An Introduction
to Bayesian Networks . UCL Press , London .
Jensen, F . V ., Kjrerulff , U., Olesen, K . G. and Pedersen, J. (1989) Et forprojekt til et ekspertsystem for drift af spildevandsrensning (an expert system for control of waste water treatment - a pilot project ) , Technical report , Judex Datasystemer A / S, Aalborg , Denmark . In Danish .
Jensen, C. S. and Kong , A . (1996) Blocking Gibbs sampling for linkage analysis in large pedigrees with many loops , Research Report R -96-2048 , Department of Computer Science , Aalborg University , Denmark , Fredrik Bajers Vej 7, DK -9220 Aalborg 0 .
Jensen, F. V ., Lauritzen , S. L . and Olesen, K . G. (1990) Bayesian updating in causal probabilistic
networks by local computations , Computational
Statistics
Quarterly , 4 ,
pp . 269 - 282 .
Heckerman , D ., Horvitz , E . and Nathwani , B . ( 1992) Toward normative expert systems : Part I . The Pathfinder project , Methods of Information in Medicine , 31 , pp . 90- 105.
Kim , J. H . and Pearl, J. (1983) A computational model for causal and diagnostic reasoning in inference systems . In Proceedings of the Eighth International on Artificial Intelligence , pp . 190- 193.
Joint Conference
Lauritzen , S. L ., Dawid , A . P., Larsen, B. N . and Leimer, H.-G. (1990) Independence properties of directed Markov fields , Networks , 20 , pp . 491- 505. Lauritzen , S. L . and Spiegelhalter , D . J . ( 1988) Local computations with probabilities on graphical structures and their application to expert systems . J oumal of the Royal Statistical Society , Series B , 50 , pp . 157- 224.
Shafer, G. and Shenoy, P. P. (1990) Probability propagation , Annals of Mathematics and Artificial -
Intelli -.qence,. 2 ., -pp- . 327- 352 .
Shenoy, P. P. (1996) Binary Join Trees, in D . Geiger and P. Shenoy (eds.) , Proceedingsof the Twelfth Conference on Uncertainty Publishers
, San Francisco
, California
in Artificial
,- -pp- . 492 - 499 .
Intelligence , Morgan Kaufmann
Xu , H . (1994) Computing marginals from the marginal representation in Markov trees, in Proceedings of the Fifth International
Conference on Information
Processing and
Management of Uncertainty in Knowledge-Based Systems (IPMU ) , Cite Interna tionale
Universitaire
, Paris
, France
, pp . 275 - 280 .
BUCKET
ELIMINATION
PROBABILISTIC
: A
UNIFYING
FRAMEWORK
FOR
INFERENCE
R . DECHTER Department
of Information
and
Computer
Science
University of California , Irvine dechter @ics.uci .edu
Abstract . Probabilistic inference algorithms for belief updating , finding the most probable explanation , the maximum a posteriori hypothesis , and the maxi mum expected utility are reformulated within the bucket elimination frame work . This emphasizes the principles common to many of the algorithms appearing in the probabilistic inference literature and clarifies the relation ship of such algorithms to nonserial dynamic programming algorithms . A general method for combining conditioning and bucket elimination is also presented . For all the algorithms , bounds on complexity are given as a function of the problem 's structure .
1. Overview Bucketeliminationis a unifyingalgorithmicframework that generalizes dynamicprogramming to accommodate algorithms for manycomplex problem solvingandreasoning activities , includingdirectionalresolution for propo sitionalsatisfiability(Davisand Putnam, 1960 ), adaptiveconsistency for constraintsatisfaction(Dechterand Pearl, 1987 ), Fourierand Gaussian eliminationfor linearequalitiesand inequalities , and dynamicprogram mingfor combinatorial optimization(BerteleandBrioschi , 1972 ) . Here, after presenting theframework , wedemonstrate that a numberof algorithms for probabilisticinference canalsobe expressed asbucket -eliminationalgorithms. The main virtuesof the bucket -eliminationframeworkare simplicity and generality . By simplicity , we meanthat a completespecification of 75
76
R. DECHTER
bucket -elimination algorithms is feasible without introducing extensive ter -
minology (e.g., graph conceptssuch as triangulation and arc-reversal) , thus making the algorithms accessible to researchers in diverse areas. More im portant , the uniformity of the algorithms facilitates understanding , which encourages cross-fertilization and technology transfer between disciplines . Indeed , all bucket -elimination algorithms are similar enough for any im provement to a single algorithm to be applicable to all others expressed in this framework . For example , expressing probabilistic inference algorithms a.g bucket -elimination methods clarifies the former 's relationship to dynamic programming and to constraint satisfaction such that the knowledge accumulated in those areas may be utilized in the probabilistic framework . The generality of bucket elimination can be illustrated with an algorithm in the area of deterministic reasoning . Consider the following algo-
rithm for deciding satisfiability . Given a set of clauses (a clause is a disjunction of propositional variables or their negations) and an ordering of the propositional variables , d = Ql , ..., Qn, algorithm directional resolution
(DR) (Dechter and Rish, 1994) , is the core of the well-known Davis-Putnam algorithm for satisfiability (Davis and Putnam, 1960). The algorithm is described using buckets partitioning the given set of clauses such that all the clauses containing Q i that do not contain any symbol higher in the ordering are placed in the bucket of Q i , denoted bucketi .
The algorithm (seeFigure 1) processesthe buckets in the reverseorder of d. When processing bucketi , it resolves over Q i all possible pairs of clauses in the bucket and inserts the resolvents into the appropriate
lower buckets .
It was shown that if the empty clause is not generated in this process then the theory is satisfiable and a satisfying truth assignment can be generated in time linear in the size of the resulting theory . The complexity of the
algorithm is exponentially bounded (time and space) in a graph parameter called induced width (also called tree-width) of the interaction graph of the theory , where a node is associated with a proposition and an arc connects
any two nodes appearing in the same clause (Dechter and Rish, 1994) . The belief-network algorithms we present in this paper have much in common with the resolution procedure above. They all possess the prop erty of com piling a theory into one from which answers can be extracted easily and their complexity is dependent on the same induced width graph parameter . The algorithms are variations on known algorithms and , for the most part , are not new , in the sense that
the basic ideas have existed for
some time (Cannings et al., 1978; Pearl, 1988; Lauritzen and Spiegelhalter, 1988 ; Tatman
and
Shachter
Favro , 1990 ; Bacchus
and
, 1990 ; Jensen van
et al . , 1990 ; R .D . Shachter
Run , 1995 ; Shachter
, 1986 ; Shachter
and
, 1988 ;
Shimony and Charniack, 1991; Shenoy, 1992) . What we are presenting here is a syntactic and uniform exposition emphasizing these algorithms ' form
BUCKETELIMINATION
77
Algorithm directional resolution Input : A set of clausescp, an ordering d == Q! , ..., Qn. Output : A decision of whether
Algorithm directional resolution
as a straightforward elimination algorithm . The main virtue of this presentation , beyond uniformity , is that it allows ideas and techniques to flow across the boundaries between areas of research. In particular , having noted that elimination algorithms and clustering algorithms are very similar in the context of constraint processing (Dechter and Pearl , 1989) , we find that this similarity carries over to all other tasks . We also show that the idea of conditioning , which is as universal as that of elimination , can be incorporated and exploited naturally and uniformly within the elimination framework . Conditioning is a generic name for algorithms that search the space of partial value assignments , or partial conditionings . Conditioning means splitting a problem into subproblems based on a certain condition . Al gorithms such as backtracking and branch and bound may be viewed as conditioning algorithms . The complexity of conditioning algorithms is exponential in the conditioning set, however , their space complexity is only linear . Our resulting hybrid of conditioning with elimination which trade off time for space (see also (Dechter , 1996b; R . D . Shachter and Solovitz , 1991)) , are applicable to all algorithms expressed within this framework . The work we present here also fits into the framework developed by Arnborg and Proskourowski (Arnborg , 1985; Arnborg and Proskourowski , 1989) . They present table -based reductions for various NP -hard graph prob lems such as the independent -set problem , network reliability , vertex cover , graph k-colorability , and Hamilton circuits . Here and elsewhere (Dechter and van Beek, 1995; Dechter , 1997) we extend the approach to a different set of problems .
78
R. DECHTER Following
preliminaries
algorithm
for
we
extend
(
4
and
to
)
.
2
-
9
.
tree
present
and
of
We
(
section
8
Then
poly
,
-
a
-
pos
-
utility
tree
for
.
.
expected
' s
)
)
maximum
schemes
section
3
explana
maximum
Pearl
describe
elimination
probable
the
the
to
then
most
finding
finding
-
(
the
algorithms
elimination
bucket
performance
finding
taBks
for
the
algorithms
combining
the
Conclusions
are
given
in
.
provide
of
nodes
a
signify
,
the
and
.
A
of
Xi
,
of
parent
Xi
in
)
E
E
E
,
we
G
,
of
while
(
Xi
its
)
pai
Xi
Let
=
acyclic
=
,
graph
between
having
i
Ipai
Xl
.
.
.
,
,
and
)
the
}
.
=
)
(
.
Xi
)
=
{
IXi
V
,
,
(
Xi
)
,
of
chi
.
.
i
:
#
For
,
{
of
the
arcs
.
.
.
,
Xn
set
(
,
are
is
a
set
edges
,
.
the
If
set
pointing
Xi
)
,
,
to
comprises
,
Pi
has
.
}
of
Xi
arise
Xi
it
,
variables
Ch
if
Xl
the
can
family
graph
variable
denoted
confusion
of
=
.
linked
conditional
directed
is
each
acyclic
directions
}
the
Xi
is
=
j
comprises
The
graph
a
V
,
domains
by
of
the
we
abbreviate
includes
no
Xi
directed
ignored
:
-
over
the
quantified
where
Xi
no
by
,
V
to
nodes
the
}
E
Whenever
directed
E
Xj
given
between
notion
un
graph
from
are
points
child
,
.
Dn
and
Xi
X
Ch
A
G
Xj
value
the
beliefs
acyclic
influences
influences
pa
to
takes
on
partial
directed
and
cycles
(
Xi
,
Xj
.
)
and
.
{
D1
,
Xi
of
identical
X
,
(
set
.
are
,
Xi
denoted
graph
)
domains
,
and
undirected
,
these
pair
(
points
variables
an
Xj
{
Xi
Xi
by
child
a
=
a
causal
relies
that
the
that
is
say
nodes
variables
=
about
by
direct
network
graph
and
Xi
of
reasoning
defined
that
of
belief
directed
elements
is
variables
strength
A
It
existance
the
probabilities
for
.
random
arcs
variables
formalism
uncertainty
representing
The
X
.
with
networks
P
the
the
clustering
conditions
In
of
to
)
relates
we
Preliminaries
Belief
pa
5
7
,
its
taBks
it
section
)
analyze
the
extend
method
der
{
and
Section
join
section
(
,
(
6
conditioning
(
)
2
and
to
hypothesis
section
section
updating
algorithm
section
teriori
(
belief
the
tion
(
.
product
.
,
Xn
A
P
its
The
.
}
be
a
belief
=
=
{
Pi
parents
belief
set
of
random
network
network
}
,
,
namely
is
where
Pi
a
variables
pair
(
G
denotes
over
P
)
where
probability
is
a
directed
relationships
probability
a
multivalued
G
probabilistic
conditional
represents
,
matrices
distribution
Pi
: . . . :
over
form
P(Xl, ...., Xn) ==Ili =lP (Xilxpai ) where an assignment (X I = Xl , ..., Xn = xn ) is abbreviated to x = (Xl , ..., xn) and where Xs denotes the projection of a tuple X over a subset of variables S . An evidence set e is an instantiated subset of variables . A = a denotes a partial assignment to a subset of variables A from their respective domains . We use upper case letter for variables and nodes in a graph and lower case letters for values in variable 's domains .
79
BUCKETELIMINATION
(a)
Figure
2.
Example
belief
2 .1
(b)
network
P (g , f , d , c , b, a ) = P ( glf ) P ( flc , b) P ( dlb , a ) P ( bla ) P ( cla )
Consider
the
belief
network
P ( g , f , d , c , b , a ) == P ( glf Its
acyclic The
graph following
namely
, given
each
proposition
some
observed
rest
of the
are
some
function
evidence
an
is known
that
multiply
, and
small
cycle
elimination relationship We partial
with
conclude tuple
( us , xp ) to Definition subset
an
tasks
variables
are
or for
existing this
singly
the
cycle
. In above
methods
will
be
with
some
of
tuple
S , where
variables Us
, and
functions XES
Xp by )
, the
for
sparse
called -
networks
sections be
,
bucket
presented
-
and
.
a variable a value
functions
( Pearl
Spiegelhal
conventions
Given
permit
, also
and
will
discussed
that
algorithm
approach
subsequent tasks
a util -
networks
well
notational
appended
all
propagation
- cutset
work
hypothesis also
, they
, 1988 ; Lauritzen
the
( map ) ,
variables
- connected this
the
section
the
. Nevertheless
clusters of
decision
NP - hard
extending
of
, 4 . given
( meu ) .
methods
small
a subset
of
to
hypothesis
finally
a subset
( Pearl
each
( mpe ) , or , given
problem
for to
, 1986 ) . These
2 . 2 ( elimination of
are
networks
, S a subset denote
the
of
Msignment
aposteriori to
updating
probability
explanation
, and
{ B ,C } .
: 1 . belief
posterior
probability
assignment
to
of
algorithm
- cutsets
networks
probable
probability
approaches
algorithms
case , pa ( F ) =
the
maximum
tree - clustering
ter , 1988 ; Shachter
belief
their
these
- connected
conditioning
the
utility
main
over
, finding
propagation two
this
most
assignment
expected
2a . In
a maximum
3 . Finding
, finding
a polynomial
with
the
by
, b ) P ( dlb , a ) P ( bla ) P ( cla ) .
, computing
, finding
maximizes
the
1988 ) . The
defined
observations
variables
maximizes
Figure
, 2 . Finding
that
to
queries
variables
variables
It
in
a set
or , given
ity
is given
) P ( flc
defined
not
xp
. Let in
u be
S . We
a
use
of Xp .
a function ( minxh
h defined ),
( maxxh
over ),
80
R. DECHTER
(meanxh), and (Ex h) are definedover U = S - {X } as follows. For every U = u, (minxh )(u) = minxh(u, x), (maxxh) (u) = maxxh(u, x), (Ex h)(u) = Ex h(u, x), and (meanxh)(u) =" Ex~ , where' ,XI is Ithe I cardinality of X 's domain. Given a set of functions h1, ..., hj defined over the subsets81, ..., 8j , the productfunction (lljhj ) and EJ hj are defined over U == UjSj . For every U = u, (lljhj )(u) = lljhj (usj)' and (Ej hj ) (u) = Ej hj (uSj) . 3 . An Elimination
Algorithm
for Belief
Assessment
Belief updating is the primary inference task over belief networks . The task is to maintain the probability of singleton propositions once new evidence arrives . Following Pearl 's propagation algorithm for singly -connected net works (Pearl , 1988) , researchers have investigated various approaches to belief updating . We will now present a step by step derivation of a general variable - elimination algorithm for belief updating . This process is typical for any derivation of elimination algorithms . Let X I = XI be an atomic proposition . The problem is to assess and update the belief in Xl given some evidence e. Namely , we wish to compute P (X I = Xl ie) = Q' . P (X I = Xl , e) , where Q' is a normalization constant . We will develop the algorithm using example 2.1 (Figure 2) . Assume we have the evidence 9 = 1. Consider the variables in the order d1 = A , C , B , F , D , G . By definition we need to compute
P(gif )P(flb,c)P(dla,b)P(cla)P(bla)P(a)
L
P (a , 9 == 1) ==
c,b,j ,d ,g = l
We can now apply some simple symbolic manipulation , migrating each conditional probability table to the left of summation variables which it does not reference, we get == P ( a ) L
P ( cla
) L
C
Carrying pute defined
the the
P ( bla ) L b
computation
rightmost by : AG ( f ) =
from
summation Lg
P ( flb
, c ) LP
f
= 1 P ( glf
right which ) and
( dlb
, a ) L
d
to
left
( from
generates place
it
P ( glf
)
( 1)
9= 1
as
to
A ) , we
a
function
G
over
far
to
the
left
first
com
-
f , AG ( f ) as
possible
,
yielding
P(a) L P(cla)L P(bla ) L P(flb,c)Aa(f) L P(dlb,a) C b f d
(2)
BUCKETELIMINATION bucketa bucketD bucketF bucketB bucketc bucketA
= = = = = =
81
P(glf), 9 = 1 P(dlb,a) P(flb, c) P(bla) P(cla) P(a)
The answer to the query P (alg == 1) can be computed by evaluating the last product and then normalizing . The bucket -elimination algorithm mimics the above algebraic manipu lation using a simple organizational devise we call buckets, as follows . First , the conditional probability tables (CPT s, for short ) are partitioned into buckets , relative to the order used d1 == A , C , B , F , D , G , as follows (going from last variable to first varaible ) : in the bucket of G we place all functions mentioning G . From the remaining CPTs we place all those mentioning D in the bucket of D , and so on . The partitioning rule can be alternatively stated as follows . In the bucket of variable X i we put all functions that mention Xi but do not mention any variable having a higher index . The resulting initial partitioning for our example is given in Figure 3. Note that observed variables are also placed in their corresponding bucket . This initialization step corresponds to deriving the expression in Eq . ( 1) . Now we process the buckets from top to bottom , implementing the
82
R. DECHTER
Bucket G Bucket D Bucket F Bucket B Bucket C
P(a )
Bucket A Figure 4.
Bucket elimination along ordering d 1 = A , C, B , F, D , G.
right to left computation of Eq. (1) . Bucketa is processedfirst . Processing a bucket amounts to eliminating the variable in the bucket from subsequent computation
. To eliminate
G , we sum over all values of g . Since , in this case
we have an observed value 9 = 1 the summation is over a singleton value .
Namely, AG(f ) = L9 =1P(glf ), is computedand placedin bucketF (this correspondsto deriving Eq. (2) from Eq. (1)). New functions are placed in lower buckets using the same placement rule . BucketD is processed next . We sum-out D getting AD(b, a) = Ed P (dlb, a) ,
that is computed and placed in bucketB, (which correspondsto deriving Eq. (3) from Eq. (2)). The next variable is F . BucketF contains two functions P (fJb, c) and Aa (f ), and thus, following Eq. (4) we generate the function
AF(b, c) := Ll P(flb , c) . Aa(f ) whichis placedin bucketB(this corresponds to deriving Eq. (4) from Eq. (3)) . In processingthe next bucketB, the function AB(a, c) == Lb (P (bla) . An (b, a) . AF(b, c)) is computed and placed in bucketc (deriving Eq. (5) from Eq. (4)). In processing the next bucketc, Ac (a) = LCECP (cla) . AB(a, c) is computed (which correspondsto deriving Eq. (6) from Eq. (5)). Finally , the belief in a can be computed in bucketA, P (alg == 1) == P (a) . AC(a) . Figure 4 summarizes the flow of computation of the bucket elimination algorithm for our example . Note that since throughout this process we recorded two -dimensional functions at the most ,
the complexity the algorithm using ordering d1 is (roughly) time and space quadratic in the domain sizes. What will occur if we use a different variable ordering ? For example , lets apply the algorithm using d2 = A , F , D , C , B , G . Applying algebraic manipulation from right to left along d2 yields the following sequence of deri vations
:
P(a, 9 = 1) = P (a) Ll Ld Lc P(cla) Lb P(bla) P(dla, b)P(flb , c) Lg =1P (glf )=
BUCKETELIMINATION
bucket
B
bucket
C
bucket
D
bucket
F
bucket
A
83
-
-
= P(a)
(a) Figure 5.
P(a) Ll P(a) Ll P(a) Ll P(a) Ll
(b)
The buckets output
when processing along d2 = A , F , D , C , B , G
AG(f ) Ld Lc P(cla) Lb P(bla) P(dla, b)P(flb , c)== Aa(f ) Ld Lc P(Cla)AB(a, d, c, f ) == A9(f ) Ld Ac(a, d, f ) == Aa(f )AD(a, f ) =
P (a)AF(a) The bucket elimination
process for ordering
d2 is summarized
in Figure
5a. Each bucket contains the initial CPTs denoted by P 's, and the functions generated throughout the process, denoted by AS. We summarize with a general derivation of the bucket elimination algo-
rithm , called elim-bel. Consider an ordering.ofthe variablesX = (Xl , ..., Xn). Using the notation Xi = (Xl ' ..., Xi) and xi = (Xi, Xi+l , ..., Xj), where Fi is the family of variable Xi , we want to compute :
P(Xl, e) = ~ P(Xn,e) = ~ ~ IIiP(xi, elXpai )= X= X2n
_(nX2
l ) Xn
Seperating X n from the rest of the variables we get :
P(Xn,elxpan )llxiEchnP (xi,elXpai )= }=: IIXiEX - FnP(Xi, elXpai ) .E xn _ (n -l) X=X2
where
An(XUn ) = L P(xn, eIXpan )I1XiEChnP (Xi, e!Xpai ) xn
(7)
84
R. DECHTER
Figure 6.
Algorithm
elim - bel
Where Un denoted the variables appearing with X n in a probability component , excluding Xn . The process continues recursively with Xn - l . Thus , the computation performed in the bucket of Xn is captured by Eq . (7) . Given ordering Xl , ..., Xn , where the queried variable appears first , the C PTs are partitioned using the rule described earlier . To process each bucket , all the bucket 's functions , denoted AI , ..., Aj and defined over subsets SI , ..., Sj are multiplied , and then the bucket 's variable is eliminated by summation . The computed function is Ap : Up - t R , Ap == }::;x ni = IAi , where Up = UiSi - Xp . This function is placed in the bucket of it : largest index variable in Up. The procedure continues recursively with the bucket of the next variable going from last variable to first variable . Once all the buckets are processed, the answer is available in the first bucket . Algorithm elim -bel is described in Figure 6.
Theorem 3.1 Algorithm elim-bel compute the posterior belief P (xlle ) for any given ordering of the variables. 0 Both the peeling algorithm for genetic trees (Cannings et al., 1978), and Zhang and Poole's recent algorithm (Zhang and Poole, 1996) are variations of elim-bel.
BUCKETELIMINATION
(a )
85
(c)
(b)
Figure 7.
Two ordering of the moral graph of our example problem
3.1. COMPLEXITY We see that although elim -bel can be applied using any ordering , its complexity varies considerably . Using ordering d1 we recorded functions on pairs of variables only , while using d2 we had to record functions on four variables
(see Bucketc in Figure 5a) . The arity of the function recorded in a bucket equals the number of variables appearing in ing the bucket 's variable . Since recording a space exponential in r we conclude that the exponential in the size of the largest bucket
that processed bucket , exclud function of arity r is time and complexity of the algorithm is which depends on the order of
.
processIng
.
Fortunately , for any variable ordering bucket sizes can be easily read in advance
from
an ordered
associated
with
the elimination
process . Consider
the moral graph of a given belief network . This graph has a node for each propositional variable , and any two variables appearing in the same CPT are connected in the graph . The moral graph of the network in Figure 2a is given in Figure 2b . Let us take this moral graph and impose an ordering on its nodes. Figures 7a and 7b depict the ordered moral graph using the two orderings d1 = A , C , B , F , D , G and d2 = A , F , D , C , B , G . The ordering is pictured from bottom up . The width of each variable in the ordered graph is the number of its earlier neighbors in the ordering . Thus the width of G in the ordered graph along d1 is 1 and the width of F is 2. Notice now that using ordering d1, the
number
of variables
in the
initial
buckets
of G and
2 respectively . Indeed , in the initial partitioning
F , are also
1 , and
the number of variables
mentioned in a bucket (excluding the bucket's variable) is always identical to the width of that node in the corresponding ordered moral graph . During processing we wish to maintain the correspondance that any
86
R. DECHTER
two nodes in the graph are connected if there is function (new or old ) deefined on both . Since, during processing , a function is recorded on all the variables apearing in a bucket , we should connect the corresponding nodes in the graph , namely we should connect all the earlier neighbors of a processed variable . If we perform this graph operation recursively from last node to first , (for each node connecting its earliest neighbors ) we get the the induced graph . The width of each node in this induced graph is identical to the bucket 's sizes generated during the elimination process (see Figure 5b) . Example 3 .2 The induced moral graph of Figure 2b, relative to ordering d1 == A , C , B , F , D , G is depicted in Figure 7a. In this case the ordered graph and its induced ordered graph are identical since all earlier neighbors of each node are already connected. The maximum induced width is 2. Indeed, in this case, the maximum arity of functions recorded by the elimination algorithms is 2. For d2 == A , F , D , C , B , G the induced graph is depicted in Figure 7c. The width of C is initially 1 (see Figure 7b) while its induced width is 3. The maximum induced width over all variables for d2 is 4, and so is the recorded function 's dimensionality . A formal definition of all the above graph concepts is given next . Definition 3 .3 An ordered graph is a pair (G , d) where G is an undirected graph and d = Xl , ..., Xn is an ordering of the nodes. The width of a node in an ordered graph is the number of the node 's neighbors that precede it in the ordering . The width of an ordering d, denoted w (d) , is the maximum width over all nodes. The induced width of an ordered graph , w* (d) , is the width of the induced ordered graph obtained as follows : nodes are processed from last to first ,. when node X is processed, all its preceding neighbors are connected. The induced width of a graph , w * , is the minimal induced width over all its orderings . The tree-width of a graph is the minimal induced width plus one (Arnborg , 1985) . The established connection between buckets ' sizes and induced width motivates finding an ordering with a smallest induced width . While it is known that finding an ordering with the smallest induced width is hard (Arnborg , 1985) , usefull greedy heuristics as well as approximation algorithms are available (Dechter , 1992; Becker and Geiger , 1996) . In summary , the complexity of algorithm elim -bel is dominated by the time and space needed to process a bucket . Recording a function on all the bucket 's variables is time and space exponential in the number of variables mentioned in the bucket . As we have seen the induced width bounds the arity of the functions recorded ; variables appearing in a bucket coincide with the earlier neighbors of the corresponding node in the ordered induced moral graph . In conclusion :
BUCKETELIMINATION
87
Theorem 3 .4 Given an ordering d the complexity of elim -bel is (time and space) exponential in the induced width w* (d) of the network 's ordered moral graph . 0
3.2. HANDLINGOBSERVATIONS Evidence should be handled in a special way during the processing of buck ets. Continuing with our example using elimination on order d1, suppose we wish to compute the belief in A = a having observed b = 1. This observation is relevant only when processing bucket B . When the algorithm arrives at that bucket , the bucket contains the three functions P (bla) , AD (b, a) , and AF (b, c) , as well as the observation b == 1 (see Figure 4) . The processing rule dictates computing AB(a, c) == P (b = 1Ia) AD(b == 1, a) AF(b == 1, c) . Namely , we will generate and record a two -dimensioned function . It would be more effective however , to apply the assignment b == 1 to each function in a bucket separately and then put the resulting func tions into lower buckets . In other words , we can generate P (b == 11a) and AD (b == 1, a) , each of which will be placed in the bucket of A , and AF (b = 1, c) , which will be placed in the bucket of C . By so doing , we avoid increasing the dimensionality of the recorded functions . Processing buckets containing observations in this manner automatically exploits the cutset conditioning effect (Pearl , 1988) . Therefore , the algorithm has a special rule for processing buckets with observations : the observed value is assigned to each function in the bucket , and each resulting function is moved individ ually to a lower bucket . Note that , if the bucket of B had been at the top of our ordering , as in d2, the virtue of conditioning on B could have been exploited earlier . When processing bucketB it contains P (bla) , P (dlb, a) , P (flc , b) , and b == 1 (see Figure 5a) . The special rule for processing buckets holding observations will place P (b == lla ) in bucket A, P (dlb == 1, a) in bucketD , and P (flc , b == 1) in bucketF . In subsequent processing , only one-dimensional functions will be recorded . We see that the presence of observations reduces complexity . Since the buckets of observed variables are processed in linear time , and the recorded functions do not create functions on new subsets of variables , the corresponding new arcs should not be added when computing the induced graph . Namely , earlier neighbors of observed variables should not be connected . To capture this refinement we use the notion of adjusted induced graph which is defined recursively as follows . Given an ordering and given a set of observed nodes, the adjusted induced graph is generated by processing from top to bottom , connecting the earlier neighbors of unobserved nodes only . The adjusted induced width is the width of the adjusted induced graph .
88
R. DECHTER
Theorem
3 .5 Given a belief network having n variables , algorithm elim -
bel when using ordering d and evidencee, is (time and space) exponential in the adjusted inducedwidth w* (d, e) of the network'8 orderedmoral graph. 0
3 .3 .
FOCUSING
ON
RELEVANT
SUBNETWORKS
We will now present an improvement to elim -bel whose essenceis restricting the computation to relevant portions of the belief network . Such restrictions are already available in the literature in the context of existing algorithms
(Geiger et al., 1990; Shachter, 1990) . Since summation over all values of a probability function is 1, the recorded functions of some buckets will degenerate to the constant 1. If we could recognize such cases in advance, we could avoid needless compu tation by skipping some buckets . If we use a topological ordering of the
belief network's acyclic graph (where parents precede their child nodes) , and assuming that the queried variable starts the ordering ! , we can recognize skipable buckets dynamically , during the elimination process. Proposition 3 .6 Given a belief network and a topological ordering ~Y 1, ..., Xn , algorithm elim -bel can skip a bucket if at the time of processing, the bucket contains no evidence variable , no query variable and no newly computed function
. 0
Proof : If topological ordering is used, each bucket (that does not contain the queried variable) contains initially at most one function describing its probability conditioned on all its parents . Clearly if there is no evidence, summation will yield the constant 1. 0 Example 3 .7 Consider again the belief network whose acyclic graph is given in Figure 2a and the ordering d1 = A , C , B , F , D , G , and assume we want to update the belief in variable A given evidence on F . Clearly the buckets of G and D can be skipped and processing should start with bucket F . Once the bucket of F is processed, all the rest of the buckets are not skipable. Alternatively , the relevant portion of the network can be precomputed by using a recursive marking procedure applied to the ordered moral graph .
(seealso (Zhang and Poole, 1996)). Definition 3 .8 Given an acyclic graph and a topological ordering that starts with the queried variable , and given evidence e, the marking process works as follows . An evidence node is marked, a neighbor of the query variable is marked, and then any earlier neighbor of a marked node is marked . 1otherwise , the queried variable can be moved to the top of the ordering
BUCKETELIMINATION Algorithm .
.
89
elim -bel
.
2 . Backward
: For p -(- n downto
1 , do
for all the matrices AI , A2, ..., Aj in bucketp, do . (bucket with observed variable) if Xp == xp appears in bucketp, then substitute Xp = xp in eachmatrix Ai and put eachin appropriate bucket.
. else, if bucketpis NOT skipable , t~en
Up-(- Ui=1Si - {Xp} Ap== LXp lli =l Ai. Add Apto the largest -index variable in Up' .
.
.
Figure 8.
Improved algorithm elim-bel
The marked belief subnetwork, obtained by deleting all unmarkednodes, can be processed now by elim-bel to answer the belief-updating - query. - It is eaBY to see that Theorem ponential graph .
3 .9 The complexity of algorithm elim - bel given evidence e is exin the adjusted induced width of the marked ordered moral sub -
Proof : Deleting the unm larked nodes from the belief network results in a belief subnetwork whose distribution is identical to the marginal distribu ~~ tion over the marked variables . o .
4. An Elimination
Algorithm
for mpe
In this section we focus on the task of finding the most probable explana tion . This task appears in applications such as diagnosis and abduction . For example , it can suggest the disease from which a patient suffers given data on clinical findings . Researchers have investigated various approaches to finding the mpe in a belief network . (See, e.g., (Pearl , 1988; Cooper , 1984; Peng and Reggia , 1986; Peng and Reggia, 1989)) . Recent proposals include best first -search algorithms (Shimony and Charniack , 1991) and algorithms based on linear programming (Santos, 1991) . The problem is to find XOsuch that P (xO) == maxx IIiP (xi , elxpai) where x == (Xl ' ..., xn ) and e is a set of observations . Namely , computing for a given ordering Xl , ..., Xn ,
M==llXn !axP(x)==IJlax ) Xn -l max Xnlli=lP(Xi,elXpai
(8)
This can be accomplished as before by performing the rnxirnization operation along the ordering from right to left , while migrating to the left , at
90
R. DECHTER
each step , all components that do not mention the maximizing variable . We get ,
M == m~x P (xn , e) == Jllax max IIiP (xi , elXpai) == X = Xn
X (n - l )
Xn
ill_ax IlXiEX - FnP(Xi, elXpai) . max P (Xn, elxpan)IlXiEchnP(Xi, elXpai) =
X =
Xn
-
l
Xn
ill_ax IIXiEX - FnP(Xi, elXpai) . hn(xun)
X =
Xn
-
l
where
hn(xun) == ~ ~x P(Xn, eIXpan )I1XiEChnP (Xi, elXpai ) Where Un are the variables appearing in components defined over X n. Clearly , the algebraic manipulation of the above expressions is the same as the algebraic manipulation for belief assessment where summation is replaced by maximization . Consequently , the bucket -elimination procedure elim -mpe is identical to elim -bel except for this change. Given ordering Xl , ..., X n, the conditional probability tables are partitioned as before . To process each bucket , we multiply all the bucket 's matrices , which in this case
are denoted hi , ..., hj and defined over subsets51, ..., 5j , and then eliminate the bucket 's variable by maximi .zation . The computed function in this case
is hp : Up-t R, hp = maxxpni =lhi , whereUp= UiSi - Xp. The function obtained by processing a bucket is placed in the bucket of its largest -index
variable in Up. In addition , a function x~(u) = argmaxxphp (u) , which relates an optimizing value of Xp with each tuple of Up, is recorded and placed in the bucket
of X p .
The procedure continues recursively , processing the bucket of the next variable while going from last variable to first variable . Once all buckets are processed, the mpe value can be extracted in the first bucket . When this backwards phase terminates the algorithm initiates a forwards phase to compute an mpe tuple . Forward phase: Once all the variables are processed, an mpe tuple is computed by assigning values along the ordering from X I to Xn , consult ing the information recorded in each bucket . Specifically , once the partial
assignment x == (Xl , ..., Xi- I) is selected, the value of Xi appended to this tuple is xi (X) , where XOis the function recorded in the backward phase. The algorithm is presented in Figure 9. Observed variables are handled as in elim - bel .
Example 4 .1 Consider again the belief network of Figure 2. Given the or dering d == A , C , B , F , D , G and the evidence g == 1, process variables from last to the first after partitioning the conditional probability matrices into
buckets, such that bucketa = { P (glf ), 9 = 1} , bucketD == { P (dlb, a)} ,
BUCKETELIMINATION
Figure 9.
bucketF bucket
A
Ilf
) ,
is
placed
==
{ P
( f
==
{
( a ) }
and
maxd
( dl
( dlb
matrices
..
and
place
the
function .
P
( a )
tension
of
==
.
is
buckets
,
we
in
F
,
to
determined
the ,
1 ,
bucket be
( bl
B
.
h D hc
mpe
.
( a )
==
==
the
( f
=
.
)
==
also ,
now
maxi
h F
p
P
given
( flb
tuple
( c /a )
( g
== ( f
( b , a )
DO
( b , a )
by
.
hc
we
hB
==
( f
it
( a A
going
) ,
record
place .
) ==
two
, c )
bucket ,
P
and
hD
,
and
in
,
contains
B
( b , c )
maxc
mpe
( c /a ) }
argmaxhG
eliminate
is
with
)
Record
To
( b , a )
ha ( f
next
value
along
.
( b , c ) B
{ P
computing
processed
hF
a )
get GO
by
bucket
compute ,
==
==
function
Compute
P
bucketc
bucketD
in
Finally
, c ) ,
in and
M
==
forward
.
process
which
) .
maxb
C
( a ) ,
( f
,
we
partial
can
compile
tuples
be
viewed
as
information to
a
compilation
regarding
variables
higher
in
the the
( or most
o .rdering
learning
)
probable ( see
ex
also
-
section
.2 ) .
Similarly bounded are
in
hG
) } 9
The
result of
( bla
assign
Process
function
A
backward ,
and
, .
the
eliminate
. hG
.
bucket
( a , c )
bucket
the
The
7
, c )
G
{ P
bucketF
well
resulting
To in
through
phase
putting
h B
it
maxa
as
The
==
process in
and
Algorithm elim-mpe
bucketB
To
, a ) .
( flb the
bucketc place
P
,
result
bucketc
b , a )
argmaxDP
.
the
in P
P
place
/b , c ) }
91
to
the
case
the
induced
exponentially bounded
by
of in
the
belief
updating
dimension width
, of
w
the
the
* ( d , e )
complexity recorded
of
the
of matrices
ordered
elim
- mpe
, and moral
graph
is
those .
In
92
R. DECHTER
summary : Theorem 4 .2 Algorithm elim -mpe is complete for the mpe task. Its complexity (time and space) is O (n . exp (w* (d, e) )) , where n is the number of variables and w * (d, e) is the e-adjusted induced width of the ordered moral graph . 0
5. An Elimination
Algorithm
for MAP
We next present an elimination algorithm for the map task . By its defini tion , the task is a mixture of the previous two , and thus in the algorithm some of the variables are eliminated by summation , others by maximization . Given a belief network , a subset of hypothesized variables A = { AI , ..., Ak } , and some evidence e, the problem is to find an assignment to the hypoth esized variables that maximizes their probability given the evidence. For- . mally , we wish to compute maXa;k P (x , e) = maXa;k EXk +l IIi = lP (Xi , elxpai) where x == (aI , ..., ak, Xk+ l , ..., xn) . In the algebraic manipulation of this expression , we push the maximization to the left of the summation . This means that in the elimination algorithm , the maximized variables should initiate the ordering (and therefore will be processed last ) . Algorithm elim map in Figure 10 considers only orderings in which the hypothesized vari ables start the ordering . The algorithm has a backward phase and a forward phase, but the forward phase is relative to the hypothesized variables only . Maximization and summation may be somewhat interleaved to allow more effective orderings ; however , we do not incorporate this option here. Note that the relevant graph for this task can be restricted by marking in a very similar manner to belief updating case. In this case the initial mark ing includes all the hypothesized variables , while otherwise , the marking procedure is applied recursively to the summation variables only . Theorem 5 .1 Algorithm elim -map is complete for the map task. Its complexity is O (n . exp (w* (d, e) ) , where n is the number of variables in the relevant marked graph and w* (d, e) is the e-adjusted induced width of its marked moral graph . 0 6 . An Elimination
Algorithm
for MEU
The last and somewhat more complicated task we address is that of find ing the maximum expected utility . Given a belief network , evidence e, a real-valued utility function u (x ) additively decomposable relative to func tions 11, ..., Ij defined over Q = { Ql , ..., Qj } , Qi <; X , such that u (x ) = LQjEQ fj (xQj ) ' and a subset of decision variables D = { D1 , ...Dk } that are assumed to be root nodes, the meu task is to find a set of decisions
BUCKETELIMINATION
Figure 10.
93
Algorithm elim-map
dO== (dOl, ..., dOk) that maximizes the expected utility . We assumethat the variables not appearing in D are indexed Xk+l , ..., Xn . Formally, we want to compute E ==
and
max L1,..Xn d1 ,..,dXk k+
lli =l P (Xi, elXpai' d1, ..., dk)U(X) ,
do == argmaxDE
As in the previous tasks , we will begin by identifying the computation associated with Xn from which we will extract the computation in each bucket . We denote an assignment to the decision variables by d == (d1, ..., dk) and xi == (Xk, ..., xi ) . Algebraic manipulation yields
E=m ;xXk IIi=!P(X,elXpai i 'd)Q }:J"EQ -}:n+ -ll}:Xn fj (xQj)
We can now separate the components in the utility functions into those mentioning X n, denoted by the index set tn , and those not mentioning X n, labeled with indexes in = { I , ..., n} - tn. Accordingly we get
E=max :+l)}= :n~lP(Xi ,elxpai 'd).(}= :nfj(xQj )+JEtn }= dXk .1 .:fj (xQj)) _(}= n -l Xn JE
94
R. DECHTER
E == m; x[_(nL L IIi : lP (Xi, elXpai ' d} jEln L fj (xQj} Xk+l l) Xn L L IIi=lP (Xi, e\Xpai ' d) L fj (xQj)] _(n- l) Xn jEtn Xk+l By migrating to the left of Xn all of the elements that are not a function of Xn , we get
max L-l llXiEX -Fn P(Xi,elxpai ' d).(L fj(xQj ))LXnllXiEFnP (Xi,elxpai ' d) d [Xk . l -n JE n +l (9) +-n L-l IIxiEx -FnP (xi,elXpai ' d).LXnIIxiEFnP (xi,elXpai ' d)jEtn L fj (xQj )] Xk +l
An(xunld)==L llXiEFiP (Xi, elxpai ' d) Xn
We define en over W n as
8n(xWnld ) ==L llXiEFnP (Xi, elXpai ' d) L Ij (XQj)) Xn
jEtn
1111111111111
We denote by Un the subset of variables that appear with Xn in a proba bilistic component , excluding Xn itself , and by W n the union of variables that appear in probabilistic and utility components with Xn , excluding Xn itself . We define An over Un as (x is a tuple over Un U Xn )
After substituting Eqs. (10) and (11) into Eq. (9), we get """""
"""""
en (x WnId) ]
E == m: x -Ln.., llXiEX- FnP(Xi, elxpai'd).'\n(xUnld)[~ Ij (xQj)+ An (XUn Id) - l JEln Xk + l
(12) The functions (In and An compute the effect of eliminating Xn . The result
(Eq. (12)) is an expression, which does not include Xn , where the product has one more matrix An and the utility
components have one more
element Tn== ~ . Applyingsuchalgebraic manipulation to therestof the variables in order yields the elimination algorithm elim -meu in Figure 11. We assume here that decision variables are processed last by elim -meu . Each bucket contains utility components f)i and probability components , Ai . When
there is no evidence , An is a constant
and we can incorporate
the marking modification we presented for elim -bel . Otherwise , during processing, the algorithm generates the Ai of a bucket by multiplying all its
BUCKETELIMINATION
Figure 11.
Algorithm
95
elim - meu
probability components and summing over Xi . The () of bucket Xi is computed as the average utility of the bucket ; if the bucket is marked , the average utility of the bucket is normalized by its A. The resulting () and A are placed into the appropriate buckets . The maximization over the decision variables can now be accomplished using maximization as the elimination operator . We do not include this step explicitly , since, given our simplifying assumption that all decisions are root nodes, this step is straightforward . Clearly , maximization and summation can be interleaved to some degree, thus allowing more efficient orderings . As before , the algorithm 's performance can be bounded as a function of the structure of its augmented graph. The augmented graph is the moral graph augmented with arcs connecting any two variables appearing in the same utility component Ii , for some i . Theorem 6 .1 Algorithm elim -meu computes the meu of a belief network augmented with utility components (i . e., an influence diagram ) in 0 (n .
96
R. DECHTER
Z3 Z2 ~
VI V2 U3 Xl (a)
(b)
Figure 12.
(a) A poly-tree and (b) a legal processing ordering
exp (w * (d , e) ) , where w * (d , e) is the induced width along d of the augmented moral graph . 0 Tatman and Schachter (Tatman and Shachter , 1990 ) have published an algorithm that is a variation of elim - meu , and Kjaerulff 's algorithm (Kjreaerulff , 1993 ) can be viewed as a variation of elim - meu tailored to dynamic probabilistic networks .
7. Relation of Bucket Elimination
to Other Methods
7.1. POLY-TREE ALGORITHM When the belief network is a poly -tree , both belief assessment, the mpe task and map task can be accomplished efficiently using Pearl 's poly -tree
algorithm (Pearl, 1988) . As well, when the augmentedgraph is a tree, the meu can be computed efficiently . A poly -tree is a directed acyclic graph whose underlying undirected graph has no cycles. We claim that if a bucket elimination algorithm process variables in a
topological ordering (parents precedetheir child nodes), then the algorithm coincides (with someminor modifications) with the poly-tree algorithm . We will demonstrate the main idea using bucket elimination for the m pe task . The arguments are applicable for the rest of the tasks . Example
7 .1 Consider the ordering Xl , U3, U2, VI , YI , ZI , Z2, Z3 of the poly -
tree in Figure 12a, and assumethat the last four variablesare observed(here we denote an observed value by using primed lowercase letter and leave other
variables in lowercase ) . Processingthe bucketsfrom last to first , after the first four buckets have been processed as observation buckets, we get
bucket(U3) = P (U3) , P (xllul , U2, U3) , P (ZI3Iu3) bucket(U2) = P (U2), P (z/2Iu2)
BUCKETELIMINATION
97
bucket(U1) = P(Ul), P(Z/IIU1) bucket(,X 1)- := P(y/llx1) . -. . When processing bucket(U3) by elim-mpe, we get hU3(Ul , u2, U3), which is placed in bucket(U2) . The final resulting bucketsare bucket(U3) == P (U3), P (xllul , u2, U3) , P (ZI3Iu3) bucket(U2) == P (U2), P (ZI2Iu2), hu3(Xl , U2, UI) bucket(UI) == P (UI) , P (z' lluI ), hu2(Xl , UI) bucket(XI ) == P (Y' llxI ), hUl(XI) We can now choose a value Xl that maximizes
the product
in X I 's bucket ,
then choose a value Ul that maximizes the product in VI 's bucket given the selected value of X I , and so on. I t is easy to see that if elim -mpe uses a topological ordering of the
poly-tree, it is time and spaceO(exp(IFI )), where IFI is the cardinality of the maximum family size. For instance , in Example 7.1, elim -mpe records
the intermediate function hU3(Xl , U2, Ul ) requiring O(k3) space, where k bounds
the
domain
size for
each
variable
. Note , however
, that
Pearl ' s al -
gorithm (which is also time exponential in the family size) is better , as it records functions on single variables only . In order to restrict space needs, we modify elim -mpe in two ways . First , we restrict processing to a subset of the topological orderings in which sibling nodes and their parent appear consecutively as much as possible. Second, whenever the algorithm reaches a set of consecutive buckets from the same family , all such buckets are combined and processed aE one superbucket. With this change, elim -mpe is similar to Pearl 's propagation algorithm on poly -trees .2 Processing a super-bucket amounts to eliminating all the super-bucket 's variables without recording intermediate results . Example 7 .2 Consider Example 7.1. Here, instead of processing each b1tcket OfUi separately, we compute by a br'ute-force algorithm the function hu1,U2,U3(Xl ) in the super-bucket of VI , V2, U3 and place the function in the bucket of XI We get the unary function hCI1 ,U2,(T3(XI ) == maXttl,U211t3 P (U3)P (XII 'UI, U2, 'U3)P (Z/3Iu3)P ('lt2)P (Z/2Iu2)P ('ltl ) P (Z/1Iul ) .
The details for obtaining an ordering such that all families in a poly -tree can be processed a5 super-buckets can be worked out , but are beyond the scope
of this
Proposition
paper . In summary
,
7 .3 There exist an ordering of a poly-tree, such that bucket-
elimination algorithms (elim-bel, elim-mpe, etc.) with the super-bucketmodification have the same time and space complexity as Pearl 's poly -tree algorithm for the corresponding tasks. The modified algorithm 's time complexity is exponential in the family size, and it requires only linear space. 0 2Actually , Pearl 's algorithm rooted
tree
in
order
to
be identical
should be restricted with
ours .
to message passing relative
to one
98
R. DECHTER
F
Figure 13 .
7 .2 .
@
Clique - tree associated with the induced graph of Figure 7a
JOIN-TREECLUSTERING
Join -tree clustering (Lauritzen and Spiegelhalter , 1988) and bucket elimi nation are closely related and their worst -case complexity (time and space) is essentially the same. The sizes of the cliques in tree-clustering is identical to the induced -width plus one of the corresponding ordered graph . In fact , elimination may be viewed as a directional (i .e., goal- or query -oriented ) version of join - tree clustering . The close relationship between join -tree clustering and bucket elimination can be used to attribute meaning to the intermediate functions computed by elimination . Given an elimination ordering , we can generate the ordered moral induced graph whose maximal cliques (namely , a maximal fully -connected subgraph ) can be enumerated as follows . Each variable and its earlier neighbors are a clique , and each clique is connected to a parent clique with whom it shares the largest subset of variables (Dechter and Pearl , 1989) . For example , the induced graph in Figure 7a yields the clique-tree in Figure 13, If this ordering is used by tree-clustering ,the same tree may be generated . The functions recorded by bucket elimination can be given the following meaning (details and proofs of these claims are beyond the scope of this paper ) . The function hp(u) recorded in bucketp by elim -mpe and defined over UiSi - { Xp } , is the maximum probability extension of u , to variables appearing later in the ordering and which are also mentioned in the clique subtree rooted at a clique containing Up' For instance , hF (b, c) recorded by elim -mpe using d1 (see Example 3.1) equals maxf ,g P (b, c, f , g) , since F and G appear in the clique-tree rooted at (FC B ) . For belief assessment, the function Ap = l:::xp ni =lAi , defined over Up = UiSi - Xp , denotes the probability of all the evidence e+P observed in the clique subtree rooted at a clique containing Up, conjoined with u . Namely , Ap(U) = P (e+P, u) .
BUCKETELIMINATION
99 P(d=llb,a)P(g=Olf =O) P(d=lib,s)P(g=Olf=1) P(d=lib,s)P(g=Olf=O)
...
...
...
...
P(d=Ilb,a)P(g=Olf=1) Figure 14. probability tree
8 . Combining
Elimination
and Conditioning
A serious drawback of elimination algorithms is that they require considerable memory for recording the intermediate functions . Conditioning , on the other hand , requires only linear space. By combining conditioning and elimination , we may be able to reduce the amount of memory needed yet still have performance guarantee . Conditioning can be viewed as an algorithm for processing the algebraic expressions defined for the task , from left to right . In this case, partial results cannot be assembled; rather , partial value assignments (conditioning on subset of variables ) unfold a tree of subproblems , each associated with an assignment to some variables . Say, for example , that we want to compute the expression for m pe in the network of Figure 2: M
=
max P (glf ) P ( flb , c ) P ( dla , b) P ( cla ) P ( bla ) P ( a ) a,c,b,f ,d,g
= max P ( a ) maxP a c
( cla ) maxP b
( bla ) maxP f
( flb , c ) maxP d
( dlb , a ) maxP 9
(glf ) . ( 13 )
We can compute along
the
traversed algorithms
the
ordering either such
expression
from
breadth
first - first
as best - first
by traversing variable
to
or depth - first search
the tree
last
variable
and will
and branch
in Figure .
result
and bound
The
14 , going
tree
in known .
can
be
search
100
R. DECHTER
Algorithm
elim
Input d ;
: A a
subset :
:
1 .
For
2 .
of
The
Initialize
PI
.
P
P
=
=
probable
We
C
output
max
of
{ p , PI p
and to
will
on
and
}
V C
mpe
a
== X
-
C
. Clearly
m :
x
P
responding
value
denote
by
v
== m ' i(x
m ~ x
every
the
partial
a
of
the
is
computed 0
When graph
( n
Given
. exp
on
be
the
variables ordered
P
with
conditioned
( c , v , e ) = = maXc
c , we
elimina
variables
V
and
by
, vlliP
compute
= lP
( Xi
maxv
while
, C
c an
-
~
X
,
assignment
, c , v , elXpai
P
kept
. The
the
a
( v , c , e )
)
and
a
cor
-
without variables
of
( w * ( d , eUc ordered
) +
O ICI
moral
in such
e
U
that
( n ) ) ,
U
all
variables
of
conditioned
15
.
complexity
the c )
the
Figure
space of
the
nodes
connecting
earlier
adjusted
, C
( w * ( d , cUe
where
constitute its
in
and
variables . exp
graph
c
for retaining
is
ordered .
In
this
neighbors
.
conditioning is
tuple
rest
w * ( d , e
and
variables
enumerated
presented
time the
width
observed
- mpe
the
conditioned
the
is
over
)
be
, and
c ,
generated
set
the will
algorithm
induced
conditioned
- cond
, e , CIXpai
treating
variables
both is
( Xi
computation
assignment
for
elim
the can
. ) .
- mpe
to
probability
graph
8 . 1
O
be
by
and
algorithm
c .
observations
conditioning of
assignment
basic
value
induced
is
an
- cond
combining
argmaxvIIi
maximum
evidence
plexity
as
probability
elim
conditioned
will
adjusted
Theorem of
the
particular
graph
=
This
exponentially
both
,
.
subset
algorithm .
the
bounded
ity
) (c )
probability
Given
of
of a
tuple
elimination
computing
case
variables
tuple
variables
maximum
moral
idea be
combinations
of
cUe
Algorithm
C
( Xv
using
with maximum
tuple
15 .
Let
maximizing
observed
the
c , do
maximizing
.
. We
( x , e )
, for
as
of e .
,
Therefore
by
ordering
.
- mpe the
the
task
=
elim
( update
demonstrate
the
} ; an
; observations
assignment
Figure
tion
, . . . , Pn
o .
The
Return
{ PI
variables
assignment
f ~
BN
conditioned
most
every
.
- mpe
network
C
Output
- cond
belief
the
that
a
induced
was
adjusted
cycle
- cutset
induced
,
the
space
) ) , while
width
its
width
-
com
w * ( d , cUe
relative
of
complex time
the equals
to
e
graph 1 . In
and
,
) ,
the this
BUCKETELIMINATION case
elim
- cond
- mpe
1988 ; Dechter Clearly take
ables
. There that
can
an
by
the
one
super
some
, during
We
mon
bucket
the
of
elim - mpe
possible
The
for
are
and to
belief
algorithms be
more
performance
reduced
suffer
of from
: exponential
using
which
avoiding
into
recording
Dechter
, 1996 ) .
in
aB similar
probabilis
-
to
-
observing
appeared
bucket the
both
com -
in the
past
et al . , 1997 ) .
shown
bucket the
and
, that
- tree
toward
by
the
standard
- elimination
and
usual and
difficulty exponential
Rish
, 1994 ; Dechter . We
have
portions - width
tree - clustering with in
and
that
which
the
the allows of the
complexity bound
.
algorithms dynamic
worst
constraint
, 1997 ) . Space
shown
, and
- based
induced
time
resolution
algo -
( always
highlights
relevant graph
associated
plague
the
.
clustering
the
accompanied
than
then orderings
framework
join
on
algorithms
elim - meu , for
derived
ex -
effort
if
network some and
pro -
expressing
, algorithms
example
proposed and
of
conscience
elim - map
also
conditioning
framework
explicitly
the
dynamic
way
without
, 1988 ) for
procedures
refined
this
- connected
to
not
were
uniform
, for
( Pearl
of
generalizes
and
. In
- elimination
space
deficiencies ( Dechter
also
network
applies
- assessment
are
performance
elimination
buckets
algorithms
, which
a singly
elegance
which
gramming
given
bucket
bounds
to
have
were
enhancements
The
decides
method
and
viewed
( Bistarelli
of the
same
. Such
likely
, thus
frameworks
a concise
algorithms
common
be
had
in
' s algorithms
network
is
by
elim -
method
and
consecutive
many
can
reasoning
) . The
simplicity
paper
framework
. We
elim - bel
trees
focusing
of
vari and
- mpe . One
a bucket
; EI - Fattah
, unifying
presented
to Pearl
on
conditioned
Conclusion
properties
tree - propagation
features
,
effectively
recorded
conditioning
that
recently
designer
and
reduces
the
, 1996 ) ) . Another a set
algorithms
probabilistic
the
more
elim - cond
to process
, 1996b
this
- elimination
topological
part
rithm
more
have
for
the
in
Rish
by
( Dechter
various
and
, we
algorithms
( Pearl
conditioning
functions
, collects
processes
to
between
of
and
. In addition
between
the
assignments
arity
reaBoning
, 1992 ) and
Using
be implemented
hybrids
, whether
throughout
Summary
ploit
it
algorithms
gramming
algorithms
k
mentioned
features
10 .
loop - cutset
procedure
( Dechter
results
deterministic
( Shenoy
can
the
approach
that
war
elimination
- mpe
basic
processing
- bucket
- bucket
had and
known
partial
on
( see
super
Related
the
bound
intermediate
9 .
the
of possible
refine
upper
conditioning
uses
tic
of shared
is a variety
dynamically or
elim - cond
advantage
imposes
to
, 1990 ) .
, algorithm
if we
ination
reduces
101
pro -
case . Such - satisfaction
complexity
conditioning
can can
be
102
R. DECHTER
implemented
naturally
quirement
ing
and
still
Finally
for
,
be
.
no
attempt
These
was
.
ditional
In
can
made
and
particular
reducing
.
the
space
Combining
the
,
1996
re
-
condition
virtues
paper
exploit
of
be
addressed
Poole
to
optimize
-
forward
as
,
and
)
the
the
be
run
-
bucket
-
structure
recently
can
algorithms
vs
within
presented
1997
the
compilation
exploiting
,
;
this
to
improvements
matrices
Boutilier
elimination
thus
combining
in
nor
should
,
probability
;
a5
,
issues
framework
,
features
viewed
computation
sources
elimination
.
distributed
press
of
topological
can
search
top
exploiting
elimination
backward
time
of
in
(
incorporated
re
the
Santos
con
et
on
-
elimination
top
al
of
.
-
,
in
bucket
-
.
In
summary
eral
,
tasks
,
which
what
we
applicable
the
importantly
buckets
,
Dechter
11
,
.
A
This
as
have
1997
)
exposition
bucket
done
,
-
by
via
approximation
areas
of
associated
-
,
research
with
algorithms
combining
sev
reasoning
several
elimination
either
or
across
deterministic
between
benefit
the
be
uniform
and
ideas
organizational
shown
a
.
the
to
use
be
conditioning
with
algorithms
of
improved
as
elimi
is
-
shown
in
.
Acknowledgment
preliminary
like
version
to
thank
of
Irina
different
grant
IRI
F49620
-
-
of
9157636
96
-
America
-
0224
,
and
paper
and
this
,
1
this
Rish
versions
of
is
probabilistic
of
all
can
we
both
the
allow
.
nation
here
transfer
,
should
uniformly
provide
to
facilitates
More
(
on
while
paper
Air
appeared
Nir
.
Force
This
work
Electrical
,
1996a
.
I
on
by
Research
ACM
-
grant
20775
and
Institute
would
comments
supported
Scientific
Research
)
useful
partially
grants
Power
Dechter
their
was
of
MICRO
(
for
Office
Rockwell
in
Freidman
AFOSR
95
grant
NSF
-
043
,
RP8014
-
Amada
06
.
References
S
.
Arnborg
and
to
S
. A
.
A
partial
k
Arnborg
-
.
.
Proskourowski
trees
.
of
A
.
S
.
.
and
.
D
In
.
Bertele
and
,
Boutilier
Artificial
van
.
Run
.
.
BIT
.
A
.
.
.
Brioschi
N
,
Journal
of
95
UAI
-
96
onserial
A
,
1985
Rossi
Cassis
ssociation
-
11
)
,
pages
France
-
csps
,
for
81
89
.
of
problems
24
-
In
1995
,
restricted
1989
.
graphs
with
bounded
finding
,
and
close
1996
to
optimal
Academic
jnmction
Press
constraint
Computing
Practice
.
.
based
Principles
.
Programming
Semiring
hard
-
on
in
,
algorithm
.
:
.
ordering
,
Dynamic
F
np
23
problems
23
)
fast
(
and
the
-
variable
AI
.
Montanari
: 2
-
sufficiently
in
F
U
25
Dynamic
CP
for
,
combinatorial
,
(
Geiger
algorithms
Mathematics
for
survey
Uncertainty
timization
C
.
a
time
Applied
Programming
Bistarelli
1997
P
Constraints
Becker
trees
U
and
Linear
and
algorithms
-
Bacchus
Discrete
Efficient
decomposability
F
.
.
,
satisfaction
Machinery
(
J
A
C
M
)
,
to
1972
.
and
op
-
appear
,
.
.
Context
Intelligence
-
specific
independence
(
UAI
-
96
)
in
,
pages
115
-
bayesian
123
,
networks
1996
.
.
In
Uncertainty
in
BUCKETELIMINATION
103
C. Cannings, E.A . Thompson, and H .H. Skolnick. Probability functions on complex pedigrees. Advances in Applied Probability , 10:26- 61, 1978. G.F . Cooper. Nestor: A computer-based medical diagnosis aid that integrates causal and probabilistic knowledge. Technical report , Computer Science department , Stanford University , Palo-Alto , California , 1984. M . Davis and H . Putnam . A computing procedure for quantification theory. Journal of the Association of Computing Machinery , 7(3) , 1960. R. Dechter and J. Pearl. Network-based heuristics for constraint satisfaction problems. A rtificial Intelligence , 34:1- 38, 1987. R. Dechter and J. Pearl. Tree clustering for constraint networks. Artificial Intelligence , pages
353
R . Dechter In
- 366
and
Principles
1994
.
of
.
Directional
resolution
I ( nowledge
: The
Representation
davis
and
- putnam
procedure
Reasoning
, revisited
( I ( R - 94 ) , pages
.
134 - 145
,
.
R . Dechter
and
I . Rish
Constraint
. To
Practice R . Dechter
and
P . van
of
Constraint
Beek
.
to
think
- 96 ) , 1996
Local
and
. Constraint
for
Artificial .
for
sat
consistency
- 95 ) , pages
constraint
240 - 257
processing
Intelligence
networks
algorithms
relational
( CP
schemes .
? hybrid
.
In
Principles
of
.
global
programming
decomposition
R . Dechter
or
( CP
. Enhancement
cutset
1992
guess
Programming
R . Dechter
, 41 : 273
Encyclopedia
of
. , 1995
In
Principles
: Backjumping
- 312
, 1990
Artificial
and
. , learning
and
.
Intelligence
, pages
276
- 285
,
algo
-
.
R . Dechter
.
rithms
.
R . Dechter
Bucket In
A rtificial
Ijcai
: A
reasoning
and :
R . Dechter
results
on
- 96 ) , pages
D . Geiger
, 20 : 507
F . V . Jensen
, S .L
networks U . Kjreaerulff
by . A
structures
An
J . Pearl
, 1990
of
the
inference
tradeoffs
. In
211
- 219
Uncertainty
, 1996 in
.
Artificial
. generating
Fifteenth
evaluation
of
circuits
, 1996
approximations
in
International
Joint
automated
Conference
on
.
structural In
parameters
Uncertainty
in
for
probabilistic
Artificial
Intelligence
. .
Identifying
independence
in
bayesian
networks
.
Net
-
. , and
computation
computational in
of
probabilistic
- 96 ) , pages
. .
Lauritzen local
Uncertainty
S . L . Lauritzen
, and
- 534
for ( UAI
- space
, 1996
scheme
benchmark
244 - 251
, T . Verma
works
time
220 - 227
general
, 1997
framework
Intelligence
- 97 : Proceedings
Intelligence
Y . El - Fattah
unifying
for
- 96 ) , pages
- buckets
. In
A
Artificial
parameters
( UAI . Mini
reasoning
: in
. Topological
R . Dechter
( UAI
elimination
Uncertainty
Intelligence
In
, 1989
I . Rish
Artificial
K . G . Olesen .
scheme
for
Intelligence
Bayesian
and
D . J . Spiegelhalter
. Local
their
to
expert
updating
Statistics
reasoning ( UAI
and
application
.
Computational
in
dynamic
- 93 ) , pages
computation systems
with .
in
causal
Quarterly
Journal
probabilistic
, 4 , 1990
probabilistic
121 - 149
, 1993
the
.
.
probabilities of
. networks
on Royal
graphical Statistical
Society , Series B , 50( 2) :157- 224, 1988. J . Pearl . Probabilistic Reasoning in Intelligent Systems . Morgan Kaufmann , 1988. Y . Peng and l .A . Reggia . Plausability of diagnostic hypothesis . In National Conference on Artificial Intelligence (AAAI86 ) , pages 140- 145, 1986. Y . Peng and l .A . Reggia . A connectionist model for diagnostic problem solving , 1989. D . Poole . Probabilistic partial evaluation : Exploiting structure in probabilistic inference . In Ijcai - 97 : Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence , 1997. S.K . Anderson R . D . Shachter and P. Solovitz . Global conditioning for probabilistic inference in belief networks . In Uncertainty in Artificial Intelligence ( UAI - 91) , pages 514- 522 , 1991. B . D 'Ambrosio R .D . Shachter and B .A . Del Favro . Symbolic probabilistic inference in belief networks . A utomated Reasoning , pages 126- 131, 1990. E . Santos , S.E . Shimony , and E . Williams . Hybrid algorithms for approximate belief updating in bayes nets . International Journal of Approximate Reasoning , in press .
104
R. DECHTER
E. Santos. On the generation of alternative explanations with implications for belief revision. In Uncertainty in Artificial Intelligence ( UAI -91) , pages 339- 347, 1991. R.D . Shachter. Evaluating influence diagrams. Operations Research, 34, 1986. R.D . Shachter. Probabilistic inference and influence diagrams. Operations Research, 36, 1988. R. D . Shachter. An ordered examination of influence diagrams. Networks, 20:535- 563, 1990. P.P. Shenoy. Valuation -based systems for bayesian decision analysis. Operations Research, 40:463- 484, 1992. S.E. Shimony and E. Chamiack . A new algorithm for finding map assignments to belief networks. In P. Bonissone, M. Henrion , L . Kanal , and J. Lemmer ed., Uncertainty in Artificial Intelligence , volume 6, pages 185- 193, 1991. J.A . Tatman and R.D . Shachter. Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man, and Cybernetics, 1990. N .L . Zhang and D. Poole. Exploiting causal independencein bayesian network inference. Journal of Artificial Intelligence Research(JAIR ) , 1996.
AN INTRODUCTION TO V.t\.RIATION AL METHODS FOR GRAPHICAL MODELS
MICHAEL
I . JORDAN
Massachusetts Institute of Technology Cambridge , MA ZO UBIN
G HAHRAMANI
University of Toronto Toronto
, Ontario
TOMMI
S. JAAKKOLA
University Santa
of California
Cruz , CA
AND LAWRENCE
K . SAUL
AT & T Labs
- Research
Florham
Park
, NJ
Abstract . This paper presents a tutorial introduction to the use of varia tional methods for inference and learning in graphical models . We present a number of examples of graphical models , including the QMR -DT database , the sigmoid belief network , the Boltzmann machine , and several variants of hidden Markov models , in which it is infeasible to run exact inference algorithms . We then introduce variational methods , showing how upper and lower bounds can be found for local probabilities , and discussing methods for extending these bounds to bounds on global probabilities of interest . Finally we return to the examples and demonstrate how variational algorithms can be formulated in each case.
105
MICHAELI. JORDAN ET AL.
106 1. The
Introduction problem
of probabilistic
computing the nodes nodes
( the
( the
probability
" hidden
" evidence
set of hidden wish
inference
a conditional
to calculate
" or " unobserved
" or " observed
nodes
and
letting
P ( HIE
General
exact
inference
( Cowell
tematic
We often
volume
of the
as inferred
of the parameters quantity
evaluation
of the likelihood
tor
and
maximize
complexity have recourse tion
tree
of P ( HIE are many
of which
we discuss
in this
calculation
to approximation
construction
clique
architectural
Even
in particular
the exact
resentation
of the joint
model
another
; put
the
of distributions implied
or clusters
of nodes
algorithms
probability
particular
pendencies
to consider
probability that
exact
produce
provide
, there
are other
the time
and
or space
it is necessary
the context
complexity
of the junc -
of the exact
is man -
procedures
associated
with
" nearly " conditionally
consideration the
rep -
a graphical
have the same complexity
is consistent
. Note
numerical with
regard within
conditional
be situations independent
are
cliques .
algorithms
no use of the
in
see , there
lead to large
under
to
is exponential
approximation
may
, al make
algorithms
problems
distribution
by the graph . There are
the
necessarily
make
-
) . Moreover
tree . As we will
distribution
way , the algorithms
less of the family
that
).
.
. Within time
of P ( HIE
) generally
paper , in which
the complexity
can be reason that
, the
as
the numera
generally
quantities
is unacceptable
in the junction
assumptions
in cases in which
ageable , there
in fact
learning
procedures
, for example
the size of the maximal natural
cases in which
mod -
E , P ( E ) is an
compute
of P ( HIE
) as a subroutine and
exact
simply
, they
joint
by Eq . ( 1) , the
to the calculation
( and related
to inference
in the
in graphical
model , for fixed
divide
sys -
, P ( E ) . Viewed
. As is suggested
do not
this
take
edges in the graph .
evidence
of the calculation
likelihood
to perform
algorithms
probabilities
related
solution of the
the
nodes , we
present
of missing
marginal
algorithms
gorithms
cases , several
of other
H represent
developed
of the observed
of Eq . ( 1 ) and
there
the values
( 1)
independencies
is closely
inference
use of the calculation
of
of some of
set of evidence
been
as the likelihood
as a by - product
Although
have
of the graphical
known
denominator
a satisfactory
.
the pattern
the likelihood that
) = % r2
to calculate
important
, although
the
; Jensen , 1996 ) ; these
the probability
a function
Indeed
" nodes ) , given
E represent
conditional
from
also wish
els , in particular
is the problem
" nodes ) . Thus , letting
algorithms
, this
advantage
distribution
models
over the values
):
P ( HIE
calculation
in graphical distribution
in which
inde nodes
, situations
in
AN INTRODUCTION TO VARIATIONAL METHODS
107
which node probabilities are well determined by a subset of the neighbors of the node , or situations in which small subsets of configurations of variables contain most of the probability mass. In such cases the exactitude achieved by an exact algorithm may not be worth the computational cost . A variety of approximation procedures have been developed that attempt to identify and exploit such situations . Examples include the pruning algorithms of Kjrerulff (1994) , the "bounded conditioning " method of Horvitz , Suermondt , and Cooper (1989) , search-based methods (e.g., Henrion , 1991), and the "localized partial evaluation " method of Draper and Hanks (1994) . A virtue of all of these methods is that they are closely tied to the exact meth ods and thus are able to take full advantage of conditional independencies . This virtue can also be a vice , however, given the exponential growth in complexity of the exact algorithms . A
related
approach
cations
of graphical
MacKay
, &
as an iterative
graphical
Another
approach
Carlo
for
design
and
in
in
appli -
( McEliece
Pearl ' s algorithm
inference
, for
in non - singly - connected
( see MacKay
problem
Kjrerulff
can be slow this
ational
of the
to converge chapter
approach
generally
provide
underlying averaging
It
yields bounds
exploit solution
to diagnose
can come
with into
of values
phenomena
their
yet
. an -
. Vari -
procedures basic
that
intuition
can be probabilis
connectivity
play , rendering
can lead
convergence
provide
. The
there
nodes
neighbors
relatively
. Taking
to simple , accurate
are
advan approxi
. to emphasize
outlined
complementary to any given
are
features problem
that
the
by no means
various
approaches
mutually
exclusive
of the graphical may well
involve
model
to inference ; indeed
formalism
an algorithm
.
algorithms
algorithms
graphs
of their
the
, which
dense
their
of convergence
inference
complex
, in graphs
include
are that
of interest
is that
settings
guarantees
approximation
on probabilities
that
& Luby , 1993 ; Fung
algorithms
methods
deterministic
methods
averaging
is important we have
design
; in particular
procedures
it can be hard
involves algorithms
, 1994 ; Jensen , Kong , &
approach
of approximate
to particular
of these
and
theoretical Carlo
Carlo
, and Neal , 1993 ) and applied ( Dagum
of these
variational
phenomena
insensitive
and Monte
algorithms
of Monte
, & Spiegelhalter
the
variational
simple
mation
models
we discuss
to
methodology
tically
, this volume
in graphical
of implementation
disadvantages
In
. A variety
, 1995 ; Pearl , 1988 ) . Advantages
simplicity
other
of approximation
methods
Favero , 1994 ; Gilks , Thomas
that
, Kim
arisen decoding
( Pearl , 1988 ) has been used successfully
method
to the
use of Monte
to the inference
tage
has
to error - control
particular
models
approximate
have been developed
The
inference
inference
.
making
&
approximate
Cheng , 1996 ) . In
singly - connected graphs
to
model
that
they
. The best combines
-
108
MICHAELI. JORDANET AL.
aspects of the different methods . In this vein , we will present variational methods in a way that emphasizes their links to exact methods . Indeed , as we will see, exact methods often appear as subroutines within an overall variational approximation (cf . Jaakkola & Jordan , 1996; Saul & Jordan , 1996) . It should be acknowledged at the outset that there is as much "art " as there is "science" in our current understanding of how variational meth ods can be applied to probabilistic inference . Variational transformations form a large , open-ended class of approximations , and although there is a general mathematical picture of how these transformations can be exploited to yield bounds on probabilities in graphical models , there is not as yet a systematic algebra that allows particular variational transformations to be matched optimally to particular graphical models . We will provide illustrative examples of general families of graphical models to which varia tional methods have been applied successfully, and we will provide a general mathematical framework which encompasses all of these particular examples, but we are not as yet able to provide assurance that the framework will transfer easily to other examples . We begin in Section 2 with a brief overview of exact inference in graph ical models , basing the discussion on the junction tree algorithm . Section 3 presents several examples of graphical models , both to provide motivation for variational methodology and to provide examples that we return to and develop in detail as we proceed through the chapter . The core material on variational approximation is presented in Section 4. Sections 5 and 6 fill in some of the details , focusing on sequential methods and block methods , respectively . In these latter two sections, we also return to the examples and work out variational approximations in each case. Finally , Section 7 presents conclusions and directions for future research.
2. Exact inference In this section we provide a brief overview of exact inference for graphical models , as represented by the junction tree algorithm (for relationships between the junction tree algorithm and other exact inference algorithms , see Shachter , Andersen , and Szolovits , 1994; see also Dechter , this volume , and Shenoy, 1992, for recent developments in exact inference ) . Our intention here is not to provide a complete description of the junction tree algorithm , but rather to introduce the "moralization " and "triangulation " steps of the algorithm . An understanding of these steps, which create data structures that determine the run time of the inference algorithm , will suffice for our
AN INTRODUCTION TO VARIATIONALMETHODS
109
Figure 1. A directed graph is parameterized by associating a local conditional probability with each node. The joint probability is the product of the local probabilities .
purposes.l For a comprehensiveintroduction to the junction tree algorithm see Cowell (this volume) and Jensen (1996). Graphical models come in two basic flavors- directed graphical models and undirected graphical models. A directed graphical model is specified numerically by associating local conditional probabilities with each of the nodes in an acyclic directed graph. These conditional probabilities specify the probability of node Si given the values of its parents, i .e., P (8iI87r(i)), where 1[ (i ) representsthe set of indices of the parents of node 8i and 87r(i) representsthe correspondingset of parent nodes (seeFig. 1).2 To obtain the joint probability distribution for all of the N nodesin the graph, i .e., P (8 ) == P (8l , 82, . . . , SN), we take the product over the local node probabilities :
N P(S) = n P(SiIS7r (i)) i=l Inference involves the calculation of conditional
(2) probabilities
under this
joint distribution . An undirected graphical model (also known as a "Markov random field " ) is specified numerically by associating "potentials " with the cliques of the graph .3 A potential is a function on the set of configurations of a clique lOur presentation will take the point of view that moralization and triangulation , when combined with a local message-passing algorithm , are sufficient for exact inference . It is also possible to show that , under certain conditions , these steps are necessary for exact inference . See Jensen and Jensen ( 1994) . 2Here and elsewhere we identify the ith node with the random variable Si associated with the node . 3We define a clique to be a subset of nodes which are fully connected and maximal ; i .e., no additional node can be added to the subset so that the subset remains fully connected .
110
MICHAELI. JORDANET AL.
S4 (/
- .' ,
I
\
S
51(". -- - '
3
,= 3(C3 ) .~) S6 -
/
"
S2 (
-
-
,
--'.'
.--Ss -)
Figure 2. An undirected graph is parameterized by associating a potential with each clique in the graph. The cliques in this example are C1 = { 81, 82, 83} , C2 = { 83, 84, 85} , and C3 = { 84, 85, 86} . A potential assigns a positive real number to each configuration of the corresponding clique. The joint probability is the normalized product of the clique potentials .
(that is, a setting of values for all of the nodes in the clique ) that Msociates a positive real number with each configuration . Thus , for every subset of nodes Ci that forms a clique , we have an associated potential
(3) where M is the total number of cliques and where the normalization factor Z is obtained by summing the numerator over all configurations :
Z= L II c/Ji(Ci)} . {5} { i=l
(4)
In keeping with statistical mechanical terminology we will refer to this sum as a "partition function ." The junction tree algorithm compiles directed graphical models into undirected graphical models ; subsequent inferential calculation is carried out in the undirected formalism . The step that converts the directed graph into an undirected graph is called "moralization ." (If the initial graph is already undirected , then we simply skip the moralization step). To under stand moralization , we note that in both the directed and the undirected cases, the joint probability distribution is obtained as a product of local
AN INTRODUCTION
TO VARIATIONAL
A
DAD
B
C
METHODS
B
(a)
111
C
(b)
Figure 3. (a) The simplest non-triangulated graph. The graph has a 4-cycle without a chord. (b) Adding a chord between nodes B and D renders the graph triangulated .
functions
. In
the
directed
case , these
functions
are the
node
conditional
probabilities P (Si IS7r (i)). In fact, this probability nearly qualifies as a potential function ; it is certainly a real-valued function on the configurations
of the set of variables { Si, S7r(i)} . The problem is that these variables do not always appear together within a clique . That is, the parents of a common child are not necessarily linked . To be able to utilize node conditional probabilities as potential functions , we "marry " the parents of all of the nodes with undirected edges. Moreover we drop the arrows on the other edges in the graph . The result is a "moral graph ," which can be used to represent the probability distribution on the original directed graph within the
undirected
formalism
.4
The second phase of the junction tree algorithm is somewhat more complex . This phase, known as "triangulation ," takes a moral graph as input and produces as output an undirected graph in which additional edges have (possibly ) been added . This latter graph has a special property that allows recursive calculation of probabilities to take place . In particular , in a triangulated graph , it is possible to build up a joint distribution by proceeding sequentially through the graph , conditioning blocks of interconnected nodes only on predecessor blocks in the sequence. The simplest graph in which this is not possible is the "4-cycle ," the cycle of four nodes shown in Fig . 3(a). If we try to write the joint probability sequentially as, for
example, P (A)P (BIA )P (CIB )P (DIC ), we seethat we have a problem. In particular , A depends on D , and we are unable to write the joint probability as a sequence
of conditionals
.
A graph is not triangulated if there are 4-cycles which do not have a chord, where a chord is an edge between non-neighboring nodes. Thus the 4Note in particular
that Fig . 2 is the moralization
of Fig . 1.
112
MICHAELI. JORDAN ET AL.
graph
in
chord
as
Fig
.
in
sequentially More
in
A
any
,
in cliques
The
the
local
data
In models
models
.
all
graph need
If
a
that
to
junction
node
lie
appears
on
based
cliques in
the
path that
on
achieving
assign
common
,
a
consequence be
the
possible
the
) . ,
local
In
a
same
junction .,
consistency
implies
will by
cliques
,
will
able
.
will
either to
,
of
be
achieve
of
in
per
particular
the
For
for
potential
efficient
is
inference
specific exact
able
lower
,
it
for Kjrerulff
for
display the
in
the
the
size moral
1990
these
" obvious
of
"
cliques
graph
triangulation ,
graphical
inference
to
bound
cliques
e .g . ,
;
investigate
algorithms see
tree to
complexity
represent
clique
junction as
.
we
the
,
cliques
costs
specific
algorithms
time
the to
the
the so
The
of
required in
considering
consider
.
size
paper
we be
on potentials
cliques the
computational
cases
we
performed clique
.
( for
in Thus
a we
discussion
) .
Examples
In
this
section
inference
is
system
in
remaining is
to
these
or
triangulation
.
,
is as
:
cliques
have
are
small
this
the
of
triangulated not
is
the
nodes
obtain
of
,
will
) . it
important
property
values
of
to
triangulation
of
, D
a
probability
known
can
they
on
of
consider
In
all the
( That
that
number
remainder and
. that
neighboring
number
the
critical
the
in has
nodes
( CIB
structure
inference
depends
the in
therefore
) P
adding
joint
property
appears
rescaling
between
discrete
it
by the
triangulated
data
intersection
and
calculation
exponential
, DIA
been a
triangulated write
intersection
calculations
consistency this
( B
has
cliques
running
marginalizing
forming
) P
property
the
be can
.
probabilistic
involve
,
This
can we
probabilistic
to
of
( A
into
tree .
for
consistency
P
running
between
because
=
graph
the
it
graph
graph
the
probability
,
)
a the
has
two
;
latter
, D
once
algorithm
bal
the , C
of
consistency
tree
triangulated
In
, B
cliques the
marginal
3
( A
tree
two
local
glo
not
) .
cliques
general
is
P
junction
between a
is
3 ( b
generally the
.
.
as
arrange tree
3 ( a )
Fig
fit
we
generally which
data
and
examples
infeasible a
examples to
present
fixed involve
subsequently
of
.
graphical
Our
graphical
first model
estimation used
example is
problems for
models
prediction
in
which
involves
used
to in
which or
a
answer
diagnostic
queries a
diagnosis
exact
graphical
.
The
model .
3.1. THE QMR-DT DATABASE The QMR-DT database is a large-scale probabilistic database that is intended to be used as a diagnostic aid in the domain of internal medicine.5
5The acronym "refers tothe "Decision Theoretic "version ofthe "Quick Medical Reference ."QMR " -DT
113
AN INTRODUCTION TO VARIATIONALMETHODS
diseases
symptoms Figure
. 4
.
The
evidence
structure
nodes
We
provide
see
The
per
-
of
(
Fig
.
4
,
use
of
, 6
the
.
a
.
The
the
a
of
"
bipartite
i
f
d
)
=
probabilities
diseases
OR
of
archival
data
,
"
model
P
{
fild
.
.
)
That
,
The
and
obtained
,
the
;
represent
di
the
for
further
which
the
up
-
represent
disease
nodes
and
are
binary
P
(
[
fid
Il
P
by
diseases
(
d
)
P
expert
nodes
.
bipartite
are
Making
form
nodes
findings
{
dj
]
~
,
of
we
the
obtain
:
)
,
P
(
were
probabilities
probability
with
All
variables
the
to
)
fild
,
refer
findings
.
symptom
P
(
of
diseases
random
and
)
we
vector
of
unobserved
from
conditional
represent
nodes
henceforth
the
vector
diseases
conditional
were
is
the
;
in
of
600
the
=
prior
and
over
,
model
layer
symptoms
over
(
nodes
here
lower
implied
probability
shaded
database
approximately
"
marginalizing
joint
DT
the
denotes
f
The
.
findings
d
.
graphical
and
observed
components
P
from
-
independencies
following
The
QMR
are
symbol
conditional
and
the
database
set
model
. "
.
There
the
as
thus
the
graph
)
in
symptoms
f
)
graphical
findings
diseases
is
symbol
DT
"
is
nodes
observed
1991
database
evidence
binary
(
represent
see
-
~
of
ale
DT
symptom
The
the
et
nodes
symptoms
QMR
to
overview
,
QMR
the
referred
brief
Shwe
layer
4000
are
a
details
of
and
dj
)
.
obtained
of
by
the
Shwe
findings
assessments
the
ith
a
symptom
5
)
(
6
)
et
given
under
that
,
(
ale
the
"
noisy
-
is
6In particular , the pattern of missing edgesin the graph implies that (a) the diseases are marginally independent, and (b) given the dise~ es, the symptoms are conditionally independent .
114
MICHAEL
I . JORDAN
ETAL.
absent, P (fi = Old), is expressedas follows:
P(fi = Old ) = (1- qiO ) II (1 - qij)dj
(7)
jE1r (i )
where the qij are parameters
obtained
from the expert
Msessments . Con -
sidering casesin which only one diseMe is present, that is, { dj = I } and { dk = 0; k ~ j } , we see that qij can be interpreted M the probability that the ith finding is present if only the jth disease is present . Considering the case in which
all diseases
are absent , we see that
the qiO parameter
can be
interpreted as the probability that the ith finding is present even though no disease
is present
.
We will find it useful to rewrite the noisy -OR model in an exponential
form :
P (fi = Old) = e- EjE1r (i) Bijdj- Bio
(8)
where (}ij := - In ( 1 - qij ) are the transformed parameters . Note also that the probability of a positive finding is given as follows :
P (fi = lid ) = I - e- EjE7r(i) Oijdj- lJio
(9)
These forms express the noisy - OR model as a generalized
If we now form the joint probability
distribution
linear model .
by taking products of
the local probabilities P (fild ) as in Eq. (6), we seethat negative findings are benign with respect to the inference problem . In particular ~ . a nroduct -
of exponential factors that are linear in the diseases(cf. Eq. (8)) yields a joint probability that is also the exponential of an expression linear in the diseases. That is, each negative finding can be incorporated into the joint probability in a linear number of operations . Products of the probabilities of positive findings , on the other hand , yield cross products terms that are problematic for exact inference . These cross product terms couple the diseases (they are responsible for the "explaining away" phenomena that arise for the noisy -OR model ; see Pearl , 1988) . Unfortunately , these coupling terms can lead to an exponential growth in inferential complexity . Considering a set of standard diagnos-
tic cases (the "CPC cases" ; see Shwe, et al. 1991), Jaakkola and Jordan (1997c) found that the median size of the maximal clique of the moralized QMR -DT graph is 151.5 nodes. Thus even without considering the trian gulation step , we see that diagnostic calculation under the QMR - DT model is generally
infeasible .7
7Jaakkola and Jordan (1997c) also calculated the median of the pairwise cutset size. This
value was found
for the QMR - DT .
to be 106 .5 , which
also rules out exact
cutset
methods
for inference
115
ANINTRODUCTION TOVARIATIONAL METHODS Input
Hidden Output Figure
5
output
nodes
3
. 2
.
.
The
NEURAL
Neural
graphical the
networks
are
each
logistic
j
( z
model
interpreting
the
,
)
=
Fig
/
(
.
5
)
.
e
-
of
one
the
of
us
one
Z
)
.
,
We
The
input
nodes
nonlinear
consider
those
and
treat
such
a
Si
.
from
neural
example
as
node
the
,
the
network
each
that
For
"
functions
with
probability
values
activation
obtained
variable
the
"
activation
as
can
as
two
a
such
binary
node
its
with
Let
and
+
.
MODELS
a
activation
write
1
network
endowed
associating
takes
we
1
neural .
GRAPHICAL
zero
by
variable
function
( see
between
a
nodes
graphs
node
function
graphical
of
evidence
layered
bounded
binary
of
AS
are
at
that
structure
set
NETWORKS
function
a
layered comprise
and
associated
using
the
logistic
:
1
P
where
j
( ) ij
and
is
"
i
,
=
a
is
neural
on
,
way
,
=
1
+
"
parameter
~
e
"
bias
"
to
requires
that
6
'
(
parent
with
(
1992
include
,
and
.
problem
)
.
The
i
.
This
advantages
to
treat
perform
unsupervised
Realizing
be
nodes
node
ability
to
learning
inference
)
the
data
10
10
between
Neal
manner
supervised
the
- J
edges
missing
as
- ' S 1. J
associated
by
this
handle
footing
( i )
with
introduced
in
,
same
6
L . . . jE7r
these
solved
in
bene
an
-
efficient
.
In
ered
fact
,
it
neural
parents
is
all
of
these
ticular
,
nodes
,
cally
during
are
of
of
are
inference
a
do
N
in
of
exact
hidden
least
in
the
0
( 2N
the
a
)
general
( see
is
nodes
hidden
,
the
-
are
.
in
par
additional
)
.
-
-
layers
layer
6
evidence
probabilisti
hidden
particular
Fig
clear
become
preceding
-
as
neural
layer
output
ignoring
lay
has
moralized
this
layer
in
in
the
in
general
generally
Thus
penultimate
units
at
.
nodes
network
in
network
inference
the
infeasible
neural
the
ancestors
is
is
a
layer
all
neural
their
in
preceding
for
units
inference
node
the
necessary
as
there
exact
A
between
hidden
,
if
in
links
training
the
that
.
nodes
links
thus
see
models
has
dependent
Thus
to
the
graph
That
easy
network
network
complexity
the
network
the
however
)
network
calculations
learning
( i )
associated
( ) iO
belief
treating
1IS7r
parameters
and
sigmoid
diagnostic
fits
the
node
the
of
are
( Si
.
,
the
time
growth
116
MICHAELI. JORDAN ET AL.
Hidden
OD J
j
. /
..;". /
..
Output Figure 6. Moralization of a neural network . The output nodes are evidence nodes during training . This creates probabilistic dependencies between the hidden nodes which are captured by the edges added by the moralization .
Figure 7. A Boltzmann machine . An edge between nodes Si and Sj is associated with a factor exp ( (Jij Si Sj ) that contributes multiplicatively to the potential of one of the cliques containing the edge. Each node also contributes a factor exp ((}iOSi) to one and only one potential .
in clique
size due to triangulation
or even hundreds of hidden neural network using exact
. Given
that
neural
networks
with
units are commonplace , we see that inference is not generally feasible .
dozens
training
a
3.3. BOLTZMANNMACHINES A Boltzmann nodes
and
ular , the
machine
is an undirected
a restricted clique
potentials
graphical
set of potential
functions
are formed
by taking
model
with
binary
( see Fig . 7 ) . products
In
- valued partic
-
of " Boltzmann
factors " - exponentials of terms that are at most quadratic in the Si ( Hin ton & Sejnowski , 1986 ) . Thus each clique potential is a product of factors exp { 8ijSiSj
} and factors
exp { (JiOSi } , where
Si E { O, 1} .8
8It is also possible to consider more general Boltzmann machines with multivalued nodes , and potentials that are exponentials of arbitrary functions on the cliques . Such models are essentially equivalent to the general undirected graphical model of Eq . (3)
ANINTRODUCTION TOVARIATIONAL METHODS
117
A given pair of nodes 8i and 8j can appear in multiple , overlapping cliques. For each such pair we ~ sume that the expressionexp{ OijSiSj} appears as a factor in one and only one clique potential . Similarly , the factors
exp{(}io8i } are ~ sumed to appear in one and only one clique potential . Taking the product over all such clique potentials (cf. Eq. (3)), we have:
P (S) =
e~ i<j BijSiSj+ ~ i BiOSi Z '
( 11)
where we have set (}ij = 0 for nodes Si and Sj that are not neighbors in the graph - this convention allows us to sum indiscriminately over all pairs
Si and Sj and still respect the clique boundaries. We refer to the negative of the exponent in Eq . (11) as the energy. With this definition the joint probability in Eq . (11) has the general form of a Boltzmann distribution . Saul and Jordan (1994) pointed out that exact inference for certain special cases of Boltzmann machine - such as trees, chains , and pairs of coupled chains- is tractable and they proposed a decimation algorithm for this purpose . For more general Boltzmann machines , however, decimation is not immune to the exponential time complexity that plagues other exact methods . Indeed , despite the fact that the Boltzmann machine is a special class of undirected graphical model , it is a special class only by virtue of its parameterization , not by virtue of its conditional independence struc ture . Thus , exact algorithms such as decimation and the junction tree algorithm , which are based solely on the graphical structure of the Boltzmann machine , are no more efficient
for Boltzmann
machines
than they are for
general graphical models . In particular , when we triangulate generic Boltz mann machines , including the layered Boltzmann machines and grid -like Boltzmann machines , we obtain intractably large cliques . Sampling algorithms have traditionally
been used to attempt to cope
with the intractability of the Boltzmann machine (Hinton & Sejnowski, 1986). The sampling algorithms are overly slow, however, and more recent work has considered the faster "mean field" approximation (Peterson & Anderson , 1987) . We will describe the mean field approximation for Boltz mann machines later in the paper - it is a special form of the variational approximation approach that provides lower bounds on marginal probabili ties . We will also discuss a more general variational algorithm that provides upper and lower bounds on probabilities (marginals and conditionals ) for
Boltzmann machines (Jaakkola & Jordan, 1997a). (although the latter can represent zero probabilities while the former cannot) .
118
MICHAEL I. JORDAN ET AL.
XT
Xj
7t ,
~ I
B
-
-
~ (,'
B
.
.
.
B
~
('-) B . YT
Figure 8. A HMM represented as a graphical model . The left -to -right spatial dimension represents time . The output nodes Yi are evidence nodes during the training process and the state
3 .4 .
nodes
HIDDEN
Xi
are hidden .
MARKOV
MODELS
In this section , we briefly review hidden Markov models . The hidden Markov model (HMM ) is an example of a graphical model in which exact inference is tractable ; our purpose in discussing HMMs here is to lay the groundwork for the discussion of intractable variations on HMMs in the following sections . See Smyth , Heckerman , and Jordan (1997) for a fuller discussion of the HMM as a graphical model . An HMM is a graphical model in the form of a chain (see Fig . 8) . Consider
a sequence of multinomial
conditional probability
" state " nodes Xi and assume that the
of node Xi , given its immediate predecessor Xi - I ,
is independent of all other precedingvariables. (The index i can be thought of as a time index). The chain is assumedto be homogeneous ; that is, the matrix of transition probabilities , A = P (Xi !Xi - I ), is invariant acrosstime . We also require a probability
distribution
7r = P (XI ) for the initial state
Xl . The HMM
model also involves a set of "output " nodes
i and an emis -
sion probability law B = P ( i !Xi ), again assumedtime-invariant. An HMM is trained by treating the output nodes as evidence nodes and the state nodes as hidden nodes. An expectation -maximization (EM ) algorithm (Baum , et al ., 1970; Dempster , Laird , & Rubin , 1977) is generally used to update the parameters A , B , 7r; this algorithm involves a simple iter -
ative procedure having two alternating steps: (1) run an inferencealgorithm to calculate the conditional probabilities P (Xil { i } ) and P (Xi , Xi - ll { i } ); (2) update the parameters via weighted maximum likelihood where the weights are given by the conditional probabilities calculated in step (1) . It
is easy to see that
exact
II
alization and triangulation
inference
is tractable
for HMMs
. The
mor -
steps are vacuous for the HMM ; thus the time
ANINTRODUCTION TOVARIATIONAL METHODS ~J J)
~v<J)
119
~3J) . . .
. . .
. . .
YJ
Y2
Yj
Figure 9. A factorial HMM with threechains. The transition matricesare A (1), A (2), and A (3) associatedwith the horizontal edges , and the output probabilitiesare determined by matricesB (1), B (2), and B (3) associatedwith the vertical edges .
complexity can be read off from Fig. 8 directly . We see that the maximal clique is of size N2 , where N is the dimensionality of a state node. Inference therefore scalesas O(N2T ), where T is the length of the time series.
3.5. FACTORIALHIDDENMARKOVMODELS In many problem domains it is natural to make additional structural assumptions about the state space and the transition probabilities that are not available within the simple HMM framework. A number of structured variations on HMMs have been considered in recent years (see Smyth, et al., 1997); generically these variations can be viewed as "dynamic belief networks" (Dean & Kanazawa, 1989; Kanazawa, Koller , & Russell, 1995). Here we consider a particular simple variation on the HMM theme known as the "factorial hidden Markov model" (Ghahramani & Jordan, 1997; Williams & Hinton , 1991). The graphical model for a factorial HMM (FHMM ) is shown in Fig. 9. The system is composedof a set of M chains indexed by m . Let the state node for the mth chain at time i be representedby Xi(m) and let the transition matrix for the mth chain be representedby A (m). We can view the effective state space for the FHMM as the Cartesian product of the state spacesassociatedwith the individual chains. The overall transition probability for the system by taking the product acrossthe intra -chain transition
120
MICHAELI. JORDANET AL. . . .
. . .
YJ
~
~
Figure 10 . A triangulation of an FHMM with two component chains . The morali step links states at a single time step . The triangulation step links states diago between neighboring time steps . probabilities : M P(XiIXi -l)=m n=lA(m )(X(m )IXi i (~{), (12 ) where the symbol Xi stands for the M -tuple (Xi (l ), Xi (2), . . . , Xi (M)) . Ghahramani and Jordan utilized a linear -Gaussian distribution for the emission probabilities of the FHMM . In particular , they assumed: P (~ IXi ) = N (L B (m)Xi (m), E ), m
(13)
where the B (m) and ~ are matrices of parameters . The FHMM is a natural model for systems in which the hidden state is realized via the joint configuration of an uncoupled set of dynamical systems . Moreover , an FHMM is able to represent a large effective state space with a much smaller number of parameters than a single unstructured Cartesian product HMM . For example , if we have 5 chains and in each chain the nodes have 10 states , the effective state space is of size 100,000, while the transition probabilities are represented compactly with only 500 parameters . A single unstructured HMM would require 1010parameters for the transition matrix in this case. The fact that the output is a function of the states of all of the chains implies that the states become stochastically coupled when the outputs are observed . Let us investigate the implications of this fact for the time complexity of exact inference in the FHMM . Fig . 10 shows a triangulation for the case of two chains (in fact this is an optimal triangulation ) . The cliques for the hidden states are of size N3 ; thus the time complexity of
ANINTRODUCTION TOVARIATIONAL METHODS
121
. . .
Figure 11. A triangulation of the state nodes of a three -chain FHMM with three com ponent chains . (The observation nodes have been omitted in the interest of simplicity ) .
Figure 12.
This graph is not a triangulation
.
.
.
.
.
.
.
.
.
of a three -chain FHMM .
exact inference is 0 (N3T ) , where N is the number of states in each chain
(we assumethat each chain has the same number of states for simplicity ). Fig . 11 shows the case of a triangulation
of three chains ; here the triangula -
tion (again optimal ) creates cliques of size N4 . (Note in particular that the graph in Fig . 12, with cliques of size three , is not a triangulation ; there are 4-cycles without a chord ) . In the general case, it is not difficult to see that cliques of size NM + l are created , where M is the number
of chains ; thus
the complexity of exact inference for the FHMM scales as O (NM + IT ) . For a single unstructured Cartesian product HMM having the same number of
states as the FHMM - i .e., NM states- the complexity scalesas O(N2MT ), thus exact inference
for the FHMM
is somewhat
less costly , but the expo -
111111111111
nential growth in complexity in either case shows that exact inference is infeasible for general FHMMs .
122
MICHAELI. JORDANET AL.
Uj
U2
U3 .
Yj
~
.
.
Y3
Figure 13. A hidden Markov decision tree . The shaded nodes { Ui } and { i } represent a time series in which each element is an (input , output ) pair . Linking the inputs and outputs are a sequence of decision nodes which correspond to branches in a decision tree . These decisions are linked horizontally to represent Markovian temporal dependence .
3 .6 .
HIGHER
- ORDER
HIDDEN
MARKOV
MODELS
A related variation on HMMs considers a higher -order Markov model in which each state depends on the previous K states instead of the single previous state . In this case it is again readily shown that the time complex ity is exponential in K . We will not discuss the higher - order HMM further in this chapter ; for a variational algorithm for the higher - order HMM see Saul and Jordan (1996) .
3.7. HIDDENMARKOVDECISIONTREES Finally , we consider a model in which a decision tree is endowed with Marko vian dynamics (Jordan , et al ., 1997) . A decision tree can be viewed as a graphical model by modeling the decisions in the tree as multinomial ran dom variables , one for each level of the decision tree . Referring to Fig . 13, and focusing on a particular time slice, the shaded node at the top of the diagram represents the input vector . The unshaded nodes below the input nodes are the decision nodes. Each of the decision nodes are conditioned on the input and on the entire sequence of preceding decisions (the vertical arrows in the diagram ) . In terms of a traditional decision tree diagram , this dependence provides an indication of the path followed by the data point as it drops through the decision tree . The node at the bottom of the diagram is the output variable . If we now make the decisions in the decision tree conditional not only
AN INTRODUCTION
TO VARIATIONAL
METHODS
123
on the current data point , but also on the decisions at the previous moment in time , we obtain a hidden Markov decision tree (HMDT ) . In Fig . 13, the horizontal edges represent this Markovian temporal dependence. Note in particular that the dependency is assumed to he level-specifIc- the proba bility of a decision depends only on the previous decision at the same level of the decision
tree .
Given a sequence of input vectors Ui and a corresponding sequence of output vectors i , the inference problem is to compute the conditional probability distribution over the hidden states . This problem is intractable for general HMDTs - as can be seen by noting that the HMDT includes the FHMM as a special case. 4 . Basics of variational
methodology
Variational methods are used as approximation methods in a wide variety of settings , include finite element analysis (Bathe , 1996), quantum mechanics (Sakurai , 1985) , statistical mechanics (Parisi , 1988) , and statistics (Rustagi , 1976). In each of these cases the application of variational methods converts a complex problem into a simpler problem , where the simpler problem is generally characterized by a decoupling of the degrees of freedom in the original problem . This decoupling is achieved via an expansion of the problem to include additional parameters , known as variational parameters , that must be fit to the problem
at hand .
The terminology comes from the roots of the techniques in the calculus of variations . We will not start systematically from the calculus of varia tions ; instead , we will jump off from an intermediate point that emphasizes the important role of convexity in variational approximation . This point of
view turns out to be particularly well suited to the development of varia tional methods for graphical models . 4 .1 .
EXAMPLES
Let us begin by considering a simple example . In particular , let us express the logarithm function variationally : In (x ) = min { "\x - In ''\ - I } . A
(14)
In this expression A is the variational parameter , and we are required to perform the minimization for each value of x . The expression is readily verified by taking the derivative with respect to A, solving and substituting . The situation is perhaps best appreciated geometrically , as we show in Fig . 14. Note that the expression in braces in Eq . (14) is linear in x with slope A. Clearly , given the concavity of the logarithm , for each line having
124
MICHAELI. JORDAN ET AL. 5
. .
.'
3 .
,,
.
"
,
I
.
.
.
.
,
,
.
,
.
,
.
.
.
.
.
.
.
.
.
'
'
'
'
'
'
,..'
,
.
.
"
,...,..,.. ..,.,.,.-' -~'.~." .,,,-' ,
.'
.
,
.
.
.
.-
,
.
.
.
-
-
.
'
. -'
-
.
~
-
_
.
-
-
-
~
. , ' ','.'~ . , ' : : '. '. ', '~'. ~, ' . . " , . ' - . ~ - - - . '
-1
-3
-5 0
0 .5
1
1 .5
2
2 .5
3
X
Figure 14.
Variational
transformation
of the logarithm
function . The linear functions
(Ax - In A - I ) form a family of upper bounds for the logarithm , each of which is exact for a particular
value of x .
slope .,\ there is a value of the intercept such that the line touches the logarithm at a single point . Indeed , - In A - I in Eq . (14) is precisely this intercept . Moreover , if we range across "\ , the family of such lines forms an upper envelope of the logarithm function . That is, for any given x , we have:
In(x) ~ AX - In A - I ,
(15)
for all A. Thus the variational transformation provides a family of upper bounds on the logarithm . The minimum over these bounds is the exact value of the logarithm . The pragmatic converted
justification
a nonlinear
have obtained
function
a free parameter
for such a transformation into
a linear
A that
function
. The
is that cost
we have
is that
we
must be set , once fOl' each x . For
any value of A we obtain an upper bound on the logarithm ; if we set A well we can obtain a good bound . Indeed we can recover the exact value of logarithm for the optimal choice of A. Let us now consider a second example that is more directly relevant to graphical models . For binary -valued nodes it is common to represent the probability that the node takes one of its values via a monotonic nonlinear ity that is a simple function - e .g., a linear function - of the values of the parents of the node . An example is the logistic regression model :
1 f (x) = 1+ e-X'
(16)
AN
which
we
the
have
values
seen
of
The
the
TO
previously
in
parents
logistic
bound
of
function
will
the
INTRODUCTION
not
a
is
work
.
VARIATIONAL
Eq
node
( 10
) .
Here
x
is
the
weighted
sum
of
.
neither
However
.
125
METHODS
convex
,
the
=
-
nor
logistic
concave
,
function
is
so
log
a
simple
linear
concave
.
That
is
,
function
g
is
a
concave
second
function
derivative
of ) .
functions
and
thereby
particular
,
can
we
( x
Thus
)
x
( as
we
can
( 1
can
+
e -
X )
readily
bound
bound
write
In
the
( 17
be the
verified
log
logistic
by
logistic
calculating
the
function
function
by
with
the
)
linear
exponential
.
In
:
g
( x
)
=
min
{ Ax
-
H
( A ) }
,
( 18
)
A
where
H
( We
( A )
will
is
the
binary
explain
suffices
to
logistic
think
function
the
entropy
how
minimum
the
of
it
) .
We
now
the
exponential
and
is
a
plotted
variational
in
obtain
Fig
an
as
.
15
.
)
for
of
i ty
in
a
graphical
with
( x
)
=
( 1
a
exponentials the
e -
on
CONVEX
Can
we
find variational
~
This
is
H
( A ) ]
-
In
.
,
Such
a
.
is
function
joint
Eq
For
.
to
x
it
the
log
noting
that
examples
any
of
of x
A
we
:
( 20
)
are the
a
a
.
If
variational as
obtained
by
,
-
the
of
form
the
local repre
functions
simple
including
is
the
transformations
more that
have
instead
we
parameters in
Eq
.
( 20
taking
-
) -
we
see
products
particularly
so
systematically been
utilized
-
form
of
given
that
.
transformations
)
probabil
over probabilities
of
in
in
joint
product
conditional
not
significant
DUALITY
variational
)
are
value
values
obtain
take
computationally in
;
for all
variationally
probability
tractable
linear
function
products
by
A ) .
( 20
,
obtain
product
,
-
now
for
sides
that
to
( 2 ) ) .
( l
.
particular
we
term
In
for
( A ) .
required
Eq
A ) ;
( 19
for
in
are
-
:
again
H
( I
below
.
function
eAx
A -
both
logistic
bounds
. we
logistic
the
-
once
representation
are
4 .2 .
the
X ) .
of
transformation
( cf
each
.
exponents
)
regression
network
bound
( x
models
representing
that
of
+
our
note
In
intercept
the
logistic
A
commute
[ eAX
better the
model
logistic
1 /
augment i .e .
of
we
the
provide
probabilities
sented
f
A
graphical
conditional
min
-
arises
exponential
for
, of
advantages
context
the
=
==
appropriate
function
( x
Finally
bound
( A )
function
transformation
upper
choices The
the
take
f
Good
, H
entropy
simply
f
This
function
binary
? in
Indeed the
, literature
many
126
MICHAELI. JORDAN ET AL. 1. 8
,
,
,
,
,
,
,
,
.
,
,
,
,
,
,
,
'
,
, .
,
,
,
.
,
,
,
,
.
,
1 4
,
'
,
'
,
,
"
.
,
,
,
.
,
,
,
,
,
,
,
.
' ,
.
,
,
,
"
,
,
,
.
.
-
" ,
,
,
,
'
"
.
,
,
'
,
"
.
.
'
' )
"
'
. . ..
r
,
. ,. ,," ' -, 1"
. '
,
. - - _. - ' _
" " "
,
,
"
,
,
, ,
..
,' ,
,
,
"
,
, ,
0 6
"
" , ,
"
,, , ~~.::,;~.:~.$"~;'=----'-------_._,
,
,
,
, , '
'
"
.
,
,
, ,
,
.
,
,
.
1 0
,
,
,
"
s -
..' - , ' .
' )
----------~~~~~~~~~~~-I'.-'.-I,I~~~~" 0 .2
----' ----- -------------
- 0 ..2 -3
-2
-1
0
1
2
3
x
Figure 15.
Variational transformation of the logistic function .
on graphical models are examplesof the general principle of convexduality. It is a general fact of convex analysis (Rockafellar, 1972) that a concave function f (x ) can be representedvia a conjugateor dual function as follows:
j (x) = min{ A ATx - j *(A)} ,
(21)
where we now allow x and A to be vectors . The conjugate function j * (A) can be obtained from the following dual expression :
j *(A) = min{ ATx - j (x )} . x
(22)
This relationship is easily understood geometrically , as shown in Fig . 16. Here we plot f (x ) and the linear function AX for a particular value of A. The short vertical segments represent values AX- f (x ) . It is clear from the figure that we need to shift the linear function AX vertically by an amount which is the minimum of the values AX - f (x ) in order to obtain an upper bounding line with slope A that touches f (x ) at a single point . This observation both justifies the form of the conjugate function , as a minimum over differences Ax - f (x ) , and explains why the conjugate function appears as the intercept
in Eq. (21). I t is an easy exercise to verify that the conj ugate function for the logarithm is f * (A) = In A + 1, and the conjugate function for the log logistic function is the binary entropy H (A) . Although we have focused on upper bounds in this section , the frame work of convex duality applies equally well to lower bounds ; in particular
AN INTRODUCTION TO VARIATIONALMETHODS
127
f (x)
x Figure
16
tions
-
.
The
conjugate
represented
function
as
dashed
f
lines
-
*
(
A
)
is
between
obtained
by
AX
and
f
(
x
minimizing
)
across
the
devia
-
.
for convex f (x ) we have : j
(
x
)
=
maX
{
ATx
-
j
*
(
A
)
}
(23)
,
. , \
where f
*
(
A
)
=
max
{
AT
x
-
f
(
x
)
(24)
}
x
is
the
conjugate
We
not
function
have
restricted
on
to
transforming
linear
the
of
the
in
x2
function
we
.
focused
write
bounds
bounds
.
argument
(
can
linear
Jaakkola
of
&
the
Jordan
in
More
this
section
general
of
1997a
but
bounds
function
,
,
)
.
interest
For
convex
can
duality
be
rather
example
obtained
by
than
,
if
f
is
the
(
x
)
is
value
concave
:
f (x) = min{AX2 - l *(A)} , .,\
(
25
)
where J*('\) is the conjugatefunction of J(x) == f (x2). Thus the transformation yields a quadratic bound on f (x). It is also worth noting that suchtransformationscanbe combinedwith the logarithmictransformation utilized earlier to obtain Gaussianrepresentations for the upper bounds. This can be useful in obtaining variational approximationsfor posterior distributions (Jaakkola& Jordan, 1997b). To summarize , the generalmethodologysuggestedby convexduality is the following. We wish to obtain upper or lower boundson a function of interest. If the function is already convexor concavethen we simply
128
MICHAELI. JORDANET AL.
calculate
the
then
we
convex
the
a
then
also
,
such
the
function
is
as
For
or
function
approach
in
to
logarithm
,
whose
be
concave
the
of
conjugate
this
the
convex
renders
transformations
the
.
not
that
consider
calculate
back
transform
If
transformation
may
transform
properties
the
argument
the
transformed
useful
we
inverse
has
,
function
need
to
useful
to
algebraic
.
. 3
.
APPROXIMATIONS
FOR
CONDITIONAL
discussion
thus
ability
far
distributions
has
at
proximations
interest
,
in
interest
AND
our
Let
in
us
lower
on
P
(
and
(
Si
SiIS7r
and
I
S7r
i
)
(
At
are
the
(
'
(
the
.
That
)
,
providing
,
assume
product
of
)
(
S
(
HIE
ap
-
of
)
that
is
P
(
that
E
our
)
we
conditional
forms
that
,
(
Si
I S7r
(
respectively
n
P
(
upper
bound
SiIS7r
(
i
a
)
,
-
Af
)
and
where
Af
appropriate
the
upper
i
,
parameterizations
first
an
have
probabil
pU
bounds
Consider
is
=
Suppose
local
have
variational
.
)
the
lower
bounds
P
of
we
and
bounds
upper
-
these
probability
.
each
that
upper
lower
do
prob
probabilities
P
marginal
local
How
global
concreteness
for
different
and
the
for
bound
is
.
the
?
graphs
upper
)
model
the
distribution
and
directed
generally
upper
that
i
graphical
conditional
problems
an
At
a
for
for
problem
learning
focus
bound
ities
the
inference
interest
of
approximations
approximations
for
the
on
nodes
into
particular
in
for
PROBABILITIES
focused
the
translate
pL
JOINT
PROBABILITIES
The
is
We
We
.
invertible
.
.
and
find
an
concave
function
space
function
for
or
of
4
conjugate
look
,
bounds
we
.
have
Given
:
)
'"
: : : ;
n
pU
(
SiIS7r
(
i
)
'
AY
(26)
)
' "
for
Eq
any
.
fixed
ing
For
settings
(
26
,
sums
example
)
of
must
thus
values
hold
upper
of
for
any
bounds
over
the
,
on
and
H
(
E
)
parameters
S
whenever
on
be
a
the
disjoint
Af
some
probabilities
form
E
P
variational
of
marginal
variational
letting
the
subset
other
can
right
-
partition
{LH }P(H,E)
hand
be
~ L n pU(SiIS7r (i)' Af), {H} i
Moreover
,
is
obtained
side
of
.
subset
of
S
,
by
the
we
equation
have
held
tak
-
.
:
(27)
where , as we will see in the examples to discussed below , we choose the vari ational forms pU (Si IS7r(i ), Af ) so that the summation over H can be carried
AN INTRODUCTION
TO VARIATIONAL
METHODS
129
out efficiently (this is the key step in developing a variational method). In either Eq. (26) or Eq. (27), given that these upper bounds hold for any
settingsof valuesthe variationalparametersAf , they hold in particular for optimizing settings of the parameters . That is, we can treat the right -hand
side of Eq. (26) or the right -hand side Eq. (27) as a function to be minimized
with respectto Af . In the latter case,this optimizationprocesswill induce interdependencies betweenthe parametersAf . Theseinterdependencies are desirable ; indeed they are critical for obtaining a good variational bound on the marginal probability of interest . In particular , the best global bounds are obtained when the probabilistic dependencies in the distribution are reflected in dependencies in the approximation . To clarify the nature of variational bounds , note that there is an im portant distinction to be made between joint probabilities (Eq . (26) ) and
marginal probabilities (Eq. (27)). In Eq. (26), if we allow the variational parameters to be set optimally
for each value of the argument S , then it
is possible (in principle ) to find optimizing settings of the variational parameters that recover the exact value of the joint probability . (Here we assume that the local probabilities P (SiIS -rr(i)) can be represented exactly via a variational transformation , as in the examples discussed in Section
4.1). In Eq. (27), on the other hand, we are not generally able to recover exact values of the marginal by optimizing over variational parameters that depend only on the argument E . Consider , for example , the case of a node Si E E that has parents in H . As we range across { H } there will be summands on the right -hand side of Eq . (27) that will involve evaluating the
local probability P (SiIS-rr(i)) for different values of the parents S-rr(i). If the
variationalparameterAf dependsonly on E, we cannotin generalexpect to obtain an exact representation for P (SiIS-rr(i)) in each summand. Thus, some of the summands in Eq . (27) are necessarily bounds and not exact values
.
This observation provides a bit of insight into reasons why a variational bound might be expected to be tight in some circumstances and loose in others . In particular , if P (SiISrr(i)) is nearly constant as we range across Srr(i )' or if we are operating at a point where the variational representation
is fairly insensitiveto the setting of Af (for examplethe right-hand sideof the logarithm in Fig . 14) , then the bounds may be expected to be tight . On the other hand , if these conditions are not present one might expect that the bound would be loose. However the situation is complicated by the
interdependencies betweenthe Af that areinducedduringthe optimization process . We will
return
to these
issues in the discussion
.
Although we have discussed upper bounds , similar comments apply to lower bounds , and to marginal probabilities obtained from lower bounds on the joint
distribution
.
130
MICHAELI. JORDAN ET AL.
The two
conditional
per
and
and
distributions lower
lower
bounds
, then
fewer
est
and
Finally
, it
simply
as
strict
a
ability
to
conditional bound
on
substitute (H
tion
)
and
to
P
( HIE
In
the
been
as
important
forward
to
ease
and
on
case
a
utility
.
have
simplify
more
readily
, the
in
development
require
9Note necessarily
way
for
a
-
to
inter
this
is
prob
obtained , we
into
might
the
obtain
a
parameters
. We
variational
form
an
-
provide do
marginal
calculate
we
practical
for
approxima
-
choices
effective
a
in
of
node
properties
.
Also
,
Eq
. ( 27 are
variational
)
are
simple
depend functions
which
the
transfor
certain
approximations
-
architectures others
functions not
,
model
probability
variational
than
currently
-
. The
section
conditional
transformation
issues
this
in
well
; in some
par cases
understood can
in
some
and cases
.
P ( H , E ) in E
general jointly
a. s a marginal exhaust
the
probability set
of
nodes
-
.
architecture
in
to
con
straight
probability
others
of
architectures
new
node
-
All
examples
necessarily
regime
than
variational in
for
of
the
related
not
parameter
algebraic
and
to is
.
provide
degree
outlined
choice
and
certain
it
frame
examples
interest a
that
methods
variational - out
generalized ,
readily
. These
H
be
general
worked
. To
certain more
creativity
that
of
the
bounds
treat
the of
particular ,
others of
assume
the
under
substantial
that
of
that
One
example
approximation
the
useful
marginal
complex
the
be
methods
) .
to
illustrate
however
, including
particular
that
and
also
variational
number
can ,
applying
themselves
mations
ticular
that
and
In
lend
a
variational
of
topology
operated
denomi
involves
parameterized
methodology
details
functions
the
numerator
thus
form
architectures
emphasize
develop
graph
on
upper
Generally
, the
bound
fitting
the
will in
histories to
the
we
variational
architectural
the
a
, for
by
variational
applied
involve of
serve is
this
sections
has
examples
also It
it
into
utilize
.
-
) .
examples
crete
parameters
then
have
involves
as
used
is
. Thus )
of up
numerator
parameters
(E
must
can
than are
variational
P
ratio
obtain
the
methods
that
likelihood
the
To
.
rather
distribution
the
following
as
these
the
these
, E
work
is
substitute
) .9
bounds
= = HUE
m ~ thods
approximation
probability
lower
S
variational
sampling
variational
, and
can
as
lower
evaluation
, is
(E
denominator
, because
which
that
hand
) j P
the
and
approximations
( much
obtain
upper
in
, E
, we
and
finished
case
other
(H
distribution
obtain
noting
tractable
P
numerator
function
the
=
conditional
essentially
a
worth
) , on )
the
the
simply
is
bounds
to
, in
is
( HIE ( HIE
can
is
. Indeed
sums
we
labor
P ; i .e . , P the
both
, if
our
sums
no
on
, however
nator
on
bounds
speaking
P
distribution
marginal
; that S .
is , we
do
not
-
AL METHODS ANINTRODUCTION TOVARIATION 4.4. SEQUENTIAL Let
us now
AND BLOCK
consider
in
131
METHODS
somewhat
more
detail
how
variational
methods
can be applied to probabilistic inference problems . The basic idea is that suggested above - we wish to simplify the joint probability distribution by transforming the local probability functions . By an appropriate choice of variational transformation , we can simplify the form of the joint probability distribution and thereby simplify the inference problem . We can transform some or all of the nodes. The cost of performing such transformations is that we obtain bounds or approximations to the probabilities rather than exact
results
.
The option of transforming only some of the nodes is important implies a role for the exact methods as subroutines within a variational
; it ap -
proximation . In particular , partial transformations of the graph may leave
someof the original graphical structure intact and/ or introduce new graphical structure to which exact methods can be fruitfully applied . In general , we wish to use variational approximations in a limited way, transforming the graph into a simplified graph to which exact methods can be applied . This will in general yield tighter bounds than an algorithm that transforms the entire graph without regard for computationally tractable substructure . The majority of variational algorithms proposed in the literature to date can be divided into two main classes: sequential and block. In the sequential approach , nodes are transformed in an order that is determined during the inference process. This approach has the advantage of flexibility and generality , allowing the particular pattern of evidence to determine the best choices of nodes to transform . In some cases, however, particularly when there are obvious substructures in a graph which are amenable to exact methods , it can be advantageous to designate in advance the nodes to be transformed . We will see that this block approach is particularly natural
in the setting
5 . The sequential
of parameter
estimation
.
approach
The sequential approach introduces variational transformations for the nodes in a particular order . The goal is to transform the network until the result ing transformed
network
is amenable
to exact methods . As we will see in
the examples below , certain variational transformations can be understood graphically as a sparsification in which edges are removed from the graph . A series of edge removals eventually renders the graph sufficiently sparse that an exact method becomes applicable . Alternatively , we can variation ally transform all of the nodes of the graph and then reinstate the exact node probabilities sequentially while making sure that the resulting graph stays computationally tractable . The first example in the following section
132
MICHAELI. JORDAN ET AL.
illustrates
the latter
approach Many example run
approach
time
of the exact
methods
provide
, one can run
a greedy
triangulation
of the junction
is sufficiently
small , in terms
the exact Ideally
made
the
the resulting be as small
of the
is , an ordering
would
as possible
order
clique
) . Thus
have been . In
the
Jaakkola
and
Jordan
is a bipartite
findings
are based
QMR - DT graph
i .e ., symptoms of the joint
impact
on inference
present
no difficulties
associated
Repeating
. Moreover
nodes
and
would
be
so that
( in particu graph
, particularly
-
would
given
that
1 -
to
approach
-
of a positive
there
for the findings
therefore
they
, the negative form that
been made
have
findings
probabilities updates
and focus findings
representation
:
concave ; thus , as in the
to express
the
on
.
= lid ) = I - e - EjE1r (i ) 6ijdj - 6iO
e - x is log
no
of the prob the
are positive
, we have the following
are not
be marginal
on the disease
have already
varia -
nodes that
assume
for
QMR - DT
probabilities
the exponential
findings
when
finding
can be used
can simply
and
of
QMR - DT
of sequential
symptom
us therefore
findings inference
context
the
( Eq . ( 8 ) for the negative that
observed
given
Eq . ( 9 ) for convenience
, we are able
we return
the conditional
of negative
.
in the
, as we have discussed -
orderings
. As we have seen , the
) . Note
time . Let
the negative
node
presented
by omission
for inference
of performing
probability
function
function
inference
be chosen
at each step
an application
network
are not
distribution
P ( fi The
the
run time
to the
triangulated
variational
findings that
in linear
with
the problem for the
section
in which
in Eq . ( 8 ) , the effects be handled
bound
transformations
would
problem
best
on the noisy - OR model
and Eq . ( 9 ) for the positive
can
allotted
as possible
( 1997c ) present
to the
network
ability
time . For
estimated
to transform
used to choose
following
QMR - DT NETWORK
out
time
run
to upper
variational
of the resulting
is perhaps
5 .1. THE
-
their
. If this
of the nodes
and show how a sequential in this network .
methods
bound
algorithm
in which
is a difficult
approach
example
network inference
ized
the former
is unlikely to produce the simplest graph at each step ; partial orders must be considered . In the literature to date
sequential
a specific
that
overall
be as simple
the maximal
procedures
The
findings
illustrates
.
, that
a single ordering that is , different heuristic
of the
choice
graph
lar , such that
tional
example
algorithm
can stop introducing
procedure
optimally
tests
tree inference
proced ure , the system run
and the second
.
variational
upper
( 28 ) case of the bound
logistic
in terms
of
134
MICHAELI. JORDANET AL.
The sequential methodology utilized by Jaakkola and Jordan for infer ence in the QMR -DT network actually proceeds in the opposite direction . They first transform all of the nodes in the graph . They then make use of a simple heuristic to choose the ordering of nodes to reinstate , basing the choice on the effect of reinstating each node individually starting from the completely transformed state . (Despite the suboptimality of this heuristic , they found that it yielded an approximation that was orders of magnitude more accurate than that of an algorithm that used a random ordering ). The algorithm then proceeds as follows : (1) Pick a node to reinstate , and consider the effect of reintroducing the links associated with the node into the current graph . (2) If the resulting graph is still amenable to exact methods , reinstate the node and iterate . Otherwise stop and run an exact method . Finally , (3) we must also choose the parameters Ai so as to make the approximation as tight as possible . It is not difficult to verify that products of the expression in Eq . (32) yield an overall bound that is a convex function of the Ai parameters (Jaakkola & Jordan , 1997c) . Thus standard optimization algori thms can be used to find good choices for the Ai. Jaakkola and Jordan (1997c) presented results for approximate inference on the " CPC cases" that were mentioned earlier . These are difficult cases which have up to 100 positive findings . Their study was restricted to upper bounds because it was found that the simple lower bounds that they tried were not sufficiently tight . They used the upper bounds to determine vari ational parameters that were subsequently used to form an approximation to the conditional posterior probability . They found that the variational approach yielded reasonably accurate approximations to the conditional posterior probabilities for the CPC cases, and did so within less than a minute of computer time .
5.2. THEBOLTZMANN MACHINE Let us now consider a rather different example . As we have discussed, the Boltzmann machine is a special subset of the class of undirected graph ical models in which the potential functions are composed of products of quadratic and linear "Boltzmann factors ." Jaakkola and Jordan (1997a) in troduced a sequential variational algorithm for approximate inference in the Boltzmann machine . Their method , which we discuss in this section , yields both upper and lower bounds on marginal and conditional probabilities of interest . Recall the form of the joint probability machine :
distribution
for the Boltzmann
P(S)=eLi <j8ij Si Sj +Li8iOSi Z .
(33)
AN INTRODUCTION TO VARIATIONALMETHODS
135
To obtain marginal probabilities such as P (E ) under this joint distribu tion , we must calculate sums over exponentials of quadratic energy func tions . Moreover , to obtain conditional probabilities such as P (HIE ) := P (H , E )/ P (E ), we take ratios of such sums, where the numerator requires fewer sums than the denominator . The most general such sum is the par tition function itself , which is a sum over all configurations { S } . Let us therefore focus on upper and lower bounds for the partition function as the general case; this allows us to calculate bounds on any other marginals or conditionals of interest . Our approach is to perform the sums one sum at a time , introducing variational transformations to ensure that the resulting expression stays computationally tractable . In fact , at every step of the process that we describe , the transformed potentials involve no more than quadratic Boltz mann factors . (Exact methods can be viewed as creating increasingly higher order terms when the marginalizing sums are performed ) . Thus the trans formed Boltzmann machine remains a Boltzmann machine . Let us first consider lower bounds . We write the partition function as follows :
~ eEj
InSiE L{O ,l} -
-
>
L .(}jkSjSk + j:l:i L (}joSj+ In SiE L{O,l}eLj~i (}ijSiSj +(}iOSi {J.
(}jkSjSk+ L
j#i
(35)
(JjoSj + Af L (}ijSj+ (JiO+ H{Af), (36) j#i
where the sum in the first term on the right -hand side is a sum over all pairs j < k such that neither j nor k is equal to i , where H (.) is as before the binary entropy function , and where Af is the variational parameter associated with node Si . In the first line we have simply pulled outside of the sum all of those terms not involving Si , and in the second line we
136
MICHAEL10JORDANET ALo
~ I;~ =::=:~:I
sl. (a)
(b)
Figure 18. The transformation of the Boltzmann machine under the approximate marginalization over node Si for the case of lower bounds . ( a) The Boltzmann machine before the transformation . (b ) The Boltzmann machine after the transformation , where Si has become delinked . All of the pairwise parameters , ()jk , for j and k not equal to i , have remained unaltered . As suggested by the wavy lines , the linear coefficients have changed for those nodes that were neighbors of Si .
have perfoI 'med the sum over the two values of Si . Finally , to lower bound the expression in Eq . (35) we need only lower bound the term In (1 + eX) on the right -hand side. But we have already -found variational bounds for a related expression in treating the logistic function ; recall Eq . (18) . The upper bound in that case translates into the lower bound in the current case:
In(l + e- X) ~ AX+ H(A).
(37)
This is the bound that we have utilized in Eq . (36) . Let us consider the graphical consequences of the bound in Eq . (36) (see Fig . 18) . Note that for all nodes in the graph other than node Si and its neighbors , the Boltzmann factors are unaltered (see the first two terms in the bound ) . Thus the graph is unaltered for such nodes. From the term in parentheses we see that the neighbors of node Si have been endowed with new linear terms ; importantly , however, these nodes have not become linked (as they would have become if we had done the exact marginalization ) . Neighbors that were linked previously remain linked with the same (}jk parameter . Node Si is absent from the transformed partition function and thus absent from the graph , but it has left its trace via the new linear Boltzmann factors associated with its neighbors . We can summarize the effects of the transformation by noting that the transformed graph is a new Boltzmann machine with one fewer node and the following parameters :
AN INTRODUCTION TOVARIATIONAL METHODS -
j ,k
()jk ()jO+ Af()ij
j
#
137
. 'l
=1= i .
Note finally that we also have a constant term Af (JiO+ H (Af ) to keep track of. This term will have an interesting interpretation when we return to the Boltzmann machine later in the context of block methods. Upper bounds are obtained in a similar way. We again break the partition function into a sum over a particular node Si and a sum over the configurations of the remaining nodes S\ Si. Moreover, the first three lines of the ensuing derivation leading to Eq. (35) are identical. To complete the derivation we now find an upper bound on In( 1 + eX). Jaakkola and Jordan (1997a) proposed using quadratic bounds for this purpose. In particular , they noted that : In(l + eX) = In(ex/ 2 + e- x/2) + x / 2
(38)
and that In(eX/2 + e- x/2) is a concavefunction of x2 (as can be verified by taking the secondderivative with respect to x2). This implies that In(l + eX) must have a quadratic upper bound of the following form: In(l + eX) S Ax2 + x/ 2 - g* (A).
(39)
where g* (A) is an appropriately defined conjugate function . Using these upper bounds in Eq. (35) we obtain:
InSiE L{O
(}jkSjSk + L OjoSj
+
j :f=i
Afj# LifJijSj +fJio
+_ 21 jL# i (JijSj+ (JiG - g*(Af), (40)
-
where Af is the variational parameter associated with node Si . The graphical consequences of this transformation are somewhat dif ferent than those of the lower bounds (see Fig . 19) . Considering the first two terms in the bound , we see that it is still the case that the graph is unaltered for all nodes in the graph other than node Si and its neighbors , and moreover neighbors of Si that were previously linked remain linked . The quadratic term , however, gives rise to new links between the previ ously unlinked neighbors of node Si and alters the parameters between previously linked neighbors . Each of these nodes also acquires a new linear term . Expanding Eq . (40) and collecting terms , we see that the approxi mate marginalization has yielded a Boltzmann machine with the following par ameters :
138
MICHAELI. JORDANET AL.
41 ;~~:=~~ 1
.
(a) Figure
19
.
The
transformation
marginalization
over
chine
before
where
the
Si
have
edges
,
The
with
a
+
2AfOjiOik
OjO
=
=
OjO
+
Oij
is
,
vantage
Jaakkola
does
given
as
by
reveal
6 . The block
the
/
AfO
2
,
a
ma
the
transformation
the
neighbors
-
,
of
Si
values
linear
algorithm
)
.
coefficients
that
= 1 =
i
j
= 1 =
i
g
upper
In
*
(
As
.
All
.
when
at
are
nodes
this
point
un
an
,
of
This
seeming
is
a
tighter
a
-
exact
transformation
.
bound
the
nodes
neighbors
readily
that
delinks
;
-
given
transformations
bound
as
.
transforma
,
links
the
)
bound
these
revealed
upper
Af
particular
upper
the
k
simply
is
between
structure
fact
-
combine
tree
The
,
additional
the
.
links
)
approximate
Boltzmann
parameter
new
Aforo
and
to
subroutine
1997a
of
[ j
+
lower
no
as
the
The
new
have
.
,
such
tractable
by
Jordan
Oio
natural
introducing
mitigated
&
a
the
)
all
have
also
+
by
particular
(
called
not
is
Si
2AfOiOOij
of
In
,
linked
of
+
more
.
,
formerly
a
after
suggest
consequences
structure
hand
edges
introduces
methods
tractable
other
node
2
is
somewhat
(
.
consequences
is
.
j
/
term
under
bounds
machine
dashed
computational
it
algorithm
the
(
Ojk
upper
Boltzmann
unaltered
=
exact
til
The
were
transformation
,
)
machine
of
neighbors
=
bound
Boltzmann
case
the
the
Ojk
have
delinked
b
As
are
constant
also
lower
,
parameters
the
(
that
lines
graphical
tions
those
wavy
the
the
.
and
the
and
Finally
for
delinked
linked
by
other
Si
.
become
become
of
node
transformation
has
suggested
(b)
on
delinked
disad
-
bound
.
approach
An alternative approach to variational inference is to designate in advance a set of nodes that are to be transformed . We can in principle view this "block approach " as an off-line application of the sequential approach . In the case of lower bounds , however, there are advantages to be gained by
AN INTRODUCTION
TO VARIATIONAL
METHODS
139
developing a methodology that is specific to block transformation . In this section , we show that a natural global measure of approximation accuracy can
be obtained
for
lower
bounds
via
a block
version
of the
variational
formalism . The method meshes readily with exact methods in cases in which tractable substructure can be identified in the graph . This approach was first presented by Saul and Jordan (1996) , as a refined version of mean field theory
for Markov
random
fields , and has been developed
further
in a
number of recent studies (e.g., Ghahramani & Jordan, 1997; Ghahramani & Hinton , 1996; Jordan, et al., 1997). In the block approach , we begin by identifying a substructure in the graph of interest that we know is amenable to exact inference methods (or , more generally , to efficient approximate inference methods ) . For example , we might pick out a tree or a set of chains in the original graph . We wish to use this simplified structure to approximate the probability distribution on the original graph . To do so, we consider a family of probability distri butions that are obtained from the simplified graph via the introduction of variational
parameters . We choose a particular
approximating
distribution
from the simplifying family by making a particular choice for the variational parameters . As in the sequential approach a new choice of variational parameters
must
be made
each time
new evidence
is available
More formally , let P (S) represent the joint distribution
.
on the graphical
model of interest , where as before S represents all of the nodes of the graph and Hand E are disjoint subsets of S representing the hidden nodes and
the evidence nodes, respectively . We wish to approximate the conditional probability P (HIE ). We introduce an approximating family of conditional probability distributions , Q (HIE , A) , where A are variational parameters . The graph representing Q is not generally the same as the graph repre senting P ; generally it is a sub-graph . From the family of approximat ing distributions
Q , we choose a particular
distribution
by minimizing
the
Kullback -Leibler (KL ) divergence, D (QIIP), with respect to the variational parameters :
A* == argminA D (Q(HIEix ) II P (HIE )), where for any probability is defined
as follows
distributions
(41)
Q (S) and P (S) the KL divergence
:
Q (S)
D(QIIP) = L Q(S) InP( S) " { S} The minimizing
values of the variational
(42)
parameters , A* , define a partic -
ular distribution , Q(HIE , A*), that we treat as the best approximation of P (HIE ) in the family Q (HIE , A).
140
MICHAELI. JORDANET AL.
One
simple
justification
approximation
for
accuracy
ability
of
the
mations
Q
evidence
( HIE
inequality
is
as
P
, ,\ ) .
( E
Indeed
follows
using
that
it
) ,
( i .e . , we
the
the
( E
)
==
In
=
the
L
In
~
a
between
seen
KL
to
be
L
lower
obtain
We
can
also
,
as ,
the
the
numbers
log ,
. ,
for
vector
the
"
in
Eq
.
( HIE
) .
KL
indeed Eq
convex .
( 23
( E
of )
using
prob
-
approxi
-
Jensen
' s
in
the
values
) .
~~..:~
2
( HIE
)
]
sides
Thus
the
,
right
.
( 43
of by
this
the side
according
)
equation
is
positivity
- hand A
of
of to
by
,
( H
for
simplicity
( HIE
) .
can
Eq Eq
. .
the
( 43
( 41
the
)
) ,
viewed
} : =
P
( H
f In
eln
P
( x
P
( E
( H
,E
case
)
H in as
H
.
to
Theat be
to
the
variables
be
define
appeal
viewed
" >" "
also
for
the
be
of
,
In
In
can
configurations
Finally
=
,
, >" ) of
)
an with
parameter
, E
of
making
approach
configuration
)
~ .:~
[ ~
block
Q
set
( E
In
divergence the
expression
P
:~
is we
DIVERGENCE
variational
( 23
.
hand
choosing
Consider
P
following
)
( QIIP
{ H
in
family
ofP
of
the
)
)
by
KL
each
the
( HIE
) ,
THE
of
In
In
is
measure on
.
- valued
on
Q
1991
distribution
one
, E
right
linking
probability
" x that
,
choice
1997
( H
D ,
bound
The
defined
variable
verified
the
logarithm
P
and
Moreover
thereby ,
H
numbers
vector
) .
the
theory
left
AND
( Jaakkola
real
the
( E lower
nodes
over
P
justify
approach valued
a
bound
in
Q
Q
Thomas
DUALITY
duality
)
}
divergence &
tightest
CONVEX
vex
of
on
the
6 .1 .
KL
( Cover
bound
~
the
the
divergence
as
lower
}
{ H
easily
best
likelihood
bound
{ H
difference
divergence
the
:
InP
The
KL
yields
InP
Eq a
.
of
discrete
as
a
.
-
( 23
) .
this
-
vector
Treat
vector
( E
con
sequential
this More of
real
vector ) .
It
as can
be
) :
)
( 44
)
}
, E
) .
Moreover
,
by
direct
substitution
) :
f * (Q) = mill
(HIE ,;\)InP(H,E)-lnP (E) {2:::H}Q
(45)
AN INTRODUCTION TO VARIATIONAL METHODS
141
and minimizing with respect to In P (H , E ), the conjugate function f * (Q) is seento be the negative entropy function ~ {H} Q(HIE ) In Q(HIE ). Thus, using Eq. (23), we can lower bound the log likelihood as follows: In P (E ) ~ :2:::: Q(HIE ) In P (H , E ) - Q(HIE ) In Q(HIE ) {H}
(46)
This is identical to Eq. (43). Moreover, we seethat we could in principle recover the exact log likelihood if Q were allowed to range over all probability distributions Q(HIE ). By ranging over a parameterized family Q(HIE , A), we obtain the tightest lower bound that is available within the family. 6.2. PARAMETER ESTIMATION VIA VARIATIONAL METHODS Neal and Hinton (this volume) have pointed out that the lower bound in Eq. (46) has a useful role to play in the context of maximum likelihood parameter estimation. In particular , they make a link between this lower bound and parameter estimation via the EM algorithm . Let us augment our notation to include parameters0 in the specification of the joint probability distribution P (SIO). As before, we designatea subset of the nodes E as the observedevidence. The marginal probability P (EIO), thought of as a function of (), is known as the likelihood. The EM algorithm is a method for maximum likelihood parameter estimation that hillclimbs in the log likelihood . It does so by making use of the convexity relationship between In P (H , EIO) and In P (EIO) described in the previous section. In Section 6 we showedthat the function
(Q,8) = L Q(HIE) InP(H, EIO) - Q(HIE) InQ(HIE) {H}
(47)
is a lower bound on the log likelihood for any probability distribution Q(HIE ). Moreover, we showed that the difference between InP (EI (}) and the bound (Q, O) is the KL divergence between Q(HIE ) and P (HIE ). Supposenow that we allow Q(HIE ) to range over all possible probability distributions on H and minimize the KL divergence. It is a standard result (cf. Cover & Thomas, 1991) that the KL divergenceis minimized by choosing Q(HIE ) == P (HIE , 0), and that the minimal value is zero. This is verified by substituting P (HIE , ()) into the right -hand side of Eq. (47) and recovering In P (E I0) . This suggeststhe following algorithm . Starting from an initial parameter vector 0(0), we iterate the following two steps, known as the "E (expectation) step" and the "M (maximization) step." First , we maximize the bound (Q, O) with respect to probability distributions Q. Second, we fix
142
MICHAELI. JORDAN ET AL.
Q and maximize the bound I:,(Q, 0) with respect to the parameters O. More formally , we have: (E step) :
Q(k+ l )
=
argmaxQ
(Q , (}(k))
(48)
(M step) : fJ(k+l ) =
argmax(}
(Q(k+l ), fJ)
(49)
which is coordinate ascent in (Q , ()) . This can be related to the traditional presentation of the EM algorithm (Dempster , Laird , & Rubin , 1977) by noting that for fixed Q , the right -
hand side of Eq. (47) is a function of fJonly through the In P (H , ElfJ) term . Thus ma:ximizing (Q , ()) with respect to () in the M step is equivalent to ma:ximizing the following function :
L P(HIE,(}(k)) InP(H, ElfJ).
(50)
{H }
Maximization
of this function , known as the "complete log likelihood " in
the EM literature Let
us now
, defines the M step in the traditional return
to
the
situation
in
which
presentation
we are
unable
of EM . to
com -
pute the full conditional distribution P (HIE , fJ). In such casesvariational methodology suggests that we consider a family of approximating distribu tions . Although we are no longer able to perform a true EM iteration given
that we cannot avail ourselvesof P (HIE , fJ), we can still perform coordinate ascent in the lower bound imizing
the KL divergence
(Q , fJ) . Indeed , the variational strategy of min with respect to the variational
parameters
that
define the approximating family is exactly a restricted form of coordinate ascent in the first argument of (Q, (J) . We then follow this step by an "M step " that increases the lower bound with respect to the parameters
(J.
This point of view , which can be viewed as a computationally tractable approximation to the EM algorithm , has been exploited in a number of recent architectures , including the sigmoid belief network , factorial hidden Markov
model
and
cuss in the following
hidden
Markov
decision
tree
architectures
sections , as well as the " Helmholtz
that
we
dis -
machine " of Dayan ,
et ale (1995) and Hinton , et ale (1995). 6 .3 .
EXAMPLES
We now return to the problem of picking a tractable variational parame terization for a given graphical model . We wish to pick a simplified graph which is both rich enough to provide distributions that are close to the true distribution , and simple enough so that an exact algorithm can be uti lized efficiently for calculations under the approximate distribution . Similar consideration ~ hold for the variational parameterization : the variational parameterization must be representationally rich so that good approximations
AN
are
available
KL
divergence
stuck
and
that
some
6 .3 . 1 .
Mean
. In
. It
field
Section and
bounds
machine
bounds
Recall written
also
the
as follows
Consider in
lJij8iSj Hand that Si
0 for
now
the
Sj we
for
the
E E
also
context
to
the
two
become
sums
8i
and
node
sum The
mann
, we
are
" mean machines
<j OijSiSj
+ Li
(JiOSi
Sj
reasonably
examples
ac -
.
that now
yielded
revisit and
the
discuss
machine
can
be
and
not
E E
. In
range
contributions
Zc
over
.lize . If
norma
a linear associated
summary
, we
, lJ )
Si
is given
on
E
contribution with
can
nodes
express
the
( 52 )
in
with
H the
and
the
evidence
updated nodes
} : = 8ijSjo jEE
machine
:
( 53 )
as follows
the
( Peterson form
we
terms
nodes
< j (Jij Si Sj + Ei
" approximation
P ( HIE contribution
.
:
associated
8io +
} : = { eEi {H }
graph
LIt . O~ t.OS t. < j Oij SiSj + """"' Zc '
eLi
function
the
E E , the
becomes
vanish
to
8j
when
, linear
in
distribution
and
vanishes
( 51 )
neighbors
conditional
8i
, 0 ) as follows
, (} ) =
is a particular
are
contribution
a Boltzmann field
'
that
Si . Finally
restricted
partition
have
found
. Boltzmann
, which
P ( HIE
Zc =
In
been
approach
the
nodes
constants
io include
updated
has
. We
block
for
of the
. For
8io =
The
the
Z
quadratic
with
distribution
the
it
algorithm
approaches
e
a constant
E E , the
parameters
of these
yield
machine
of
probability
representation
P ( HIE
where
getting
all
such
variational
Boltzmann
the
relate
nodes
associate
conditional
not
machine
joint
machine
reduces
and
to realize
of cases can
the
:
=
a Boltzmann
minimizes
parameters
several
143
that
possible
we discuss
P ( SlfJ ) = Oij
good
, in a number
Li
where
a procedure
necessarily
a sequential
within
. We
that
METHODS
approximations
section
discussed
lower
Boltzmann
of finding
is not
Boltzmann
5 .2 we
so that
variational this
VARIATIONAL
enough
; however
simple
solutions
TO
hope
minimum
relatively
lower
simple
simultaneously
curate
upper
yet
has
in a local
desiderata
In
INTRODUCTION
of variational
:
(JiOSi } 0
subset
( 54 )
H .
& Anderson
, 1987 ) for
approximation
Boltz
in which
a
144
MICHAELI. JORDAN ET AL.
0
f =:~~'
.
(b)
(a)
Figure 20. (a) A node Si in a Boltzmann machine with its Markov blanket . (b) The approximating mean field distribution Q is based on a graph with no edges. The mean field equations yield a deterministic relationship , represented in the figure with the dotted lines, between the variational parameters Jl,i and J.l.j for nodes j in the Markov blanket of node i .
completely factorized distribution is used to approximate P (HIE , fJ). That is, we consider the simplest possible approximating distribution ; one that is obtained by dropping all of the edges in the Boltzmann graph (see
Fig. 20). For this choice of Q(HIE , J..L), (where we now use J..L to represent the variational parameters ) , we have little choice as to the variational parameterization - to represent as large an approximating family as possible we endow each degree of freedom Si with its own variational parameter J..Li. Thus Q can be written as follows :
Q(HIE,J..L) = II J..Lfi(1- J..Li)l- Si,
(55)
iEH
wherethe product is takenoverthe hiddennodesH . Formingthe KL divergencebetweenthe fully factorizedQ distribution and the P distribution in Eq. (52), we obtain: D (QIIP) = L [JLiIn JLi+ (1 - JLi) In(l - JLi)] t -
L ijJ .LiJ -jL.". OioJLi+ In Zc, .J.(J.L t<
(56)
where the sums range across nodes in H . In deriving this result we have used the fact that , under the Q distribution , Si and Sj are independent random variables with mean values J1 ,i and J1 ,j . We now take derivatives of the KL divergencewith respectto Jli- noting that Zc is independent of J1 ,i- and set the derivative to zero to obtain the following equations:
JLi =a LJ 8ijJLj +(JiO,
(57)
145
AN INTRODUCTION TO VARIATIONAL METHODS
where a (z) = 1/ (1 + e- Z) is the logistic function and we define (}ij equal to (Jji for j < i . Eq . (57) defines a set of coupled equations known as the "mean field equations ." These equations are solved iteratively for a fixed point solution . Note that each variational parameter J1 ,i updates its value based on a sum across the variational parameters in its Markov blanket (cf . Fig . 20b) . This can be viewed as a variational form of a local message passing algorithm . The mean field approximation for Boltzmann machines can provide a reasonably good approximation to conditional distributions in dense Boltz mann machines , and is the basis of a useful approach to combinatorial opti mization known as "deterministic annealing ." There are also cases, however, in which it is known to break down . These cases include sparse Boltzmann machines and Boltzmann machines with "frustrated " interactions ; these are networks
whose
ing In
nodes the
that case
this
potential cannot
of
mean
us
machines
to
now .
"
P
of
() )
~
bound
in
( JijJ
Eq
. LiJ . Lj
+
2: : :
[ J. li
In
Galland
.
2: : :
)
( JioJ
. Li
;
and
Jordan
this
-
In
1993
) . ,
within
for
for
-
indeed
subroutines
Saul
( 47
,
help
problem
i < j
-
as by
neighbor
provide
algorithms
estimation
2: : :
also
can
pursued
lower
( EI
exact
parameter
the
between ( see
algorithms
approach
the
In
exact
use
field
out
constraints satisfied
,
the
consider
Writing
embody
simultaneously
networks led
" structured Let
be
sparse
observation
the
functions
case
( 1996
) .
Boltzmann ,
we
have
:
Z
i
, ui
+
( 1
-
, ui
)
In
( 1
-
JLi
) ]
( 58
)
1.
Taking
the
" Hebbian with where
derivative "
term
respect
to
the
brackets
distribution
performing
P
with J1, iJ1 , j Oil
.
respect
as It
well is
signify ( SIO
) .
Thus
an approximate
to as
not
a
hard an
average
we
have
Oij
yields
a
contribution to
from
show
that
with the
gradient
respect following
which the
this
has
a
derivative
derivative
is
the
unconditional
to gradient
simple of
algorithm
In
Z
( SiSj
) ;
for
M step :
~ (}ij CX(J.LiJ .Lj - (BiBj)).
(59)
Unfortunately , however , given our assumption that calculations under the Boltzmann distribution are intractable for the graph under consideration , it is intractable to compute the unconditional average. We can once again appeal to mean field theory and compute an approximation to (SiSj ) , where we now use a factorized distribution on all of the nodes; however , the M step is now a difference of gradients of two different bounds and is therefore no longer guaranteed to increase . There is a more serious problem , moreover, which is particularly salient in unsupervised learning problems . If the
146
MICHAELI. JORDANET AL.
data set of interest is a heterogeneous collection of sub-populations , such as in unsupervised classification problems , the unconditional distribution will generally be required to have multiple modes. Unfortunately the factorized mean field approximation is unimodal and is a poor approximation for a multi -modal distribution . One approach to this problem is to utilize multi -modal Q distributions within the mean-field framework ; for example , Jaakkola and Jordan (this volume ) discuss the use of mixture models as approximating distributions . These issues find a more satisfactory treatment in the context of di rected graphs , as we see in the following section . In particular , the gradient for a directed graph (cf . Eq . (68)) does not require averages under the uncondi tional distribution . Finally , let us consider the relationship between the mean field approx imation and the lower bounds that we obtained via a sequential algorithm in Section 5.2. In fact , if we run the latter algorithm until all nodes are eliminated from the graph , we obtain a bound that is identical to the mean field bound (Jaakkola , 1997) . To see this , note that for a Boltzmann machine in which all of the nodes have been eliminated there are no quadratic and linear terms ; only the constant terms remain . Recall from... Section 5.2 that the ,f ) , " constant that arises when node i is removed is JLf(JiO+ H (J1 where (JiGrefers to the value of (JiGafter it has been updated to absorb the linear terms from previously eliminated nodes j < i . (Recall that the latter update is given by BiD== BiG+ J1 ,tBij for the removal of a particular node j ). Collecting together such updates for j < i , and summing across all nodes i , we find that the resulting constant term is given as follows :
2:::i {OiOJ1 ,i +H(J1 ,i)} =2 :::j 8ijJ1 ,iJ.L+j 2::: ,i i< i 8ioJ1 - L [JLiInJLi+ (1 - JLi)In(l - J.Li)] 1,
(60)
This differs from the lower bound in Eq. (58) . . only by the term In Z , which disappears when we maximize with respect to J1 ,i . 6.3.2. Neural networks As discussed in Section 3, the "sigmoid belief network " is essentially a (directed ) neural network with graphical model semantics . We utilize the logistic function as the node probability function :
1 ) , P(Si==l/S7r (i) ==1+e-EjE7r (i)lJijSj -lJiO
(61)
where we assume that (Jij = 0 unless j is a parent of i . (In particular , Oij =1= 0 * Oji = 0) . Noting that the probabilities for both the Si = 0 case
ANINTRODUCTION TOVARIATIONAL METHODS
147
andthe Si = 1 casecanbe writtenin a singleexpression asfollows : P(S.IS (i)OijSj +OiO ) Si t 1r(t). ) - e(EjE7r 1+ eEjE7r (i)BijSj+BiO '
(62)
weobtainthe followingrepresentation for thejoint distribution: P(SI8) ==II i
(EjE7r (i)(}ij Sj+0;0) Si e . , 1+ eEjE7r (i)(}ijSj+()iO
(63)
Wewishto calculateconditionalprobabilities underthisjoint distribution. Aswehaveseen(cf. Fig. 6), inference for general sigmoidbeliefnetworks is intractable , andthusit is sensible to consider variationalapproximations . Saul, Jaakkola , andJordan(1996 ) andSaulandJordan(this volume ) have exploredtheviabilityof thesimplecompletely factorized distribution.Thus onceagainweset: Q(HIE, /1) == II /17i(1 - /1i)l - Si, iEH
(64)
andattemptto find thebestsuchapproximation by varyingtheparameters Jl,i. The computationof the KL divergence D(QIIP) proceeds muchas it doesin the caseof the meanfieldBoltzmannmachine . The entropyterm (QInQ) is the sameasbefore . Theenergyterm (QInP) is foundby taking the logarithmof Eq. (63) andaveraging with respectto Q. Puttingthese resultstogether , weobtain: InP(EIO) ~ L OijJ .liJ.lj + L OiOJ .li i<j i - ~1- ( In [1+ e2::jE7r (i)OijSj +OiO ]) - L [J.li In(J.li) + (1 - J.li) In(l - /li )] 1;
(65)
where(.) denotes anaverage with respectto the Q distribution.Notethat, despitethe fact that Q is factorized , weareunableto calculatethe average of In[l + eZi], wherezi denotesEjE7r(i) OijSj+ OiOo This is an important term whicharisesdirectlyfrom the directednatureof the sigmoidbelief network(it arisesfrom the denominator of the sigmoid , a factor which is necessary to definethe sigmoidas a localconditionalprobability). To
148
MICHAELI. JORDAN ET AL.
deal
with
this
term
parameters
of
~
Jensen
(
In
[
'
l
+
eZi
Jensen
'
this
to
s
.
)
(
given
Saul
works
[
.
and
in
+
65
(
the
For
fixed
this
In
[ e
{
iZie
=
~
i
(
Zi
)
+
(
~
~
i
(
Zi
)
+
In
for
~
i
Kij
i
=
is
is
the
)
a
the
( )
Boltzmann
via
their
ijJ1
, j
also
lize
Jordan
-
+
( JiG
+
a
~
)
+
iZi
+
The
]
e
(
65
found
tighter
)
)
.
that
bound
due
)
(
e
(
l
-
l
~
-
final
~
i
i
)
)
]
Zi
)
)
,
the
(
can
on
in
Zi
result
bound
the
be
log
of
a -
(
Zi
.
of
parents
i
has
)
,
where
)
utilized
case
~
66
likelihood
limiting
number
,
by
,
a
so
net
-
that
a
probabilistic
a
differentiating
(
.
the
J1 , i
( } ji
Yet
J1 , j
-
and
(
,
we
~
j
)
+
)
is
again
KL
obtain
the
diver
-
following
Kij
-
(
21
)
1996
)
67
)
we
Thus
,
see
that
the
from
as
in
the
case
are
(
a
term
Eq
.
(
67
)
of
linked
)
can
be
.
and
Saul
and
Jordan
~
ways
and
,
is
i
.
(
The
presented
this
two
obtain
for
bounds
the
is
contributions
parameters
lower
,
and
second
parameters
approximation
and
i
,
term
the
equation
different
j
first
and
node
.
algorithm
variational
upper
of
.
child
the
,
variational
variational
slightly
i
consistency
(
its
involves
Fig
the
,
that
node
again
see
i
Given
children
passing
Jordan
node
.
of
that
the
the
on
the
find
message
another
.
(
j
node
and
in
)
i
parents
node
we
update
both
1996
iZi
of
node
given
the
,
local
to
including
(
eZi
parameter
L
of
from
,
how
)
the
blankets
parameters
.
,
-
.
depends
"
a
of
Jaakkola
show
~
large
the
~
parents
for
as
equations
work
co
machine
these
e
O
a
,
that
"
Markov
,
ale
.
.
expression
the
blanket
Saul
-
+
parameters
contributions
interpreted
et
on
Eq
J
equation
the
(
1
that
has
invoked
from
over
Markov
[ e
=
parameters
contributions
sum
(
In
show
variational
L
an
consistency
iZi
expectation
the
a
(
over
in
:
parents
sum
Saul
form
bound
sign
a
tight
upper
negative
however
lower
.
other
{
tractable
volume
be
-
J
where
a
an
introduced
(
node
of
to
equations
J1 ,
,
providing
require
a
bound
and
approximate
respect
consistency
as
we
variational
.
values
with
viewed
=
a
can
function
gence
)
hidden
as
logistic
additional
with
a
Jensen
theorem
interpretation
the
be
tight
provide
each
introduced
:
]
to
Jordan
limit
such
eZi
)
)
appears
particular
l
1996
that
term
standard
(
which
central
In
to
Eq
(
particular
sufficiently
.
In
reduces
in
in
this
not
)
.
can
Note
that
1995
(
directly
al
parameters
.
was
(
et
provides
bound
which
These
inequality
Seung
Saul
inequality
]
s
i
,
volume
papers
different
the
uti
-
update
sigmoid
belief
in
)
Jaakkola
net
and
-
ANINTRODUCTION TOVARIATIONAL METHODS
149
.//~=:~~~ I
~jO 0 0 ,
I
,
I
I ,
I
I
I
,
I ,
I
,
I
I ,
,
I ,
I
I
I
,
I
~. ' I ,
O ----_tO ------O ,
,
,
,
,
,
,
,
,
,
,
,
,
,
I
,
,
,
0
0
(a) Figure
(
21
b
)
The
with
.
(
a
)
mean
the
A
lines
blanket
,
fixed
(
we
can
there
a
first
(
both
is
of
ten
,
(
and
Jordan
this
. 3
.
see
Fig
.
probability
22
(
a
)
)
Using
.
)
,
as
blanket
Jlj
for
.
the
nodes
figure
j
in
the
the
)
(
case
l
of
-
J . Li
of
and
a
"
of
weight
one
)
on
were
"
.
a
handwrit
-
with
advantage
of
data
in
.
the
graph
Indeed
-
,
Saul
performance
with
comparative
empirical
,
including
(
a
decay
competitive
missing
Dayan
the
parameters
degradation
,
work
comparisons
1996
)
.
models
(
notation
FHMM
)
is
developed
for
the
FHMM
a
earlier
is
given
)
machine
and
that
68
under
variational
architectures
Hinton
(
network
further
related
ij
and
.
Boltzmann
zero
with
the
)
is
important
For
Saul
appearance
the
results
.
the
equation
belief
that
( }
by
interesting
the
deal
parameters
parameters
between
An
the
obtained
J . Li
the
sigmoid
model
P({x1m )},{Yt}(8) =
i
values
slight
Markov
~
also
to
,
to
result
:
-
in
.
and
the
The
1
in
obtaining
Frey
distribution
(
extreme
report
is
see
i
Note
ability
digits
~
bounded
Markov
.
)
)
-
its
hidden
hidden
Markov
in
and
variational
systems
,
Factorial
factorial
8ij
the
networks
sampling
its
represented
respect
.
form
-
SI9
,
volume
the
belief
Gibbs
(
tested
is
in
sigmoid
3
)
~
calculate
are
1996
and
term
non
learning
(
J . Lj
1992
problem
pixels
with
)
second
for
.
i
P
,
the
approach
missing
(
-
supervised
model
joint
al
Jli
with
,
following
to
,
recognition
ical
.
~
parameters
et
digit
other
6
-
Neal
maximal
these
Saul
The
J . Li
need
by
term
that
on
(
no
noted
regularization
term
gradient
the
distribution
fact
with
,
parameters
Jl
takes
:
is
machine
relationship
variational
the
)
cx
network
deterministic
.
compute
volume
unconditional
(
belief
a
parameters
this
that
sigmoid
the
i
Ll8ij
Note
a
between
node
variational
Jordan
in
yield
,
of
Finally
for
Si
equations
dotted
Markov
node
field
(b)
multiple
chain
(
by
see
Section
:
structure
3
. 5
)
,
the
150
MICHAELI. JORDANET AL.
~ .
.
.
.
.
.
.
.
.
0
..0
..( )-........ ...
0
..0
..( )-....-.... ...
0
..0
..() -........ ...
.
.
.
Figure 22. (a) The FHMM . (b) A variational approximation for the FHMM can be obtained by picking out a tractable substructure in the FHMM graph. Parameterizing this graph leads to a family of tractable approximating distributions .
M[7f III (m )(X~m ))gTA(m )(Xlm )IXI ~{)]gTP(Yt I{Xlm )}~=1)(69 ) Computation under this probability distribution is generally infeasible , because, as we saw earlier , the clique size becomes unmanageably large when the FHMM chain structure is moralized and triangulated . Thus it is necessary to consider approximations . For the FHMM there is a natural substructure on which to base a variational algorithm . In particular , the chains that compose the FHMM are individually tractable . Therefore , rather than removing all of the edges, as in the naive mean field approximation discussed in the previous two sections , it would seem more reasonable to remove only as many edges as are necessary to decouple the chains . In particular , we remove the edges that link the state nodes to the output nodes (see Fig . 22(b )) . Without these edges the moralization process no longer links the state nodes and no longer creates large cliques . In fact , the moralization process on the delinked graph in Fig . 22(b ) is vacuous, as is the triangulation . Thus the cliques on the delinked graph are of size N2 , where N is the number of states for a single chain . Inference in the approximate graph runs in time O (MT N2 ) , where M is the number of chains and T is the length of the time series. Let us now consider how to express a variational approximation using the delinked graph of Fig . 22 (b) as an approximation . The idea is to intro duce one free parameter into the approximating probability distribution , Q , for each edge that we have dropped . These free parameters , which we denote as A~m), essentially serve as surrogates for the effect of the observation at time t on state component m . When we optimize the divergence D (QIIP ) with respect to these parameters they become interdependent ; this (deterministic ) interdependence can be viewed as an approximation to the probabilistic dependence that is captured in an exact algorithm via the moralization process.
AN INTRODUCTION TO VARIATIONALMETHODS
151
Referringto Fig. 22(b), we write the approximatingQ distribution in the followingfactorizedform:
M T Q{{xfm)}1{Yt},0,;\) = IIir )(xim ))tII=2A(m )(xfm )Ixf ~{), m =l(m
(70)
where .,\ is th ~ vector of variational parameters .,\~m). We define the transition matrix A (m) to be the product of the exact transition matrix A (m) and the variational parameter A~m) : A(m) (X ;m) IX;~{ ) = A (m) (X ;m) IX;~{ )A~m),
(71)
and similarly for the initial state probabilities ir(m): ir(m) (X ~m)) = 1[(m) (X ~m) )Aim).
(72)
This family of distributions respects the conditional independencestatements of the approximate graph in Fig. 22, and provides additional degrees of freedom via the variational parameters. Ghahramani and Jordan (1997) present the equations that result from minimizing the KL divergencebetween the approximating probability distribution (Eq. (70)) and the true probability distribution (Eq. (69)). The result can be summarized as follows. As in the other architectures that we have discussed, the equation for a variational parameter (A~m)) is a function of terms that are in the Markov blanket of the correspondingdelinked node (i .e., Yt). In particular , the update for A~m) dependson the parameters A~n), for n # m, thus linking the variational parameters at time t . Moreover, the update for A~m) depends on the expected value of the states X ;m), where the expectation is taken under the distribution Q. Given that the chains are decoupled under Q, expectations are found by running one of the exact algorithms (for example, the forward-backward algorithm for HMMs ), separately for each chain. These expectations of course depend on the current values of the parameters ,\~m) (cf. Eq. (70)), and it is this dependencethat effectively couples the chains. To summarize, fitting the variational parameters for a FHMM is an iterative , two-phaseprocedure. In the first phase, an exact algorithm is run as a subroutine to calculate expectations for the hidden states. This is done independently for each of the M chains, making referenceto the current values of the parameters A~m). In the second phase, the parameters ,\~m) are updated based on the expectations computed in the first phase. The procedure then returns to the first phase and iterates.
AN INTRODUCTION TO VARIATIONALMETHODS
.
Figure 24.
.
153
.
The "forest of trees approximation " for the HMDT . Parameterizing this
graphleadsto anapproximating familyof Q distributions .
We can also consider a "forest of trees approximation " in which the horizontal links are eliminated (see Fig . 24) . Given that the decision tree is a fully connected graph , this is essentially a naive mean field approximation on a hypergraph . Finally , it is also possible to develop a variational algorithm for the HMDT that is analogous to the Viterbi algorithm for HMMs . In particular , we utilize an approximation Q that assigns probability one to a single path in the state space. The KL divergence for this Q distribution is particularly easy to evaluate , given that the entropy contribution to the KL divergence (i .e., the Q In Q term ) is zero. Moreover , the evaluation of the energy (i .e., the Q In P term ) reduces to substituting the states along the chosen path into the P distribution . The resulting algorithm involves a subroutine in which a standard Viterbi algorithm is run on a single chain , with the other chains held fixed . This subroutine is run on each chain in turn . Jordan , et ale (1997) found that performance of the HMDT on the Bach chorales was essentially the same as that of the FHMM . The advantage of the HMDT was its greater interpretability ; most of the runs resulted in a coarse-to -fine ordering of the temporal scales of the Markov processes from the top to the bottom of the tree .
7. Discussion We have described a variety of applications of variational methods to prob lems of inference and learning in graphical models . We hope t .o have convinced the reader that variational methods can provide a powerful and elegant tool for graphical models , and that the algorithms that result are simple and intuitively appealing . It is important to emphasize, however, that research on variational methods for graphical models is of quite recent origin , and there are many open problems and unresolved issues. In this
154
MICHAELI. JORDAN ET AL.
section we discuss a number of these issues. We also broaden the scope of the presentation and discuss a number of related strands of research. 7 .1 .
RELATED RESEARCH
The methods that we have discussed all involve deterministic , iterative approximation algorithms . It is of interest to discuss related approximation schemes that are either non-deterministic or non-iterative . 7.1.1. Recognition models and the Helmholtz machine All of the algorithms that we have presented have at their core a nonlinear optimization problem . In particular , after having introduced the variational parameters , whether sequentially or as a block , we are left with a bound such as that in Eq . (27) that must be optimized . Optimization of this bound is generally achieved via a fixed -point iteration or a gradient -based algorithm . This iterative optimization process induces interdependencies between the variational parameters which give us a "best" approximation to the marginal or conditional probability of interest . Consider in particular a problem in which a directed graphical model is used for unsupervised learning . A common approach in unsupervised learning is to consider graphical models that are oriented in the "generative " direction ; that is, they point from hidden variables to observables. In this case the "predictive " calculation of P (E [H ) is elementary . The calculation of P (HIE ) , on the other hand , is a "diagnostic " calculation that proceeds backwards in the graph . Diagnostic calculations are generally non-trivial and require the full power of an inference algorithm . An alternative approach to solving iteratively for an approximation to the diagnostic calculation is to learn both a generative model and a "recognition " model that approximates the diagnostic distribution P (HIE ) . Thus we associate different parameters with the generative model and the recognition model and rely on the parameter estimation process to bring these parameterizations into register . This is the basic idea behind the "Helmholtz machine " (Dayan , et al ., 1995; Hinton , et al ., 1995) . The key advantage of the recognition -model approach is that the calculation of P (HIE ) is reduced to an elementary feedforward calculation that can be performed quickly . There are some disadvantages to the approach as well . In particular , the lack of an iterative algorithm makes the Helmholtz machine unable to deal naturally with missing data , and with phenomena such as "explaining away," in which the couplings between hidden variables change as a function of the conditioning variables . Moreover , although in some cases there is a clear natural parameterization for the recognition model that is induced
AN INTRODUCTION TO VARIATIONAL METHODS
155
from the generative model (in particular for linear models such as factor analysis ) , in general it is difficult to insure that the models are matched appropriately .l0 Some of these problems might be addressed by combining the recognition -model approach with the iterative variational approach ; essentially treating the recognition -model as a "cache" for storing good initializations for the variational parameters . 7.1.2. Sampling methods In this section we make a few remarks on the relationships between vari ational methods and stochastic methods , in particular the Gibbs sampler . In the setting of graphical models , both classes of methods rely on extensive message-passing . In Gibbs sampling , the message-passing is par ticularly simple : each node learns the current instantiation of its Markov blanket . With enough samples the node can estimate the distribution over its Markov blanket and (roughly speaking ) determine its own statistics . The advantage of this scheme is that in the limit of very many samples, it is guaranteed to converge to the correct statistics . The disadvantage is that very many samples may be required . The message-passing in variational methods is quite different . Its pur pose is to couple the variational parameters of one node to those of its Markov blanket . The messages do not come in the form of samples, but rather in the form of approximate statistics (as summarized by the varia tional parameters ) . For example , in a network of binary nodes, while the Gibbs sampler is circulating messages of binary vectors that correspond to the instantiations of Markov blankets , the variational methods are cir culating real-valued numbers that correspond to the statistics of Markov blankets . This may be one reason why variational methods often converge faster than Gibbs sampling . Of course, the disadvantage of these schemes is that they do not necessarily converge to the correct statistics . On the other hand , they can provide bounds on marginal probabilities that are quite difficult to estimate by sampling . Indeed , sampling -based methods - while well -suited to estimating the statistics of individual hidden nodes- are ill equipped to compute marginal probabilities such as P (E ) = EH P (H , E ) . An interesting direction for future research is to consider combinations of sampling methods and variational methods . Some initial work in this direction has been done by Hinton , Sallans , and Ghahramani (this volume ), who discuss brief Gibbs sampling from the point of view of variational approximation . laThe particular recognition model utilized in the Helmholtz machine is a layered graph , which makes weak conditional independence assumptions and thus makes it possible , in principle , to capture fairly general dependencies .
156
MICHAELI. JORDAN ET AL.
7 . 1 .3 .
Bayesian
methods
Variational
inference
parameter additional
.
nodes
thereby
treat
inference
in
a ,
A
in
each
a
setting
of
ensemble
tional
of
by
the
the
L
, KL
is
model
copying
the
ture
of
models they
do
not
best
a
family
Type
II
)
has
and inference approximation
In
marginal
this
The
,
) a
logistic
in
data
,
member
represent ) .
of
a
The
-
where
varia
-
ensemble
is
.
( 73
we
we
( O ) dO
;
a
know a
that
lower
find
the
et
aI
,
that
)
this
bound
.
In
minimizing
following
quantity
:
,
( 74
key
quantity
has
been
1996
aspect
) of
parametric
in
)
Bayesian
and
prior
on
mix
-
Markov
applications for itself
between
to
hidden
these family
the
connection
have
applied
minimization
inference
( 1997b
for
(O ) P
and a
using
)
of
on
variational
likelihood
,
originally to
6 ,
approach
factorization
Jordan
was
( OIE
) ) dlJ
6 ,
interesting
described
( OIE
Section
likelihood
particular .
often
. learning
way
( fJIE ( 91 ~E
bound
( E
is
different
maximization
P
averaging
any
Q ~p
Section
f
and
:
in
the
lower
=
One
Q
a
P
In
( Waterhouse
maximum
Bayesian
parameters
the
a ) .
specific
Jaakkola
tractable
)
ensemble
1997
given
)
as
best
( E
assume
( 1997b
and
for
in
( lJIE
from
architectures
factorizes
MacKay
the
,
Q
to
model
,
( MacKay
that
the
of
recently experts
P
Let
divergence
argument
the
logarithm and
f
=
argument
yields
selection More
Q
the
)
) .
as
)
probabilistic
networks of
1993
volume
.
"
neural
as
problem
useful
learning of
distribution
equivalent
In
which
,
KL
of
be
divergence
Camp
this
generic
inference be
thought
posterior
( QIIP
line
must
particular
van the
as
Bayesian
parameters
,
footing
" be
appropriate
same
minimization
Heckerman
" ensemble
" ensemble can
& to the
Following
.
same
of
treat
can
as
an
problem
probabilistic
known
( Hinton
K
( cf
the
. This
parameters
minimizing
general generally
approximations
fitting
the
the
quite
model
model
approximation
fit
can
on
variational
way
to
we
graphical
method
as
applied
inference
variational
troduced
a
graphical
and
be
Indeed
Bayesian
intractable
the
can
estimation
(J .
In
Q
is ,
just
that
determines related
variational
work
,
inference
. also
variational
developed
variational
approach regression
to with
a
find
Gaussian
methods an
analytically prior
on
the
.
7.1.4. Perspective and prospectives Per haps the key issue that faces developers of variational methods is the issue of approximation accuracy . At the current state of development of variational methods for graphical models , we have little theoretical insight
AN INTRODUCTION TO VARIATIONAL METHODS
157
into conditions under which variational methods can be expected to be accurate and conditions under which they might be expected to be inaccurate . Moreover , there is little understanding of how to match variational transformations to architectures . One can develop an intuition for when variational methods work by examining their properties in certain well -studied cases. For mean field meth ods, a good starting point is to understand the examples in the statistical mechanics literature where this approximation gives not only good , but indeed exact , results . These are densely connected graphs with uniformly weak (but non -negative ) couplings between neighboring nodes (Parisi , 1988) . The mean field equations for these networks have a unique solution that determines the statistics of individual nodes in the limit of very large graphs . In more general graphical models , of course, the conditions for a mean field approximation may not be so favorable . Typically , this can be diagnosed by the presence of multiple solutions to the mean field equations . Roughly speaking , one can interpret each solution as corresponding to a mode of the posterior distribution ; thus , multiple solutions indicate a mul timodal posterior distribution . The simplest mean field approximations , in particular those that utilize a completely factorized approximating distri bution , are poorly designed for such situations . However , they can succeed rather well in applications where the joint distribution P (H , E ) is multimodal , but the posterior distribution P (HIE ) is not . It is worth emphasizing this distinction between joint and posterior distributions . This is what allows simple variational methods - which make rather strong assumptions of conditional independence - to be used in the learning of nontrivial graphical models . A second key issue has to do with broadening the scope of variational methods . In this paper we have presented a restricted set of variational techniques , those based on convexity transformations . For these techniques to be applicable the appropriate convexity properties need to be identified . While it is relatively easy to characterize small classes of models where these properties lead to simple approximation algorithms , such as the case in which the local conditional probabilities are log-concave generalized lin ear models , it is not generally easy to develop variational algorithms for other kinds of graphical models . A broader characterization of variational approximations is needed and a more systematic algebra is needed to match the approximations to models . Other open problems include : (1) the problem of combining variational methods with sampling methods and with search based methods , (2) the problem of making more optimal choices of node ordering in the case of sequential methods , (3) the development of upper bounds within the block framework , (4) the combination of multiple variational approximations for
158 the
MICHAEL I. JORDAN ET AL.
same
model
tectures
,
that
Similar
on
of
and
for
pruned
values
variational
discrete
sampling
of
of
the
of
a
large
underlying
the
of
degree
.
these
on
probability
graph
for
methods
all
The
difficulty
lies
actual
model
-
methods
cases
the
archi
.
and
exact
in
to
for
variables
methods
versions
contingent
methods
random
foundations
is
properties
.
development
theoretical
accuracy
discrete
the
or
solid
that
)
exist
incomplete
probability
8
5
problems
providing
fact
(
continuous
open
based
in
and
combine
in
the
conditional
rather
than
on
the
.
Acknowledgments
We
wish
to
always
)
thank
Brendan
Peter
Frey
Dayan
for
,
David
helpful
Heckerman
,
comments
on
the
Uffe
Kjrerulff
,
manuscript
and
(
as
.
References
Bathe
,
K
.
Baum
,
L
. E
ring
J
in
of
.
(
. ,
1996
)
.
Petrie
the
T
,
T
Cowell
.
,
(
Ed
,
(
in
.
,
press
)
Luby
is
P
. ,
T
. ,
NP
-
Hinton
&
M
.
,
R
.
(
.
.
plete
in
press
In
M
A
. P
,
I
D
.
L
. ,
B
.
.
.
.
and
,
&
C
,
Ghahramani
(
,
,
Z
,
29
,
Thomas
E
B
.
.
.
.
S
,
.
.
.
A
,
of
Prentice
-
Hall
.
technique
Markov
occur
chains
Theory
for
Norwell
.
Bayesian
,
Zemel
model
142
. )
,
&
for
-
150
.
The
-
Annals
MA
,
60
,
R
New
York
networks
:
Kluwer
,
141
-
S
.
(
In
153
John
M
Academic
Wiley
.
I
.
.
Jordan
Publishers
inference
.
:
.
in
.
Bayesian
belief
.
1995
)
.
The
Helmholtz
Machine
.
reasoning
about
causality
and
persistence
.
.
:
,
A
Learning
unifying
in
Rubin
,
(
.
1994
)
framework
Graphical
.
D
. B
.
(
Journal
for
Models
1977
of
Localized
)
.
the
.
Maximum
-
Royal
partial
:
Dayan
In
D
.
,
D
.
S
P
.
probabilistic
Norwell
in
,
MA
likelihood
:
from
Statistical
Society
evaluation
Proceedings
of
.
(
(
1996
)
.
Does
Touretzky
,
of
the
the
M
Processing
1994
)
.
:
The
379
-
Kluwer
Tenth
incom
,
belief
B39
1
networks
Conference
.
-
,
-
.
San
38
.
Un
-
Mateo
,
.
C
wake
.
Systems
Backward
in
of
,
8
simulation
Proceedings
-
Mozer
the
Tenth
.
sleep
algorithm
&
M
.
E
learn
.
Cambridge
,
Bayesian
(
MA
:
networks
Conference
good
Hasselmo
.
.
San
MIT
Eds
. )
Press
,
.
Uncertainty
Mateo
,
CA
:
,
limitations
of
&
Hinton
,
G
.
Report
.
deterministic
Boltzmann
machine
learning
.
Net
-
.
,
&
,
245
-
A
,
273
. ,
E
.
(
CRG
Jordan
,
modelling
&
:
.
Information
)
-
Learning
Ghahramani
,
NJ
.
1993
Z
.
,
maximization
.
Neural
355
Technical
A
Information
inference
Intelligence
.
,
Toronto
.
Ed
.
probabilistic
904
5
)
.
.
R
)
Intelligence
.
4
171
of
algorithm
?
Favero
,
W
1989
. M
,
Kaufmann
,
N
EM
G
Artificial
Morgan
-
elimination
Artificial
in
work
(
-
(
,
estimators
Galland
889
Bucket
Hanks
Hinton
Advances
,
K
,
Kaufmann
density
R
7
,
Cliffs
1970
.
&
Morgan
164
Intelligence
Neal
Jordan
the
and
:
Bayesian
.
Laird
via
certainty
,
. ,
Englewood
(
Approximating
,
)
.
. ,
data
,
.
.
functions
Elements
Artificial
E
,
)
N
probabilistic
,
.
,
Models
1993
.
G
Publishers
,
CA
41
)
.
Weiss
to
,
Academic
Dempster
Draper
1991
(
hard
,
&
Introduction
Kanazawa
ference
Gilks
(
,
of
Intelligence
Dechter
.
Graphical
,
Computational
Fung
.
.
Computation
,
G
,
J
in
&
Neural
Dean
,
Learning
P
,
,
analysis
Thomas
networks
Dayan
Procedures
Soules
statistical
&
.
. )
,
,
Statistics
,
R
Dagum
Element
.
Mathematical
Cover
Frey
Finite
,
M
1996
-
.
I
.
)
TR
(
-
.
Switching
96
1997
-
)
.
3
,
state
Department
-
of
Factorial
Hidden
space
models
Computer
.
University
of
Science
Markov
.
models
.
Machine
.
&
.
Spiegelhalter
The
,
Statistician
D
.
,
(
43
1994
,
)
169
.
A
-
178
language
.
and
a
program
for
complex
ANINTRODUCTION TOVARIATIONAL METHODS
159
Heckerman , D. (in press). A tutorial on learningwith Bayesiannetworks. In M. I. Jordan (Ed.), Learningin GraphicalModels. Norwell, MA: Kluwer AcademicPublishers. Henrion, M. (1991). Search -basedmethodsto bound diagnosticprobabilities in very large belief nets. Uncertaintyand Artificial Intelligence: Proceedings of the Seventh Conference . San Mateo, CA: MorganKaufmann. Hinton, G. E., & Sejnowski , T. (1986). Learningandrelearningin Boltzmannmachines . In D. E. Rumelhart & J. L. McClelland, (Eds.), Parallel distributedprocessing : Volume 1, Cambridge, MA: MIT Press. Hinton, G.E. & van Camp, D. (1993). Keepingneural networkssimple by minimizing the descriptionlength of the weights. In Proceedings of the 6th Annual Workshopon ComputationalLearning Theory, pp 5-13. New York, NY: ACM Press. Hinton, G. E., Dayan, P., Frey, B., and Neal, R. M. (1995). The wake-sleepalgorithm for unsupervisedneural networks. Science , 268:1158- 1161. Hinton, G. E., Sallans, B., & Ghahramani, Z. (in press). A hierarchicalcommunity of experts. In M. I . Jordan (Ed.), Learningin GraphicalModels. Norwell, MA: Kluwer AcademicPublishers. Horvitz, E. J., Suermondt, H. J., & Cooper, G.F. (1989). Boundedconditioning: Flexible inferencefor decisionsunderscarceresources . Conferenceon Uncertaintyin Artificial Intelligence: Proceedingsof the Fifth Conference . Mountain View, CA: Association for UAI . Jaakkola, T. S., & Jordan, M. I . (1996). Computingupperandlowerboundson likelihoods in intractable networks. Uncertaintyand Artificial Intelligence: Proceedingsof the Twelth Conference . SanMateo, CA: MorganKaufmann. Jaakkola, T . S. (1997). Variational methodsfor inferenceand estimation in graphical models . Unpublisheddoctoral dissertation, Massachusetts Institute of Technology . Jaakkola, T. S., & Jordan, M. I . (1997a ) . Recursivealgorithmsfor approximatingprobabilities in graphical models. In M. C. Mozer, M. I. Jordan, & T . Petsche(Eds.), Advancesin Neural Information ProcessingSystems9. Cambridge, MA: MIT Press. Jaakkola, T. S., & Jordan, M. I . (1997b). Bayesianlogistic regression : a variational approach. In D. Madigan & P. Smyth (Eds.), Proceedingsof the 1997Conferenceon Artificial Intelligenceand Statistics, Ft . Lauderdale , FL. Jaakkola, T . S., & Jordan. M. I . (1997c ) . Variationalmethodsandthe QMR-DT database . Submitted to: Journal of Artificial IntelligenceResearch . Jaakkola, T . S., & Jordan. M. I . (in press). Improvingthe meanfield approximationvia the useof mixture distributions. In M. I . Jordan(Ed.), Learningin GraphicalModels. Norwell, MA: Kluwer AcademicPublishers. Jensen, C. S., Kong, A., & Kjrerulff, U. (1995). Blocking-Gibbs samplingin very large probabilistic expert systems. International Journal of Human-ComputerStudies, 42, 647- 666. Jensen,F. V., & Jensen,F. (1994). Optimal junction trees. Uncertaintyand Artificial Intelligence: Proceedings of the Tenth Conference . SanMateo, CA: MorganKaufmann. Jensen, F. V. (1996). An Introductionto BayesianNetworks. London: UCL Press. Jordan, M. I . (1994). A statistical approachto decisiontree modeling. In M. Warmuth (Ed.), Proceedings of the SeventhAnnual ACM Conferenceon ComputationalLearning Theory. New York: ACM Press. Jordan, M. I ., Ghahramani, Z., & Saul, L. K. (1997). Hidden Markov decisiontrees. In M. C. Mozer, M. I . Jordan, & T . Petsche(Eds.), Advancesin Neural Information ProcessingSystems9. Cambridge, MA: MIT Press. Kanazawa, K ., Koller, D., & Russell, S. (1995). Stocha .gtic simulation algorithms for dynamic probabilistic networks. Uncertaintyand Artificial Intelligence: Proceedings of the EleventhConference . SanMateo, CA: MorganKaufmann. Kj ~rulff, U. (1990). Triangulationof graphs- algorithmsgiving small total state space. ResearchReport R-90-09, Departmentof Mathematicsand ComputerScience , Aalborg University, Denmark. Kj ~rulff, U. (1994). Reduction of computational complexity in Bayesiannetworks
160
MICHAEL
through
removal
ceedings
of
the
I.
of
weak
Tenth
MacKay, D.J.C. (1997a). manuscript. Department MacKay, D.J.C. (1997b). eters. Submitted to Neural MacKay, D.J.C. (1997b). Learning
in
McEliece, of
Pearl s
Areas
belief
in
Merz, [ http:
dependences.
Ensemble Physics, Comparison Computation. Introduction
of
Models.
propagation
Department
R.
of
(1992).
R.
(1993).
Norwell,
MA:
Kluwer
G. (1988). J. (1988).
Statistical Probabilistic
San
C.,
networks.
Mateo,
Rockafellar, Rustagi,
CA:
inference
(1972).
(1976).
Report press). In
J.
L.
1,
(1987).
Convex
networks. Neural
Saul,
neural
ings
In
D.
K.,
&
networks.
S.
(1995). R. D.,
Networks:
inference
of
the
Tenth
Shenoy, P. 40,P. Research, Shwe, M. A.,
S.
P.,
for hidden Waterhouse,
of
experts.
Neural Williams,
Artificial
M.
I.
Touretzky, Processing
In
Publishers.
M.
Ma
M.
in
Analysis.
Methods
I. (in
I. Jordan
M.
Princeton
in
Intelligence
(1996).
Red In
Statistics.
Mechanics. Learning M. I.
Exploiting
C.
Systems
Resear
Mozer, 8.
press).
Cam
A mean
(Ed.),
Learning
Annealed theories of learning The Statistical Mechanics P Andersen, S. K., & Szolovits, in
belief
Conference.
(1992). 463 484. Middleton,
H. P., & Cooper, INTERNIST-1/QMR
Smyth,
of
Jordan,
Academic
Seung, Neural Shachter, bilistic
Jordan,
Information
L.
Kluwer
&
bel
A mean
995 1019.
Variational
Journal
K.,
of
CRG-TR-93-1 A view of M. I. Jordan
Sakurai, J. (1985). Modern Quantum Saul, L. K., & Jordan, M. I. (1994). 6, 1173 1183. Saul, L. K., Jaakkola, T. S., & Jordan, networks.
Computer
Kaufmannn.
R.
re
using
Publishers.
Morgan
Systems,
R.
J.
and
UCI
Field Theory. Reasoning
& Anderson,
Complex
J.
Subm
learning
Academic
Kl
Cheng,
algorithm.
Information
Probabilistic
MA:
Peterson,
Saul,
learning of appr to M
University
&
Connectionist
versity of Toronto Technical Neal, R., & Hinton, G. E. (in tal, sparse, and other variants. Inference,
Mateo,
Norwell,
D.J.C.,
AL.
Unce
San
J., & Murphy, P. M. (1996). /www. ics . uci/ - m1earn/MLRepository
fornia,
Parisi, Pearl,
ET
Conference.
MacKay,
Communication.
C.
Neal, 113. Neal,
Graphical
R.J.,
JORDAN
San
Mateo,
Valuation-based B., Heckerman,
&
F.
Uncertainty CA:
systems D. E
(1991). Probabilistic d knowledge base. Meth. Heckerman, D., & Jordan, M. I. (19 Markov probability models. Neural S., MacKay, D.J.C. & Robinson, T In D. S. Touretzky, M. C. Mozer,
Information C. K. I.,
G.
networks.
Processing Hinton,
G.
Systems E. (1991).
8.
Cambrid Mean
AN INTRODUCTION TO VARIATIONAL METHODS
161
temporallydistortedstrings.In Touretzky , D. S., Elman,J., Sejnowski , T., & Hinton, G. E., (Eds.), Proceedings of the 1990Connectionist ModelsSummerSchool . San Mateo, CA: MorganKaufmann . 9. Appendix In this section, we calculate the conjugate functions for the logarithm function and the log logistic function . For f (x) = In x , we have: j *(.;\) = min{ x .;\x - lnx } .
(75)
Taking the derivative with respect to x and setting to zero yields x = >..- 1. Substituting back in Eq . (75) yields :
f * (A) = In A + 1,
(76)
which justifies the representation of the logarithm given in Eq. (14). For the log logistic function g(x) == - In(l + e- X), we have:
g*(>..) = min x {>..x + In(l + e-X)}.
(77)
Taking the derivative with respect to x and setting to zero yields: A
=
=
e
i
-
+
(78)
X
e
-
x
'
from which we obtain : x
=
=
In
(79)
~
A
and
1 In(1+e-X)=1--=-"':\.
(80)
Pluggingtheseexpressions backinto Eq. (77) yields: f * (A) = - AIn;;\ - (1 - A) In(l - A),
(81)
whichis the binary entropyfunctionH (A). This justifiesthe representation of the logistic function givenin Eq. (19).
IMPROVING THE MEAN FIELD APPROXIMATION VIA THE USE OF MIXTURE DISTRIBUTIONS
TOMMI
S . JAAKKOLA
University Santa
of Cruz
,
California CA
AND
MICHAEL I . JORDAN MassachusettsInstitute of Technology Cambridge, MA
Abstract . Mean field methods provide computationally efficient approxi mations to posterior probability distributions for graphical models . Simple mean field methods make a completely factorized approximation to the posterior , which is unlikely to be accurate when the posterior is multi modal . Indeed , if the posterior
is multi - modal , only one of the modes can
be captured . To improve the mean field approximation in such cases, we employ mixture models as posterior approximations , where each mixture component
is a factorized
distribution
. We describe efficient
methods
for
optimizing the parameters in these models .
1.
Introduction
Graphical models provide a convenient formalism in which to express and manipulate conditional independence statements . Inference algorithms for graphical models exploit these independence statements , using them to compute conditional probabilities while avoiding brute force marginaliza tion over the joint probability table . Many inference algorithms , in particu lar the algorithms that construct a junction tree , make explicit their usage of conditional independence by constructing a data structure that captures the essential Markov properties underlying the graph . That is, the algorithm groups interacting variables into clusters , such that the hypergraph of clusters has Markov properties that allow simple local algorithms to be 163
164
TOMMI S. JAAKKOLA AND MICHAEL I . JORDAN
employed for inference. In the best case, in which the original graph is sparse and without long cycles, the clusters are small and inference is efficient. In the worst case, such as the caseof a densegraph, the clusters are large and inference is inefficient (complexity scalesexponentially in the size of the largest cluster). Architectures for which the complexity is prohibitive include the QMR database (Shwe, et al., 1991), the layered sigmoid belief network (Neal, 1992), and the factorial hidden Markov model (Ghahramani & Jordan, 1996). Mean field theory (Parisi, 1988) provides an alternative perspective on the inference problem. The intuition behind mean field methods is that in dense graphs each node is subject to influences from many other nodes; thus, to the extent that each influence is weak and the total influence is roughly additive (on an appropriate scale), each node should roughly be characterized by its mean value. In particular this characterization is valid in situations in which the law of large numbers can be applied. The mean value is unknown, but it is related to the mean values of the other nodes. Mean field theory consistsof finding consistencyrelations betweenthe mean values for each of the nodes and solving the resulting equations (generally by iteration ). For graphical models these equations generally have simple graphical interpretations ; for example, Peterson and Anderson (1987) and Saul, Jaakkola, and Jordan (1996) found, in the casesof Markov networks and Bayesian networks, respectively, that the mean value of a given node is obtained additively (on a logit scale) from the mean values of the nodes in the Markov blanket of the node. Exact methods and mean field methods might be said to be complementary in the sensethat exact methods are best suited for sparsegraphs and mean field methods are best suited for densegraphs. Both classesof methods have significant limitations , however, and the gap between their respective domains of application is large. In particular , the naive mean field methods that we have referred to above are based on the approximation that each node fluctuates independently about its mean value. If there are a strong interactions in the network, i .e. if higher-order moments are important , then mean field methods will generally fail . One way in which such higher-order interactions may manifest themselvesis in the presence of multiple modes in the distribution ; naive mean field theory, by assuming independent fluctuations , effectively assumesa unimodal distribution and will generally fail if there are multiple modes. One approach to narrowing the gap between exact methods and mean field methods is the "structured mean field" methodology proposedby Saul and Jordan (1996). This approach involves deleting links in the graph and identifying a core graphical substructure that can be treated efficiently via exact methods (e.g., a forest of trees). Probabilistic dependenciesnot
IMPROVING THE MEAN FIELD APPROXIMATION
165
accounted for by the probability distribution on this core substructure are treated via a naive mean field approximation. For architectures with an obvious chain-like or tree-like substructure, such as the factorial hidden Markov model, this method is natural and successful. For architectures without such a readily identified substructure, however, such as the QMR databMe and the layered sigmoid belief network, it is not clear how to develop a useful structured mean field approximation. The current paper extends the basic mean field methodology in a different direction . Rather than basing the approximation on the assumption of a unimodal approximating distribution ,1 we build in multimodali ty by allowing the approximating distribution to take the form of a mixture distribution . The components of the mixture are assumedto be simple factorized distributions , for reasonsof computational efficiency. Thus, within a mode we assumeindependent fluctuations , and multiple modes are used to capture higher-order interactions. In the following sections we describe our mixture -based approach that extends the basic mean field method. Section 2 describesthe basicsof mean field approximation , providing enough detail so as to make the paper selfcontained. Section 3 then describeshow to extend the mean field approximation by allowing mixture distributions in the posterior approximation . The following sections develop the machinery whereby this approximation can be carried out in practice. 2.
Mean
field
approximation
Assume we have a probability model P (S, S*), where S* is the set of instantiated or observed variables (the "evidence" ) and the variables S are hidden or unobserved. We wish to find a tractable approximation to the posterior probability P (SIS*). In its simplest form mean field theory assumesthat nodes fluctuate independently around their mean values. We make this assumption explicit by expressingthe approximation in terms of a factorized distribution Qmj (SIS*): Qmf (SIS*) =
II7.Qi(SiIS *),
(1)
where the parameters Qi (SiIS * ) depend on each other and on the evidence S* (i .e., we must refit the mean field approximation for each configuration of the evidence ) . lBy "unimodal distributions" we meandistributionsthat are log-concave . Meanfield distributions that are products of exponentialfamily distributions are unimodal in this sense .
166
TOMMI S. JAAKKOLAAND MICHAELI. JORDAN
The mean field approximationcan be developedfrom severalpoints of view, but a particularly usefulperspectivefor applicationsto graphical modelsis the "variational" point of view. Within this approachwe use the KL divergenceas a measureof the goodnessof the approximationand choosevaluesof the parametersQi(SiIS*) that minimizethe KL divergence :
KL( Qmj (SIS *) IIP(Sls*) ) = ~ Qmf (SIS *)log~
.
(2)
Why do we useKL(QmfIIP) rather than KL (PIIQmf)? A pragmaticanswer is that the use of KL (QmfIIP) implies that averagesare taken using the tractable Qmf distribution rather than the intractableP distribution (cf. Eq. (2)); only in the former casedo we haveany hopeof finding the best approximationin practice. A moresatisfyinganswerarisesfrom considering the likelihood of the evidence , P(S*). Clearly calculationof the likelihood permits calculationof the conditionalprobability P(SIS*). UsingJensen 's inequality we can lower bound the likelihood (cf. for exampleSaul, et al., 1996):
logP(S*)
-
-
>
log LsP(S,s*) P(S,S*) log L Q(SIS *)(J(SIS ;)S P(S,S*) LSQ(SIS *)log (J(SIS ;)'
for arbitrary Q(SIS*), in particular for the meanfield distribution Qmj (SIS*). I t is easily verified that the difference between the left hand side and the right hand side of this inequality is the KL divergence KL (QIIP ); thus, minimizing this divergence is equivalent to maximizing a lower bound on the likelihood . For graphical models, minimization of KL (QmfIIP ) with respect to Qi (SiIS*) leads directly to an equation expressing Qi (SiIS*) in terms of the Qj (Sj IS*) associatedwith the nodes in the Markov blanket of node i (Peterson & Anderson, 1987; Saul, et al., 1996). These equations are coupled nonlinear equations and are generally solved iteratively . 3. Mixture
approximations
In this section we describe how to extend the mean field framework by utilizing mixture distributions rather than factorized distributions as the approximating distributions . For notational simplicity , we will drop the
167
IMPROVING THE MEAN FIELD APPROXIMATION
dependenceof Q(SIS*) on S* in this section and in the remainder of the paper, but it must be borne in mind that the approximation is optimized separately for each configuration of the evidence S*. We also utilize the notation F (Q) to denote the lower bound on the right hand side of Eq. (3):
:F(Q)=:LsQ(S)log :~i~ .~):l Q(~ S When
Q
( S
)
takes
the
approximation
a
corresponding
I ~_
the
as
I
OUf distribution
factorized
is
as
to
Qmf
mean
bound
proposal
form
" factorized F
utilize
( S
field
( Qmj
"
)
in
(3)
Eq
.
( 1 ) ,
we
approximation
will
refer
and
will
to
the
denote
) .
a
mixture
=
1: : :
distribution
as
the
approximating
:
Qmix
( S
)
amQmj
( Slm
) ,
( 4 )
m
where
each
butions
.
and of
In
F
( Qmix
F
( Qmf
sider
the
The
sum the
of
mixing
to
one
KL
divergence
the
remainder
)
-
1m
)
component
the
distributions
proportions
,
are
Qmf
am
additional
,
which
( 81m are
parameters
to
)
are
factorized
constrained be
fit
distri
to via
the
be
-
positive
minimization
. of
mixture
corresponding
this
section
bound to
we
the
and mixture
focus
the
on
the
relationship
factorized components
between
mean .
To
field this
bounds end
,
con
-
:
P(8,8 *) :F(Qm1 ..x)-- ~ L...SQmix (8)logQ ( mix 8) }:,Sam [Qmj (Slm )log ~ .~~.!-~ m Qmix (8:2)] }:,Sam [Qmf (8Im )log ~( S 8~)+Qmf (Slm )log 2Qmix ~l(SI ~))-] m Qmj (,Slm (S }:mam :F(Qmflm )+m }:,SamQmf (Slm )log 2Qmix ~f(SI ~)1 (S (5) }:mG :m :F(Qmflm )+I(m;S) -
-
where I (m ; S) is the mutual information between the mixture index m and the variables S in the mixture distribution2 . The first term in Eq. (5) 2Recall that the mutual
l (x ;y) = ~ LIz
,y
information . between any two random variables x and y is
P (x , y) logP (ylx )fP (y ).
168
TOMMI
is a convex
S. JAAKKOLA
combination
AND
of factorized
MICHAEL
mean
field
I . JORDAN
bounds
and
therefore
it
yields no gain over factorized mean field as far ~ the KL divergence is concerned
. It
is the
second
term , i .e ., the
mutual
information
, that
char -
acterizes the gain of using the mixture approximation .3 As a non-negative quantity , I (m ; S) incre ~ es the likelihood bound and therefore improves the approximation . We note , however, that since I (m ; S) :::; log M , where M is the number of mixture components , the KL divergence between the mixture approximation and the true posterior can decrease at most logarithmically in the number
of mixture
4 . Computing
components
the mutual
.
information
To be able to use the mixture bound of Eq. (5) we need to be able to compute the mutual information term . The difficulty with the expression M it stands is the need to average the log of the mixture distribution over all configurations of the hidden variables . Given that there are an exponential number of such configurations , this computation is in general intractable . We will make use of a refined form of Jensen's inequality to make headway. Let
us first
rewrite
the mutual
"""
information
as
Qmf (8lm )
I(m;S) = L..., O :'mQmj (Slm )logQ . (8) m
,
S
(6)
m1 .X
=EO :'mQmj (8Im )[-lOg ~ ] (7) = ~ :'mR {SIm """O:'mQmj(8Im)[- logOO :'m{8fm {Slm R))Qmj Qmix (8)) ] (8) = L O :'mQmj (Slm) logR(8Im) - LGmlogam m ,S
"""
m
( I )[ R(Slm ) Qmix (8)] (9)
+~ QmQmj 8m - log_._~~~---Qmj{8Im )
where we have introduced additional smoothing distributions R(Slm ) whose role is to improve
the convexity
bounds that we describe below .4 With
some
foresight we have also introduced am into the expression ; as will become 3Of course , this is only the potential gain to be realized from the mixture mation . We cannot rule out the possibility that it will be harder to search for approximating distribution in the mixture approximation than in the factorized particular the mixture approximation may introduce additional local minima KL divergence .
approxi the best case. In into the
4In fact , if we use Jensen's inequality directly in Eq. (6), we obtain a vacuous bound of zero , as is easily verified .
IMPROVING THE MEAN FIELD APPROXIMATION
169
apparent in the following section the forced appearance of am turns out to be convenient in updating the mixture coefficients . To avoid the need to average over the logarithm of the mixture distri bution , we make use of the following convexity bound :
- log(x) ~ - .,\ x + log("\) + 1,
(10)
replacing the - log(.) terms in Eq. (9) with linear functions, separately for eachm . These substitutions imply a lower bound on the mutual information given by:
I (m; S) ~ } : amQmf (Slm)logR(Slm) - } : amlogam
-
m,S m - } : Am}: R(Slm)Qmix (S) + } : amlogAm + 1 (11) m S m IA(m; B).
(12)
The tightness of the convexity bound is controlled by the smoothing functions R (Stm) , which can be viewed as "flattening" the logarithm function such that Jensen's inequality is tightened. In particular , if we set R(Slm ) cx Qm/ (Slm )/ Qmix(S) and maximize over the new variational parameters Amwe recoverthe mutual information exactly. Sucha choicewould not reduce the complexity of our calculations, however. To obtain an efficient algorithm we will instead assumethat the distributions R have simple factorized forms, and optimize the ensuing lower bound with respect to the parameters in these distributions . It will turn out that this assumption will permit all of the terms in our bound to be evaluated efficiently. We note finally that this bound is never vacuous. To seewhy, simply set R(Slm ) equal to a constant for all { S, m} in which casemax,\ I ,\(m; S) = 0; optimization of R can only improve the bound. 5 . Finding
the mixture
parameters
In order to find a reasonable mixture approximation we need to be able to optimize the bound in Eq . (5) with respect to the mixture model . Clearly a necessary condition for the feasibility of this optimization process is that the component factorized mean field distributions can be fit to the model tractably . That is, we must assume that the following mean field equations :
88Qj :F(Qm /))=constant (Sj
(13)
170
TOMMI S. JAAKKOLA AND MICHAEL I. JORDAN
can be solved efficiently (not necessarily in closed form) for any of the marginals Qj (Sj ) in the factorized mean field distribution Qmf .5 Examples of such mean field equations are provided by Petersonand Anderson (1987) and Saul, et all (1996). Let us first consider the updates for the factorized mean field components of the mixture model. We needto take derivatives of the lower bound: F (Qmix) ~ L amF (Qmflm ) + IA(m; S) == FA (Qmix) m
(14)
with respect to the parameters of the component factorized distributions . The important property of I A(m; S) is that it is linear in any of the marginals Qj (Sjlk ), where k indicates the mixture component (seeAppendix A). As a result, we have:
8I).(m;S) = 8Qj(Sjlk) where
the
ally
constant
is
dependent
field
on
distribution
)
8
:
F ; \
can
be
the
us
I
; \
can
now
the
( m
;
)
.
write
ak8
for
the
turn
to
same
F ;\
+
8I
; \
( m
8Qj
specific
;
the
S
( Sjlk
is
gener
)
equations
=
-
mean
:
0
(
16
)
)
marginal
by
( and
Qj
( Sjlk
iteratively
)
.
We
can
optimizing
thus
each
of
mixture
-
Ek
coefficients
ak
log
therefore
CXk
ak
,
true
these
.
We
note
coefficients
for
F
; \
first
that
appear
( Qmix
)
( see
Eq
,
linearly
.
(
14
)
)
,
apart
in
and
we
:
:
F ; \
( Qmix
)
=
L
cxk
( -
E
( k
)
k
where
that
)
)
component
.
the
is
)-
( Sjlk
( Sjlk
particular
assumption
( Qmflk
any
Qj
the
components
model
term
The
: 8Qj
mixture
entropy
B
=
our
(15)
marginal
in
from
)
fitting in
Let
)
the
marginals
follows
efficiently
best
marginals
from
It
( Sjlk
solved
the
.
of
other
( Qmix
8Qj
find
independent
the
constant
-
E
( k
)
-
is
E
the
( k
collection
)
= =
of
F
( Qmflk
terms
)
linear
+
)
-
Lk aklogak +1 in
CXk
(17 )
:
L Qmj(Slk) logR(Slk) s
+
LAmL R(Slm}Qmf (Slk} + logAk m S
(18)
5In the ordinary mean field equations the constant is zero, but for many probability models of interest this slight difference posesno additional difficulties .
IMPROVING THE MEAN FIELD APPROXIMATION
171
Now Eq. (17) has the form of a free energy, and the mixture coefficientsthat maximize the lower bound : FA(Qmix) therefore come from the Boltzmann distri bution : e- E(k) ak=}=:kl e-E(k'). ( This with
fact
respect Finally
Am and ter
is easily
verified
updates
parameters
Lagrange
multipliers
to optimize
Eq . ( 17 )
to am ) .
, we must the
using
(19)
optimize
parameters
the
bound
of the smoothing
in Appendix Am , we readily
A . Differentiating obtain
with
respect
distribution the
bound
to the
parameters
. We discuss with
respect
the lat to the
:
;\m= EkakEsRam (Slm )Qmf (Slk )'
( 20 )
where a simplified form for the expressionin the denominator is presented in the Appendix . 6. Discussion We have presenteda general methodology for using mixture distributions in approximate probabilistic calculations. We have shown that the bound on the likelihood resulting from mixture -basedapproximations is composedof two terms, one a convexcombination of simple factorized mean field bounds and the other the mutual information between the mixture labels and the hidden variables S. It is the latter term which representsthe improvement available from the use of mixture models. The basic mixture approximation utilizes parameters am (the mixing proportions ), Qi (Silm ) (the parametersof the factorized component distri butions). These parameters are fit via Eq. (19) and Eq. (16), respectively. We also introduced variational parameters Am and Ri (Si 1m) to provide tight approximations to the mutual information . Updates for these parameters were presented in Eq. (20) and Eq. (29). Bishop, et al. (1998) have presentedempirical results using the mixture approximation . In experiments with random sigmoid belief networks, they showed that the mixture approximation provides a better approximation than simple factorized mean field bounds, and moreover that the approximation improves monotonically with the number of components in the mixture . An interesting direction for further research would involve the study of hybrids that combine the use of the mixture approximation with exact methods. Much as in the caseof the structured mean field approximation of
172
TOMMI S. JAAKKOLAAND MICHAELI. JORDAN
Sauland Jordan(1996), it shouldbe possibleto utilize approximationsthat identify core substructures that are treated exactly , while the interactions between
these substructures
are treated
via the mixture
approach . This
III
would provide a flexible model for dealing with a variety of forms of high order interactions , while remaining tractable computationally .
II
References
Bishop, C., Lawrence, N ., Jaakkola, T . S., & Jordan, M . I . Approximating posterior in belief networks using mixtures . In M . I . Jordan , M . J . Kearns , & S. A . III
distributions
Solla, Advances in Neural Information ProcessingSystems 10, MIT Press, Cambridge MA (1998) . Ghahramani , Z . & Jordan , M . I . Factorial
Hidden Markov models . In D . S. Touretzky ,
II
M . C. Mozer, & M . E. Hasselmo (Eds.) , Advances in Neural Information Processing Systems 8, MIT Press, Cambridge MA (1996) . Neal , R . Connectionist
learning
of belief networks , Artificial
Intelligence , 56: 71- 113
(1992) . Parisi , G. Statistical Field Theory. Addison-Wesley: Redwood City (1988) . Peterson , C . & Anderson , J . R . A mean field theory learning algorithm
for neural net -
works. Complex Systems 1:995- 1019 (1987) . Saul , L . K ., Jaakkola , T . S., & Jordan , M . I . Mean field theory for sigmoid belief networks . Journal of Artificial Intelligence Research, 4:61- 76 ( 1996) . Saul , L . K . & Jordan , M . I . Exploiting tractable substructures in intractable networks .
In D . S. Touretzky , M . C. Mozer, & M . E. Hasselmo (Eds.) , Advances in Neural Information Processing Systems 8, MIT Press, Cambridge MA (1996). Shwe
, M . A . , Middleton
, B . , Heckerman
P., & Cooper , G . F . Probabilistic
, D . E . , Henrion
, M . , Horvitz
diagnosis using a reformulation
, E . J . , Lehmann
, H .
of the INTERNIST
-
l / QMR knowledge base. Meth. Inform . Med. 30: 241-255 (1991).
A . Optimization of the smoothing distribution Let usfirst adoptthe followingnotation: 7rR ,Q(m, m') :=
Ls R(Slm )Qmf (Slm ')
(21)
II L Ri(Silm )Qi(Silm ') i Si H(Q IIRim)
H(m)
LSQmj (81m )logR(8Im ) L L Qi(8ilm )logRi(Silm ) i Si - Lam m logam
(22) (23)
(24) (25)
IMPROVINGTHE MEANFIELDAPPROXIMATION
173
wherewe haveevaluatedthe sumsoverS by makinguseof the assumption that R(Slm) factorizes.We denotethe marginalsof thesefactorizedforms by Ri(Siim). Eq. (11) now readsas: IA(m; S) = L amH(Q II Rim ) + H (m) m - ~ Am[~ Q(m') 7rR ,Q(m, m/)] + ~ amlogAm+ ~26)
Tooptimize thebound IA(m;S) with respect toR,weneed thefollowing derivatives : ~a-Rj (Q(IISjlk Ri~_ )) (27 ) ) l = 8k ,mQj Rj((Sjlk Sjlk ,Q n ~."Si ) --{}{}7rR R SJI .,Ikm )') = 8k,mQj (Sj Im')1 .#J. L .;Ri(SiIm)Qi(SiIm') .(28 J.((m Denoting the optimal
the product
in Eq . ( 28 ) by 7r~ ,Q ( m , m ' ) , we can now characterize
R via the consistency
equations
I ,X((m ; 81} = Q k 2j a Rj Sjlk Rj ( Sjlk }) - Ak [ ~ Note
that
the
second
term
does
:
Q (m , ) ~ R. ,Q ( k , m , ) Qj ( Sjlm ' ) ] = 0 ( 29 ) not
depend
on the
smoothing
Rj ( Sjlk ) and therefore these consistency equations are easily any specific marginal . The best fitting R is found by iteratively ing each of its not necessarily
marginal solved for optimiz -
marginals . We note finally that the iterative solutions normalized . This is of no consequence , however , since
information bound of Eq . ( 11 ) - once maximized invariant to the normalization of R .
over the A parameters
are the -
is
INTRODUCTION
TO
MONTE
CARLO
METHODS
D .J .C . MACKAY
Department of Physics , Cambridge University . Cavendish Laboratory , Madingley Road, Cambridge , CB3 OHE. United Kingdom .
ABSTRACT
This chapter describes a sequence of Monte Carlo methods : impor tance sampling , rejection sampling , the Metropolis method , and Gibbs sampling . For each method , we discuss whether the method is expected to be useful for high- dimensional problems such as arise in inference with graphical models . After the methods have been described , the terminology of Markov chain Monte Carlo methods is presented . The chapter
concludes
with
a discussion
of advanced methods , includ -
ing methods for reducing random walk behaviour . For details of Monte Carlo methods , theorems and proofs and a full
list of references, the reader is directed to Neal (1993) , Gilks, Richardson and Spiegelhalter (1996) , and Tanner (1996). 1. The problems The aims of Monte
to be solved Carlo methods
are to solve one or both of the following
problems .
Problem 1: to generatesamples{x(r)}~=l from a givenprobability distribution p (x ) .l Problem 2 : to estimate expectations of functions under this distribution , for example
==/>(x))=JdNx P(x)>(x).
(1)
1Please note that I will use the word "sample" in the following sense: a sample from a distribution P (x ) is a single realization x whose probability distribution is P (x ) . This contrasts with the alternative usage in statistics , where "sample" refers to a collection of realizations { x } . 175
176 The
D.J.C. MACKAY probability
might
distribution
be
a
arising
in
model
from
data
' s
that
P
distribution
modelling
-
parameters
x
is
given
an
N
sometimes
-
will
solved
{
x
,
(
r
}
=
l
can
to
which
example
,
call
first
target
.
density
conditional
,
distribution
posterior
real
also
the
a
the
data
with
the
the
will
or
observed
the
solve
give
we
physics
spaces
on
we
~
,
vector
discrete
then
)
)
some
concentrate
it
ples
x
for
dimensional
consider
We
(
statistical
probability
We
will
of
generally
components
a
assume
Xn
,
but
we
will
.
problem
(
second
sampling
)
problem
by
,
because
using
if
the
we
have
random
sam
-
estimator
< i >
.
=
i
L
< I > (
x
(
r
)
)
.
(
2
)
r
Clearly
.if. .
tion
of
< i >
the
< 1 >
will
vectors
is
< 1 >
.
{
Also
decrease
,
as
x
(
as
~
,
is
one
of
The
the
accuracy
of
of
x
to
We
,
it
may
be
find
given
1
.
,
We
will
can
can
< I >
If
The
we
can
(
at
x
)
is
goes
as
first
is
to
that
a
that
we
)
j
>
(
x
)
x
)
~
.
-
< 1
2
dozen
expecta
-
variance
of
.
(
Carlo
methods
(
t
the
the
< / > ,
Monte
So
then
,
estimate
he
regardless
(
sam
of
3
)
.
equation
space
independent
high
pled
the
2
.
)
)
is
To
be
dimensionality
samples
{
P
x
(
(
x
obtain
typically
)
*
,
(
x
)
r
)
}
suffice
(
x
)
(
why
-
a
wish
to
multiplicative
draw
samples
constant
,
;
that
P
" is
(
x
,
)
,
we
that
)
can
=
=
P
*
we
samples
do
we
a
x
dif
from
?
which
such
other
samples
.
HARD
within
cause
independent
easy
from
to
can
Obtaining
not
density
least
*
dimensionality
.
often
P
P
x
of
as
,
function
difficult
(
(
increases
of
of
methods
the
evaluate
general
variance
Carlo
P
in
the
P
FROM
,
a
is
P
R
.
that
evaluate
2
dNx
few
SAMPLING
evaluated
from
samples
dimensionality
Carlo
assume
be
generated
Monte
he
as
P
IS
are
of
J
however
Monte
WHY
=
the
t
that
distribution
.
1
satisfactorily
later
for
=
of
< I >
will
=
0 '
2
variance
estimate
ficulties
1
the
~
number
properties
of
,
}
important
independent
precise
)
where
0 '
This
r
the
not
from
not
know
(
x
)
/
Z
.
(
easily
P
solve
(
the
x
)
?
normalizing
z ==JdNXp*(x).
problem
There
1
are
two
?
Why
difficulties
is
4
)
it
.
constant
(5)
178
D.J.C. MACKAY
N ? Let us concentrate on the initial
cost of evaluating Z . To compute
Z (equation (7)) we have to visit every point in the space. In figure Ib there are 50 uniformly spaced points in one dimension . If our system had N dimensions , N = 1000 say, then the corresponding number of points would be 501000, an unimaginable number of evaluations of P * . Even if each component xn only took two discrete values, the number of evaluations of P * would be 21000, a number that is still horribly huge, equal to the fourth power of the number of particles
in the universe .
One system with 21000states is a collection of 1000 spins , for example ,
a 30 X 30 fragment of an Ising model (or 'Boltzmann machine' or 'Markov field') (Yeomans 1992) whose probability distribution is proportional to P* (x ) -= exp [- ,sE(x )]
(9)
where Xn E { :i:1} and
E(x)==- [~LJmnXmXn +LHnxn ]. (10 ) m
, n
n
The energy function E (x ) is readily evaluated for any x . But if we wish to evaluate this function at all states x , the computer 21000 function evaluations .
time required
would be
The Ising model is a simple model which has been around for a long time ,
but the task of generating samplesfrom the distribution P (x ) = P* (x )jZ is still an active research area as evidenced by the work of Propp and Wilson
(1996). 1 .2 .
UNIFORM
SAMPLING
Having agreed that we cannot visit every location x in the state space, we
might consider trying to solve the second problem (estimating the expec-
tation of a function <J>(x)) by drawingrandomsamples{x(r)}~=l uniformly from the state spaceand evaluating P* (x ) at those points. Then we could introduce ZR , defined by
R ZR =rL=lp*(x(r)), and estimate < I>==JdN x< />(x)P(x)by p * ( ( r ) < i>==rLR < / > ( x ( r ) ) --;=l R.
(11)
(12)
MONTECARLOMETHODS Is
anything
and
wrong
with
P * ( x ) . Let
and
us
concentrate
is
often
this
on
that
the
set
Shannon
T
,
in
whose
- Gibbs
? Well ( x )
nature
concentrated
typical
strategy
assume
of
a
given
the
depends
benign
region
is
of
, it
a
P * (x ) .
small
volume
entropy
is
by
179
on
the
, smoothly
A
high
- dimensional
of
the
state
ITI
~
probability
2H
functions
<j >( x )
varying
distribution
space
(X ) ,
function
known
where
distribution
P
H
as ( X
)
its
is
the
(x ) ,
1
H
If
almost
is
a
all
the
benign
by
sampling the
the
is
the
2N
of
of
a
set
,
what
Ising H
is
H
model
max
=
N
sampling
?
the
once
is
. So
set
of
bits
so be
a
not
is
melts
temperature
set
of
is
of
of
for particles
the
study
, if
unlikely
1 .3 .
1000
in
the Ising
the to
to
Under
these for
phase Ising
model of
, <1> .
a is
at
disordered
But
N
required
high are
which
phase
roughly
samples
to
uniform
interesting
temperature to
an
tends
estimating more
critical
of
entropy conditions
Considerably
the
number
the
/ 2
.
bits
.
simply
)
to
the At
this
For
this
hit
the
order
is
about
universe
.
models
useful
~
1015
2N
- N / 2 =
, This
Thus
of
distribution be
chance
not
( 15
the
sampling
size
( x ) is
/ 2
roughly
uniform
modest P
is
2N
.
And
actually
square is
in
utterly
most
uniform
of
high
the
)
number
useless
for
- dimensional
, uniform
sampling
.
OVERVIEW
Having bution
==
of
problems is
N
a
required
distribution
and
.
as
Rmin
which
Let
space
has
samples
probability
technique
an
?
state
( 14
the
1 .
ordered
the
of
we hit
required the
sample
number
if to
- H .
,
interest
an
entropy
once
2N
order
such
distribution
typical
of
each
of
likely
are
size
uniform
of are
samples total
. SO
. So
estimate we
<jJ ( x )
principally
set
that
distribution
great
from
the
probability
~
satisfactory
of
temperatures model
2H
and
be
typical
good
many
set
will
)
order
uniform
Rmin
well are
intermediate
a
a
The
temperatures
to ,
may
temperatures
Ising
high
.
typical (x )
the
. The
size
set
the
large
, how
has
typical thus
in
giving
again
( 13
<jJ ( x ) P on
of
.
in
fdNx
takes
model
typical
At
located
~
sufficiently
times
Ising
in
tends
is =
Rmin
So
( x ) log2
chance R
of
the the
P
4> ( x )
samples
falling
typical
of
that
number of
, and
/ 2N
the
a
~
value
stand of
case
states
2H
hit
only
set
take
=
mass
the
values
number
typical
us
,
the
will
make
)
probability
function
determined
( X
established P
(x )
that
== P * ( x ) / Z
drawing is
difficult
samples even
from if
a
P * ( x ) is
high easy
- dimensional to
evaluate
distri , we
will
180
D.J.C. MACKAY
Q * (x ),' I I I " ~ ,
x
Figure 2. Functions involved in importance sampling. We wish to estimate the expectation of (x ) under P (x ) cx : P *(x ) . We can generate samples from the simpler distribution Q(x ) cx : Q*(x ). We can evaluate Q* and P * at any point .
now study a sequenceof Monte Carlo methods: importance sampling , rejection sampling , the Metropolis method , and Gibbs sampling . 2. 1mportance sam pIing Importance sampling is not a method for generating samples from P (x ) (problem 1) ; it is just a method for estimating the expectation of a func tion (x ) (problem 2) . It can be viewed as a generalization of the uniform sampling method . For illustrative purposes, let us imagine that the target distribution is a one- dimensional density P (x ) . It is assumed that we are able to evaluate this density , at least to within a multiplicative constant ; thus we can evaluate a function P * (x ) such that
P(x) = P* (x)/ Z.
(16)
But P (x ) is too complicated a function for us to be able to sample from it directly . We now assume that we have a simpler density Q (x ) which we can evaluate to within a multiplicative constant (that is, we can evaluate Q * (x ) , where Q (x ) = Q * (x ) / ZQ) , and from which we can generate samples. An example of the functions P * , Q * and is shown in figure 2. We call Q the sampler density . In importance sampling , we generate R samples { x (r)} ~=l from Q (x ) . If these points were samples from P (x ) then we could estimate ~ by equation (2) . But when we generate samples from Q , values of x where Q (x ) is greater than P (x ) will be over- represented in this estimator , and points
MONTECARLOMETHODS - 6 .2
- 6 .2
- 6 .4
- 6 .4
- 6 .6
- 6 .6
-6 .8
- 6 .8
-7
-7
- 7 .2
181
- 7 .2 10
1 00
1 000
1 0000
1 00000
1 000000
10
(a)
1 00
1 000
1 0000
1 00000
1 000000
(b)
Figure 9. Importance sampling in action: a) using a Gaussian sampler density; b) using a Cauchy sampler density. Horizontal axis showsnumber of samples on a log scale. Vertical " axis shows the estimate cI>. The horizontal line indicates the true value of
where Q (x) is less than P (x) will be under- represented. To take into account the fact that we have sampled from the wrong distribution , we introduce 'weights' - p * (x(r))
Wr= (J;{; (;)) which we use to adjust the ' importance
(17)
' of each point in our estimator
thus :
= l : r Wr(x(r)) . -
(18)
l : r Wr
If Q (x) is non- zero for all x where P (x) is non- zero, it can be proved that , . .
the estimator converges to <1>, the mean value of 4>(x ) , as R increases. A practical difficulty estimate
how
reliable
the
with importance sampling is that it is hard to "
estimator
is . The
"
variance
of is hard
to
estimate, becausethe empirical variancesof the quantities wr and wr4>(x(r)) are not necessarily a good guide to the true variances of the numerator and
denominator in equation (18). If the proposal density Q(x) is small in a region where 14 >(x) P* (x) I is large then it is quite possible, even after many points x (r ) have been generated , that none of them will have fallen in that region . This leads to an estimate of that is drastically wrong , and no indication in the empirical variance that the true variance of the estimator is large . , . .
MONTECARLOMETHODS
183
will fall outside Rp and will have weight zero.] Then we know that most samples from Q will have a value of Q that lies in the range
(217ra2 )N /2exp -/2-:N (-2 N:f:-V :").
(2)2
Thus the weights Wr = P*/ Q will typically have values in the range
(27r0 '2)N/2exp (~2:i:~2) .
(23)
So if we draw a hundred samples, what will the typical range of weights be? We can roughly estimate the ratio of the largest weight to the median weight by doubling the standard deviation in equation (23) . The largest weight and the median weight will typically be in the ratio :
wmax r ~ =exp (J2N ).
In
N
==
1000
samples
is
Thus very
likely
to
an
importance
likely
be
utterly
conclusion
,
In from
two
the
dimensions
set
the , still
of
P ,
Rejection
We
assume
again
a
function have
that
we
( within
a
multiplicative
by
a few
, we
clearly
this
may
, even
of
a
we
obtain
in
order
long
are
a typical exp
samples
time
unless in
to
set
.
vary
Q
by
, although
suffers
that
the
.
will
weights often
samples likely
weight
problem
huge
obtain
hundred
median
dimensions
to
take if
the
with
high
need
one
- dimensional
samples in
after
than
a high
samples
points of
from
. We
one
- dimensional for
us
is
lie a
good
typical large
in
set
,
factors
similar
to
,
each
( . . ; N) .
be
proposal
factor further
density
to
a simpler
ZQ
able
before
that
we
== P * ( x ) / Z
sample
density
, as
assume
P (x ) to
from
it
Q ( x ) which ) , and
know
which the
we we
value
of
that
is
directly can can
too . We
evaluate generate
a constant
c
that for
A
greater
for
those
factors
a
assume
such
with
by
times
weight
sampling
complicated
samples
and
probabilities
largest
sampling
P . Second
differ
3 .
1019
dominated
associated
because other
, the
estimate
importance . First
to
weights
roughly
sampling
approximation the
be
difficulties
typical
therefore
(24)
schematic We
proposal
picture
generate density
of
two
the
random
Q ( x ) . We
all
x , cQ * ( x ) >
two
functions numbers
then
evaluate
is . The
P * (x ) .
( 25 )
shown
in
first
, x , is
cQ * ( x ) and
figure
4a .
generated
generate
from a uniformly
the
MONTECARLOMETHODS
185
-
-4
Figure
c
5
such
.
A
that
As
a
tions
(
x
case
P
)
~
P
(
other
.
no
[
aQ
c
if
the
/
(
(
N
/
and
,
N
the
rejection
the
acceptance
volume
=
=
1000
under
c
grows
)
,
,
is
a
2
)
N
/
2
(
27r0
'
Q ~
)
N
/
2
=
create
(
a
x
1
.
scaled
up
by
a
01
,
/
we
.
The
=
the
our
the
. ]
N
c
c
?
0 ' ~
=
=
exp
(
The
Q
are
technique
x
)
,
useful
for
)
at
single
is
and
similar
of
the
10
density
P
)
~
20
,
000
.
What
:
curve
P
(
x
)
implies
is
1
/
20
,
.
Q
x
(
)
x
.
)
In
that
large
has
complex
this
property
to
the
In
the
general
,
.
method
only
and
)
since
that
000
26
will
immediate
the
for
generating
sampling
(
is
(
is
this
c
origin
one
-
samples
dimensional
from
high
.
rejection
to
is
value
.
normalized
study
a
the
there
)
under
N
whilst
Q
x
,
set
answer
volume
and
(
to
than
case
is
Q
need
two
larger
the
what
of
we
log
case
,
,
from
these
percent
not
So
with
samples
that
one
is
dimensionality
therefore
(
is
density
find
P
For
P
x
exp
of
that
c
aQ
obtain
-
one
method
)
factor
distribu
assume
this
all
P
=
of
fact
1
for
value
practical
sampling
Q
)
from
to
us
,
(
ratio
the
Metropolis
Importance
density
x
Gaussian
Let
say
bound
'
distributions
The
(
samples
if
?
-
27r0
the
,
not
1000
this
be
sampling
problems
to
x
Q
dimensional
ape
-
(
with
dimensional
.
(
exponentially
Rejection
4
is
will
4
sampling
P
=
=
-
because
upper
fur
N
value
bounds
N
~
be
cQ
rate
3
generating
ap
to
-
and
rate
acceptance
of
Imagine
in
-
cQ
rate
2
Gaussian
is
close
upper
for
1
rejection
than
C
With
.
using
larger
so
pair
)
deviation
cQ
2
0
broader
a
is
)
-1
slightly
5
are
be
that
~
a
figure
aQ
dimensionality
27ra
and
standard
must
such
)
consider
deviations
ap
x
-2
.
zero
whose
standard
)
,
mean
deviation
the
(
x
study
with
standard
1
Gaussian
cQ
-3
work
well
problems
.
if
the
it
proposal
is
difficult
-
186
D.J.C. MACKAY
r"\Q(x;X(l))
. . . . . .
. . . I . .
.
.
.
.
.
.
.
.
.
I
I
.
.
.
.
.
.
.
,
.
,
.
.
.
.
.
'
. "
.
.
"
.
.
'
.
-
-
, -
, -
-
X(l )
x
---.
Figure 6. Metropolis method in one dimension. The proposal distribution Q(x ' ; x ) is here shown as having a shape that changes as x changes, though this is not typical of the proposal densities used in practice.
The
Metropolis
which
algorithm
depends
the
on
simplest
the
case
current
is
x
not
of
a
Q
( x
.
for
,
x
whether
'
to
we
( t ) )
in
new
a
~
1
then
the
Otherwise
If
the
step
set
x
is
( t +
l
)
sampling
= =
,
samples
to
,
{
be
label
current
is
a
point
of
( t +
then
( x
' ;
a
in
in
( x
on
density
)
.
An
shows
.
T
' )
It
example
the
density
)
for
any
Q
( x
x
' ;
x
.
A
( t ) )
tentative
.
To
decide
quantity
' )
l
)
= =
x
' .
,
a
Markov
( t ) )
the
no
.
factor
.
.
.
the
.
( x
we
rejection
the
list
current
It
R
to
( t )
label
of
state
is
important
to
T
are
;
to
x
unity
' )
/
=
1
note
Q
( x
as
a
and
x
to
( t ) )
. T
that
.
compute
If
the
Gaussian
the
.
.
able
' ;
.
independent
correlated
be
,
that
t
produce
such
points
superscript
need
is
)
then
in
on
the
samples
we
Q
,
:
influence
not
The
and
1
and
chain
(
.
= =
does
.
rejected
causes
density
latter
is
sampling
time
,
a
step
have
distribution
( x
probability
the
r
)
.
rejection
iterations
P
)
28
If
another
a
( t
rejection
and
Here
P
/
x
.
probability
( x
( x
with
from
symmetrical
,
Q
X
accepted
distribution
P
)
superscript
acceptance
simple
( t ) ;
accepted
points
the
of
ratios
density
( t
is
.
states
target
the
probability
the
of
the
compute
( x
x
from
sequence
*
the
( x
Q
might
.
P
Q
( t ) )
fixed
P
figure
( 2 )
x
centred
any
to
density
discarded
used
simulation
from
posal
list
have
samples
the
*
collected
the
I
' )
difference
are
we
onto
:
Metropolis
To
on
that
written
samples
the
}
( X
be
this
x
compute
is
set
the
points
)
independent
to
a
( r
*
state
we
Note
rejected
x
Notation
are
,
( t ) .
can
' ;
Gaussian
similar
evaluate
we
state
new
accepted
x
)
a
density
( x
( 27 p
new
the
x
proposal
,
proposal
Q
a5
;
and
a
-
If
' ;
6
)
can
P a
( l
the
state
of
density
all
figure
we
from
( x
at
x
that
the
The
such
look
states
assume
use
.
Q
to
shown
generated
accept
( t )
density
x
different
is
x
distribution
' ;
is
two
before
state
simple
( x
density
( t ) )
As
new
Q
makes
state
proposal
for
proposal
x
a
The
necessary
' ;
current
be
( t )
instead
the
centred
Metropolis
pro
-
Q (
MONTECARLOMETHODS
187
Figure 7. Metropolis method in two dimensions, showing a traditional proposal density that has a sufficiently small step size ~ that the acceptance frequency will be about 0.5.
method simply involves comparing the value of the target density at the two points . The general algorithm for asymmetric Q , given above, is often called the Metropolis - Hastings algorithm . It can be shown that for any positive Q (that is, any Q such that Q (x ' ; x ) > 0 for all x , x') , as t - + 00, the probability distribution of x (t) tends to P (x ) == P * (x ) / Z . [This statement should not be seen as implying that Q has to assign positive probability to every point x ' - we will discuss examples later where Q (x' ; x ) == 0 for some x , x' ; notice also that we have said nothing about how rapidly the convergence to P (x ) takes place.] The Metropolis method is an example of a ' Markov chain Monte Carlo ' method (abbreviated MCMC ) . In contrast to rejection sampling where the accepted points { x (r)} are independent samples from the desired distribution , Markov chain Monte Carlo methods involve a Markov process in which a sequence of states { x (t)} is generated , each sample x (t) having a probability distribution that depends on the previous value, x (t - 1). Since successive samples are correlated with each other , the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from P . Just as it was difficult to estimate the variance of an importance sampling estimator , so it is difficult to assess whether a Markov chain Monte Carlo method has 'converged ' , and to quantify how long one has to wait to obtain samples that are effectively independent samples from P .
188
D.J.C. MACKAY
4.1. DEMONSTRATION OF THE METROPOLIS METHOD The Metropolis method is widely used for high- dimensional problems . Many implementations of the Metropolis method employ a proposal distribution with a length scale f that is short relative to the length scale L of the prob able region (figure 7) . A reason for choosing a small length scale is that for most high - dimensional problems , a large random step from a typical point (that is, a sample from P (x )) is very likely to end in a state which hM very low probability ; such steps are unlikely to be accepted. If f is large , movement around the state space will only occur when a transition to a state which has very low probability is actually accepted, or when a large random step chances to land in another probable state . So the rate of progress will be slow , unless small steps are used. The disadvantage of small steps, on the other hand , is that the Metropo lis method will explore the probability distribution by a random walk , and random walks take a long time to get anywhere . Consider a one- dimensional random walk , for example , on each step of which the state moves randomly to the left or to the right with equal probability . After T steps of size f , the state is only likely to have moved a distance about . ; Tf . Recall that the first aim of Monte Carlo sampling is to generate a number of inde pendent samples from the given distribution (a dozen, say) . If the largest length scale of the state space is L , then we have to simulate a random walk Metropolis method for a time T ~ (L / f ) 2 before we can expect to get a sam pIe that is roughly independent of the initial condition - and that 's assuming that every step is accepted: if only a fraction / of the steps are accepted on average, then this time is increased by a factor 1/ / . Rule of thumb : lower bound on number of iterations ora Metropo lis method . If the largest length scale of the space of probable states is L , a Metropolis method whose proposal distribution generates a random walk with step size f must be run for at least T ~ (L / f )2 iterations to obtain an independent sample . This rule of thumb only gives a lower bound ; the situation may be much worse, if , for example , the probability distribution consists of several islands of high probability separated by regions of low probability . To illustrate how slow the exploration of a state space by random walk is, figure 8 shows a simulation of a Metropolis algorithm for generating samples from the distribution :
1 21 x E { a, 1, 2 . . . , 20} P(x) = { 0 otherwise
(29)
MONTECARLOMETHODS
189
~~~III ~ . . - III.-~
(a)
I111111111111111111111 1111 !!1
(b) Metropolis
=1!!llliiilll -
Figure 8. Metropolis method for a toy problem . (a) The state sequence for t = 1 . . . 600 . Horizontal direction = states from 0 to 20; vertical direction = time from 1 to 600 ; the cross bars mark time intervals of duration 50. (b ) Histogram of occupancy of the states after 100, 400 and 1200 iterations . (c) For comparison , histograms resulting when successive points are drawn independently from the target distribution .
D.J.C. MACKAY
190 The proposal distribution is
X '
Q(x';x) = { ~ Because
the
when
0
and
state
end
state
end
x
states
is
in
4 .2 .
METROPOLIS
The
rule
{ O , 1 , 2 , . . . 20 steps
of
thumb
is of
{ un
,
}
in
the
such
that
each
variable
dom
walk
the
directions
with
step
least
need
T T
~
Now comes
( L
how ,
fall
~
from
but sharply
/ /
big if
f
the
) 2
;
iterations
to
.
random of
same
a
lower
step
the sizes
,
case
of
equal
and us
,
to all
umin
be
to will
an
section
a
largest
this
is
and
assumption
, a
by
where
independent
ran
-
effectively
controlled ,
-
adjusted
executing
generate be
loss dis
deviations
f
,
-
spherical
Without
the
Under others
distri
separable
that
taken
previous
obtain
standard
1 . the
time
the
has
to
target
. a
the
applies
is
to
assume
close
a
is
it
on
also
that
that
Let
The
bound
method
distribution in
abolish
the
distribution
of
as
other
iterations
to
using
giving
umax
is
target
just
to
the
we
sample
the
needed ,
here
we
f ) 2 . can
is .
Let
.
required
independent
try
simplest
and
.
about
amax
( amax
,
independently sizes
lengthscale
}
.
both
exploration
direction
probability
evolves
,
target
deviations
acceptance
samples
largest
n
standard
xn
each
{ xn
,
an
visit
hundred
distribution
the
axes
to
an
into
!
proposal
that
the
it
the
in
assume
with
these
independent
will
can
take
effectively
Metropolis
Consider
deviation
we
step
with
systematic
above
a
first
are
four to
A
above reach
DIMENSIONS
walk
and
Thus
states
to
it
is
end thumb
encounter
about
hundred
discussed
.
iterations
only
evolution
the of
The
400
around
HIGH
random
,
different
of
a
Gaussian
aligned
smallest
at
a
standard
generality
tribution
we of
.
four
.
important
get
of
problems
that
Gaussian
could
IN
that
dimensional
bution
}
instead
iterations
is
its of
iterations
does
.
methods
and
rule
100
occur
.
one
first
for it
10
long
the
iteration
21
the
~
How
will
=
=
steps
about
that
'
example
indeed
Carlo
rejections
reach
T
simulating
shows Monte
10
.
540th by
(30)
x
Xo
present
And
METHOD
of
higher
the
example
twenty
number
on
or
to
is
predicts .
generated
space
about
thumb
1
1
,
-
time
iteration
space
place
only
simple
state
of state
behaviour
toy
178th
=
take
a
the
:!:
state
it
distance
in
X
uniform '
the
take
the
rule
takes
This walk
the
is x
does
confirmed on
The
are
Since
)
to in
long
typically
whole
state
samples
?
( x
state
started
How
will
occurs ?
.
20
This
the
end
of
= it
.
traverse
in
8a
that
end
the
was
figure
predicts
P
takes
simulation in
=
distribution
proposal
The shown x
target
the
=
otherwise
too It
be big seems
? -
The
bigger bigger
plausible
it than that
is
,
amin the
the
smaller -
optimal
then
this the must
number
T
acceptance be
be
-
rate similar
to
MONTECARLOMETHODS
X2
191
X2
P(x) ;:.::.::-. -
(a)
Xl
(b) X2
X2
P (x2IXI)
(c)
;:
Xl
(d)
.............. t))
tX X ( + 2 ) (t+ l)X (t) Xl
Xl
Figure 9. Gibbs sampling. (a) The joint density P (x ) from which samples are required. (b) Starting from a state x (t), Xl is sampled from the conditional density P (xllx ~t)). (c) A sample is then made from the conditional density P (x2IxI ) . (d) A couple of iterations of Gibbs sampling.
umin. Strictly , this may not be true ; in special cases where the second small est an is significantly greater than amin, the optimal f may be closer to that second smallest an . But our rough conclusion is this : where simple spherical proposal distributions are used, we will need at least T ~ (am.ax/ umin)2 iterations to obtain an independent sample , where umax and amln are the longest and shortest lengthscales of the target distribution . This is good news and bad news. It is good news because, unlike the cases of rejection sampling and importance sampling , there is no cat &Strophic dependence on the dimensionality N . But it is bad news in that all the same, this quadratic dependence on the lengthscale ratio may force us to make very lengthy simulations . Fortunately , there are methods for suppressing random walks in Monte Carlo simulations , which we will discuss later .
192
D.J.C. MACKAY
5. Gibbs sampling We introduced importance sampling , rejection sampling and the Metropo lis method using one- dimensional examples . Gibbs sampling , also known as the heat bath method, is a method for sampling from distributions over at least two dimensions . It can be viewed as a Metropolis method in which the proposal distri bu tion Q is defined in terms of the conditional distri butions of the joint distribution P (x ) . It is assumed that whilst P (x ) is too complex to draw samples from directly , its conditional distributions P (Xi I{ Xj } jli ) are tractable to work with . For many graphical models (but not all ) these one- dimensional conditional distributions are straightforward to sample from . Conditional distributions that are not of standard form may still be sampled from by adaptive rejection sampling if the conditional distribution satisfies certain convexity properties (Gilks and Wild 1992) . Gibbs sampling is illustrated for a cage with two variables (Xl , X2) = x in figure 9. On each iteration , we start from the current state x (t), and Xl is sampled from the conditional density P (xllx2 ) ' with X2 fixed to x ~t). A sample x2 is then made from the conditional density P (x2IxI ) , using the new value of Xl . This brings us to the new state X(t+l ), and completes the iteration . In the general case of a system with I( variables , a single iteration involves sampling one parameter at a time :
X(t+l) 1 X(t+l) 2 X3 (t+l)
r.....I
(t),X3 (t),...XK (t)} P(XIIX2 I (t+l),X3 (t),...XK (t)} P(X2IXl IXl (t+l),X2 (t+l)'...XK (t)},etc P(X3 .
erty that every proposal is always accepted . Because Gibbs sampling is a Metropolis method , the probability distribution of x (t) tends to P (x ) as t - + 00, as long as P (x ) does not have pathological properties .
5.1. GIBBSSAMPLING IN HIGHDIMENSIONS Gibbs sampling suffers from the same defect as simple Metropolis algorithms - the state space is explored by a random walk , unless a fortuitous parameterization has been chosen which makes the probability distribution P (x ) separable . If , say, two variables x 1 and X2 are strongly correlated , having marginal densities of width L and conditional densities of width f , then it will take at least about (L / f ) 2 iterations to generate an indepen dent sample from the target density . However Gibbs sampling involves no adjustable parameters , so it is an attractive strategy when one wants to get
MONTECARLOMETHODS
193
a model running quickly . An excellent software package, BUGS, is available which makes it easy to set up almost arbitrary probabilistic models and simulate them by Gibbs sampling (Thomas , Spiegelhalter and Gilks 1992) . 6
.
Terminology
We
for
now
spend
method
A
p
a
and
( O )
( x
)
Mar
few
a
transition
is
given
.
construct
The
A
( t +
l
)
( X
.
distribution
7r
chain
is
must
such
( x
)
is
ergodic
the
MetropoliR
initial
( x
' ;
state
probability
x
)
distribution
.
at
the
( t
+
l
)
th
iteration
of
the
T
( x
' ;
x
) p
( t ) ( x
)
.
( 34
)
:
( x
)
is
the
invariant
distribution
of
the
often
( t )
)
convenient
all
density
-
t
= =
f
dNx
7r
( x
)
as
( x
' )
)
.
of
( x
that
t
-
T
which
( x
T
,
construct
of
P
' )
distribution
ergodic
to
B
desired
( X
be
( x
invariant
t
00
by
' ;
x
is
,
,
for
)
7r
( x
)
any
or
( x
' ;
x
)
if
.
p
mixing
T
( O )
( x
)
.
concatenating
( 35
)
( 36
)
( 37
)
simple
satisfy
=
JdNX
B
These
base
( x
' ;
x
)
P
( x
transitions
)
,
need
not
be
individually
.
Many
erty
the
that
an
also
transitions
the
an
JdNX
P
P
for
= =
distribution
p
base
which
.
The
It
by
T
' )
chain
7r
2
specified
of
on
by
the
desired
chain
theory
methods
.
probability
p
1
based
distribution
chain
We
be
Carlo
the
are
can
probability
kov
Monte
sketching
sampling
chain
and
chain
moments
Gibbs
Markov
The
Markov
useful
transition
probabilities
satisfy
the
detailed
balance
prop
-
:
T
( x
' ;
x
)
P
( x
)
=
T
( x
;
x
' )
P
( x
' )
,
for
all
x
and
x
' .
( 38
)
This equation says that if we pick a state from the target density P and make a transition under T to another state , it is just as likely that we will pick x and go from x to x ' as it is that we will pick x ' and go from x ' to x . Markov chains that satisfy detailed balance are also called reversible Markov chains . The reason why the detailed balance property is of interest is that detailed balance implies invariance of the distribution P (x ) under
194
D.J.C. MACKAY
the Markov chain T (the proof of this is left as an exercise for the reader ) . Proving that detailed balance holds is often a key step when proving that a Markov chain Monte Carlo simulation will converge to the desired distri bution . The Metropolis method and Gibbs sampling method both satisfy detailed balance , for example . Detailed balance is not an essential condi tion , however , and we will see later that irreversible Markov chains can be useful in practice .
7. Practicalities Can we predict how long a Markov chain Monte Carlo simulation will take to equilibrate ? By considering the random walks involved in a Markov chain Monte Carlo simulation we can obtain simple lower bounds on the time required for convergence. But predicting this time more precisely is a difficult problem , and most of the theoretical results are of little practical use. Can we diagnose or detect convergence in a running simulation ? This is also a difficult problem . There are a few practical tools available , but none of them is perfect (Cowles and Carlin 1996) . Can we speed up the convergence time and time between inde pendent samples of a Markov chain Monte Carlo method ? Here, there is good news.
7.1. SPEEDINGUP MONTECARLOMETHODS 7 .1 .1 . The method
Reducing
Monte
applicable
information For
random
hybrid
to many
to reduce systems
walk Carlo
behaviour
state
random
Metropolis
reviewed
continuous
, the
in
method
walk
in
spaces
which
behaviour
probability
P
(x )
methods
Neal
( 1993 makes
) is use
a
Metropolis of
gradient
. can
be
written
in
the
form
e - E (X ) P
(x )
==
( 39
)
Z
where not only E (x ), but also its gradient with respect to x can be readily evaluated . It seems wasteful to use a simple random - walk Metropolis method when this gradient is available - the gradient indicates which di rection one should go in to find states with higher probability ! In the hybrid Monte Carlo method , the state space x is augmented by momentum variables p , and there is an alternation of two types of proposal . The first proposal randomizes the momentum variable , leaving the state x unchanged . The second proposal changes both x and p using simulated Hamiltonian dynamics as defined by the Hamiltonian
H (x , p) = E (x ) + K (p) ,
(40)
MONTECARLOMETHODS g = gradE ( x ) E = findE ( x )
.
# set gradient # set objective
, .
,
for 1 = 1: L p = randn ( size (x) ) H = p' * p / 2 + E ;
195
using initial x function too
# loop L times # initial momentumis Normal(O, l ) # evaluate H(x ,p)
xnew = x gnew = g ; for tau = 1 : Tau
# make Tau ' leapfrog ' steps
p = p - epsilon * gnew / 2 ; # make half - step in xnew = xnew + epsilon * p ; # make step in x
p
gnew = gradE ( xnew ) ; # find new gradient p = p - epsilon * gnew/ 2 ; # makehalf - step in p endfor # find new value of H Enew = findE ( xnew ) ; Hnew = p ' * p / 2 + Enew ; dH = Hnew - H ; # Decide whether to accept if ( dH < 0 ) accept elseif ( rand ( ) < exp ( - dH) ) accept else accept endif ( accept ) g = gnew ; endif endfor
= 1 ; = 1 ; = 0 ;
if
Figure 10.
where als
used
to
This
density desired
E = Enew ;
Octave source code for the hybrid Monte Carlo method .
K ( p ) is a ' kinetic
are
PH ( x , p ) =
the
x = xnew ;
create
~ZH
' such
exp [ - H ( x , p ) ] =
is separable distribution
energy
( asymptotically
, so it is clear exp [ - E ( x ) ] jZ
as [
p ) =
) samples
pTp / 2 . These from
the
joint
two
propos
ZI H exp [ - E ( x ) ] exp [ - K ( p ) ] . that
the
marginal
. So , simply
distribution
discarding
-
density
the
( 41 ) of x is mom
en -
196
D.J.C. MACKAY
1
1
' " ", I'
(a)
" 1' 1' ,
0.5
' "
, ,
1' ' "
,
,. I'
"
1'
(b) 0.5
1' ..
", '
:
.. ,
' "
' " I'
1'
I ' ' "
0
I '
0 - 0 .5
-0.5
.
-1 -1
- 1 .5 -1
- 0 .5
0
0 .5
1
- 1 .5
1
(c)
0.5
-1
- 0 .5
0
0 .5
1
; ' ;' ;'
.
(d)
1
;' ;' ; ' ;' "
0 .5
:
of " "
.
/
: /
0
"
;'
"
",,; ,,;, ; ,
" "
/
"
... "
- 0 .5
.
-1 -1
- 0 .5
0
0 .5
1
Figure 11. (a,b) Hybrid Monte Carlo used to generate samples from a bivariate Gaussian with correlation p = 0.998. (c,d) Random- walk Metropolis method for comparison. (a) Starting from the state indicated by the arrow, the continuous line represents two successivetrajectories generated by the Hamiltonian dynamics. The squares show the endpoints of these two trajectories . Each trajectory consists of Tau = 19 'leapfrog' steps with epsilon = 0.055. After each trajectory , the momentum is randomized. Here, both trajectories are accepted; the errors in the Hamiltonian were + 0.016 and - 0.06 respectively . (b) The second figure shows how a sequenceof four trajectories converges from an initial condition , indicated by the arrow, that is not close to the typical set of the target distribution . The trajectory parameters Tau and epsilon were randomized for each trajectory using uniform distributions with means 19 and 0.055 respectively. The first trajectory takes us to a new state, (- 1.5, - 0.5) , similar in energy to the first state. The second trajectory happens to end in a state nearer the bottom of the energy landscape. Here, since the potential energy E is smaller, the kinetic energy ]( = p2/ 2 is necessarily larger than it was at the start . When the momentum is randomized for the third trajectory , its magnitude becomes much smaller. After the fourth trajectory has been simulated, the state appears to have become typical of the target density. (c) A random- walk Metropolis method using a Gaussian proposal density with radius such that the acceptance rate was 58% in this simulation . The number of proposals was 38 so the total amount of computer time used was similar to that in (a). The distance moved is small because of random walk behaviour. (d) A random- walk Metropolis method given a similar amount of computer time to (b).
MONTECARLOMETHODS
197
turn variables , we will obtain a sequence of samples { x (t)} which asymptotically come from P (x ) e The first proposal draws a new momentum from the Gaussian density exp [- K (p )]/ ZKe During the second, dynamical proposal , the momentum variable determines where the state x goes, and the gradient of E (x ) determines how the momentum p changes, in accordance with the equations
x = p
(42)
i> = - ~~~ax ~~~l .
(43)
Becauseof the persistentmotion of x in the direction of the momentum p, during eachdynamicalproposal, the state of the systemtendsto move a distancethat goeslinearly with the computertime, rather than as the square root. If the simulation of the Hamiltoniandynamicsis numericallyperfect then the proposalsareacceptedeverytime, because the total energyH (x , p) is a constantof the motion and so a in equation(27) is equalto one. If the simulationis imperfect, becauseof finite stepsizesfor example, then some of the dynamicalproposalswill be rejected. The rejectionrule makesuseof the changein H (x , p), which is zeroif the simulationis perfect. The occasional rejectionsensurethat asymptotically , we obtain samples(x (t), p(t)) from the requiredjoint densityPH(X, p). The sourcecode in figure 10 describesa hybrid Monte Carlo method whichusesthe 'leapfrog' algorithmto simulatethe dynamicson the function findE (x) , whosegradient is found by the function gradE(x) . Figure 11 showsthis algorithm generatingsamplesfrom a bivariateGaussianwhose energyfunction is E (x) == ~XTAx with A=
- 250 249.25 .75 - 249 250.75 .25 ] .
(44)
7.1.2. Overrelaxation The method of 'overrelaxation ' is a similar method for reducing random walk behaviour in Gibbs sampling . Overrelaxation was originally introduced for systems in which all the conditional distributions are Gaussian . (There are joint distributions that are not Gaussian whose conditional distributions are all Gaussian , for example , P (x , y) = exp (- x2y2)jZ .) In ordinary Gibbs sampling , one draws the new value x~t+l ) of the cur rent variable Xi from its conditional distribution , ignoring the old value x ~t). This leads to lengthy random walks in cages where the variables are strongly correlated , as illustrated in the left hand panel of figure 12.
198
D.J.C. MACKAY Gibbs sampling
Overrelaxation
1
(a)
1
-0.5 -1
-1
(b)
-1
Figure 12. Overrelaxation contrasted with Gibbs sampling for a bivariate Gaussian with correlation p = 0.998. (a) The state sequencefor 40 iterations , each iteration involving one update of both variables. The overrelaxation method had Q ' = - 0.98. (This excessively large value is chosen to make it easy to see how the overrelaxation method reduces random walk behaviour.) The dotted line shows the contour xT}::- lX = 1. (b) Detail of (a), showing the two steps making up each iteration . (After Neal ( 1995).)
In from
Adler a
's
( 1981
Gaussian
tribution
.
current
value
If
)
overrelaxation
that the of
is
is
v
r-..J Normal
to
the
,
=
J. L +
( O , 1 ) and
Adler
a:
is
a
Xi
method
J. L)
parameter
+
instead
side of
's
a: ( x ~t ) -
one
opposite
distribution
x ~t ) , then
x ~t + l )
where
biased
conditional Xi
method
( 1 -
is
of
the
conditional
Normal
( J. L, 0- 2 )
sets
Xi
a2
) 1 / 2uv
between
Xt(t+l)
samples
dis and
-
the
to
-
,
1
( 45
and
)
1 , commonly
set to a negative value . The transition matrix T (x ' ; x ) defined by this procedure does not satisfy detailed balance . The individual transitions for the individual coordinates just described do satisfy detailed balance, but when we form a chain by applying them in a fixed sequence, the overall chain is not reversible . If , say, two variables are positively correlated , then they will (on a short timescale )
MONTECARLOMETHODS
199
evolve in a directed manner instead of by random walk , as shown in figure 12. This may significantly reduce the time required to obtain effectively independent samples. This method is still a valid sampling strategy - it converges to the target density P (x ) - because it is made up of transitions that satisfy detailed balance . The overrelaxation method has been generalized by Neal (1995, and this volume ) whose 'ordered overrelaxation ' method is applicable to any system where Gibbs sampling is used. For practical purposes this method may speed up a simulation by a factor of ten or twenty . 7.1.3. Simulated annealing A third technique for speeding convergence is simulated annealing . In simulated annealing , a 'temperature ' parameter is introduced which , when large , allows the system to make transitions which would be improbable at temperature 1. The temperature may be initially set to a large value and reduced gradually to 1. It is hoped that this procedure reduces the chance of the simulation 's becoming stuck in an unrepresentative probability island . We asssume that we wish to sample from a distribution of the form
p(x) = .~:..:=.z~~~.~~.
(46)
where E (x ) can be evaluated . In the simplest simulated annealing method , we instead sample from the distribution ~ 1~
PT(X) ==ztT>e- T
(47)
and decreage T gradually to 1. Often the energy function can be separated into two terms ,
E (x ) == Eo(x ) + El (X) ,
(48)
of which the first term is ' nice' (for example , a separable function of x ) and the second is ' nasty ' . In these cases, a better simulated annealing method might make use of the distribution ~ l_i~
PT(X) = ~ e-EO (X)- T
(49)
with T gradually decreasing to 1. In this way, the distribution at high temperatures reverts to a well- behaved distribution defined by Eo. Simulated annealing is often used as an optimization method , where the
aim is to find an x that minimizes E (x ), in which case the temperature is decreased
to zero
rather
than
to
1 . As a Monte
Carlo
method
, simulated
200
D.J.C. MACKAY
annealing as described above doesn't sample exactly from the right distribution ; the closely related 'simulated tempering' methods (Marinari and Parisi 1992) correct the biasesintroduced by the annealing processby making the temperature itself a random variable that is updated in Metropolis fashion during the simulation. 7.2. CAN THE NORMALIZING CONSTANTBE EVALUATED? If the target density P (x ) is given in the form of an unnormalized density P* (x ) with P (x ) == 1y ;p * (x ) , the value of Z may well be of interest. Monte Carlo methods do not readily yield an estimate of this quantity , and it is an area of active researchto find ways of evaluating it . Techniquesfor evaluating Z include: 1 .
Importance
2 .
' Thermodynamic
sampling
( reviewed integration
by '
during
Neal
( 1993 simulated
) ) . annealing
,
the
' accep
-
tance ratio ' method , and ' umbrella sampling ' (reviewed by Neal (1993) ) . 3. ' Reversible jump Markov chain Monte Carlo ' (Green 1995) . Perhaps the best way of dealing with Z , however, is to find a solution to one's task that does not require that Z be evaluated . In Bayesian data mod elling one can avoid the need to evaluate Z - which would be important for model comparison - by not having more than one model . Instead of using several models (differing in complexity , for example ) and evaluating their relative posterior probabilities , one can make a single hierarchical model having , for example , various continuous hyperparameters which play a role similar to that played by the distinct models (Neal 1996) . 7.3. THE METROPOLIS METHOD FOR BIG MODELS Our original description of the Metropolis method involved a joint updating of all the variables using a proposal density Q (x ' ; x ) . For big problems it may be more efficient to use several proposal distributions Q (b) (x ' ; x ) , each of which updates only some of the components of x . Each proposal is indi vidually accepted or rejected , and the proposal distributions are repeatedly run through in sequence. In the Metropolis method , the proposal density Q (x /; x ) typically has a number of parameters that control , for example , its 'width ' . These parameters are usually set by trial and error with the rule of thumb being that one aims for a rejection frequency of about 0.5. It is not valid to have the width parameters be dynamically updated during the simulation in a way that depends on the history of the simulation . Such a modification of the proposal density would violate the detailed balance condition which guarantees that the Markov chain hag the correct invariant distribution .
MONTECARLOMETHODS
201
7.4. GIBBSSAMPLING IN BIGMODELS Our description of Gibbs sampling involved sampling one parameter at a time , as described in equations (31- 33) . For big problems it may be more efficient to sample groups of variables jointly , that is to use several proposal distributions :
x~t+1.)..X~t+1) fiJP(X1...XaIX~t~1...X~) (t+ l ) (t+ l ) Xa+ l . . . Xb
(50)
( I (t+l.). .Xa (t+l ),Xb (t) (t) etc.. (51) rv PXa+l . . .XbXl +l . . .XK)'
7.5. HOWMANYSAMPLES ARENEEDED ? At A the start of this chapter, we observed that the variance of an estimator depends only on the number
of independent
samples R and the value of
0-2:=J dNXP(x) j>(x) - <1 2,
(52)
We have now discussed a variety of methods for generating samples from P (x ) . How many independent samples R should we aim for ? In many problems , we really only need about twelve independent samples from P (x ) . Imagine that x is an unknown vector such aB the amount of corrosion present in each of 10,000 underground pipelines around Sicily , and >(x ) is the total cost of repairing those pipelines . The distribution P (x ) describes the probability of a state x given the tests that have been carried out on some pipelines and the assumptions about the physics of corrosion . The quantity is the expected cost of the repairs . The quantity 0' 2 is the variance of the cost - 0' measures by how much we should expect the actual cost to differ from the expectation <1>. N ow , how accurately would a manager like to know ? I would suggest there is little point in knowing to a precision finer than about 0-/ 3. After all , the true cost is likely to differ by :f:0' from
7.6. ALLOCATION OFRESOURCES Assuming we have decided how many independent samples R are required , an important question is how one should make use of one's limited computer resources to obtain these samples. A typical Markov chain Monte Carlo experiment involves an initial period in which control parameters of the simulation such as step sizes may be adjusted . This is followed by a ' burn in ' period during which we hope the
202
D.J.C. MACKAY ( 1)
(
2
:-
)
-
: -
: -
::::J
: -
= -
= J
: -
=-
= J
: -
=-
=J
--
:-
:-
:-
:-
:-
:-
:-
:-
:-
_J
( 3 ) : : : ::: ~ :J :J :::) :) :) -) :) :) :) :)
Figure 13. Three possible Markov Chain Monte Carlo strategies for obtaining twelve samples using a fixed amount of computer time . Computer time is represented by horizontal lines; samples by white circles. (1) A single run consisting of one long 'burn in ' period followed by a sampling period . (2) Four medium- length runs with different initial conditions and a medium- length burn in period. (3) Twelve short runs.
simulation 'converges' to the desired distribution . Finally , as the simulation continues , we record the state vector occasionally so as to create a list of states { X(r )} ~ l that we hope are roughly independent samples from P (x ) . There are several possible strategies (figure 13) . 1. Make one long run , obtaining all 2. Make a few medium length runs taining some samples from each. 3. Make R short runs , each starting dition , with the only state that is simulation .
R samples from it . with different initial conditions , obfrom a different random initial conrecorded being the final state of each
The first strategy has the best chance of attaining 'convergence' . The last strategy may have the advantage that the correlations between the recorded samples are smaller . The middle path appears to be popular with Markov chain Monte Carlo experts because it avoids the inefficiency of discarding burn - in iterations in many runs , while still allowing one to detect problems with lack of convergence that would not be apparent from a single run .
MONTECARLOMETHODS
203
7.7. PHILOSOPHY One curious defect of these Monte Carlo methods - which are widely used by Bayesian statisticians - is that they are all non- Bayesian . They involve computer experiments from which estimators of quantities of interest are derived . These estimators depend on the sampling distributions that were used to generate the samples. In contrMt , an alternative Bayesian approach to the problem would use the results of our computer experiments to infer the properties of the target function P (x ) and generate predictive distribu tions for quantities of interest such M <1>. This approach would give answers which would depend only on the computed values of P * (x (r )) at the points { x (r )} ; the answers would not depend on how those points were chosen. It remains an open problem to create a Bayesian version of Monte Carlo methods . 8 . Summary - Monte Carlo methods are a powerful tool that allow one to implement any probability distribution that can be expressed in the form P (x ) = lzP * (X) . - Monte Carlo methods can answer virtually any query related to P (x ) by putting the query in the form
J cf >(x)P(x) ~ ~ Lr (x(r)).
(53)
- In high - dimensional problems the only satisfactory methods are those based on Markov chain Monte Carlo : the Metropolis method and Gibbs sampling . - Simple Metropolis algorithms , although widely used, perform poorly because they explore the space by a slow random walk . More sophisti cated Metropolis algorithms such as hybrid Monte Carlo make use of proposal densities that give faster movement through the state space. The efficiency of Gibbs sampling is also troubled by random walks . The method of ordered overrelaxation is a general purpose technique for suppressing them .
ACKNOWLEDGEMENTS This presentation of Monte Carlo methods owes a great deal to Wally Gilks and David Spiegelhalter . I thank Radford Neal for teaching me about Monte Carlo methods and for giving helpful comments on the manuscript .
204
D.J.C. MACKAY
References Adler
, S . L . : 1981 , Over
tition
function
- relaxation
for multiquadratic
method
for
the
Monte
- Carlo
evaluation
actions , Physical Review D -Particles
of
the
par -
and Fields
23( 12) , 2901- 2904. Cowles, M . K . and Carlin , B . P.: 1996, Markov-chain Monte-Carlo convergence diagnostics - a comparative review , Journal of the American Statistical Association 91 ( 434 ) , 883- 904 . Gilks , W . and Wild , P.: 1992, Adaptive rejection sampling for Gibbs sampling , Applied Statistics
41 , 337 - 348 .
Gilks , W . R ., Richardson , S. and Spiegelhalter , D . J .: 1996, Markov Chain Monte Carlo in Practice , Chapman and Hall . Green , P. J .: 1995, Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination
, Biometrika
82 , 711 - 732 .
Marinari , E . and Parisi , G .: 1992, Simulated
tempering
- a new Monte - Carlo scheme ,
Europhysics Letters 19(6), 451- 458. Neal , R . M .: 1993, Probabilistic inference using Markov chain Monte Carlo methods , Technical Report CRG - TR - 99- 1, Dept . of Computer Science, University of Toronto . Neal , R . M .: 1995, Suppressing random walks in Markov chain Monte Carlo using ordered overrelaxation , Technical Report 9508 , Dept . of Statistics , University of Toronto . Neal , R . M .: 1996, Bayesian Learning for Neural Networks , number 118 in Lecture Notes in Statistics , Springer , New York . Propp , J . G . and Wilson , D . B .: 1996, Exact sampling with coupled Markov chains and
applications to statistical mechanics, Random Structures and Algorithms 9(1-2) , 223252 .
Tanner , M. A.: 1996 , Toolsfor StatisticalInference : Methods for theExplorationof Posterior DistributionsandLikelihood Functions , SpringerSeriesin Statistics , 3rd edn, SpringerVerlag . Thomas , A., Spiegelhalter , D. J. andGilks, W. R.: 1992 , BUGS: A programto perform BayesianinferenceusingGibbssampling , in J. M. Bernardo , J. o . Berger , A. P. Dawidand A. F. M. Smith(eds), BayesianStatistics1" ClarendonPress , Oxford, pp. 837-842. Yeomans , J.: 1992 , Statisticalmechanics of phasetransitions , Clarendon Press , Oxford. For a full bibliography and a more thorough review of Monte Carlo methods, the reader is encouragedto consult Neal (1993) , Gilks et ale (1996) , and Tanner (1996) .
SUPPRESSING MONTE
RANDOM
CARLO
USING
RADFORD Dept
.
M of
of : / / www
IN
ORDERED
MARKOV
CHAIN
OVERRELAXATION
. NEAL
Statistics
University http
WALKS
and Toronto
. cs
. toronto
Dept ,
.
Toronto . edu
of
Computer ,
/ ~
Ontario radford
Science ,
Canada /
Abstract . Markov chain Monte Carlo methods such as Gibbs sampling and simple forms of the Metropolis algorithm typically move about the distribu tion being sampled via a random walk . For the complex , high-dimensional distributions commonly encountered in Bayesian inference and statistical physics , the distance moved in each iteration of these algorithms will usually be small , because it is difficult or impossible to transform the problem to eliminate dependencies between variables . The inefficiency inherent in taking such small steps is greatly exacerbated when the algorithm operates via a random walk , as in such a case moving to a point n steps away will typ ically take around n2 iterations . Such random walks can sometimes be suppressed using "overrelaxed " variants of Gibbs sampling (a.k .a. the heatbath algorithm ) , but such methods have hitherto been largely restricted to prob lems where all the full conditional distributions are Gaussian . I present an overrelaxed Markov chain Monte Carlo algorithm based on order statistics that is more widely applicable . In particular , the algorithm can be applied whenever the full conditional distributions are such that their cumulative distribution functions and inverse cumulative distribution functions can be efficiently computed . The method is demonstrated on an inference problem for a simple hierarchical Bayesian model .
1. Introduction Markov chain Monte Carlo methods are used to estimate the expectations of various functions of a state , x = (Xl ' . . . ' XN) , with respect to a distri bution given by some density function , 1r(x ) . Typically , the dimensionality , N , is large , and the density 1r(X) is of a complex form , in which the compo 205
206
RADFORDM. NEAL
nents of x are highly dependent. The estimates are basedon a (dependent) sample of states obtained by simulating an ergodic Markov chain that has
7r(x) as its equilibrium distribution . Starting with the work of Metropolis , et ale (1953), Markov chain Monte Carlo methods have beenwidely used to solve problems in statistical physics and , more recently , Bayesian statistical inference . It is often the only approach known that is computationally feasible . Various Markov
chain Monte Carlo methods and their applications
are
reviewed by Toussaint (1989) , Neal (1993) , and Smith and Roberts (1993). For the difficult problems that are their primary domain , Markov chain Monte Carlo methods are limited in their efficiency by strong dependencies between
components
of the state , which
move about the distribution
force the Markov
chain
to
in small steps. In the widely -used Gibbs sam-
pling method (known to physicists as the heatbath method ) , the Markov chain operates by successively replacing each component of the state , Xi , by a value randomly chosen from its conditional distribution given the cur -
rent values of the other components, 1l"(Xi I { Xj }j #i ). When dependencies between variables
are strong , these conditional
distributions
will be much
narrower than the corresponding marginal distributions , 1l"(Xi) , and many iterations
of the Markov
chain
will
be necessary
for the state
to visit
the full
range of the distribution defined by 7r(x) . Similar behaviour is typical when the Metropolis algorithm is used to update each component of the state in turn , and also when the Metropolis algorithm is used with a simple proposal distribution that changes all components of the state simultaneously . This inefficiency due to dependencies between components is to a certain extent unavoidable . We might hope to eliminate the problem by trans forming to a parameterization in which the components of the state are no longer dependent . If this can easily be done, it is certainly the preferred solution . Typically , however , finding and applying such a transformation is difficult or impossible . Even for a distribution ~ simple as a multivari ate Gaussian , eliminating dependencies will not be easy if the state has millions of components , as it might for a problem in statistical physics or .
.
Image
processIng
However
.
, in the Markov
chain
Monte
Carlo
methods
that
are most
com -
monly used, this inherent inefficiency is greatly exacerbated by the random walk nature of the algorithm . Not only is the distribution explored by tak ing small steps , the direction of these steps is randomized in each iteration , with the result that on average it takes about n2 steps to move to a point n
steps away. This can greatly increase both the number of iterations required before equilibrium is approached , and the number of subsequent iterations that are needed to gather a sample of states from which accurate estimates for the quantities of interest can be obtained . In the physics literature , this problem has been addressed in two ways
SUPPRESSING
RANDOM
WALKS
USING
OVERRELAXATION
207
- by "overrelaxation" methods, introduced by Adler (1981), which are the main subject of this paper , and by dynamical methods , such as " hybrid Monte Carlo " , which I briefly describe next . The hybrid Monte Carlo method , due to Duane , Kennedy , Pendleton ,
and Roweth (1987) , can be seen as an elaborate form of the Metropolis algorithm (in an extended state space) in which candidate states are found by simulating a trajectory defined by Hamiltonian dynamics . These trajec tories will proceed in a consistent direction , until such time as they reach a region of low probability . By using states proposed by this determin istic process, random walk effects can be largely eliminated . In Bayesian inference problems for complex models based on neural networks , I have
found (Neal 1995) that the hybrid Monte Carlo method can be hundreds or thousands of times faster than simple versions of the Metropolis algorithm . Hybrid Monte Carlo can be applied to a wide variety of problems where the state variables are continuous , and derivatives of the probability density can be efficiently computed . The method does, however , require that careful choices be made both for the length of the trajectories and for the stepsize used in the discretization of the dynamics . Using too large a stepsize will cause the dynamics to become unstable , resulting in an extremely high rejection rate . This need to carefully select the stepsize in the hybrid Monte
Carlo
method
is similar
to the need to carefully
select the width
of the proposal distribution in simple forms of the Metropolis algorithm . (For example , if a candidate state is drawn from a Gaussian distribution centred
at the current
state , one must
somehow
decide
what
the standard
deviation of this distribution should be) . Gibbs sampling does not require that
the user set such parameters . A Markov
chain Monte
Carlo
method
that shared this advantage while also suppressing random walk behaviour would
therefore
Markov
be of interest
chain methods
.
based on " overrelaxation
" show promise in this
regard. The original overrelaxation method of Adler (1981) is similar to Gibbs sampling , except that the new value chosen for a component of the state is negatively correlated with the old value. In many circumstances , successive overrelaxation improves sampling efficiency by suppressing ran dom walk behaviour . Like Gibbs sampling , Adler 's overrelaxation method does not require that the user select a suitable value for a stepsize pa-
rameter. It is therefore significantly ea.sier to use than hybrid Monte Carlo (although one does still need to set a parameter that plays a role analogous
to the trajectory length in hybrid Monte Carlo) . Overrelaxation methods also do not suffer from the growth in computation time with system size that results from the use of a global acceptance test in hybrid Monte Carlo .
(On the other hand, although overrelaxation has been found to greatly improve sampling in a number of problems , there are distributions
for which
208
RADFORD M. NEAL
overrelaxation
is
ineffective
Unfortunately
to
,
problems
eral
have
will
see
ation
to
these
,
however
methods
is
reduced
in
In
on
this
the
can
useful
be
if
to
which
tribution
functions
strategies
that
Section
.
2
In
that
are
is
on
conclude
a
by
discussing
by
Green
of
linear
quadratic
later
studied
Barone
and
and
conditional
general
Han
by
1992
"
which
Gibbs
,
the
-
will
efficiently
.
,
is
forms
cumulative
mention
dis
several
problems
to
I
which
have
inverse
also
sam
method
performed
.
,
-
other
which
ordered
the
meth
of
for
strategy
its
ordered
a
might
overrelaxation
hierarchical
be
-
cumulative
ordered
simple
-
over
implementation
employing
the
for
been
first
(
)
,
Though
applied
model
in
practice
.
,
I
and
' s
)
distributions
used
in
and
hence
,
Markov
1984
and
itself
since
method
demonstrate
in
Adler
new
overrelaxation
overrelaxation
strategies
,
1971
The
1989
Gaussian
for
The
and
6
long
Whitmer
(
' s
conditional
introduced
have
I
Adler
.
( Young
,
that
based
.
have
.
cannot
computations
proposals
method
Gaussian
distributions
methods
the
with
)
and
distributions
review
to
work
Frigessi
(
I
4
future
functions
was
be
of
problem
equations
overrelaxation
,
practice
and
range
Section
used
methods
of
In
how for
Overrelaxation
,
inference
Overrelaxation
systems
the
Section
.
is
possibilities
In
.
previous
in
5
Bayesian
-
in
.
discuss
Section
overrelax
overrelaxation
for
.
conditional
applicable
functions
some
and
I
As
rejection
sampled
ordered
these
full
widen
generally
in
method
tion
,
.
of
of
"tribution
computed
follows
3
dis
functions
applied
which
employ
method
"
can
efficiently
introduced
distribution
on
,
-
more
invariant
ability
be
Sev
are
is
overrelaxation
performing
the
further
be
.
methods
probability
this
chain
distribution
more
discussed
.
for
Section
relaxation
are
computations
strategy
can
In
any
required
where
free
,
Markov
the
may
overrelaxation
these
to
-
one
be
the
only
Gaussian
that
the
,
principle
from
ergodic
can
of
distribution
In
sample
cumulative
applicable
distribution
undermine
rejection
.
problems
the
method
a
an
Most
. )
.
present
to
.
Moreover
the
way
statistics
detail
applicable
ods
I
used
only
in
for
by
order
)
correct
can
.
is
are
methods
below
the
walks
produce
discuss
3
rejections
obvious
,
of
would
be
such
well
method
overrelaxation
that
determined
any
paper
use
method
pling
2
,
works
distributions
for
ensure
random
Carlo
overrelaxation
Section
to
Monte
conditional
made
( see
suppress
hybrid
full
been
rejections
but
original
the
applicable
occasional
be
' s
all
proposals
we
Adler
where
generally
,
the
)
The
in
the
-
based
Adler
(
was
later
statistical
'
problems
starting
.
a
of
minimiza
method
by
method
to
is
the
sampling
same
limited
proposed
solution
for
literature
discussed
method
been
iterative
also
chain
physics
.
the
with
point
1981
context
by
Gaussian
for
)
found
the
more
,
SUPPRES ; SING
2
.
1
.
ADLER
Adler
'
state
'
s
(
log
Xi
I
x
{
=
X
j
}
(
;
#
i
Xl
)
,
,
-
(
1
+
xI
)
updated
(
in
component
i
which
Adler
.
'
XN
)
in
'
s
is
"
than
1
+
x
~
turn
)
will
)
.
As
in
the
the
full
.
Note
by
its
,
,
is
other
such
as
new
,
JLi
,
replaced
7r
the
(
X1
,
the
new
)
(
j
for
for
,
j
X
state
variance
x
X2
chosen
and
,
by
the
includes
value
components
,
when
of
The
mean
the
,
class
components
.
conditional
Xi
the
Adler
this
,
the
densities
that
ordering
for
conditional
used
)
sampling
fixed
,
distribution
Gaussians
of
value
the
terminology
Gibbs
on
old
all
"
functions
the
x.,1,
in
some
depend
,
that
multivariate
using
are
when
multiquadratic
the
,
applicable
such
(
is
general
method
,
209
OVERRELAXATION
METHOD
is
Gaussian
other
(
are
.
density
distributions
exp
.
are
probability
USING
OVERRELAXATION
method
:
j
WALKS
GAUSSIAN
overrelaxation
,
7r
S
RANDOM
#
0 '
i
.
;
,
In
value
JLi
+
0 :
(
Xi
-
J . Li
)
+
O '
i
(
1
-
0 :
2
)
1
/
2
n
(
1
)
where n is a Gaussian random variate with mean zero and variance one. The parameter 0:' controls the degreeof overrelaxation (or underrelaxation) ; for the method to be valid, we must have - 1 ~ 0:' ~ + 1. Overrelaxation to the other side of the mean occurs when (l' is negative. When a is zero, the method is equivalent to Gibbs sampling. (In the literature , the method is often parameterized in terms of w == 1- 0: . I have not followed this convention, as it appears to me to make all the equations harder to understand.) One can easily confirm that Adler's method leavesthe desired distri bution invariant - that is, if Xi has the desired distribution (Gaussian with mean JLiand variance a; ) , then x~ also has this distribution . Furthermore, it is clear that overrelaxed updates with - 1 < a < + 1 produce an ergodic chain. When (l' == - 1 the method is not ergodic, though updates with 0: == - 1 can form part of an ergodic schemein which other updates are performed as well, as in the "hybrid overrelaxation" method discussedby Wolff (1992) . 2.2. HOW OVERRELAXATIONCAN SUPPRESSRANDOM WALKS The effect of overrelaxation is illustrated in Figure 1, in which both Gibbs sampling and the overrelaxation method are shown sampling from a bivariate Gaussiandistribution with high correlation. Gibbs sampling undertakes a random walk, and in the 40 iterations shown (each consisting of an update of both variables) succeedsin moving only a small way along the long axis of the distribution . In the samenumber of iterations, Adler's Gaussian overrelaxation method with 0:' = - 0.98 coversa greater portion of the distri bution , since it tends to moveconsistently in one direction (subject to some random variation , and to "reflection" from the end of the distribution ). The manner in which overrelaxation avoids doing a random walk when sampling from this distribution is illustrated in the close-up view in Fig-
210
RADFaRD M. NEAL
1 -
+1 .
) .
0
1 .
.
- 1 .
. 0
.
.
+ 1
-1
0
Gibbs Sampling
+1
Adler 's Method , Q' = - 0 .98
+1 -
Figure 1. Gibbs sampling and Adler 's overrelaxation method applied to a bivariate Gaussian with correlation 0.998 (whose one-standard-deviation contour is plotted ) . The top left shows the progress of 40 Gibbs sampling iterations (each consisting of one update for each variable) . The top right shows 40 overrelaxed iterations , with a = - 0.98. The close-up on the right shows how successive overrelaxed updates operate to avoid a random walk .
ure
1
,
When
side
of
ity
of
the
sistent
in
which
each
which
updates
is
quite
state
,
move
to
of
When
to
at
1
is
other
on
a
point
the
combined
,
effect
single
elliptic
of
and
variable
move
is
to
reverses
.
randomness
in
when
this
the
cause
other
a
G '
con
=
Q '
is
,
the
close
will
occasional
1
-
until
When
-
-
probabil
contour
introduced
also
.
the
move
of
along
updated
to
easily
contour
state
will
is
to
most
motion
amount
contours
each
tending
visualized
the
the
small
I + 1
to
let
,
the
reversals
in
.
chosen
is
be
I 0
after
-
move
which
,
different
as
the
,
motion
a
scale
stays
updates
-
state
the
can
state
-
overrelaxed
-
effect
Successive
reached
direction
The
the
.
in
is
mean
.
density
not
changes
conditional
case
but
tion
the
these
direction
end
time
shows
0
well
required
.
,
this
for
As
the
randomization
the
correlation
state
will
to
move
of
the
from
bivariate
occur
on
one
end
Gaussian
about
of
the
the
distri
approaches
same
bu
-
SUPPRESSING
:i: l , the optimal
RANDOM
value
optimal a rather This comes about
of a approaches
one end of the distri
root
of the
ratio
as the
independent walk
bu tion
of eigenvalues
correlation
point
therefore
USING
- 1 , and the benefit
to the other of the
in n step rather
correlation
than
also goes to infinity
from
from
the n 2 steps
this
large . to move
to the square
matrix
gain
using
arbitrarily required
is proportional
goes to :1: 1 . The
211
OVERRELAXATION
than a == 0 ( Gibbs sampling ) becomes because the number , n , of typical steps
from
infinity
WALKS
, which
moving
needed
goes
to
to a nearly
with
a random
.
2.3. THE BENEFITFROMOVERRELAXATION Figure
2 shows
Gaussian tions
with
the benefit
of overrelaxation
p == 0 .998 in terms
of the state , Xl , and
is close
to optimal
in sampling
of reduced
xi . Here , a value
in terms
of speed
value The
the bivariate for two func -
of a = - 0 .89 was used , which
of convergence
( The value of 0:' == - 0 .98 in Figure 1 was chosen random walks visually clearer , but it is in fact this
from
autocorrelations for
this
distribution
.
to make the suppression somewhat too extreme
of for
of p .)
asymptotic
efficiency
of a Markov
chain
sampling
method
in esti -
mating the expectation of a function of state is given by its " autocorrelation time " the sum of the autocorrelations for that function of state at all lags , positive and negative ( Hastings 1970 ) . I obtained of the autocorrelation times for the quantities plotted series of 10 ,000 points autocorrelations that
appeared
the efficiency
using
truncation
with
zero ) . These
of E [Xl ] is a factor
(lI = - 0 .89 than
of E [xi ] , the benefit
In comparison these factors
at the lags past which
to be approximately
of estimation
overrelaxation
estimation
, with
from
numerical in Figure
when
of about
using
overrela ./xation
estimates 2 ( using a
the estimated estimates
show
22 better
when
Gibbs
sampling
is a factor
. For
of about
with Gibbs sampling , using overrelaxation will reduce by the variance of an estimate that is based on a run of given
length , or alternatively , it will reduce by the same factors the length that is required to reduce the variance to some desired level . Overrelaxation done
into
when
so far do not been
16 .
is not always overrelaxation
provide
mis - interpreted
a complete . Work
beneficial
, however
produces
an improvement
. Some research , but
physics
literature
has been
the
answer , and in some cases , appear
in the
of run
results to have
has concentrated
on
systems of physical interest , and has primarily been concerned with scaling behaviour in the vicinity of a critical point . Two recent papers have ad dressed
the question
Barone ate Gaussian interesting
and
in the context
of more
Frigessi
( 1990 ) look
distributions
, finding
cases . In interpreting
general
statistical
at overrelaxation the
these
rate
applied
of convergence
results , however
applications to multivari in a number
, one should
. of
keep in
212
RADFORDM. NEAL
0 . . -
CO
<0
~
N
0
0
500
1000
1500
2000
Plot of xi during Gibbssamplingrun
Plot of x~ during overrelaxedrun with a = - 0.89 Figure 2.
Sampling from a bivariate Gaussian with p = 0.998 using Gibbs sampling
and Adler 's overrelaxation
method with a = - 0.89 . The plots show the values of the
first coordinate and of its square during 2000 iterations of the samplers (each iteration consisting of one update for each coordinate).
SUPPRESSING
RANDOM
if
WALKS
mind
that
a method
converges
time
required
to
some
to tive
log(p). advantage
Hence, of
reach
for rates near one results of Barone overrelaxation
sampling. for
This
some
relax
(ie,
to
Green
by
nizing
that
on
contour
pectation variance
the
always
values
convergence one
true,
(1)
to
of
for
joint
the
0 <
state. as c
near
since
1
probability
state,
Green
and
In particular, generally
and
Han
we
on
require
for problems unrealistic to
ative autocorrelations be obtained with locally-antithetic
do of to
not depend interest, but zero, as the
As within
remarked the class
only
asymptotic
hope
state
are
distributions
3.
Adler s tional
positive, where
Previous
sta
when of stat interest
a modest
deg
may
th
Markov find antithet
produce estimation eff a sample of independent st character of equation (1), on negative autocorrelations can come rather from the fas chain moves more rapidly to above, the benefits of overrelaxation of multivariate Gaussian distri
on asymptotic variance). work, however, I take it beneficial as is typically the
are
density,
variance
where to
a
They in eq
the
chains be used during an initial period and during the subsequent generation In practice, however, we are usually
of
c
performance
of
of limit
c very
equilibrium,
the
however;
variance
function in the
a
correlations,
with
at
asymptotic
of a linear goes to zero
be
negative
look
acc
(l p2)/(l pl). what
distributions
with
(1992)
of
converge p1/p2, bu
confirm
equation
Han
judged of
apply
and
some
is not
distributions
level
methods 2 is not
is approximately and Frigessi for
O
geometrically
given
if two method
can
USING
for the
true example
when
conditional
proposals
overrelaxation distributions
More research as given that
for
are
in
in
correlati and se
distributions
more
method Gaussian.
general can
be applied Although
ov
214
RADFORD M. NEAL
exist , in both statistical physics and statistical inference , most problems to which Markov chain Monte Carlo methods are currently applied do not satisfy
this constraint
. A number
of proposals
have been made for more
general overrelaxation methods , which I will review here before presenting the
" ordered
overrelaxation
" method
in the
next
section
.
Brown and Woch (1987) make a rather direct proposal: To perform an over relaxed update of a variable whose conditional distribution Gaussian , transform
to a new parameterization
the conditional distribution and then transform
of this variable
is not
in which
is Gaussian , do the update by Adler 's method ,
back . This may sometimes
be an effective strategy , but
for many problems the required computations will be costly or infeasible .
A second proposal by Brown and Woch (1987), also made by Creutz (1987) , is based on the Metropolis algorithm . To update component i , we first find a point , xi , which is near the centre of the conditional distribution ,
7r(Xi I { Xj} j :;i:i ). We might , for example, choosexi to be an approximation to the mode , though other choices are also valid , as long as they do not
depend on the current Xi. We then take xi == xi - (Xi - xi ) as a candidate for the next state , which , in the usual Metropolis fashion , we accept with
probability min[l , 7r(X~I { Xj}j :;i:i ) / 7r(Xi I { Xj}j =Fi)]. If x~is not accepted, the new state
is the same
as the old state .
If the conditional distribution is Gaussian , and xi is chosen to be the exact mode , the state proposed with this method will always be accepted , since the Gaussian distribution to
Adler
's method
with
is symmetrical
G' == - 1 . Such
. The result is then identical
a method
can
be combined
with
other updates to produce an ergodic chain . Alternatively , ergodicity can be ensured by adding some amount of random noise to the proposed states . Green and Han ( 1992) propose a somewhat similar , but more general , method . To update component i , they find a Gaussian approximation to
the conditional distribution , 7r(Xi I { xj } j :;i:i ), that does not depend on the current Xi. They then find a candidate state xi by overrelaxing from the current state according to equation (1), using the JLi and O 'i that characterize this Gaussian approximation , along with some judiciously chosen
G'. This candidate state is then acceptedor rejected using Hastings' (1970) generalization of the Metropolis algorithm , which allows for non-symmetric proposal
distributions
.
Fodor and Jansen (1994) proposea method that is applicable when the conditional
distribution
is unimodal
, in which
the
candidate
state
is the
point on the other side of the mode whose probability density is the same as that of the current state . This candidate state is accepted or rejected based on the derivative of the mapping from current state to candidate state . Ergodicity may again be ensured by mixing in other transitions , such as standard Metropolis updates .
SUPPRESSING
The ing
proposed rate
tions ; if it even to the
have
decisions
rejection
been
appreciated
discussion
in Section
is easy to see that
requires
that
to the time which
update
such
long , depending
distribu seems
be apparent
a bivariate
,
not from
Gaussian
, it
of one of the two
variables
of motion of random
the long therefore
rejections
to move
The
it . Moreover
point
should
from
the direction suppression
between
along walks
be at least
comparable
the length
of the distri
on the degree
of correlation
bu tion , of the
.
section
based
, I present
the method
to suppress
dependencies
4 .1. THE
density
statistics
of overrelaxation
over states with rejected , thereby
random
walks
, which
real - valued preserving
with
to sample
from
proceed
in turn , based
new value , xi , obtained
2 ) Arrange labeling
a distribution by updating
on their
as follows
full
over x == ( Xl , . . . , XN ) with the values
conditional
of the components distributions
these them
r being
the
the K generated 3 ) Let
the
i is replaced
, from
the
index values
by a
conditional
K values plus the old value , xi , in non - decreasing as follows : X~l ) ~
,
, whose
:
K random values , independently 7r ( Xi I { Xj } j # i ) .
x ~O) ~ with
arbitrarily
METHOD
densities are 7r( Xi I { x j } j # i ) . In the new method , the old value , Xi , for component
1) Generate tribution
can be applied
components , and the potential for
even in distributions
OVERRELAXATION
7r ( X ) , and we will
Xi , repeatedly
order
.
ORDERED
As before , we aim
on
a new form
( in theory ) to any distribution in which changes are never strong
, but
us -
flaw :
conditional
of reducing
high . This
sampling
215
is achieved
serious
of the way
literature
for the method
balance
a potentially
be too
an overrelaxed
interval
Overrelaxation
In this
may
in the
can be arbitrarily
variables
4.
the
rate
is to reverse . Effective
required
detailed
from
is no obvious
2 . When
when
is rejected , the effect axis of the distribution
OVERRELAXATION
by characteristics
high , there
small
USING
in which
all suffer
is determined
is too
a quite
WALKS
generalizations
accept - reject
rejection
RANDOM
... ~ in this
x ~r ) == Xi ~ ordering
are equal
new value
for component
Here , K is a parameter of 0: in Adler ' s Gaussian
of the method overrelaxation
... ~
of the old
dis order ,
x ~K )
(2)
value . ( If several
of
the tie randomly
.)
to the old Xi , break i be X~ == x ~K - r ) .
, which plays a role analogous to that method . When K is one , the method
216
RADFORD M. NEAL
is equivalent to Gibbs sampling ; the behaviour as f( - t 00 is analogous to Gaussian
overrelaxation
As presented
with
a = - 1.
above , each step of this " ordered overrelaxation
" method
would appear to require computation time proportional to K . As discussed below , the method will provide a practical improvement in sampling efficiency only if an equivalent effect can be obtained using much less time . Strategies for accomplishing this are discussed in Section 5. First , however , I will show that
the method
is valid -
that
leaves the distribution 1r(x) invariant to that
of Adler
4 .2 . VALIDITY
's method
for Gaussian
OF ORDERED
the update
described
above
and that its behaviour is similar distributions
.
OVERRELAXATION
To show that ordered overrelaxation leaves 1r(x) invariant , it suffices to
show that each update for a component , i , satisfies "detailed balance" -
ie, that the probability density for such an update replacing Xi by x~ is the same as the probability density for xi being replaced by Xi, aBsumingthat the starting state is distributed according to 1r(x). It is well known that the detailed balance condition (also known as " reversibility " ) implies invariance
of 1r(x), and that invariance for each component update implies invariance for transitions in which eachcomponent is updated in turn . (Note, however, that the resulting sequential update procedure , considered as a whole , need not satisfy
detailed
balance ; indeed , if random
walks are to be suppressed
as we wish , it must not .) To see that detailed balance holds , consider the probability density that component i has a given value , Xi , to start , that Xi is in the end replaced
by some given different value, x~, and that along the way, a particular set of 1( - 1 other values (along with xi ) are generatedin step (1) of the update procedure . Assuming there are no tied values, this probability density is
1r(XiI {Xj}j#i) . I( ! 1r(X~I {Xj}j#i) II 1r(x~t) I {Xj}j#i) . I [s = f( - r] (3) r # t# s
where r is the index of the old value, Xi, in the ordering found in step (2), and s is the index of the new value, xi . The final factor is zero or one, depending on whether the transition in question would actually occur with the particular set of 1< - 1 other values being considered . The ' probabil -
ity density for the reverse transition , from xi to Xi, with the same set of [( - 1 other values being involved , is readily seen to the identical to the above. Integrating over all possible sets of other values, we conclude that
the probability density for a transition from Xi to x~, involving any set of other values, is the same aB the probability
density for the reverse transi -
tion from xi to Xi. Allowing for the possibility of ties yields the same result, after a more detailed accounting .
SUPPRESSING RANDOMWALKSUSINGOVERRELAXATION217 4.3. BEHAVIOUROF ORDEREDOVERRELAXATION In analysingorderedoverrelaxation , it can be helpfulto view it from a perspectivein whichthe overrelaxationis donewith respectto a uniformdistribution. Let P(x) bethe cumulativedistribution functionfor the conditional distribution 7r(Xi I {Xj}j #i) (hereassumedto becontinuous ) , and let p - l (x) be the inverseof F (x). , , Orderedoverrelaxationfor Xi is equivalentto the followingprocedure : First transformthe current valueto ui == F (Xi), then perform orderedoverrelaxationfor Ui, whosedistribution is uniform over [0, 1], yielding a newstate u~, and finally transformbackto xi == p - l (u~). Overrelaxationfor a uniformdistribution, startingfrom u, maybe analysed as follows. When K independentuniform variatesare generatedin step (1) of the procedure , the numberof them that are lessthan U will be binomiallydistributed with meanKu and varianceKu (l - u). This number is the index, r , of u == u{r) found in step (2) of the procedure . Conditional on a valuefor r , which let us supposeis greaterthan K / 2, the distribution of the new state, u' == u{K- r), will be that of the K - r + 1 order statistic of a sampleof sizer from a uniform distribution over [0, u]. As is well known (eg, David 1970, p. 11), the k'th orderstatistic of a sampleof sizen from a uniform distribution over [0, 1] hasa beta(k, n- k+ 1) distribution, with density proportionalto uk- l (l - u)n- k, meank/ (n + 1), and variance k (n - k + 1) / (n + 2) (n + 1)2. Applying this result, u' for a givenr > K / 2 will havea rescaledbeta(K - r + 1, 2r- 1<) distribution, with meanJl(r ) == u(K - r + l )/ (r+ 1) and variancea2(r ) == u2(K - r+ l ) (2r- K ) / (r+2) (r+ 1)2. When K is large, wecanget a roughideaof the behaviourof overrelax ation for a uniformdistribution by consideringthe casewhereu (and hence likely r / K ) is significantlygreaterthan 1/ 2. Behaviourwhenu is lessthan 1/ 2 will of coursebe symmetrical, and we expect behaviourto smoothly interpolatebetweentheseregimeswhenu is within about 1/ JK of 1/ 2 (for which r / K might be either greateror lessthan 1/ 2) When u ~ 1/ 2, we can usethe Taylor expansion . K u - K u2~~ - K u + 2u 8 K u + 2u 82 + . . . (4) JL(I{ u + 8) == Ku + 1 (Ku + 1)2 + (Ku + 1)3 to concludethat for largeK , the expectedvalueof u', averagingoverpossible valuesfor r == Ku + 8, with <5havingmeanzeroand varianceKu (l - u), is approximately K u - K u2 ~ ~ K u (1 - u) K u + 2u ~ Ku+ 1 + ([( u + 1)3
(1 - u)
+
1/ K
(5)
For u <{:: 1/ 2, the bias will of coursebe opposite, with the expectedvalue of u' beingabout (1- u) - 1/ [( , and for u ~ 1/ 2, the expectedvalueof u' will be approximatelyu.
218
RADFORD M. NEAL
Figure 9 . Points representing 5000 ordered overrelaxation updates . The plot on the left shows ordered overrelaxation for a uniform distribution . The horizontal axis gives
the starting point , drawn uniformly from [0, 1]; the vertical axis, the point found by ordered overrelaxation with f ( = 100 from that starting point . The plot on the right shows ordered overrelaxation for a Gaussian distribution . The points correspond to those on the left , but transformed by the inverse Gaussian cumulative distribution function .
The variance of u' will be, to order 1/ K , approximately u2(Ku ) + Ku (1 - u) [J.L'(J
Ku (1-u){Ku {l(u++2uf 1)4 ~i~K.=.~l (6) rv rv
By symmetry, the varianceof u' when u ~ 1/ 2 will be approximately 2u/ K . (Incidentally, the fact that u' has greater variance when u is near 1/ 2 than when u is near 0 or 1 explains how it is possible for the method to leave the uniform distribution invariant even though u' is biased to be closer to 1/ 2 than u is.) The joint distribution for u and u' is illustrated on the left in Figure 3. The right of the figure showshow this translates to the joint distribution for the old and new state when ordered overrelaxation is applied to a Gaussian distribu tion . 4.4. COMPARISONSWITH ADLER'S METHOD AND GIBBS SAMPLING For Gaussianoverrelaxation by Adler's method, the joint distribution of the old and new state is Gaussian. As seenin Figure 3, this is clearly not the casefor ordered overrelaxation. One notable difference is the way the tails of the joint distribution flare out with ordered overrelaxation., a reflection of the fact that if the old state is very far out in the tail , the new state will likely be much closer in. This effect is perhaps an advantage of the
SUPPRESSING
RANDOM
WALKS
USING
OVERRELAXATION
219
ordered overrelaxation method , as one might therefore expect convergence from
a bad starting
point
to be faster
with
ordered
overrelaxation
than
with Adler 's method. (This is certainly true in the trivial cagewhere the state consists of a single variable ; further analysis is needed to establish whether it true in interesting cases.) Although there is no exact equivalence between Adler 's Gaussian overrelaxation
method
and ordered
overrelaxation
, it is of some interest
to find
a
value of K for which ordered overrelaxation applied to a Gaussian distribu tion corresponds roughly to Adler 's method with a given Q' < O. Specifically , we can try to equate the mean and variance
of the new state , x ' , that
re-
sults from an overrelaxed update of an old state , x , when x is one standard deviation away from its mean. Supposing without loss of generality that
the mean is zero and the variance is one, we seefrom equations (5) and (6) that when x == 1, the expected value of x' using ordered overrelaxation is
<1>- 1 1>(- 1) + IlK ) ~ - 1+ l / K >(- I ) ~ - 1+ 4.13/ K and the variance of x' is 2 <1>(- 1)1K >(- 1)2 ~ 5.42/ K , where (x) is the Gaussiancumulative distribution function , and >(x) the Gaussian density function . Since the corresponding values for Adler 's method are a mean of Q' and a variance of 1 - Q'2, we can get a rough correspondence by setting !( ~ 3.5/ (1 + Q') . For the example of Figure 2, showing overrelaxation by Adler 's method with Q' = - 0.89, applied to a bivariate Gaussian with correlation 0.998, or dered overrelaxation should be roughly equivalent when [{ = 32. Figure 4 shows visually
that this is indeed the case . Numerical
relation
indicate
times
that
ordered
overrelaxation
estimates with
of autocor -
[ { = 32 is about
a factor of 22 more efficient , in terms of the number of iterations required for a given level of accuracy , than is Gibbs sampling , when used to esti-
mate E [X1]' When usedto estimateE[xi ], orderedoverrelaxationis about a factor
of 14 more efficient . Measured
ficiency advantages are virtually for Adler
' s method
by numbers of iterations
, these ef-
identical to those reported in Section 2.3
.
Of course, if it were implemented in the most obvious way, with [{
random variates being explicitly generatedin step (1) of the procedure, ordered overrelaxation with [{ == 32 would required a factor of about 32 more computation time per iteration than would either Adler 's overrelaxation method or Gibbs sampling . Adler 's method would clearly be preferred in comparison to such an implementation of ordered overrelaxation . Interest ingly , however , even with such a naive implementation , the computational efficiency of ordered overrelaxation is comparable to that of Gibbs sampling - the factor of about 32 slowdown per iteration being nearly cancelled by the factor
of about 22 improvement
from the elimination
of random
This near equality of costs holds for smaller values of K as well -
walks .
the im -
provement in efficiency (in terms of iterations) using ordered overrelaxation
220
RADFORDM. NEAL
a . . , -
CX)
< 0
oqo
C\ J
a
0
500
1000
1500
2000
Plot of xi during ordered overrelaxation run with K = 32 Figure 4 . Sampling from a bivariate Gaussian with p = 0.998 using ordered overrelax ation with 1< = 32 . Compare with the results using Gibbs sampling and Adler 's method shown in Figure 2.
with K = 16 is about a factor of 12 for E[Xl] and 11 for E[xi ], and with K = 8, the improvementis about a factor of 8 for E [Xl] and 7 for E [xi ]. We therefore see that any implementation of ordered overrelaxation whose computational cost is substantially less than that of the naive approach of explicitly generating I( random variates will yield a method whose computational efficiency is greater than that of Gibbs sampling , when used with
any value for I ( up to that
ber of iterations
required
which
is optimal
in terms of the num -
for a given level of accuracy . Or rather , we see
this for the case of a bivariate Gaussian distribution , and we may hope that it is true for many other distributions of interest as well , including those
whose
conditional
overrelaxation
5 . Strategies
method
distributions
are non - Gaussian
, for which
Adler
's
is not applicable .
for implementing
ordered
overrelaxation
In this section , I describe several approaches to implementing ordered overrelaxation , which are each applicable to some interesting class of distribu tions , and are more efficient than the obvious method of explicitly gen-
SUPPRESSING RANDOM WALKS USING OVERRELAXATION
221
erating K random variates . In some cases, there is a bound on the time required for an overrelaxed update that is independent of K ; in others the reduction in time is less dramatic (perhaps only a constant factor ) . As was seen in Section 4.4, any substantial reduction in time compared to the naive implementation will potentially provide an improvement over Gibbs sampling . There will of course be some distributions for which none of these imple mentations is feasible ; this will certainly be the case when Gibbs sampling itself is not feasible . Such distributions include , for example , the complex posterior distributions that arise with neural network models (Neal 1995) . Hybrid Monte Carlo will likely remain the most efficient sampling method for such problems . 5.1. USING THE CUMULATIVE DISTRIBUTION FUNCTION The most direct method for implementing ordered overrelaxation in bounded time (independently of K ) is to transform the problem to one of perform ing overrelaxation for a uniform distribution on [0, 1] , as was done in the analysis of Section 4.3. This approach requires that we be able to efficiently compute the cumulative distribution function and its inverse for each of the conditional distributions for which overrelaxation is to be done . This requirement is somewhat restrictive , but reasonably fast methods for compu ting these functions are known for many standard distributions (Kennedy and Gentle 1980) . This implementation of ordered overrelaxation produces exactly the same effect as would a direct implementation of the steps in Section 4.1. As there , we aim to replace the current value, Xi , of component i , by a new value , xi . The conditional distribution for component i , 7r(Xi I { Xj } j #i ) , is here assumed to be continuous , with cumulative distribution distribution function F (x ) , whose inverse is p - l (x ) . We proceed as follows : 1) Compute u = F {Xi) , which will lie in [0, 1] . 2) Draw an integer r from the binomial (K , u) distribution . This r has the same distribution as the r in the direct procedure of Section 4.1. 3) If r > K - r , randomly generate v from the beta (K - r + 1, 2r - K ) distribution , and let u' = uv . If r < K - r , randomly generate v from the beta (r + 1, K - 2r ) distri bution , and let u' = 1 - (1- u) v . If r = K - r , let u' = u . Note that u' is the result of overrelaxing u with respect to the uniform distri bu tion on [0, 1] . 4) Let the new value for component i be xi = F - 1(u' ) .
222
RADFORD M. NEAL
Step
( 3 ) is
in n
a
on
of
size
sample
-
k
+
from
1)
beta
( Devroye When
binomial , Sections
for
Woch
other
quired the
5 .2 .
In
some
, an
K
ordered
times
potentially
aE
for
step
( 1 ) , depends
in
haps
in
way
a complex
updated of
, this
xi . This
however
need
, even
tional
distribution
components dered
Other ing
[(
generating some cost on Gilks to from sity the
than
from
one
the
value
distribution "
that
other
is
and
Wild
implement
( 1992 Gibbs
approximations
. When
and dis -
framework
cannot
reduce
be
for
. If
quite
the
re -
for
which
computed
.
of
Section
other
an
the
random
being
for
each
variate
update
the to
values
f(
times
of
scale
" , in
less
than
[(
-
total
the
itself
than
,
condi
the
of
generation
less
-
update
from
the
in
, per
contribution on
much
drawn
overrelaxation
dominant
con
themselves
generated
dependence
take
are
re - computed
be
be
4 . 1 , the
components are
ordered
then
take
therefore
variates
the
be
naturally
( and
of
components
lead
to
will
state
) is
density
takes
occur
whenever
family the ) . The
another
than
other , an
or -
aE long
aE
continually
adaptive
change
a succession value refined
( due
of is
to
is
it
is
from the
result
from " setup
dependence
widely
drawn to
that
of used
randomly
the
as
method
approximations
drawn
, with
, as
value
long
some
sampling
example , a
generat as
generated
with
rejection
scheme
which times
are
a method
parameters
this
one
values
using
important
. In using
more
" economies
distribution
sampling
are
similar
Brown
distributions
random
values
will
this
whenever of
a log - concave function
also same
incurred
be re -
conditional
may
will
[(
must
a parametric
components
is
the
update
which
could
. This
in
to that
.
can
from
inverse
procedure
the
update
situations
values
to
direct
once
found from
update
known
exceed
overrelaxation
update
other
values
is
are
by
, which
sampling
the
only
that
sampling
the
these
comes
, rather
its
Gibbs
on
[(
overrelaxation
a Gibbs
a
. Since
done
makes
ordered
or
will
suggestion
application
distribution
be
time
aE
general
though
computation
its
Xi , from
conditional
(k ,
SCALE
) . In
distribution
time
implementation
the
, however
overrelaxation
long
advantageous
ditional
to
3 ) . The
OF
beta variates
overrelaxation
time
that
function
ECONOMIES
cases
than
this . The
time
allow
distribution
USING
less
, or
ordered
, though
implementations time
cumulative
K
update
Section
a
random
expected
allows of
sampling
( see
statistic
[ 0 , 1 ] has
generating
bounded
order
X .4 ) .
a transformation
possible
k ' th
on
for
in
. 4 and
the
distribution
methods
implementation
perform
computation
, p . 11 ) that
uniform
computation
Gaussian
admits
IX
likely
) to
1970
distributions
Gibbs
in
( 1987
a
independent
a simple
tribution
from
, this
time
and
( David
. Efficient
and
feasible
spirit
fact
n
1986
in
quired
the
distribution
performed
in
based
same later
the
den
density values
,
SUPPRESSING RANDOM WALKS USING OVERRELAXATION
223
take much less time to generate than earlier values. Further time savings can be obtained by noting that the exact numerical values of most of the K values generated are not needed. All that is required is that the number , r , of these values that are less than the current Xi be somehow determined , and that the single value x ~K - r) == x~ be found . In particular , the adaptive rejection sampling method can be modified in such a way that large groups of values are "generated " only to the extent that they are localized to regions where their exact values can be seen to be irrelevant . The cost of ordered overrelaxation can then be much less than K times the cost of a Gibbs sampling update . This is a somewhat complex procedure , however , which I will not present in detail here. 6 . Demonstration
: Inference
for a hierarchical
Bayesian
model
In this section , I demonstrate the advantages of ordered overrelaxation over Gibbs sampling when both are applied to Bayesian inference for a simple hierarchical model . In this problem , the conditional distributions are nonGaussian , so Adler 's method cannot be applied . The implementation of ordered overrelaxation used is that based on the cumulative distribution function , described in Section 5.1. For this demonstration , I used one of the models Gelfand and Smith ( 1990) use to illustrate Gibbs sampling . The data consist of p counts , 81, . . . , 8p. Conditional on a set of unknown parameters , AI , . . . , Ap, these counts are assumed to have independent Poisson distributions , with means of Aiti , where the ti are known quantities associated with the counts 8i . For example , 8i might be the number of failures of a device that has a failure rate of Ai and that has been observed for a period of time ti . At the next level , a common hyperparameter {3 is introduced . Condi tional on a value for {3, the Ai are assumed to be independently generated from a gamma distribution with a known shape parameter , a , and the scale factor f3. The hyperparameter f3 is assumed to have an inverse gamma distribution with a known shape parameter , " and a known scale factor , b. The problem is to sample from the conditional distribution for j3 and the Ai given the observed 81, . . . , sp. The joint density of all unknowns is given by the following proportionality :
P(,a, A1, . . . , ApI 81, . . . , 8p) cx P (,a) P (Al , . . . , Ap I ,8) P (SI, . . . , Sp I AI , . . . ' Ap)
cx f3- ')'- 1e- c5 / .a
p p II ,a-at\i - 1e-Ai/(3. II t\iie-Aiti i=l i=l
(7) (8)
224
RADFORD
The conditional distribution gamma
M . NEAL
for ~ given the other variables is thus inverse
:
f-' I AI P (~ \
" " , Ap \ , Sl , " . , Sp)
<X
f-' ~- PQ- 'Y- l e -
{O+EiAi)//3
(9)
However, I found it more convenient to work in terms of T = 1/ {3, whose conditional density is gamma: P (T I A1, " . . . , Ap, 81, . . . , 8p) cx Tpo+')'- 1e- T(c5 +EiAi)
(10)
The conditional distributions for the Ai are also gamma: P (A1 ", I {A ' J.} J.#1". T, 81, . . . , 8p) <X Ai , Si+o- 1e- >"i(ti+7')
(11)
In each full iteration of Gibbs sampling or of ordered overrelaxation , these conditional distributions are used to update first the Ai and then T. Gelfand and Smith (1990, Section 4.2) apply this model to a small data set concerning failures in ten pump systems , and find that Gibbs sampling essentially converges within ten iterations . Such rapid convergence does not always occur with this model , however. The Ai and T are mutually dependent , to a degree that increases as Q :' and p increase. By adjusting a and p, one can arrange for Gibbs sampling to require arbitrarily many iterations to converge . For the tests reported here, I set p == 100, Q :' == 20, <5 == 1, and l' = 0.1. The true value of T was set to 5 (ie, ,8 = 0.2) . For each i from 1 to p , ti was set to if p , a value for Ai was randomly generated from the gamma distribution with parameters Q :' and ,8, and finally a synthetic observation , Si, was randomly generated from the Poisson distribution with mean Aiti . A single such set of 100 observations was used for all the tests , during which the true values of T and the Ai used to generate the data were of course ignored . Figure 5 shows values of T sampled from the posterior distribution by successive iterations of Gibbs sampling , and of ordered overrelaxation with [( == 5, [( == 11, and [( == 21. Each of these methods was initialized with the Ai set to sifti and T set to Q :' divided by the average of the initial Ai ; The ordered overrelaxation iterations took about 1.7 times as long as the Gibbs sampling iterations . (Although approximately in line with expectations , this timing figure should not be taken too seriously - since the methods were implemented in S-Plus , the times likely reflect interpretative overhead , rather than intrinsic computational difficulty .) The figure clearly shows the reduction in autocorrelation for T that can be achieved by using ordered overrelaxation rather than Gibbs sampling . Numerical estimates of the autocorrelations (with the first 50 points discarded ) show that for Gibbs sampling , the autocorrelations do not approach
SUPPRESSING RANDOM WALKSUSINGOVERRELAXATION 225 ~
( 0
&l )
. qo
0
100
200
300
400
500
600
Plot of 1" during Gibbs sampling run
Plot of T during
ordered overrelaxation
run with K = 5
Plot of T during ordered overrelaxation run with K = 11
Plot of r during ordered overrelaxation run with K = 21 Figure 5. Sampling from the posterior distribution for T using Gibbs sampling and ordered overrelaxation with K = 5, 1< = 11, and 1< = 21. The plots show the progress of T = 1/ {3 during runs of 600 full iterations (in which the Ai and T are each updated once).
226
RADFORD M. NEAL
zero until around lag 28, whereas for ordered overrelaxation with K = 5, the autocorrelation is near zero by lag 11, and for K = 11, by lag 4. For ordered overrelaxation with K = 21, substantial negative autocorrelations are seen, which would increase the efficiency of estimation for the expected value of T itself , but could be disadvantageous when estimating the expectations of other functions of state . The value f( = 11 seems close to optimal in terms of speed of convergence.
7. Discussion The results in this paper show that ordered overrelaxation should be able to speed the convergence of Markov chain Monte Carlo in a wide range of circumstances . Unlike the original overrelaxation method of Adler (1981) , it is applicable when the conditional distributions are not Gaussian , and it avoids the rejections that can undermine the performance of other generalized overrelaxation methods . Compared to the alternative of suppressing random walks using hybrid Monte Carlo (Duane , et at. 1987) , overrelax ation has the advantage that it does not require the setting of a stepsize parameter , making it potentially easier to apply on a routine bagis. An implementation of ordered overrelaxation based on the cumulative distribution function Wagdescribed in Section 5.1, and used for the demonstration in Section 6. This implementation can be used for many problems , but it is not as widely applicable ag Gibbs sampling . Natural economies of scale will allow ordered overrelaxation to provide at least some benefit in many other contexts , without any special effort . By modifying adaptive rejection sampling (Gilks and Wild 1992) to rapidly perform ordered overrelaxation , I believe that quite a wide range of problems will be able to benefit from ordered overrelaxation , which should often provide an order of magnitude or more speedup, with little effort on the part of the user. To use overrelaxation , it is necessary for the user to set a time -constant parameter - a for Adler 's method , [( for ordered overrelaxation - which , roughly speaking , controls the number of iterations for which random walks are suppressed. Ideally , this parameter should be set so that random walks are suppressed over the time scale required for the whole distribution to be traversed , but no longer . Short trial runs could be used to select a value for this parameter ; finding a precisely optimal value is not crucial . In favourable cases, an efficient implementation of ordered overrelaxation used with any value of K less than the optimal value will produce an advantage over Gibbs sampling of about a factor of [( . Using a value of K that is greater than the optimum will still produce an advantage over Gibbs sampling , up to around the point where [( is the square of the optimal value . For routine use, a policy of simply setting [( to around 20 may be
SUPPRESSING RANDOM WALKS USING OVERRELAXATION
227
reasonable . For problems with a high degree of dependency, this may give around an order of magnitude improvement in performance over Gibbs sampling , with no effort by the user. For problems with little dependency between variables , for which this value of K is too large , the result could be a slowdown compared with Gibbs sampling , but such problems are sufficiently easy anyway that this may cause little inconvenience . Of course, when convergence is very slow , or when many similar problems are to be solved, it will be well worthwhile to search for the optimal value of K . There are problems for which overrelaxation (of whatever sort ) is not advantageous , as can happen when variables are negatively correlated . Fur ther research is needed to clarify when this occurs , and to determine how these situations are best handled . It can in fact be beneficial to underrelax in such a situation - eg, to use Adler 's method with a > 0 in equation (1) . It is natural to ask whether there is an "ordered underrelaxation " method that could be used when the conditional distributions are non-Gaussian . I believe that there is. In the ordered overrelaxation method of Section 4.1, step (3) could be modified to randomly set xi to either x ~r+l ) or x ~r - l ) (with the change being rejected if the chosen r :t: 1 is out of range) . This is a valid update (satisfying detailed balance) , and should produce effects similar to those of Adler 's method with a > O. Acknowledgements . I thank David MacKay for comments on the manuscript . This work was supported by the Natural Sciences and Engineering Research Council of Canada . References Adler , S. L . ( 1981 ) " Over - relaxation method for the Monte Carlo evaluation of the partition function for multiquadratic actions " , Physical Review D , vol . 23 , pp . 2901 - 2904 . Barone , P. and Frigessi , A . ( 1990 ) " Improving stochastic relaxation for Gaussian random fields " , Probability in the Engineering and Informa tional Sciences , vol . 4 , pp . 369 - 389 . Brown , F . R . and Woch , T . J . ( 1987 ) " Overrelaxed heat - bath and Metropo lis algorithms for accelerating pure gauge Monte Carlo calculations " , Physical Review Letters , vol . 58 , pp . 2394 - 2396 . Creutz , M . ( 1987 ) " Overrelaxation and Monte Carlo simulation " , Physical Review D , vol . 36 , pp . 515 - 519 . David , H . A . ( 1970 ) Order Statistics , New York : John Wiley & Sons . Devroye , L . ( 1986 ) Non - uniform Random Variate Generation , New York : Springer - Verlag . Duane , S., Kennedy , A . D ., Pendleton , B . J ., and Roweth , D . ( 1987 ) " Hy brid Monte Carlo " , Physics Letters B , vol . 195 , pp . 216 - 222 .
228
RADFORDM. NEAL
Fodor, Z. and Jansen, K . (1994) "Overrelaxation algorithm for coupled Gauge-Higgs systems" , Physics Letters B , vol . 331, pp . 119- 123.
Gilks, W . R. and Wild , P. (1992) "Adaptive rejection sampling for Gibbs sampling " , Applied Statistics , vol . 41, pp . 337-348.
Gelfand, A . E. and Smith, A . F . M . (1990) "Sampling-based approaches to calculating marginal densities" , Journal of the American Statistical Association
, vol . 85 , PD . 398 - 409 .
Green, P. J. and Han, X . (1992) "Metropolis methods, Gaussian proposals and antithetic variables" , in P. Barone, et al. (editors) Stochastic Models , Statistical Methods, and Algorithms in Image Analysis , Lecture Notes in Statistics , Berlin : Springer -Verlag .
Hastings, W . K . (1970) "Monte Carlo sampling methods using Markov chains and their applications
" , Biometrika
, vol . 57 , pp . 97- 109 .
Kennedy, W . J. and Gentle, J. E. (1980) Statistical Computing, New York: Marcel
Dekker
.
Metropolis , N ., Rosenbluth , A . W ., Rosenbluth , M . N ., Teller , A . H ., and
Teller, E. (1953) "Equation of state calculations by fast computing machines" , Journal of Chemical Physics , vol . 21, pp . 1087- 1092. Neal , R . M . (1993) " Probabilistic inference using Markov Chain Monte Carlo methods " , Technical Report CRG -TR -93- 1, Dept . of Computer Science, University of Toronto . Obtainable in compressed Postscript by anonymous ftp to ftp .cs.toronto .edu, directory pub / radford , file review
.ps . Z .
Neal , R. M . (1995) Bayesian Learning for Neural Networks , Ph .D . thesis , Dept . of Computer Science, University of Toronto . Obtainable in compressed Postscript by anonymous ftp to ftp .cs.toronto .edu , directory
pub/ radford, file thesis.ps.Z. Smith, A . F. M . and Roberts, G. O. (1993) "Bayesian computation via the Gibbs
sampler
and related
Markov
chain Monte
Carlo
methods " ,
Journal of the Royal Statistical Society B, vol. 55, pp. 3-23. (See also the other papers and discussionin the same issue.) Toussaint, D . (1989) "Introduction to algorithms for Monte Carlo simulations and their application to QCD " , Computer Physics Communica tions
, vol . 56 , pp . 69 - 92 .
Whitmer , C. (1984) "Over-relaxation methods for Monte Carlo simulations of quadratic and multiquadratic
actions " , Physical Review D , vol . 29,
pp . 306 - 311 .
Wolff, U. (1992) "Dynamics of hybrid overrelaxation in the gaussianmodel" , Physics Letters B , vol . 288, pp . 166- 170. Young , D . M . (1971) Iterative Solution of Large Linear Systems, New York : Academic
Press .
CHAIN GRAPHS AND SYMMETRIC ASSOCIATIONS
THOMAS
S . RICHARDSON
Statistics Department University of Washington tsr @stat . washington . edu
Abstract . Graphical models based on chain graphs , which admit both directed and undirected edges, were introduced by by Lauritzen , Wermuth and Frydenberg as a generalization of graphical models based on undirected graphs , and acyclic directed graphs . More recently Andersson , Madigan and Perlman have given an alternative Markov property for chain graphs . This raises two questions : How are the two types of chain graphs to be interpreted ? In which situations should chain graph models be used and with which Markov property ? The undirected edges in a chain graph are often said to represent 'symmetric ' relations . Several different symmetric structures are considered , and it is shown that although each leads to a different set of conditional indepen dences, none of those considered corresponds to either of the chain graph Mar kov properties . The Markov properties of undirected graphs , and directed graphs , in cluding latent variables and selection variables , are compared to those that have been proposed for chain graphs . It is shown that there are qualita tive differences between these Markov properties . As a corollary , it is proved that there are chain graphs which do not correspond to any cyclic or acyclic directed graph , even with latent or selection variables .
1. Introduction The use of acyclic directed graphs (often called 'DAG 's) to simultaneously represent causal hypotheses and to encode independence and conditional in dependence constraints associated with those hypotheses has proved fruit ful in the construction of expert systems, in the development of efficient updating algorithms (Pearl [22]; Lauritzen and Spiegelhalter [19]) , and in 231
232
THOMASS. RICHARDSON
inferring causal structure (Pearl and Verma [25]; Cooper and Herskovits [5]; Spirtes, Glymour and Scheines[31]). Likewise , graphical models based on undirected graphs , also known as Mar kov random fields , have been used in spatial statistics to analyze data from field trials , image processing , and a host of other applications (Ham -
mersley and Clifford [13]; Besag [4]; Speed [29]; Darroch et ala [8]). More recently , chain graphs , which admit both directed and undirected edges have been proposed as a natural generalization of both undirected graphs
and acyclic directed graphs (Lauritzen and Wermuth [20]; Frydenberg [11]). Since acyclic directed graphs and undirected graphs can both be regarded as special cases of chain graphs it is undeniable that chain graphs are a generalization in this sense. The introduction of chain graphs has been justified on the grounds that
this admits the modelling of 'simultaneous responses' (Frydenberg [11]), 'symmetric associations' (Lauritzen and Wermuth [20]) or simply 'associative relations ' , as distinct from causal relations (Andersson , Madigan and
Perlman [1)). The existence of two different Markov properties for chain graphs raises the question of what sort of symmetric
relation
is represented
by a chain graph under a given Markov property , since the two properties are clearly
different . A second related
question
concerns
whether
or not
there are modelling applications for which chain graphs are particularly well suited , and if there are , which Markov
property
is most appropriate
.
One possible approach to clarifying this issue is to begin by considering causal systems , or data generating processes, which have a symmetric structure . Three simple , though distinct , ways in which two variables , X
and Y , could be related symmetrically are: (a) there is an unmeasured, ' confounding
' , or ' latent ' variable
that
is a common
cause of both X and
Y ; (b ) X and Yare both causes of some 'selection ' variable (conditioned on in the sample ) ; (c) there is feedback between X and Y , so that X is a cause of Y , and Y is a cause of X . In fact situations (a) and (b ) can easily be represented by DAGs through appropriate extensions of the formalism
(Spirtes, Glymour and Scheines[31]; Cox and Wermuth [7]; Spirtes, Meek and Richardson [32)). In addition , certain kinds of linear feedbackcan also be modelled with directed cyclic graphs (Spirtes [30]; Koster [16); Richardson [26, 27, 28]; Pearl and Dechter [24]). Each of these situations leads to a different set of conditional independences . However , perhaps surprisingly , none of these situations , nor any combination of them , lead in general to either of the Markov properties associated with chain graphs . The remainder of the paper is organized as follows : Section 2 contains definitions of the various graphs considered and their associated Markov properties . Section 3 considers two simple chain graphs , under both the ori ginal Markov property proposed by Lauritzen , Wermuth and Frydenberg ,
CHAIN GRAPHS AND SYMMETRIC ASSOCIATIONS
233
and the alternative given by Andersson , Madigan and Perlman . These are compared to the corresponding directed graphs obtained by replacing the undirected edges with directed edges in accordance with situations (a) , (b) and (c) above. Section 4 generalizes the results of the previous section : two properties are presented , motivated by causal and spatial intuitions , that the set of conditional independences entailed by a graphical model might satisfy . It is shown that the sets of independences entailed by (i ) an undirected graph via separation , and (ii ) a (cyclic or acyclic ) directed graph (possibly with latent and ! or selection variables ) via d-separation , satisfy both pro perties . By contrast neither of these properties , in general , will hold in a chain graph under the Lauritzen - Wermuth -Frydenberg (LWF ) interpreta tion . One property holds for chain graphs under the Andersson -Madigan Perlman (AMP ) interpretation , the other does not . Section 5 contains a discussion of data -generating processes associated with different graphical models , together with a brief sketch of the causal intervention theory that has been developed for directed graphs . Section 6 is the conclusion , while proofs not contained in the main text are given in Section 7.
2. Graphs and Probability Distributions Thissectionintroduces thevariouskindsof graphconsidered in this paper, togetherwith their associated Markovproperties . 2.1. UNDIRECTED ANDDIRECTED GRAPHS An undirected graph , UG , is an ordered pair (V , U ) , where V is a set of vertices and U is a set of undirected edges X - Y between vertices .l Similarly , a directed graph , DG , is an ordered pair (V , D ) where D is a set of directed edges X - + Y between vertices in V . A directed cycle consists of a sequence of n distinct edges Xl - + X2 - + . . . - + Xn - + Xl (n ~ 2) . If a directed graph , DG , contains no directed cycles it is said to be acyclic , otherwise it is cyclic . An edge X - + Y is said to be out of X and into Y ; X and Yare the endpoints of the edge. Note that if cycles are permitted there may be more than one edge between a given pair of vertices e.g~ X t - Y t - X . Figure 1 gives examples of undirected and directed graphs . 2.2. DIRECTED GRAPHS WITH LATENT VARIABLES AND SELECTION
VARIABLES
Cox and Wermuth [7] and Spirtes et al. [32] introduce directed graphs in which V is partitioned into three disjoint sets 0 (Observed), S (Selection) lBold face (X ) denote sets; italics (X ) denote individual vertices; greek letters (7r) denote paths.
234
THOMASS. RICHARDSON UG1
A -
A
C
I B - - D
UG2
B-
(a)
I
t - D
DGI
-t-C A ~C A A DG : \ J --DB-C~-o 1-~ B DG2
(b)
(c)
Figure1. (a) undirected graphs ; (b) a cyclicdirectedgraph; (c) acyclicdirectedgraphs
and L (Latent ), written DG (O, S, L ) (where DG may be cyclic). The interpretation of this definition is that DG representsa causalor data-generating mechanism; 0 representsthe subset of the variables that are observed; S represents a set of selection variables which, due to the nature of the mechanism selecting the sample, are conditioned on in the subpopulation from which the sample is drawn; the variables in L are not observedand for this reason are called latent.2 Example.. Randomized Trial of an Ineffective Drug with Unpleasant Side-Effects3 A simple causal mechanism containing latent and selection variables is given in Figure 2. The graph represents a randomized trial of an ineffective drug with unpleasant side-effects. Patients are randomly assigned to the treat ment or control group (A ). Those in the treatment group suffer unpleasant side-effects , the severity of which is influenced by the patient 's general level of health (H ) , with sicker patients suffering worse side-effects. Those patients who suffer sufficiently severe side-effects are likely to drop out of the study . The selection variable (Sel ) records whether or not a patient remains in the study , thus for all those remaining in the study S el = Stay In . Since unhealthy patients who are taking the drug are more likely to drop out , those patients in the treatment group who remain in the study tend to be healthier than those in the control group . Finally health status (H ) influences how rapidly the patient recovers. This example is of interest because, as should be intuitively clear , a simple comparison of the recovery time of the patients still in the treatment and control groups at the end of the study will indicate faster recovery among those in the treatment group . This comparison falsely indicates that the drug has a beneficial effect , whereas in fact , this difference is due entirely to the side-effects causing the sicker patients in the treatment group to drop out of the study .4 (The only difference between the two graphs in Figure 2 is that in DG1 (01 , 81 , L :J.) 2Note that the terms variable and vertex are used interchangeably. 31 am indebted to Chris Meek for this example. 4For precisely these reasons, in real drug trials investigators often go to great lengths to find out why patients dropped out of the study.
235 R
I
AssignmentA
H
(Treatment /COntrol )" " SideEffects
...............
Selection (StayIn/ Drop
DGl(Ot,SiL.)
01={A,Ef,H,R} 81={Sel} L1=0
02, ={A.Ef,R} 82 , ={Sel} L2,={H}
DG2 (02,sz,Lz)
Figure 2. Randomizedtrial of an ineffectivedrug with unpleasantside effectsleading to drop out. In DG1(Ol , Sl ,Ll ), H E 01 , and is observed , while in DG2(O2, S2, L2) H E L2 and is unobserved(variablesin L are circled; variablesin S are boxed; variables in 0 are not marked) .
A .1/.-C A ..-C IB I -..D B
A - . . c
CG1
A
B- J - D
B - - D
(a)
Figure 3.
CG2
I
(b)
(a) mixed graphs containing partially directed cycles; (b) chain graphs.
health status (H ) is observed so H E 01 , while in DG2 (O2 , 82 , L2 ) it is not observedso H E L2 .) 2 .3 .
MIXED
GRAPHS
AND
CHAIN
GRAPHS
In a mixed graph a pair of vertices may be connected by a directed edge or an undirected edge (but not both ) . A partially directed cycle in a mixed graph G is a sequence of n distinct edges (El " ' " En ) , (n ~ 3) , with endpoints Xi , Xi +l respectively , such that : (a) Xl == Xn + l ,
(b) ~i (1 ~ i ~ n) either Xi - Xi +l or Xi - * Xi +l , and (c) ~j (1 ~ j ::; n) such that Xj - tXj +l . A chain graph C G is a mixed graph in which there are no partially direc -
ted cycles (see Figure 3). Koster [16] considersclassesof reciprocal graphs containing directed and undirected edges in which partially directed cycles are allowed . Such graphs are not considered separately here, though many of the comments which apply to LWF chain graphs also apply to reciprocal graphs since the former are a subclass of the latter . To make clear which kind of graph is being referred to UG will denote undirected graphs , DG directed graphs , CG chain graphs , and G a graph
236
THOMASS. RICHARDSON
which
may
be
whatever
exists
a
has
)
Xi
Y
Xi
the
is
. 4
l
Xi
(
is
path
of
THE
E
X
,
,
to
X
no
( X
1
~
i
~
n
.
.
.
-
,
Xl
i . e
+
is
Y
,
,
.
.
.
.
Ei
occurs
it
+
)
and
( El
: =
vertex
.
.
.
,
,
is
in
En
Xn
-
graph
G
such
l
that
: =
Xi
than
A
a
)
+
Xi
more
cyclic
Y
.
Y
+
)
l
,
once
directed
( of
there
where
Xi
-
on
Ei
+
Xi
+
the
path
l
,
path
from
X
to
.
ASSOCIATED
a
UG2
WITH
I
,
and
{
'
than
X
is
I
C
AlLB
I
C
AlLB
I
C
most
terms
a of
the
I
XJL
Y
Z
instead
Y
{
of
;
;
(
. lL
Z
B
{
V
D
{
is
I
( a
)
sets
from
a
-
of
variable
variable
separation
e
}
a
~
(
Z
tail
in
UG
the
;
I
}
{
C
,
in
Fu
Z
,
)
. 7
following
conditio
-
C
'
f
{
A
,
D
,
C
}
;
;
AlLD
I
{
B
}
;
are
independence
listed
.
relations
For
instance
,
note
.
rather
than
define
a
vertices
.
vertices
vertices
unique
path
( Note
that
in
;
in
,
since
a
a
directed
cyclic
there
chain
may
graph
be there
. )
introduced .
BlLC
.
}
,
;
}
I
)
a
D
D
AlLD
empty
{
of
of
BlLC
' elementary
edges
pair
;
I
general
as
Here
the
a
means
for
global
deriving
property
the
is
conse
defined
-
directly
.
independent
of braces
Z
,
I
criterion
Y
disjoint
path
some
by
}
D
A
}
of
a
,
conditions
' X
JL
C
be
,
are
convenient
}
for
re
.
AlLE
only
pair
Markov
{
I
in
given
,
;
may
{
not
graphical
of
D
sequence
each
When
,
no
include
Z
l
B
D
and
}
that .
.
all
Z
{
I
conditions
means
used
I
B1l
a
relevant
'
not
Figure
AlLB
does
local
is
separated
I
I
between
UG
there
;
Yare
AlLB
between
if
by
AlLD
a
independence
:
A
as
,
does
in
;
C
}
Markov
Z is
{
here
set
Y
I
vertices
edge
7 ' XJL
I
. lL
edge
one
of
D
conditional
Property
and
,
of
global
quences
.
defined
one
that
separation
entails
sequence
6aften
X
AlLD
}
also
5 ' Path
more
if
throughout
form
UGI
,
)
separated
graphs
B1l
Here
Z
of
graph
empty
Markov
undirected
Fu
Y
be
set
undirected
be
E
to
a
an
may
Y
Y
Fu
a
In
Z
via
UG1
graph
. 6
(
said
XJl
the
the
,
variable
Yare
Fu
that
associates
Z
independences
Y
(
If
-
Global
Thus
VJL
l
X
edges
PROPERTY
G
and
and
UG
tion
between
of
otherwise
X
graph
Y
Undirected
in
,
property
a
X
at
. 5
MARKOV
with
then
is
)
form
Markov
lations
of
path
GRAPHS
vertices
nal
+
n
acyclic
GLOBAL
global
X
~
A
sequence
vertices
Xi
i
the
UNDIRECTED
A
~
.
a
distinct
and
1
path
a
.
+
these
of
of
,
~
then
of
consists
sequence
endpoints
or
2
anyone
type
are
Y omitted
given
Z from
' ;
if
Z singleton
=
0
,
the sets
abbrevia {
V
}
,
e
. g
.
2 .5 .
THE
CHAIN
GRAPHS
GLOBAL
MARKOV
DIRECTED
AND
SYMMETRIC
PROPERTY
237
ASSOCIATIONS
ASSOCIATED
WITH
GRAPHS
In a directed graph DG , X is a parent of Y , (and Y is a child of X ) if there
is a directed edgeX -)- Y in G. X is an ancestorof Y (and Y is a descendant of X ) if there is a directed path X -)- . . . -)-Y from X to Y , or X := Y . Thus
'ancestor' ('descendant') is the transitive , reflexive closure of the 'parent' ('child ') relation . A pair of consecutive edges on a path 7r in DG are said to collide at vertex A if both edges are into A , i .e. - +A ~ , in this case A is called
a collider
on 7r, otherwise
A is a non - collider
on 7r. Thus
every
vertex on a path in a directed graph is either a collider , a non-collider , or
an endpoint. For distinct vertices X and Y , and set Z <; V \ { X , Y } , a path 7r between X and Y is said to d-connect X and Y given Z if every collider on
7r is an
disjoint
ancestor
of a vertex
in
Z , and
no
non - collider
on
7r is in
sets X , Y , Z , if there is an X E X , and Y E Y , such that
Z . For
there
is a path which d-connects X and Y given Z then X and Yare said to be d-connected given Z . If no such path exists then X and Yare said to be
d-separatedgiven Z (seePearl [22]). Directed
Global
Markov
Property
; d - separation
(I =DS)
DG I=DSXli Y I Z if X and Yare d-separatedby Z in DG . Thus the directed graphs in Figure 1(b,c) entail the following conditional independences
via d- separation :
DG1
I=DS BJlC I { A , D } ;
DG2
I=DS BJlC I A ; AlLD I { B , C} ;
DG3
I=DS AJlB I C; AlLB I { a , D } ; AlLD I C; AJlD I { B , a } ; BJlD I C; BlLD I { A , C } .
Note that the conditional independences entailed by DG3 under d-separation 2 .6 .
THE
DIRECTED
are precisely GLOBAL GRAPHS
those entailed MARKOV WITH
by U G 2 under separation .
PROPERTY LATENT
AND
ASSOCIATED SELECTION
WITH VARIABLES
The global Markov property for a directed graph with latent and/ or selection variables is a natural extension of the global Markov property for
directed graphs. For DG (O , S, L ), and XUY (JZ ~ 0 define:
DG(O, S, L) FDSXlL Y I Z if andonly if DG FDSXJl Y I Z U S.
238
THOMASS. RICHARDSON
In
by
other
DG
led
words
(
by
O
,
S
the
,
L
)
is
,
are
under
in
0
in
patients
observed
in
S
will
S
are
See
Spirtes
and
the
DGI
.
)
= =
,
DS
the
health
[
,
81
,
(
. 7
.
O2
,
Ll
AJlR
I
DG2
is
]
,
L2
)
GLOBAL
)
,
the
)
{
H
,
Spirtes
,
S
,
}
,
;
unobserved
,
,
by
)
Meek
(
Thus
. 2
,
the
,
81
a
all
the
of
(
O
conditio
I
. .
llR
)
-
S
Richardson
L1
only
variables
set
P
,
not
relation
entails
and
O1
. . llR
I
AJlR
82
L
.
=
[ 32
,
shown
I
{
in
In
]
;
)
.
Cox
Figure
2
,
:
A
O2
.
are
independence
O
,
L
which
2
distribution
DG1
Sel
(
Section
In
conditional
FDS
in
in
Stay
S
variables
variables
=
the
in
observed
in
-
and
variables
only
example
(
,
subpopulation
observed
;
selection
the
a
DG
the
33
the
L
Sel
graph
entailed
82
all
,
independences
graph
status
in
every
in
the
.
Hence
hold
independences
DG2
.
Thus
OI
S
entai
occur
involving
which
conditional
(
I
However
2
]
. g
in
sample
following
DG1
tioned
7
,
variables
from
e
for
Richardson
[
the
since
those
,
upon
in
O
drawn
on
conditioned
hold
Wermuth
since
are
were
)
(
relations
latent
implicitly
DG
entailed
independence
no
relations
samples
which
entails
(
relations
those
which
of
conditioned
independences
and
of
in
independence
to
nal
,
includes
,
be
observed
independence
subset
DG
interpretation
Similarly
variables
conditional
the
conditional
.
of
exactly
always
the
,
observed
set
graph
set
Since
the
directed
conditioning
(
,
I
L2
)
H
{
;
H
O2
graph
A
,
does
tt
the
H
Ef
,
not
,
Sel
}
(
Ef
any
neither
DG1
,
,
,
independences
of
OI
}
.
entail
so
H
81
,
the
,
L1
above
)
is
men
-
entailed
by
.
MARKOV
PROPERTIES
ASSOCIATED
WITH
CHAIN
GRAPHS
There
are
for
2
.
7
A
.
1
The
path
V
I
is
in
from
(
V
a
if
any
graph
are
that
}
both
of
CG
from
in
Markov
be
it
see
V
\
to
Studeny
and
original
Bouckaert
set
W
on
induced
all
have
the
and
W
if
(
1r
,
there
X
-
with
Ant
re
-
formulated
chain
graph
[ 36
]
,
Andersson
on
W
)
of
CG
=
endpoint
in
,
)
(
an
is
tY
subgraph
edges
been
is
property
edges
the
Wand
a
directed
X
denote
properties
applied
(
)
proposed
Markov
to
all
between
W
been
relation
graph
anterior
which
is
(
vertices
may
be
in
Y
Let
these
that
derived
.
to
W
have
independence
chain
said
E
which
conditional
Frydenberg
is
such
all
,
-
W
W
removing
criteria
a
Wermuth
some
to
recently
properties
definitions
graph
to
)
Markov
both
chain
anterior
separation
undirected
In
-
V
by
8More
global
.
Lauritzen
V
7r
obtained
a
.
path
the
different
graphs
vertex
a
{
two
chain
terms
rather
of
than
et
an
ale
[ 2
]
)
.
CHAINGRAPHSAND SYMMETRICASSOCIATIONS
II B - -- D
AI XI B .:_-- : n
(a)
(b)
I
B -
C -
/ 1""
D
B -
C -
(c)
239
D
(d)
Figure 4. Contrast between the LWF and AMP Markov properties. Undirected graphs used to test AJLD I { B , C } in CGl under (a) the LWF property , (b) the AMP property . Undirected graphs used to test BJLD I C in CG2 under (c) the LWF property , (d) the AMP property .
in V \ W . A complex in CG is an induced subgraph with the following form: X - + Vl - . . . - Vn ~ y (n ~ 1). A complex is moralized by adding the undirected edge X - Yo Moral (CG) is the undirected graph formed by moralizing all complexesin CG, and then replacing all directed edgeswith undirected edges. LWF
Global
CG
Markov
FLWF
XlL
graph
Moral
Hence
the
Y
I
( CG
Z
if
( Ant
chain
independences
Property
AJLB
CG2
FLWF
AJLB
;
that
I
,
2 . 7 .2 .
The
In
a
and
chain
has
an
between
all X
and
the
( W
a
) CG
set
on
the
W
on
}
{ a
I
{ A
same
entail
;
BJlC
, D
}
, a
the
I
;
}
and
( W if
path ' !r . g
there are
( See
A is
to
path ( X 5 .)
conditional
;
;
AlLD
I
by
graph be
{ B
CG2
, a
}
;
under
DG3
the
under
Now
Markov
connected
and
d - sep
-
,
Con
subgraph edges V
in
' !r
from
- t
Y
a
in
and
there ( W ,
chain V
)
property if
W
directed
directed Figure
undirected
1 ) .
vertex a
C
}
by
extended all
) ) .
, D
entailed
V
The
contains
( Con W
said
.
the
.
between }
{ A
I
chain
Ware
W
in
following
AJLD
Figure
- Perlman
E
Z
)
:
those
( see
and
by
entailed
as
edges W
in of
edges
I
BJLD
V
some Con
, a
separation
vertices
edges ancestor
;
)
property
AiLB
undirected
set
3 ( b
{ B
- Madigan
to
vertex
which
under
only
undirected be
G2
graph
connected
;
Y
( PLWF
) ) ) .
independences
are
Andersson
containing is
U
C
I
Graphs
from
Z
Markov
conditional
property
aration
C
I
the
Markov
U
Figure
AJLD
Chain
separated
Y
LWF
BJLD
Notice
U
in
the
! = LWF
is
( X
graphs
under
GCl
LWF
X
for
is )
=
Ext
CG
( W
some are
such
path I
( CG
graph to
a { V
V
, W
) ,
) ,
and
all
is
said
to
W
E that
W
in Y
is
let
9Note that other authors, e.g. Lauritzen [17], have used 'ancestral' to refer to the set named ' anterior ) in Section 3.
240
THOMASS. RICHARDSON
A ,B-D -e",oE -. -X
A
A
B -... E -....X
B
I
x
- x
I C- eF
(b)
(a)
(e)
(c)
Figure 5. Constructing an augmented and extended chain graph: (a) a chain graph CG ; (h) directed edgesin Anc ({ A , E , X } ); (c) undirected edgesin Con(Anc ({ A , E , X } )) (d) Ext (CG , Anc ({ A , E , X } )); (e) Aug (Ext (CG , Anc ({ A , E , X } ))). X
Z
J/
X
i/
Z
X
!/
Z
X-
Z
J/
(a)
(b)
X --.. .A
X-
y- !
t~ ~
(c)
A
(d)
Figure 6. (a) Triplexes (X , Y, Z ) and (b) the corresponding augmented triplex . (c) A chain graph with a hi-flag (X , A , B , Y ) and two triplexes (A , B , Y ), (X , A , B ) ; (d) the corresponding
augmented chain graph .
Anc (W ) = { V I V is an ancestor of some W E W } . A triple of vertices (X , Y, Z ) is said to form a triplex in CG if the
induced subgraph CG ({ X , Y, Z } ) is either X - + Y - Z , X - + y ~ Z , or X - Y .(- Z . A triplex is augmented by adding the X - Z edge. A set of four vertices (X , A , B , Y ) is said to form a bi-flag if the edges X - + A , Y - + B ,
and A - B are present in the induced subgraph over { X , A , B , Y } . A bi-flag is augmented by adding the edge X - Yo Aug (CG ) is the undirected graph formed by augmenting all triplexes and bi -flags in CG and replacing all
directed edgeswith undirected edges(seeFigure 6). Now let Aug[CG ; X , Y , Z] = Aug(Ext (CG , Anc(X U Y U Z))). AMP
Global Markov Property
(FAMP)
GG FA MP XJl Y I Z if X is separated from Y by Z in the undirected graph Aug [GG; X , Y , Z]. Hence the conditional
independence
relations
associated with the chain
graphs in Figure 3(b) under the AMP global Markov property are: GGl
FAMP AJlB ; AlLB I C; AJlB I D ; AlLD ; AlLD I B ; BlLC ; BJlC I A ;
~
18
CG2 FAMP AlLB ; AlLB I D ; AlLD ; AlLD I B ; BlLD , { A , a } .
241
:C ASSOCIATIONS CHAINGRAPHS ANDS"YrMMETRJ A- - C I B- - D
A ')[] B--c -eIJ
A - - C ' (f) B - eIJ " """-'
DG1u (Ola,Sla,Lla)
LWF
the
and
special
graph
,
,
directed
. 8
.
X
- -
graphs
said
to
XlL
Y
,
G1
Y
global
XlL
I
Z
,
and
said
.
I
weakly
is
R
is
P
in
said
be
G
R
be
Markov
for
]
;
all
I
3 . Directed
in
.
strongly
FR
X
Y
I
here
[ 12
] ;
and
Graphs
if
Z
if
are
with
Y
[ 36
vertex
and
in
]
dis
-
dashed
]
to
[ 11
] ;
;
Andersson
Symmetric
such
Spirtes
be
V
,
,
a
Y
G
Z
Z
Y
Z
strongly
in
( and
] ;
et
al
G
FR
to
RX1L
Y
be
I
Z
property
[ 2
P
[ 21
] )
.
hence
Meek
.
-
distribution
I
[ 30
,
said
The
- MarkovianR
XlL
sep
given
distribution
~
.
-
a
and
G
I
FR2
d
For
is
Y
G2
under
.
that
XR
a
if
if
DG3
X
,
are
only
property
Z
is
respectively
set
which
only
known
R2
and
,
Markov
there
and
Frydenberg
Bouckaert
,
,
if
subsets
P
complete
XlL
with
global
sets
Z
equivalent
disjoint
A
I
Markov
G
for
P
disjoint
properties
Studeny
Z
using
.
property
all
graph
distribution
( Geiger
[ 31
Y
Y
Markov
are
and
both
[ 7
by
R1
XlL
LWF
if
- MarkovianR
G
,
Wermuth
property
)
of
and
for
directed
generalization
Cox
properties
FRI
separation
XJl
if
G
to
complete
the
)
,
COMPLETENESS
G1
- MarkovianR
implies
which
global
under
under
a
properties
AMP
Markov
if
property
Z
a
global
CG2
UG2
the
AND
under
.
- separation
( acyclic
are
Markov
under
( d
undirected
graphs
AMP
equivalent
complete
there
DG)(,{ Olc,slc,Llc )
separation
an
property
undirected
and
graphs
G2
Thus
to
Y
,
Markov
Markov
is
and
LWF
chain
is
either
EQUIVALENCE
be
aration
in
with
which
with
the
MARKOV
Two
graph
graphs
- -
coincide
chain
graphs
between
lines
ale
a
chain
tinguish
P
properties
of
Thus
acyclic
2
AMP
case
.
lc ={A,B,C,D} SIc=0 Ltc =0
A chain graph and directed graphs in which C and D are symmetrically
Figure 7. related.
Both
B- Jt
tb ={A,B,C,D} SIb={S} Lib =0 DG1h (Olb,slb,Llb)
la ={A,B,C,D} Sla =0 Lla ={ 71
CG )
A - - C
All
of
the
weakly
] ;
Spirtes
)
et
.
Relations
In this section the Markov properties of simple directed graphs with symmetrically related variables are compared to those of the corresponding chain graphs . In particular , the following symmetric relations between variables X and Yare considered : (a) X and Y have a latent common cause; (b ) X and Yare both causes of some selection variable ; (c) X is a cause of Y , and Y is a cause of X , as occurs in a feedback system . The conditional independences relations entailed by the directed graphs
242
THOMAS S. RICHARDSON
A B J D ~~
A B- J - D
CG2
A J B ~~
A B- l ~ D
Z8={A,B,C,D} S2 ,a=0 LZ8={ Tl,TZ}
Zbs(A.B.C.D} SZb s(SI.S2} LZb=0
Ok ={A,B,C,D} ~ =0 L:zc=0
DG2a ( 1a,s2a , L2a )
DG2b ( 2b,s2b , L2~
DG~ 2c,8::zc , L2 .c)
Figure 8. A chain graph and directed graphs in which the pairs of vertices B and C , and C and D , are symmetrically related.
in Figure 7 are:
DG1a(Ola, Sla, L1a) FDSAlLB ; AlLB I C; AiLB I D; AlLD ; AJlD I B; BJlC ; BJlC I A;
DG1b (OIb,SIb,LIb) FnsAJlB I C; AJlB I D; BJlC I D; AJlD I C; AJlB I {a,D};AJlD I {B,a};BJlC I {A,D}; DG1c (OIc , SIc , LIc ) FDS AJlB ; AJlB
I {a ,D }.
It follows that none of these directed graphs is Markov equivalent to Gal under the LWF Markov property . However , DGla (Ola , Sla , L1a ) is Markov equivalent to GGl under the AMP Markov property . Turning now to the directed graphs shown in Figure 8, the following conditional independence relations
are entailed
:
DG2a(O2a,S2a,L2a) FDS AlLB ; AJLB I D; AlLD; AlLD I B; BJLD; BJlD I A; DG2b(O2b, S2b, L2b) FDS ARB I C; ARB I {a , D} ; AlLD I C; ARD I {B , a } ; BlLD I C; BlLD I {A, a } ; DG2c(O2c, S2c, L2c) doesnot entail any conditionalindependences . I t follows that none of these directed graphs is Markov equivalent to CG2 under the AMP Markov property. However, DG2b(O2b, S2b, L2b) is Markov equivalent to CG2 under the LWF Markov property. Further , note that DG2b(O2b, S2b, L2b) is also Markov equivalent to UG2 (under separation ) and DG3 (under d-separation) in Figure l (a). There are two other simple symmetric relations that might be considered: (d) X and Y have a common child that is a latent variable; (e) X and Y have a common parent that is a selection variable. However, without additional edgesX and Y are entailed to be independent (given S)
CHAIN
GRAPHS
AND
SYMMETRIC
243
ASSOCIATIONS
in these configurations , whereas this is clearly not the case if there is an edge between X and Y in a chain graph . Hence none of the simple directed graphs with symmetric relations corresponding to CG1 are Markov equivalent to CG1 under the LWF Markov property , and likewise none of those corresponding to CG2 are Markov equivalent to CG2 under the AMP Markov property . In the next section a stronger result is proved : in fact there are no directed graphs , however complicated , with or without latent and selection variables , that are Markov equivalent to CG1 and CG2 under the LWF and AMP Markov properties , respecti vely . 4 . Inseparability
and Related
Properties
In this section two Markov properties , motivated by spatial and causal intuitions
, are introduced
. It is then shown that
these Markov
properties
hold for all undirected graphs , and all directed graphs (under d-separation ) possibly
with
latent
and selection variables . Distinct
vertices X and Yare
inseparableR in G under Markov Property R if there is no set W such that
G FR XJlY
I W . If X and Yare not inseparableR , they are separableRo Let
[G]knsbe the undirectedgraph in whichthere is an edgeX - Y if and only if X and Yare
inseparable R in Gunder
R . Note that in accord with the
definition of Fvs for DG (O, S, L ), only vertices X , Y E 0 are separableDs
or inseparableDs , thus [DG(O, S, L)]~: is definedto havevertex set 00 For an undirectedgraph model [UG]1nsis just the undirectedgraph U G . For an acyclic , directed graph (without latent or selection variables )
under d-separation , or a chaingraph undereither Markov property [G]~S is simply the undirected graph formed by replacing all directed edges with
undirectededges , hencefor any chaingraph CG, [CG]~n~p = [CGJln ; Fo In any graphical model , if there is an edge (directed or undirected ) bet ween a pair of variables
then those variables
are inseparable R. For undirec -
ted graphs , acyclic directed graphs , and chain graphs (under either Markov
property ), inseparability R is both a necessaryand a sufficient condition for the existence of an edge between a pair of variables . However , in a direc ted graph with cycles, or in a (cyclic or acyclic ) directed graph with latent and / or selection variables , inseparability DSis not a sufficient condition for there to be an edge between a pair of variables (recall that in DG (O , S, L ), the entailed
conditional
independences
are restricted
to those that are ob -
servable) .
An inducing path betweenX and Y in DG (O , S, L ) is a path 7rbetween X and Y on which (i) every vertex in 0 U S is a collider on 7[, and (ii ) every collider is an ancestor of X , Y or S.10 In a directed graph , DG (O , S, L ), loThe notion of an inducing path was first introduced
for acyclic directed graphs with
244
THOMAS S. RICHARDSON A .....- B
A-...c~1B
A ---..0 - - ' B
A ~ 1WB
Figure 9. Examples of directed graphs DG (O , S, L ) in which A and B are inseparableDs.
variables X , Y EO , are inseparable DSif and only if there is an inducing
path between X and Y in DG (O, S, L ).ll For example, C and D were inseparableDs in DGla (Ola , Sla , LIa ) , and DG1b(Olb , SIb , LIb ), while in DG1c (Olc , SIc , L1c ) A and B were the only separableDs variables . Figure 9 contains further examples of graphs in which vertices are inseparable DS. 4 .1 .
' BETWEEN
A vertex
SEPARATED
' MODELS
B will be said to be betweenR X and Y in G under Markov
pro -
perty R , if and only if there exists a sequence of distinct vertices (X ==
Xo, Xl , . . . ,Xn := B , Xn+l , . . . , Xn+m := Y) in [G]~Ssuchthat eachconse cutive
pair of vertices
Xi , Xi + l in the sequence are inseparableR
in G un -
der R . Clearly B will be betweenR X and Y in G if and only if B lies
on a path betweenX and Y in [G]~ns. The set of verticesbetweenX and Y under property R in graph G is denoted BetweenR(G; X , Y ), abbreviated to Between R(X , Y ), when G is clear from context. Note that for any chain graph CG , BetweenLWF (CG ; X , Y ) = BetweenAMP (CG ; X , Y ), for all vertices
X
and
BetweenR
Y .
Separated
Models
A model G is betweenR separated, if for all pairs of vertices X , Y and sets W (X , Y Et W ) :
G FR XlLY I W =~ G Fn XlLY I W n Betweenn(G; X , Y )
(where{X , Y} U W is a subsetof the verticesin [G]~ns). It follows that if G is betweenR separated , then in order to make some
(separable ) pair of vertices X and Y conditionally independent , it is always sufficient to condition on a subset (possibly empty ) of the vertices that lie on paths between X and Y . The intuition that only vertices on paths between X and Yare relevant to making X and Y independent is related to the idea, fundamental to much latent variables in Verma and Pearl [37]; it was subsequently extended to include selection variables in Spirtes, Meek and Richardson [32]. 11Inseparability
DS is a necessary and sufficient condition for there to be an edge between
a pair of variables in a Partial Ancestral Graph (PAG), (Richardson [26, 27]; Spirtes et ale [32, 31]) , which represents structural features common to a Markov equivalence class of directed graphs .
CHAIN
[
G
]
GRAPHS
AND
InS
SYMMETRIC
P
R
S
.
ASSOCIATIONS
X
A
t
\
-
BCY
Figure
10
,
D
of
,
E
.
,
F
BetweenR
,
Q
,
T
graphical
in
the
in
then
)
are
All
:
i
,
)
All
/
Suppose
a
(
X
Y
WnBetweens
lows
is
,
there
on
Xn
1r
+
l
,
arables
V
,
is
then
.
(
E
.
.
,
Xn
+
,
separated
.
m
X
,
,
d
-
S
X
,
A
and
)
.
Y
Y
)
,
the
which
are
shown
by
12Where
' A
of
A
,
or
and
they
B
are
share
,
]
)
)
=
{
A
,
should
be
,
and
dependent
.
.
here
.
It
is
easy
without
'
to
see
selection
separated
'
for
or
by
directed
'
d
-
separ
graphs
-
with
.
I
1r
in
does
W
U
E
,
G
not
V
\
UG
~
s
given
=
pairs
=
XO
of
pair
I
,
X
,
X1
,
Y
,
)
.
vertices
of
and
W
(
X
XlLY
X
Betweens
(
contradiction
but
connecting
connect
W
vertices
it
.
.
.
)
.
fol
,
if
Xn
=
insep
separated
in
I
B
,
BetweenLWF
LWF
(
AlLD
I
or
Figure
{
{
=
-
Hence
0
betweenLWF
AlLD
-
But
are
variables
Y
.
CG2
,
present
directly
separated
each
and
B
12
given
consecutive
but
.
XlLY
of
Y
are
appendix
and
,
)
3
G
}
}
:
,
CGl
G
betweenAMP
;
A
,
D
)
=
{
G
}
and
.
that
FAMP
causally
some
14
proof
path
X
Y
quantities
replacing
s
;
,
separated
The
=
G
X
they
betweens
path
1r
not
~
GG2
cause
I
a
FLWF
note
)
.
;
,
graphs
the
UG
CG1
'
property
0
G
interact
two
between
a
(
(
then
is
sequence
is
CoConR
correspondence
directed
in
edge
,
betweenDS
=
that
an
separableLwF
AMP
are
this
GGl
For
)
on
}
natural
[
is
such
is
)
a
if
'
a
=
L
L
V
Q
CoConR
models
=
Since
graphs
is
Dare
,
there
Y
,
in
Hare
is
Then
F
Hausman
connected
GGl
so
S
,
.
,
regions
to
vertex
=
chain
This
0
'
E
that
directly
=
there
general
,
constitutes
because
(
O
T
dependent
is
graphs
(
,
contiguous
(
by
some
7r
Betweens
In
)
(
that
V
V
D
not
state
contradiction
,
,
are
variables
for
Betweens
given
.
C
graph
selection
,
n
. e
,
This
only
over
i
B
vertices
undirected
'
or
,
.
DG
connected
/
if
"
DF
vertices
undirected
for
(
and
A
are
connected
carries
'
{
,
which
proof
and
=
S
that
variables
'
)
that
graphs
proof
latent
V
(
Y
graphically
causally
The
the
latent
,
,
directed
Proof
that
X
Rand
principles
1
W
;
,
intuition
Theorem
ated
G
P
way
causal
they
ii
while
some
spatial
also
(
,
modelling
connected
in
(
}
Q
/ E
C
245
BJlD
connected
common
I
'
cause
{
means
A
that
(
or
some
,
G
}
either
combination
,
A
is
a
cause
of
of
these
B
)
.
,
B
is
a
THOMASS. RICHARDSON
246
but BetweenAMP (CG2; B , D ) = { a } , and yet CG2 ~ AMPBJlD I { a } .
4 .2 .
A
' CO - CONNECTION
vertex
W
Ware
There
is
does
It
is
easy
separated X
X
Y
BetweenR
Y
is
' in
Co A
model vertices
G
G
FR
principle
Theorem
I W
be
( i )
Directed
)
that
: Again graphs
in
the
FR
the
X
, Yand
[ G ] kns
in
the
which
sequence
[ G ] ~ns
in
which
the
sequence
and
in
[ G ] ~ns } .
X
Y
and
Y
in
[ G ] ~ns , and
for
any
CoConAMP
chain
( CG
(b ) graph
betweenR sets
of
if is
CG
and not
, and
; X , Y ) .
( G ; X , Y ) , so being being
G B
co - connectedR X
vertices
and
Y . Both
which
are
to -
[ G ] ~ns .
tt
XJlY
I W of
the
inclusion is
W
, if
for
all
pairs
): n
COConR
vertices
or
W
determined
W
exclusion
irrelevant
(G ; X , Y )
in
[ G ] ~ns ) .
of to
vertices
whether
that X
are
not
Y
are
and
. models
with
are
latent
co - connections and
/ or
determined
selection
variables
. , are
.
for AMP
) in
, Y ) in
X
to in
co - connectionR
graph
are
proof and
that
(X , Y
set
, possibly
determined graphs
be
a subset
Undirected
if
Models
given
graphs
Chain
G
G
variables
to
X
than
W
some
independent
co - connectionvs ( iii
is
by
CoConR
Y
to sets
* =>
, W
, . . . , Bm of
( G ; X , Y ) are
and
said
and
states
2
( ii )
be
, Y
( X , Y ) from to
Y
; X , Y ) =
CoConR
{ X , Y } U W
CoConR
entailed
will X
, B2
co - connectedR
Determined
XlLY
( where
be
[ G ] kns . Note
' X
in
variables
co - connectedR
requirement
- ConnectionR
of
is
(G ; X , Y ) ~
between
of
pairs
from
( CG
a weaker
Y
R .
will
in
and
, . . . , An
pairs
( W , B1
I V
B
( G ; X , Y ) and
pologically
Proof
by
, A2
consecutive
separated
BetweenR
and
directed
that
not
vertices
{ V
X
R .
Gunder
, Y , CoConLWF
Clearly
This
see
is
from
vertices
X
to B
( X , AI
consecutive
and
R in
to
:
vertices
and
(G ; X , Y ) =
(a )
co - connectedR
Gunder
X
inseparable
if
be
of
contain
CoConR
only
Y R in
a sequence
not
are
to
of
contain
inseparable
There
Let
said
' MODELS
[ G ] ~ns satisfying
a sequence
not
are ( ii )
in
is
does
in
be
vertices
(i )
to
will
DETERMINED
co - connectionAMP undirected chain
graphs
determined graphs
is
given
are
given
in
. here the
. The
appendix
proofs .
for
CHAINGRAPHSAND SYMMETRICASSOCIATIONS
247
SinceBetweens (X , Y) ~ CoCons(X , Y), an argumentsimilar to that usedin the proof of Theorem1 (replacing'Betweens ' with 'CoCons') shows that if UG Fs XJLY I W then UG Fs XJlY I W n CoCons(X , Y). Conversely , if UG Fs XlLY I W n CoCons(X , Y ) then X and Yare separatedby W n CoCons(X , Y) in UG. SinceW n CoCons(X , Y ) <; W , it followsthat X and Yare separatedby W in UG.13 0 For undirectedgraphsUG Fs XJlY I W =~ UG Fs XJlY I W n Betweens (X , Y), i.e. undirectedgraphscouldbe saidto be betweensdetermined. Chain graphsare not co-connectionLwF determined.In CG1 Band Care separableLwF , sinceCG1 FLWFBJlC I {A, D} , but CG1~ LWF BlLC I {D } and CoConLWF (CG1; B , C) = {D} . In contrast, chaingraphsareco-connectionAMP determined. 5.
Discussion
The two Markov properties presented in the previous section are based on the intuition that only vertices which , in some sense, come 'between ' X and Y should be relevant as to whether or not X and Yare entailed to be inde pendent . Both of these properties are satisfied by undirected graphs and by all forms of directed graph model . Since chain graphs are not betweenR separated under either Markov property , this captures a qualitative difference between undirected and directed graphs , and chain graphs . On the other hand since chain graphs are co-connectionAMP determined , in this respect , at least , AMP chain graphs are more similar to directed and undirected graphs .
5.1. DATAGENERATING PROCESSES Since the pioneering work of Sewall Wright [38] in genetics, statistical models based on directed graphs have been used to model causal relations , and data generating processes. Models allowing directed graphs with cycles have been used for over 50 years in econometrics , and allow the possibility of representing linear feedback systems which reach a deterministic equilibrium subject to stochastic boundary conditions (Fisher [10] ; Richardson [27]) . Besag [3] gives several spatial -temporal data generating processes whose limiting spatial distributions satisfy the Markov property with respect to a naturally associated undirected graph . These data generating processes are time - reversible and temporally stationary . Thus there are data generating mechanisms known to give rise to the distributions described by undirected and directed graphs . 13This is the ' Strong Union Property ' of separation in undirected
graphs (Pearl [22]) .
248
THOMAS S. RICHARDSON
Cox [6] states that chain graphs under the LWF Markov property "do not satisfy the requirement of specifying a direct mode of data generation ." However , Lauritzen14 has recently sketched out , via an example , a dynamic data generating process for LWF chain graphs in which a pair of vertices joined by an undirected edge, X - Y , arrive at a stochastic equilibrium , as t - + 00; the equilibrium distribution being determined by the parents of X and Y in the chain graph . A data generation process corresponding to a Gaussian AMP chain graph may be constructed via a set of linear equations with correlated
errors (Andersson et at. [1]). Each variable is given as a linear function of its parents in the chain graph , together with an error term . The distribution over the error terms is given by the undirected edges in the graph , as in a Gaussian undirected graphical model or 'covariance selection model '
(Dempster [9]), for which Besag [3] specifies a data generating process. The
linear
model
constructed
in this
way
differs
from
a standard
linear
structural equation model (SEM): a SEM model usually specifieszeroesin the
covariance
model
sets
to
matrix zero
for
elements
the
error
of the
terms , while inverse
error
the
covariance
covariance
matrix
selection .
The existence of a data generating process for a particular chain graph
(under either Markov property ) is important since it provides a full justifi cation for using this structure . As hM been shown in this paper , the mere fact that
two variables
are 'symmetrically
related ' does not , on its own ,
justify the use of a chain graph model . 5 .2 .
A THEORY
OF
INTERVENTION
IN
DIRECTED
GRAPHS
Strotz and Wold [35], Spirtes et at. [31] and Pearl [23] develop a theory of causal intervention for directed graph models which makes it sometimes possible to calculate
the effect of an ideal intervention
in a causal system .
Space does not permit a detailed account of the theory here, however, the central idea is very simple : manipulating a variable , say X , modifies the structure of the graph , removing the edges between X and its parents , and instead making a 'policy ' variable the sole parent of X . The relationships between
all other variables
and their parents
are not affected ; it is in this
sense that the intervention is 'ideal ' , only one variable is directly affected .IS Example .- Returning to the example considered in section 2.2, hypotheti cally a researcher could intervene to directly determine whether or not the patient suffers the side-effects, e.g. by giving all of the patients (in both the 14Personal communication
.
15It should also be noted that for obvious physical reasons it may not make sense to speak of manipulating certain variables , e.g. the age or sex of an individual .
CHAINGRAPHS ANDSYMMETRIC ASSOCIATIONS R
249
R
Recovery
Assignment A f
Policy t
A
A ~>
(Treatment /contro !)", IH Health
~f
Side Effects ~ f
DG
DGManip (Ef)
DGManip (H) (c)
(b)
(a)
Figure 11. Intervening in a causal system: (a) before intervention ; (b) intervening to directly control side-effects; (c) intervening to directly control health status.
side
-
to
.
initially
suffers
side
the
result
(
treatment
-
effects
of
One
.
The
yet
the
ferent
be
the
.
certain
is
on
true
tain
can
it
is
also
often
may
be
34
]
;
case
present
data
was
even
(
if
a
,
present
which
(
In
the
an
may
be
Richardson
the
sense
of
[ 27
absence
observation
is
(
of
of
that
a
they
is
,
theory
basis
2
]
;
)
of
for
]
)
may
at
.
same
slogan
11
set
of
"
]
a
in
;
va
certain
and
Verma
that
the
equivalence
. class
to
Pearl
[ 23
data
the
predict
]
)
.
where
graphs
the
A
is
Causation
theory
another
feedback
,
is
a
.
not
,
generating
system
chain
-
variables
Spirtes
causal
is
-
.
knowing
distributions
Correlation
-
cer
two
selection
settings
for
thus
share
or
]
;
dif
control
than
enough
[ 31
of
intervention
the
be
to
17
knowledge
. kov
of
.
,
,
Thus
be
et
part
the
/
.
a
hence
clearly
models
behaviour
.
makes
able
more
and
33
shows
and
background
Mar
,
will
are
particular
Spirtes
it
16
)
data
often
[
unknown
importance
.
of
there
Frydenberg
dynamic
represent
the
basis
important
the
that
c
.
experiments
[
a
see
great
Ch
A
latent
;
patient
(
observational
are
when
an
equivalent
equivalent
]
in
model
specification
the
[ 37
directly
statistically
manipulate
Richardson
model
of
are
scientists
when
and
status
,
basis
,
Markov
Pearl
Spirtes
constitutes
;
element
all
even
interventions
intervention
on
:
,
particular
certain
mechansim
17This
;
by
which
of
16In
]
B
such
the
11
is
controlled
out
and
[ 28
the
-
perform
that
Verma
f
:
simplistic
common
generated
results
of
in
to
-
health
-
assigned
Figure
equivalent
on
A
of
whether
in
s
precipi
was
theory
point
ruled
of
statistically
and
result
patient
graph
'
or
the
the
The
to
and
over
the
Richardson
B
the
,
features
t
shows
intervention
are
intervening
be
is
structural
this
misses
often
objection
riables
of
but
directly
models
This
-
)
patient
purely
B
,
variables
the
differentiated
effect
This
.
control
A
b
independent
that
graphs
(
prevents
group
expected
to
either
11
which
becomes
be
models
not
example
and
)
would
objection
could
For
Figure
,
control
as
between
which
in
to
common
which
intervention
!
,
something
graph
the
intervening
distinction
[
effects
After
)
_
the
intervention
groups
_
tates
control
I
and
II1I1I1II
treatment
. "
resear
-
250
THOMASS. RICHARDSON
cher would be unable to answer questions concerning the consequences of intervening in a system with the structure of a chain graph. However, Lauritzen 18has recently given, in outline, a theory of intervention for LWF chain graphs, which is compatible with the data generating processhe has proposed. Such an intervention theory would appear to be of considerable use in applied settings when the substantive researchhypothesesare causal in nature. 6.
Concl usion
The
examples
which
given
a pair
symmetric perties between
Markov
with
chain
graphs
not , in
described
directed
this
general
' weaker
, will
lead
shown
quite are
with
a directed , the
to
graph
any
of an
chain
different
of those
structure
of a chain
relationship
that or
edge , rather
, should
the
pro -
directed
either
marginalizing
model
, than
in
differences and
) , and
Markov
via
ways , different
Markov
undirected
undirected
graph
' , substantively
many
qualitative
symmetric
model
inclusion
are
. Further
variables , the
, correspond
' safer
to
, there
. Consequently
by
there related
associated
reason
' or
that
and ! or selection
edge , in a hypothesized
as being
clear
symmetrically
been
properties latent
. For
make
be
has
with
does be
, ~
( possibly
associated
tioning
paper may
, in general
particular
the
graphs
can
this
relationships . In
graph
in
of variables
not
inclusion
condi
-
than
a
be regarded of
a directed
edge . This not
paper
has
correspond
question full
of
answer
data
Acknow
ledgements like
Peter
Spirtes
ful
on
an
this
England
the
topic
. I am
and Isaac
, UK
chain
also
the
18Personalcommunication .
symmetric
, this
, and
( under
grateful
to
. Finally
correspond
Markov
do
to . A , of a
property
),
.
Glymour
Perlman
for
helpful
anonymous
, I would
Institute
for
Mathematical
revised
version
of this
like
, Steffen
, Richard
Wermuth
three
which interesting
, in general
a given
Cox , Clark
, Michael Nanny
do
the
specification
of intervention
, David
open
graphs
the
graphs
relations
leaves
chain
involve
Meek
suggestions
Newton
, where
Besag
, Chris Studeny
many
relations
theory
Julian
, Milan
comments
ledge
for
Madigan
are
. However
would
associated
to thank
, David
tions
question
process
with
ritzen
there
graphs
symmetric
this
generating
that
chain
which to
together
I would
shown
to
reviewers
to
gratefully
Sciences paper
was
Lau -
Scheines
,
conversa
-
for
acknow
, Cambridge
prepared
use -
.
,
251
CHAIN GRAPHSAND SYMMETRIC ASSOCIATIONS 7 . Proofs
In DG (O , S, L ) supposethat J1 , is a path that d-connects X and Y given Z US , C is a collider on Ji" and C is not an ancestor of S. Let length (C, Z) be 0 if C is a member of Z ; otherwise it is the length of a shortest directed path <5 from C to a member of Z. Let Coll (Jl,) = { C I C is a collider on IL, and C is not an ancestor of S} . Then
let
size(J,L,Z) = IColl(J,L)I +
L
length (C, Z)
CEColl (J-L)
where
d
I Coll
( J. l , )
- connecting
X
and
X
Y
and
d
d
-
Z
connects
tices
and
1
given
Z
that
vertex
j
Proof
a
S
is
not
in
Z
)
..
t5i
S
.
that
contrary
to
Form
than
Ci
Jl "
Suppose
that
19It
X
and
is
J. L '
Z
)
not
Y
<
,
then
Y
B
A
)
path
US
JJ
,
if
Z
)
size
is
given
Z
a
is
( J1 "
at
minimal
acyclic
acyclic
path
<
denotes
is
J. L
acyclic
,
there
JiI
Z
)
least
,
'
.
that
If
d
- connects
d
- connects
there
one
is
a
minimal
path
acyclic
. 19
the
d
a
subpath
of
J. L
between
ver
-
,
Y
there
-
connecting
}
~
is
J1 ,
a
is
point
of
then
Ci
,
in
S
t5i
( } i
and
and
collider
Ci
do
Y
Ci
from
t5 j
from
of
does
X
each
to
not
on
some
intersect
.
path
t5i
between
for
path
at
ancestor
that
,
directed
directed
an
path
0
only
t5i
not
prove
in
the
S
not
intersection
following
be
the
,
a
and
collider
Ci
hence
no
intersect
exists
the
way
vertex
vertex
loss
of
J. L ( X
,
)
)
( Wx
and
because
that
there
c5i
in
is
a
an
8i
jj
,
then
on
J1 ,
vertex
except
J. L
,
Wy
Y
given
contains
DG
acyclic
,
if
Y
X
x
)
,
J. L
is
U
S
is
path
d
a
.
not
to
on
at
is
path
a
t5i
Ci
by
minimal
,
)
8i
.
It
Figure
( } i
t5i
.
Let
J. L '
now
12
and
acyclic
Y
)
given
J1 , .
be
the
easy
. )
and
and
to
Moreover
than
or
X
both
is
colliders
( cyclic
other
both
on
Y
vertex
on
on
y
,
at
is
is
( See
more
- connecting
that
W
J . L ( Wy
no
there
J1 ,
that
after
and
Z
J. L
on
on
W
J. . L'
intersects
to
that
,
if
to
generality
Wx
:
closest
closest
X
Z
prove
then
,
.
X
to
Z
X
acyclic
such
( Jl "
,
{
path
is
- connects
size
d
U
S
any
Ci
W
Z
intersects
shortest
be
hard
given
)
of
be
let
Wy
of
( J. L ' ,
L
t5i
Jl , '
concatenation
size
,
assumption
without
show
other
acyclic
S
now
path
let
,
where
if
then
and
,
.
Z
( J1 , '
and
( A
minimal
on
a
,
the
a
a
vertex
will
showing
X
that
Z
We
no
Z
JiI
( J. l , )
given
size
given
ancestor
be
of
in
is
that
Y
( O
an
no
Let
is
DG
such
and
member
is
and
Y
there
such
between
J. L
in
,
and
Coll
and
.
If
U
,
proofs
B
Lemma
of
X
S
following
A
#
S
U
X
the
cardinality
U
Z
path
In
( i
the
between
given
- connecting
J. L
is
given
Y
that
I
path
J. . L
d
- connecting
Z
.
and
a
252
THOMASS. RICHARDSON , Jl
Figure
12
.
Finding
intersects
a
with
x
a
-
- connecting
path
path
- - - . . . . . C ;
4
-
"
-
d
directed
"
'
tSi
-
- . . . . . Cj
T
Jl
j . L'
from
a
4
of
smaller
collider
-
size
Ci
- Y
X
-
to
-
a
.
C ;
/
than
-
Jl
in
.
-
"
"
j .L,
vertex
-
"
Figure
13
.
Finding
paths
shortest
Jl ,
a
,
hi
T
not
,
remains
.
Let
the
that
because
Ci
not
be
Z
J1 ,
is
Y
in
a
)
(
Proof
0
O
,
..
not
on
path
Jl ,
( O
,
J1 ,
such
,
L
,
L
J1 ,
C
only
on
some
( Ci
Jl ,
colliders
the
length
-
. Y
than
IJ . ,
in
the
case
where
two
T
)
d
a
Xi
-
)
Z
J1 , ' ,
a
uS
+
only
di
,
( 0
: : ;
.
.
Hence
i
( i
#
j
)
do
.
Y
)
Let
.
It
Z
)
Jl , '
is
<
on
from
to
to
a
the
e
size
Jl , '
T
Ci
be
now
~
( Jl "
that
a
y
Z
)
may
member
member
of
Z
.
,
{
Xn
<
X
n
+
X
,
= =
Y
,
B
m
.
0
between
ZU
.
T
,
( Jl , ' ,
path
.
t5j
collider
from
and
Xl
be
size
assumption
JI "
,
l
and
the
and
Jl , ( Cj
path
on
t5j
and
shortest
the
t5i
on
,
path
Xo
than
. )
also
OJ
connecting
= =
Xi
,
shortest
vertex
and
( T
given
to
shorter
then
13
is
t5j
of
a
is
,
Figure
on
contrary
is
minimal
that
,
Z
.
is
Y
of
B
size
of
( See
Ci
,
and
( X
is
a
d
- connecting
of
are
from
,
-
"
smaller
member
.
to
di
vertices
ancestor
that
,
minimal
)
.
,
)
B
and
}
Xn
~
+
are
Y
0
l
,
.
,
.
,
given
then
.
,
there
Xn
+
m
inseparableDs
= =
in
.
Since
an
S
a
of
a
if
not
,
that
)
)
length
is
of
,
S
intersect
vertex
If
DG
and
minimal
sequence
in
DG
is
not
2
is
,
JL
/
'
assumption
false
X
the
to
the
closest
Ci
are
T
X
is
- connects
than
Lemma
ZUS
is
t5i
,
IJ . '
that
this
Cj
Jl ,
less
Hence
d
and
on
is
shown
Jl , ( X
J1 , '
W
to
on
of
show
of
be
vertex
concatenation
to
from
Suppose
where
.
contrary
to
intersect
path
intersect
path
minimal
It
not
,
case
Z
- connecting
hj
directed
is
d
and
the
.
- - . . . . cj
Z
directed
in
Z
not
j
to
S
is
path
an
ancestors
of
some
at
OJ
path
vertex
,
and
6j
S
Z
that
is
given
ancestor
in
of
as
j
01
E
6j
S
.
Z
,
.
and
A
.
It
.
.
U
Z
vertex
,
( j
every
in
Ok
.
Let
follows
6jl
sequence
S
a
Z
8j
by
: #
of
j
' )
vertices
collider
.
be
a
Lemma
do
on
Denote
directed
that
fJ j
intersect
Xi
in
that
colliders
shortest
1
not
J1 ,
the
,
0
,
such
and
J1 ,
and
no
that
CHAIN
GRAPHS
AND
SYMMETRIC
ASSOCIATIONS
253
each Xi is either on J1 , or is on a directed path 8j from OJ to Zj can now be constructed
:
Base Step : Let Xo := X .
Inductive
Step : If Xi is on some path 8j then define Wi+l to be OJ;
otherwise , if Xi is on J1 " then let Wi + l be Xi , Let Vi + l be the next vertex on
J.L, after Wi+l , such that ~ +1 E O . If there is no vertex OJ' between Wi+l and ~ +1 on J1 " then let Xi +l := ~ +1' Otherwise let OJ* be the first collider on J1 " after
Wi + l , that
is not an ancestor
of S , and let Xi + l be the first
vertex in 0 on the directed path 8j* (such a vertex is guaranteed to exist since Zj *, the endpoint of 8j*, is in 0 ). It follows from the construction that if B is on J1 " and B E 0 , then for some i , Xi == B .
Claim : Xi and Xi +l are inseparableDs in DG (O , S, L ) under d-separation . If Xi and Xi + l are both on JL, then JL(Xi , Xi + l ) is a path on which every non-collider is in L , and every collider is an ancestor of S. Thus J1 , (Xi , Xi + l )
d-connects Xi and Xi +l given Z US for any Z ~ O\ { Xi , Xi +l } . So Xi and Xi +l are inseparableDs . If Xi lies on some path 8j, but Xi +l is on Jl, then the path 7r formed by concatenating the directed path Xi +- . . . +- Cj and
l1I(Oj , Xi +l ) again is such that every non-collider on 7r is in L , and every collider is an ancestor of S, hence again Xi and Xi + l are inseparableDs. The cases in which either
Xi + l alone , or both Xi and Xi + l are not on J1 ,
can be handled similarly . Corollary
0
1 If B is a vertex on a minimal
d- connecting
path 7r between
X and Y .qiven Z U S in DG (O, S, L ), Z U { X , Y, B } ~ 0 , then B E BetweenDS(X , Y ) . Proof . This follows directly from Lemma 2
0
Corollary 2 If J.L is a minimal d-connecting path between X and Y given ZUS in DG (O , S, L ) , C is a collider on J.L that is an ancestor ofZ but not S,
l5 is a shortest directedpath from C to some Z E Z , and Z U {X , Y, C} ~ 0 , then Z E COCOllvs(X , Y ). Proof : By Lemma
1, 5 does not intersect
IL except at C . Let the sequence of
vertices on 5that are in 0 be (VI , . . . , Vr .= Z ) . It follows from the construc tion in Lemma 1 that there is a sequence of vertices (X == Xo , Xl , . . . , Xn == VI , Xn + l , . . . , Xn +m . = Y ) in 0 such that consecutive pairs of vertices are inseparableDs, Since, by hypothesis , C is not an ancestor of S, it follows that no vertex on 5is in S. Hence <5(~ , ~ + l ) is a directed path from ~ to ~ + l on which , with the exception
of the endpoints , every vertex is in L and is a non -
collider on 6, it follows that ~ and ~ +I are inseparableDsin DG (O , S, L ). Thus the sequences(X .= Xo, Xl , . ' . ' Xn .= VI, . . . , Vr .= Z ) and (Y .=
254
THOMASS. RICHARDSON
Xn +m, . . . , Xn == VI , . . . , Vr == Z ) establish that Z E CoConDS(X , Y ) in
DG (O , S, L ).
0
Theorem 1 (i ) A directedgraph DG (O , S, L ) is betweenDS separatedunder d-separation .
Proof.- Suppose, for a contradiction , that DG (O , S, L ) FDS XlLY I W , (W U { X , Y } ~ 0 ), but DG (O, S, L ) ~ DSXJlY I W n BetweenDS (X , Y ). In this case there is some minimal path 7r d-connecting X and Y given
S U (WnBetweenDS(X , Y )) in DG (O, S, L ), but this path is not d-connecting given S U W . It is not possible for a collider on 7r to have a descendant in S U (W n BetweenDs(X , Y )), but not in S U W . Hence there is some
non-collider B on 7r, s.t . B E S U W , but B ft S U (W n BetweenDS (X , Y )). This implies B E W \ BetweenDS(X , Y ) , and since W ~ 0 , it follows that
B E O . But in this caseby Corollary 1, B E BetweenDS (X , Y ), which is a contradiction
Theorem mined
.
0
2 ( ii ) A directed graph DG (O , S, L ) is co-connectionvs deter -
.
Proof : Since Betweenvs (X , Y ) ~ CoConvs (X , Y ) , the proof of Theorem 1 above, replacing 'betweenDs' with 'co- connectedDs' , suffices to show that
if DG (O , S, L ) FDS XJlY I W then DG (O, S, L ) FDS XJlY I W n CoConDS (X , Y ). To prove the converse, suppose, for a contradiction, DG (O, S, L ) FDS XlLY I WnCoConDs (X , Y ), but DG (O , S, L ) I=#DsXlLY I W , where wu { X , Y } ~ O . It then follows that there is some minimal d-connecting path 7r between X and Y in DG (O , S, L ) given W U S. Clearly it is not possible
for there to be a non-collider on 7r which is in S U (W n CoConns(X , Y )), but
not
in S U W . Hence
it follows
that
there
is some collider
C on 7r which
has a descendantin S U W , but not in S U (W n CoConDS (X , Y )). Hence C is an ancestor of W \ CoConDS(X , Y ) , but not S. Consider a shortest directed path l5 from C to some vertex W in W . It follows from Lemma 1, and the minimality of 7r that l5 does not intersect 7r except at C . It now
follows by Corollary 2, that W E CoConDs(X , Y ). Therefore if C is an ancestor
of a vertex
in S U W , then
C is also
an ancestor
of a vertex
in
S U (W n CoConDS(X , Y )) . Hence 7r d-connects X and Y given S U (W n
CoConDS (X , Y )), which is a contradiction. Lemma
0
3 Let C G be a chain graph with vertex set V ; X , Y E V and
W ~ V \ { X , Y } . Let H be the undirectedgraph Aug[CG ; X , Y, W ]. If there
CHAIN GRAPHS ANDSYMMETRIC ASSOCIATIONS 255 ABC l- !- J (a)
A--B- C l~ J?-
A- B- C , l.2-<J~ ! .u(c)
Figure 14. (a) Achain graph CGwithCoConAMP (CG ;X,Y) = {B,W}; (b) apath JLinAug [CG ;X,Y,{W}]; (c) apathJL ' inAug [CG ; X,Y,{W}] every vertex ofwhich occurs onJI . andisinCoConAMP (CG ;X,Y). is a path I.L connecting X and Y in H , then there is a path Il ' connecting X and Y in H such that if V is a vertex on J.L' then V is on tL, and
V E CoConAMP (X , Y ) U { X , Y } . Proof : If X and Yare
adjacent in H then the claim is trivial since J.L' ==
(X , Y ) satisfies the lemma. Supposethen that X and Yare not adjacent in
H .
Let the vertices on Jj be (X == Xl , . . . , Xn == Y ) . Let a be the greatest
-
-
-
.
,
-
.
.
j such that Xj is adjacent to X in H . Let {3 be the smallest k > a such that Xk is adjacent to Y in H . (Since X and Yare not adjacent a , (3 < n .)
It is sufficient to prove that { Xo , . . . , X ,a} ~ COCOllAMP (X , Y ), since then the path J.L' == (X , Xa , . . . , X {3, Y ) satisfiesthe conditions of the lemma.
This can be provedby showingthat there is a path in [CG]~~ p from X to each Xi (a ~ i ~ ,8) which does not contain Y . A symmetric argument
showsthat thereis alsoa path from Y to Xi (a ~ i ~ ,8) in [CG]~~p, which does not contain X . The proof is by induction on i . Base case : i = a . Since X is adjacent to Xa in H , either there is a (directed or undirected ) edge between X and Xi in C G , or the edge was
added via augmentation of a triplex or bi-flag in Ext (CG , Anc({ X , Y } U W )). In the former case there is nothing to prove since X and Xi are
adjacentin [CG]~~p. If the edgewasaddedvia augmentationof a triplex then ------ there --- - is a vertex T such that (X . ,. T ,. Xi .) is a triplex in CG , hence T
is adjacentto X and Xi in [CG]~~p. SinceX and Yare not adjacentin H , T ~ Y , so (X , T , Xi ) is a path which satisfies the claim . If the edge Wag added via augmentation of a bi -flag then there are two vertices To, -
Tl , forming a hi-flag (X , To, Tl , Xi ). From the definition of augmentation it then follows that To and Tl are adjacent to X and Xi in H . Since we suppose that X and Yare not adjacent in H it follows that neither To
nor Tl can be Y . Hence(X ,To,Tl , Xi ) is a path in [CG]~~p satisfyingthe claim .
Inductive
case : i > a ; suppose that there is a path from X to Xi - l
in [CG]~~p which doesnot containY. Since i - I < ,8, Xi - l is not adjacent to Y in H . By a similar proof to that in the base case it can easily be shown that there is a path from
Xi - l to Xi in [CG]~~p which doesnot containY . This path may then be
256
THOMAS S. RICHARDSON CG
A -B C 1-X XI ,/ 1-V ", T -"lIY ~ (b) H2 B c 1 ",/-y1-V ", T/-X -U
Aug[CG;X,Y,{V,W}] BCD
- J- u J--v~ (a) HI X -
Uy
(c)
(d)
~~
.i ~
. :~
Figure 15 . (a)A chain graph ,G CG ,{U in which CoConAMP (induced CG ;X,Y ) ={Uof }; (Aug b)[the undirected graph Aug [ C ; X , Y , , W }]; ( c ) HI , the subgraph G;X G ,Y {U,[W over (X,Y)U{X undirec ted graph H2 ,,Aug C}];X G ,Y,CoConAMP {U,W }nCoConAMP (X,,Y Y})].={U,X,Y};(d)the concatenated with the path from X toXi -lpath (whose existence is guarantee by the induction hypothesis ) to form a connecting X and Xi-l0in [CGJ ~~Pwhich does not contain Y. Lemma 4,Let CG be aCoConAMP chain graph the induced subgraph of Aug [C;G X Y,W ]over (X,,and Y)Ulet {XHI ,Y}.be Let H2 be the undirecte graph Aug [C;G X,Y,WnCoConAMP (X,Y)].HIisasubgraph ofH2 . Proof .then We first prove that if(X a,vertex V,is in HIthen Vand isinY Hoccur 2.IfVin occurs in HI V E CoConAMP Y ) U {X Y }. Clearly X both HIand H2 ,so suppose that VECoConAMP (X,Y). Itfollows from the definition ofthe extended graph that ifVisavertex in Ext (CG,vertex T)then there isaV path consisting of undirected edges from V to some in T . Since is in Ext (CG , Anc ({X , Y } U W )) there is a,path 1T ofE the form (VW := Xo -n,... -t ... -tbe Xn +m :=W )in CG where W {X ,Y}U ,and m-Xn ~o.Let Xk the first vertex on 1T which is inthen {X,Y }UV WE ,i.CoConAMP e.Vi(0~ i,Y<),kand )Xi ~,Xk {X),Y }aU W. Now , if Xk E W since ( X 7r (V is path from VtoXkwhich does not contain XorY,itfollows that XkEWn CoConAMP ( X , Y ). Hence V occurs in Ext (CG , Anc (WnCoConAMP (X,Y))), and so also in H2 . Alternatively , if Xk E {X , Y }, then again Xk occurs in Ext (CG ,Anc ({X,Y})),and thus inH2 . Hence ,reasons ifthere isan edge A-B in HI ,then and inH2 .There are three why there may be an edge inA HI : Boccur
CHAINGRAPHSAND SYMMETRICASSOCIATIONS
257
(a) There is an edge(directedor undirected) betweenA and B in CG. It then followsimmediatelythat there is an edgebetweenA and B in H 2. (b) The edgebetweenA and B in HI is the result of augmentinga triplex in Ext(CG, Anc({X , Y } U W )). Then there is somevertex T such that (A, T, B) forms a triplex in Ext(CG, Anc({X , Y } U W )). Since, by hypothesis,A, B E CoConAMP (X , Y), it followsthat T E CoConAMP (X , Y)U {X , Y } , and henceT occursin HI . It then followsby the previousreasoning that T is in H2, and so the triplex is alsopresentin Ext(CG, Anc({X , Y } U (W n CoConAMP (X , Y)))). Hencethereis an edgebetweenA and B in H2. (c) The edgebetweenA and B in HI is the resultof augmentinga bi-flag in Ext(CG, Anc({X , Y } U W )). This caseis identicalto the previousone, exceptthat therearetwo verticesTo, TI, suchthat (A, To, TI, B) formsa biflag in Ext(CG, Anc({X , Y } UW)). As before, it followsfrom the hypothesis that A, B E CoConAMP (X , Y), that To, TI E CoConAMP (X , Y) U {X , Y} , henceTo and TI occur in H2 and the hi-flag is in Ext(CG, Anc({X , Y } U (W n CoConAMP (X , Y )))). Thus the A- B edgeis alsopresentin H2. 0
Theorem 2 (iii ) Chaingraphsare co-connectionAMP determined . Proof: (CG FAMPXlLY I W => CG FAMPXlLY I W n CoConAMP (X , Y)) Let H be Aug[CG; X , Y, W ]. SinceCG FAMPXJlY I W , X and Yare separatedgivenW in H . Claim: X and Yare separatedin H by W n CoConAMP (X , Y). Suppose , for a contradiction, that there is somepath J.Lin H , connectingX and Y on which there is no vertex in W n CoConAMP (X , Y ). It then followsfrom Lemma3 that thereis a path J.L' in H composedonly of verticeson J1 , which are in CoConAMP (X , Y). Sinceno vertex on J.Lis in W n CoConAMP (X , Y), it then followsthat no vertexon J.L' is in W . SoX and Yare not separated by W in H , contradictingthe hypothesis . However , Aug[CG; X , Y,W n CoConAMP (X , Y)] is a subgraphof H , so X and Yare separatedby W n CoConAMP (X , Y) in Aug[CG;X ,Y,W n CoConAMP (X , Y )]. Thus CG FAMPXJlY , W n CoConAMP (X , Y ). (CG FAMPXJlY I W n CoConAMP (X , Y) => CG FAMPXJlY I W ) The proof is by contraposition.Supposethat thereis a path J.Lfrom X to Y in Aug[CG; X , Y, W ]. Lemma3 impliesthat thereis a path J.L' from X to Y in Aug[CG; X , Y, W ] everyvertexof whichis in {X , Y } UCoConAMP (X , Y). It then followsfrom Lemma4 that this path existsin Aug[CG; X , Y, W n COCOllAMP (X , Y )]. 0
258
THOMAS S. RICHARDSON
References
1.
~. N
2.
. . NM NN
3.
,...... N
4 .
5 .
0. N
6.
. . ,CX ..>. ,0.')..
7. 8.
. . (,0 ,... ,r...
9. 10.
11.
S. A. Andersson , D. Madigan , andM. D. Perlman . An alternativeMarkovproperty for chaingraphs . In F. V. JensenandE. Horvitz, editors, Uncertainty in Artificial Intelligence : Proceedings of the12thConference , pages40-48, SanFrancisco , 1996 . MorganKaufmann . S. A. Andersson , D. Madigan , and M. D. Perlman . A newpathwiseseparation criterionfor chaingraphs . In preparation , 1997 . J. Besag . On spatial-temporalmodelsand Markovfields. In Transactions of the 7th PragueConference on InformationTheory , StatisticalDecisionFunctionsand RandomProcesses , pages47- 55. Academia , Prague , 1974 . J. Besag . Spatialinteractionandthe statisticalanalysisof lattice systems(with discussion ). J. RoyalStatist. Soc. SeT . B, 36:302-309, 1974 . G. F. CooperandE. Herskovits . A Bayesian methodfor the inductionof probabi listic networksfromdata. MachineLearning , 9:309-347, 1992 . D. R. Cox. Causalityandgraphicalmodels . In Proceedings , 49thSession , volume1 of Bulletinof theInternationalStatisticalInstitute, pages363-372, 1993 . D. R. Cox and N. Wermuth . MultivariateDependencies : Models , Analysisand Interpretation . ChapmanandHall, London , 1996 . J. N. Darroch , S. L. Lauritzen , andT. P. Speed . Markovfieldsandlog-linearmodels for contingency tables. Ann. Statist., 8:522-539, 1980 . A. Dempster . Covariance selection . Biometrics , 28:157-175, 1972 . F. M. Fisher. A correspondence principlefor simultaneous equationmodels . Econometrica , 38(1) :73- 92, 1970 . M. Frydenberg . ThechaingraphMarkovproperty . Scandin . J. Statist., 17:333-353, 1990 .
12. D. Geiger . Graphoids : a qualitativeframeworkfor probabilisticinference . PhD thesis,UCLA, 1990 . 13. J. M. Hammersley and P. Clifford. Markovfieldson finite graphsand lattices. Unpublished manuscript , 1971 . 14. D. Hausman . Causalpriority. Nous, 18:261-279, 1984 . 15. D. Heckerman , D. Geiger , and D. M. Chickering . LearningBayesiannetworks : the combination of knowledge andstatisticaldata. In B. Lopezde Mantarasand D. Poole , editors, Uncertaintyin Artificial Intelligence . Proceedings of the 10th Conference , pages293-301, SanFrancisco , 1994 . MorganKaufmann . J. T. A. Koster. Markovpropertiesof non-recursive causalmodels . Ann. Statist., 24:2148 - 2178 , October1996 . S. L. Lauritzen . Graphical Models . Number81in OxfordStatisticalScience Series . Springer -Verlag , 1993 . S. L. Lauritzen , A. P. Dawid, B. Larsen , andH.-G. Leimer . Independence properties of directedMarkovfields. Networks , 20:491-505, 1990 . S. L. Lauritzenand D. J. Spiegelhalter . Localcomputationwith probabilitiesin graphicalstructuresandtheir applicationto expertsystems(with discussion ). J. RoyalStatist. Soc. SereB, 50(2):157-224, 1988 . S. L. Lauritzenand N. Wermuth . Graphicalmodelsfor association betweenvariables , some of which are qualitative and some quantitative . Ann . Statist ., 17:31- 57, 1989 . C. Meek. Strongcompleteness andfaithfulness in Bayesian networks . In P. Besnard andS. Hanks,editors, Uncertainty in ArtificialIntelligence : Proceedings of the11th Conference , pages403-410, SanFrancisco , 1995 . MorganKaufmann . J. Pearl. Probabilistic Reasoning in IntelligentSystems . MorganKaufman , 1988 . J. Pearl. Causaldiagramsfor empiricalresearch(with discussion ). Biometrika , 82:669-690, 1995 . J. Pearland R. Dechter . Identifyingindependencies in causalgraphswith feedback. In F. V. Jensen andE. Horvitz, editors , Uncertainty in Artificial Intelligence .'
CHAINGRAPHS ANDSYMMETRIC ASSOCIATIONS
25.
29. 30. . . co ~ t~
31.
. LC ~
32.
. ~ ~
33.
IIIII1IIII
38.
259
Proceedingsof the 12th Conference , pages454- 461, San Francisco , 1996. Morgan Kaufmann. J. Pearl and T ~Verma. A theory of inferred causation. In J. A. Allen, R. Fikes, and E. Sandewall , editors, Principles of KnowledgeRepresentationand Reasoning : Proceedin .qs of the SecondInternational Conference , pages441- 452, SanMateo, CA, 1991. MorganKaufmann. T . S. llichardson. A discoveryalgorithmfor directedcyclic graphs. In F. V . Jensen and E. Horvitz, editors, Uncertaintyin Artificial Intelligence: Proceedingsof the 12th Conference , pages454- 461, San:Francisco , 1996. MorganKaufmann. T. S. Richardson. Modelsof feedback : interpretation and discovery . PhD thesis, Carnegie -Mellon University, 1996. T. S. Richardson. A polynomial-time algorithm for decidingMarkov equivalenceof directed cyclic graphicalmodels. In F. V. Jensenand E. Horvitz, editors, Uncertainty in Artificial Intelligence: Proceedings of the 12th Conference , pages462- 469, San Francisco , 1996. MorganKaufmann. T . P. Speed. A note on nearest-neighbourGibbs and Markov distributions over graphs. SankhyaSer. A, 41:184- 197, 1979. P. Spirtes. Directedcyclic graphicalrepresentations of feedbackmodels. In P. Besnard and S. Hanks, editors, Uncertaintyin Artificial Intelligence: Proceedings of the 11th Conference , pages491- 498, SanFrancisco , 1995. MorganKaufmann. P. Spirtes, C. Glymour, and R. Scheines . Causation, Predictionand Search . Lecture Notesin Statistics. Oxford University Press, 1996. P. Spirtes, C. Meek, and T .S. Richardson. Causalinferencein the presenceof latent variables and selectionbias. In P. Besnardand S. Hanks, editors, Uncertainty in Arti_ : Proceedings of the 11th Conference , pages403- 410, San .ficial Intelligence Francisco, 1995. MorganKaufmann. P. Spirtesand T . S. Richardson. A polynomial-time algorithm for determiningdag equivalencein the presenceof latent variablesand selectionbias. In D. Madigan and P. Smyth, editors, Preliminary papersof the Sixth International Workshopon AI and Statistics, January4-7, Fort Lauderdale , Florida, pages489- 501, 1997. P. Spirtesand T . Verma. Equivalenceof causalmodelswith latent variables. Technical Report CMU-PHIL-33, Departmentof Philosophy , CarnegieMellon University, October 1992. R. H. Strotz and H. O. A. Wold. Recursiveversusnon-recursivesystems:an attempt at synthesis. Econometrica , 28:417- 427, 1960. (Also in CausalModelsin the Social Sciences , H.M. Blalock Jr. ed., Chicago: Aldine Atherton, 1971). M. Studenyand R. Bouckaert. On chaingraphmodelsfor descriptionof conditional independence structures. Ann. Statist., 1996. Acceptedfor publication. T . Vermaand J. Pearl. Equivalenceand synthesisof causalmodels. In M. Henrion, R. Shachter,L. Kanal, and J. Lemmer, editors, Uncertaintyin Artificial Intelligence: Proceedingsof the 12th Conference , pages220- 227, San Francisco , 1996. Morgan Kaufmann. S. Wright. Correlationand Causation. J. Agricultural Research , 20:557- 585, 1921.
THE MULTIINFORMATION FUNCTIONAS A TOOLFOR MEASURINGSTOCHASTIC DEPENDENCE
,
M . STUDENY
,
AND
J . VEJNAROVA
Institute of Information Theory and Automation Academy of Sciences of Czech Republic Pod vodarenskou
viii
4, 182 08 Prague
AND
Laboratory of Intelligent Systems University of Economics Ekonomicka
957 , 148 00 Prague
Czech Republic
Abstract . Given a collection of random variables [~i]iEN where N is a finite nonempty set, the corresponding multiinformation
function assigns to each
subset A c N the relative entropy of the joint distribution of [~i]iEA with respect to the product of distributions of individual random variables ~i for i E A . We argue that it is a useful tool for problems concerning stochastic
(conditional ) dependenceand independence(at least in the discrete case). First , the multiinformation
function makes it possible to express the
conditional mutual information between [~i]iEA and [~i]iEB given [~i]iEC (for every disjoint A , B , C c N ), which can be considered as a good measure of conditional stochastic dependence. Second, one can introduce reasonable measuresof dependence of level r among variables [~i]iEA (where A c N , 1 ~ r < card A ) which are expressible by means of the multiinformation
function . Third , it enablesone to derive theoretical results on (nonexistence of an) axiomatic characterization models
1.
of stochastic conditional
independence
.
Introduction
Information
theory provides a good measure of stochastic dependence be-
tween two random variables, namely the mutual information [7, 3]. It is always nonnegative and vanishes if the corresponding two random variables 261
, , M. STUDENY ANDJ. VEJNAROVA
262
are stochastically independent . On the other hand it achieves its maximal
value iff one random variable is a function of the other variable [28]. Perez [15] wanted also to express numerically the degree of stochastic dependence among any finite number of random variables and proposed a numerical characteristic called "dependence tightness ." Later he changed the terminology , calling the characteristic multiinformation and encouraging research on asymptotic properties of an estimator of multiinformation
[18]. Note that multiinformation also appeared in various guises in earlier information -theoretical papers. For example, Watanabe [24] called it "total correlation" and Csiszar [2] showedthat the IPFP procedure convergesto the probability distribution minimizing multiinformation within the considered family of distributions having prescribed marginals . Further
prospects
occur
when
one considers
multiinformation
as a set
function . That means if [l;i]iEN is a collection of random variables indexed by a finite set N then the multiinformation function (corresponding to [~i]iEN) assigns the multiinformation of the subcollection [~i]iEA to every A c N . Such a function was mentioned already in sixties by Watanabe
[25] under the name "total cohesion function ." Some pleasant properties of the multiinformation function were utilized by Perez [15] in probabilistic decision -making . Malvestuto named the multiinformation
function "en-
taxy " and applied it in the theory of relational databases [9]. The multiinformation
function plays an important
role in the problem of finding
"optimal dependencestructure simplification " solved in thesis [21], too. Finally , it has appeared to be a very useful tool for studying of formal properties of conditional independence . The
first
author
in modern
statistics
to deal
with
those
formal
prop -
erties of conditional independencewas probably Dawid [5]. He characterized certain statistical concepts (e.g. the concept of sufficient statistics) in terms of generalizedstochastic conditional independence. Spohn [17] studied stochastic conditional independence from the viewpoint of philosophical logic and formulated the same properties as Dawid . The importance of conditional
independence in probabilistic
reasoning was explicitly
discerned
and highlighted by Pearl and Paz [13]. They interpreted Dawid's formal properties in terms of axioms for irrelevance models and formulated a nat ural conjecture that these properties characterize stochastic conditional in -
dependencemodels. This conjecture was refuted in [19] by substantial use of the multiinformation function and this result was later strengthened by showing that stochastic conditional independence models cannot be char-
acterized by a finite number of formal properties of that type [20]. However , as we have already mentioned , the original prospect of multiin formation was to express quantitatively the strength of dependence among random variables . An abstract view on measures of dependence was brought
MULTIINFORMATION
AND
STOCHASTIC
263
DEPENDENCE
by Renyi [16] who formulated a few reasonablerequirements on measures of dependenceof two real-valued random variables. Zvarova [28] studied in more detail information -theoretical measures of dependence including mutual information . The idea of measuring dependence appeared also in nonprobabilistic calculi for dealing with uncertainty in artificial intelligence
[22, 23]. This article is basically an overview paper , but it brings several minor new results which (as we hope ) support our claims about the usefulness of the
multiinformation
function
. The
basic
fact
here
is that
the
multiinfor
-
mation function is related to conditional mutual information . In the first part of the paper we show that
the conditional
mutual
information
com -
plies with severalreasonablerequirements(analogousto Renyi's conditions) which should be satisfied by a measure of degree of stochastic conditional dependence . The second part of the paper responds to an interesting suggestion from Naftali Tishby and Joachim Buhmann . Is it possible to decompose multi -
information (which is considered to be a measure of global dependence) into level-specific measures of dependence among variables ? That means one would like to measure the strength of interactions of the "first level" by a special measure of pairwise dependence, and similarly for interactions of "higher levels." We show that the multiinformation can indeed be viewed as a sum of such level-specific measures of dependence. Nevertheless , we have found recently that such a formula is not completely new : similar level-specific measures of dependence were already considered by Han [8] . Finally , in the third part of the paper , as an example of theoretical use of the multiinformation
function
we recall
the results
about
nonexistence
of
an axiomatic characterization of conditional independence models . Unlike
the original paper [20] we present a long didactic proof emphasizing the essential steps . Note
that
all results
of the
paper
are formulated
for random
taking a finite number of values although the multiinformation can
be used
also
in the
case of continuous
variables
. The
reason
variables
function is that
we wish to present really elementary proofs which are not complicated by measure
2.
- theoretical
Basic
technicalities
.
concepts
We recall well -known information
- theoretical
concepts in this section ; most
of them can be found in textbooks, e.g. [3]. The reader who is familiar with information
theory
can skip the section .
Throughout the paper N denotes a finite nonempty set of factors or in short a factor set. In the sequel, whenever A , B C N the juxtaposition AB
, , M. STUDENYAND J. VEJNAROVA
264 will
be
i
2
EN
used
,
. 1
.
to
the
The
should
nonempty
~
finite
Xi
: f :
B
c
c
By
a
B
Xi
and
a
of
frames
.
x
Or
0
i
Having
N
a
)
=
any
B
b
is
the
( pAB
)
1 b
b
=
a
{
.
C
Nand
probability
P
( a
,
b
)
defines
a
image
is
of
the
A
b
( a
)
;
distribution
( by
b
=
N
the
in
a
A
.
to
a
fixed
symbol
for
set
{
P
( y
set
E
N
}
Y
)
;
N
is
XA
Whenever
XB
we
y
will
be
is
an
understand
E
Y
}
=
1
.
understood
By
any
arbitrary
collection
distribution
of
P
over
E
X
( a
XN
\
A
}
B
b
,
1
a
discrete
for
such
N
the
as
every
a
[ ~
that
pB
( b
over
E
marginal
follows
XA
i ] iEA
:
.
.
In
the
sequel
N
)
\
>
B
0
the
conditional
defined
by
:
) for
c
over
defined
.
every
a
E
XN
\
B
.
)
distribution
B
A
subvector
: =
' ( b
)
is
L
of
N
one
can
frames
[ ~
i ] iEN
use
a
induces
probability
\
the
{
P
( y
)
;
)
.
set
y
E
the
Z
Y
P
transformed
is
,
&
on
Provided
of
a
B
under
the
symbol
condition
pAl
b
transformation
distribution
finite
distribution
f
i
random
E
=
nonempty
probability
P
:
factor
;
distribution
A
P
)
a
Xi
2
joint
p0
between
a
( z
discrete
values
to
denote
.
Supposing
Q
any
.
when
c
frame
distribution
the
b
disjoint
b )
into
}
A
take
projection
distribution
of
a
mapping
mapping
i
.
to
A
the
with
probability
( conditional
For
( pi
Every
tions
.
is
particular
PB
: =
for
{
situation
= f
finite
Y
{
convention
: ; f
describes
has
the
0
nonempty
on
where
distribution
pi
i ] iEB
In
that
over
XN
P
[ ~
.
N
coordinate
a
P
and
L
pi
It
i
E
Nand
,
its
probability
natural
0
and
of
.
c
is
the
distribution
then
on
,
i ] iEN
A
( a
the
E
Xi
function
on
pA
adopt
B
variables
i
for
i
distribution
[ ~
describes
,
equivalently
pA
we
XA
factor
frame
factor
distribution
vector
Having
U
instead
random
a
lliEA
E
probability
distribution
the
every
distribution
random
It
called
real
)
A
i
.
probability
( discrete
union
by
discrete
to
product
nonnegative
probability
to
to
N
x
set
DISTRIBUTIONS
Cartesian
A
the
denoted
corresponding
assigned
by
every
i
set
is
the
denoted
for
sometimes
correspond
variable
denotes
notation
be
PROBABILITY
factors
frame
the
will
DISCRETE
random
0
shorten
singleton
the
f
Z
on
the
( y
.
In
Y
and
f
:
z
E
distribu
Y
-
7
-
Z
is
a
formula
)
=
z
}
such
for
a
of
f
( ~
every
case
distribution
vector
of
)
.
we
a
random
say
Z
that
,
Q
vector
is
~
an
,
Q
MULTIINFORMATION
Supposing say B
A
that - t
It
A
A
( P
is )
on
c
there
exists
= = pB
( a , b )
=
when
E
XB
;
A
, B
, G
Supposing that
we
write
A
is
A
holds
for is
Jl
BIG
every
the
when
the
=
remaining
is
>
a
distribution
is
a
- t
XA
( b ) ,
to
such
b
E
a
and
XB
.
,
b
of
deterministic
E
a
random
vector
function
that
we
write
,
XA
f
N
P
that
XB E
function
outside
over
respect
distribution
the
O } ;
is with
f
the
The
symbol
T
serve
class
is
of
uniquely
set
it
another
determined
can
take
arbitrary
The
of
( N
)
of
Let
a
denote a
i
( [ Xi
P B
pAC
E
is
a
distribution
given
C
(a , c )
. pBC
Xc
. It
over
with
respect
N
to
we
P
and
stochastic
the
[ ~ i ] iEN
the
( b , c )
describes
vector
known
and
values
of
point
of
E
I
T
set
of N
situation
in
view
when
every
[ ~ i ] iEA
situation
and
[ ~ i ] iEB
are
) .
ordered
, where
A
triplets #
0
#
(A
B
independence
model (N
and
those
) , its
their
a
is
, Yi ] iEN
N
, BIG
. These
)
of
triplets
statements
model
induced ( A
a
c
, BIG
T
by
a
)
E
T
( over
within
N
)
are
Y
N
I
)
such
is N
that
an
on Y
i
T
(B c
T
A
1L
( N
)
=
P
( [ Xi ] iEN
)
. Q
( [ Yi ] iEN
E
the
P
over
BIG
N ( P
) .
model
.
independency model
XN
inducing
)
) .
)
is
inducing 3
. Put
mod
for
[ Xi
, Yi ] iEN
-
.
I
and Zi
=
Q Xi
E
ZN
be X
define
)
( N
, AIG
independency
independency
: = lliEN
class
distribution )
probabilistic
probabilistic
the
triplet
.
( N
over
( N
of the
model
probability
distribution on
subset is
images
distribution
, J
a
independency
model
also
is image
symmetric
probability
and
an
triplets
I
J
over
symmetric of
probability
n
N
collection
conditional
closure
be
E
C
factor
of
distribution
R
,
the
of
Supposing I
P
probability every
XB
=
random
( from
)
some
class
(c )
independency
in
2 . 1 the
E a
independency by
Lemma
b
. pC
are
independency
induced
and of
equality
[ ~ i ] iEC
symmetric
probabilistic
Proof
,
will
, BIG
just
~,'joint
.
triplets
The
di
MODELS
, an (A
consists
els
of
N
general
) .
the
identification
set
Supposing
A
XA
subsets
for
factor In
if
of
disjoint
(N
)
unrelated
INDEPENDENCY
the
are
(a , b , c )
E
values
pairwise
N
independent ( P
a
2 .2 .
will
c
distribution
completely
for
a
for
that
( b )
conditionally
pABC
T
for
[ ~ i ] iEA
. Note
pB
: XB
P
subvector [ ~ i ] iEB
P B
f
265
DEPENDENCE
.
say
p
mapping
( b )
situation
{ b
and on
0
random
set
disjoint
a
( a , b )
subvector
the
are
STOCHASTIC
dependent
pAB
whose
values
N
pAB
the
[ ~ i ] iEN
, B
functionally
if
reflects
random
AND
.
a Y
i
, , M. STUDENY ANDJ. VEJNAROVA
266
It is easy to verify that for every (A , BIG } E 7 (N ) one has A lL BIG (R) iff [ A Jl BIG (P ) & A Jl BIG (Q) J. 0
2.3. RELATIVE ENTROPY Supposing Q and Y we say that Q implies Q (y ) = a entropy of Q with
R are probability distributions on a nonempty finite set is absolutely continuous with respect to R iff R (y ) = a for every y E Y . In that case we can define the relative respect to R as
1l(QIR)=2:::{Q(y).In~ ;yEY&Q(y)>O}. Lemma 2 .2 Suppose that Q and R are probability distributions on a nonempty finite set Y such that Q is absolutely continuous with respect to R . Then
(a) 1l (QIR) ~ 0, (b) 1l (QIR) = 0 iffQ = R. Proof Consider the real function 0,
0, h(y) = 0 otherwise. Since
Jensen's inequality [3] with respect to R and write :
0=c.p(l)=c.p(yEY Lh(y).R(y))~yEY Lc.p(h(y)).R(y)=1l(QIR ). Owing to strict convexity of c.p the equality holds iff h is constant on the set { y E Y ; R(y) > O} . That means h := 1 there, i .e. Q = R. 0 Supposing that (A , BIG ) E T (N ) and P is a probability distribution over N the formula
A P(x) A P(x) -
pAC (ZAC ).pBC (ZBC ) 0
pC(xc)
for x E XABC with pC (xc ) > 0, for remaining x E XABC .
(1)
defines a probability distribution on XABC. Evidently , pABa is absolutely A continuous with respect to P . The conditional mut.ual information between A and B given G with respect to P , denoted by I (A ; BIG IIP ) is the relative entropy of pABa with respect to P . In case that P is known from the context we write just I (A ; BIG ).
MULTIINFORMATION
AND
Consequence
2 . 1 Supposing
ity
over N
distribution
that
( A , BIG ) E T ( N ) and P is a probabil
I ( A ; BIG
II P ) ~ 0 ,
(b)
I ( A ; BIG
liP ) = 0 iff A ..l1- BIG
Owing
nothing
to Lemma
but
the
267
DEPENDENCE
-
one has
(a )
Proof
STOCHASTIC
(P ) .
2 .2 it suffices
corresponding
to realize
conditional
that
pABa
independence
=
P means
statement
.
0
distribution
P
2.4. MULTIINFORMATION FUNCTION The
multiinformation
(over a
factor as follows :
function
set
induced
N ) is a real
M ( D liP ) = 1i ( PDI
by
function
a probability
on the
power
set of N
defined
p {i}) for 0 # D c N, M (011 P) = o.
n iED
We again omit the symbol of P when the probability distribution is clear from the context. It follows from Lemma 2.2(b) that M (D ) = 0 whenever card D = 1 . Lemma
2
over
N
.
.
3
Let
(
A
,
BIG
)
E
T
(
N
)
and
P
be
a
probability
distribution
Then
I
(
A
;
BIG
)
=
=
M
(
ABG
)
+
M
(
G
)
-
M
(
AG
)
-
M
(
BG
)
(2)
.
Proof. Let us write 1l(pABCIP) as
'"L,."",{pABC pABC (X pC ((XC ));XEXABC XBC (X).InpAC (XAC ))'.PBC &pABC (X)>O }.
Now tor
lli
we of
~ A
can the
p
artificially ratio
{ i } ( Xi
strIctly
)
. lliEB
positive
erties
of
multiply in
for
logarithm
the
both
argument
p
{ i } ( Xi
)
any
considerea
one
can
. lli
c
pABC
( x )
write
. In
+
}: : : ; {
pABC
( x )
pC II
-
2: : {
pABC
( X )
. In
pAC n
-
~
{
pABC
( X )
iEC
. In
iEAC
pBC n
iEBC
as
and
logarithm
{ i } ( Xi
)
by
. lliEC
p X
a
sum
of
.
{ i } ( Xi Using
four
the a
)
;
x
E
XABC
&
pABC
is
- known
( x )
>
( XC
)
;
x
E
XABC
&
pABC
( x )
>
O
O
}
}
{ i } ( x '. )
( XAC
)
;
X
E
XABC
&
pABC
( x )
>
O }
;
X
E
XABC
&
pABC
( x )
>
a
p { i } ( x I. )
( XBC
)
P { i } ( X '' )
always prop
:
p { i } ( X ', )
p
-
product
which
well
terms
denomina
special
-r \ \ Xl
iEABC
. In
p
it
.& II
numerator
the
configuration
pABCI }: : : ; {
the
of
} .
-
, , M. STUDENYAND J. VEJNAROVA
268 The
first
ABC
term
. To
is nothing
see that
configurations
for
that
of xs
is groups
but
the
the
second
which
the
having
the
for
the
2 .5 .
ENTROPY
If
is a discrete
Q
of
other
AND
logarithm to
( y , xc ) . In
function
sum
there
has
for
in groups
the
same
of
value
,
C : ,:/
pc ( XC ) = lliEC P { i } ( Xi )
0
terms
.
L pABC JlEXAB pABC (JI, ZC) > 0
( y , xc ) =
probability
0
ENTROPY
distribution the
. pC ( xc ) .
.
CONDITIONAL
by
can
L In PC ( x ~ ) zcE Xc lliEC p { t } ( Xi ) pC (zc O
two
Q is defined
multiinformation
projection
pABC
=
entropy
same
L In PC ( x ~ ) II iEC P { t } ( x 1..) zc E Xc pc (zc ) > 0
Similarly
of the
is M ( C ) one
corresponding
L L oeCEXC YE XAB pc (zc 0 pABa (JI,ZC
=
value
term
on
a nonempty
finite
set
Y
the
formula 1
1 (Q ) = Lemma
2 .4
nonempty
1l ( Q ) ~ 0 , 1l ( Q ) = . Since
every
y ; the
factor
0 iff
equality
set
N
will
- the see that using
&
Q (y ) > O } .
probability
here
distribution
the
procedure
by
on
)
for
symbol
of
as in the
function
on
a
1 . It
power
0 #
D
when proof
set
one
gives
has
In Q ( y ) - l :2:: 0
0 for
both
of N
is clear
of Lemma
defined
from
the is not
0
over
as follows
M (D) = - H(D) + L H({i}) for every D c N .
iED
such
(b ) . P
H ( 011 P ) =
2 .3 it
every
( a ) and
distribution
C N ,
it
1.
(y ) - l ~
a probability
the
P
Q (y ) =
Q ( y ) . InQ
if Q ( y ) =
induced
1l ( pD
that
real
O. Hence
only
function
II P ) =
such
increasing
Q (y ) >
function
omit
same
a discrete
y E Y
is an
with
is a real
often
is
exists
occurs
H (D
We
there
entropic
Q
; y E Y
Y . Then
logarithm
y E Y
The
that
set
(b) Proof
{ Q ( y ) . In Q( Y )
Suppose
.finite
(a )
for
L
a :
o .
context difficult
. By to
MULTIINFORMATION
Hence
, using
the
formula
I ( A ; BIG
Supposing is defined
AND
( 2 ) from
) =
A ,B
c
are
use
the
symbol
distribution
Proof
P
) =
One the
can
be
is nothing of pAl
) -
entropy
of
(3)
A
given
B
H (B ) .
indicate
the
corresponding
distribution
easily
see using
pB
but
gives
H ( AIB write
it
the
Measure this
over
probability
N , A ,B
section
c
N
are
sequence
used
the
in the
pB
(b
proof
O} .
(4 )
of Lemma
2 .3
are
for
2 . 1 zero one
from
let
, one
pAlb
& can
P
( ab ) >
utilize
the
a }
definition
dependence why
conditional
quantitative
mutual measure
of
informa
-
degree
of
. the
random known
following
vectors ( fixed
mutual
bound find
,
0
us consider
conditional
always
( a ) . ln ~
(4 ) .
arguments
discrete
is a lower can
hand
stochastic
already
the
b E XB
O
as a suitable
topic
~ c are
~ BC
other
L aE XA pAl b(a
dependence
this
values
since
(b) .
several
stochastic
and
the
&
form
considered
motivate
; a E XA
conditional
~ A , ~ B , and
possible
method
&
AB
( ab )
II P ) . On in
we give be
conditional
~ AC
the
) ; bEXB
(b)
expression
of
should
To
( b ) oH ( Allplb
( ab ) . In pAB
b and
which
bound
H ( AB
II P ) to
} : { pB
L pB bE XB pB (b O
of
) =
a probability
AB
L...., { P
that
conditional
) .
expression
~
tion
the
) + H ( BG
. Then
H ( AIBIIP
In
derives
H ( G ) + H ( AG
disjoint
H ( AIB
2 . 5 Let
disjoint
3.
2 . 3 one
269
P .
Lemma
that
) -
DEPENDENCE
difference
H ( AIB
We
Lemma
- H ( ABC
N
as a simple
STOCHASTIC
for
or
and
a distribution
the
values having
task
joint
prescribed
information those
specific
. Suppose
distributions
) . What
then
I ( A ; B I C ) ? By , and
it
prescribed
is the
are Con -
precise
marginals
, , M. STUDENYAND J. VEJNAROVA
270
for A AC and BC such that I (A ; BIG ) = 0 (namely the " conditional P given by the formula ( 1) ) . 3.1. MAXIMAL
DEGREE OF CONDITIONAL
product "
DEPENDENCE
But one can also find an upper bound . Lemma 3 .1 Let over N . Then
(A , BIG ) E T (N ) and P be a probability
distribution
I (A ; BIG ) ~ mill { H (AIG ) , H (BIC ) } .
Proof that :
It follows from ( 3) with help of the definition
of conditional
entropy
I (A ; BIG ) = H (AIG ) - H (AIBG ) . Moreover , 0 ~ H (AIBC ) follows from (4 ) with Lemma 2.4 ( a) . This implies I (A ; BIG ) ::::; H (AIG ) , the other estimate with H (BIC ) is analogous . 0 The following proposition generalizes an analogous result obtained in the unconditional case by Zvarova ( [28) , Theorem 5) and loosely corre sponds to the condition E ) mentioned by Renyi [16] . Proposition 3 . 1 Supposing tribution over N one has
(A , BIG ) E T (N ) and P is a probability
I (A ; BIG liP ) = H (AIG II P )
ProD/: By the formula
mentioned
iff
dis -
BG - t A (P ) .
in the proof of Lemma 3.1 the considered
equality occurs just in case H (AIBC II P ) = O. Owing to the formula (4 ) and Lemma 2.4 (a) this is equivalent to the requirement H (A II pi bc) = 0 for every (b, c) E XBC with pBC (b, c) > O. By Lemma 2.4 (b ) it means just that for every such a pair (b, c) E XBC there exists a E XA with pAl bc(a ) = 1. Of course , this a E XA is uniquely determined . This enables us to define the required function from XBC to XA . 0 A natural question that arises is how tight is I ( A ; BIG ) from Lemma 3.1? More exactly , we ask ways find a distribution having prescribed marginals I ( A ; BIG ) = min { H (AIG ) , H (BIG ) } . In general , the shown by the following example .
the upper bound for whether one can al for AC and BC with answer is negative as
MULTIINFORMATION
AND
STOCHASTIC
DEPENDENCE
271
Example 3.1 Let us put XA = XB = Xc = { O, 1} and define PAC and PBC as follows
PAC(O,0) = ~, PAC(O,1) = PAc(1, 1) = ! ' PAc(l ,0) = 1' PBC(O,O) = PBC(O, 1) = PBc(l , 0) = PBc(l , 1) = i . Since (PAC)C = (PBC)C there exists a distribution on XABC having them as marginals . In fact , any such distribution P (O, 0, 0)
=
a,
P (O, 0, 1)
=
(3,
P(O,l ,O) P(O, 1, 1) P(l , 0, 0) P(l , 0, 1) P(l , 1,0)
= = = = =
! ~~~a-
P (l , 1, 1)
=
(3,
P can be expressed as follows
a, (3, a, {3, ~,
wherea E [1\ , ! ]"B E [0, ! ]. It is easyto showthat H(AIC) < H (BIC). On the other hand, for every parameter a either P (O, 0, 0) and P (l , 0, 0) are simultaneously nonzero or P (O, 1, 0) and P (l , 1, 0) are simultaneously nonzero . Therefore A is not functionally dependent on BC with respect to P and by Proposition 3.1 the upper bound H (AIC ) is not achieved. <> However , the upper bound given in Lemma 3.1 can be precise for specific
prescribed marginals. Let us provide a general example. Example 3.2 Supposethat PBG is given, consider an arbitrary function 9 : XB - t XA and define PAC by the formula PAc(a, c) = L { PBC(b, c) ; bE XB & g(b) = a }
for a E XA , c E Xc .
One can always find a distribution P over ABC having such a pair of distri butions PAC, PBC as marginals and satisfying I (A ; BIG liP ) = H (AIG II P ). Indeed, define P over ABC as follows: P (a, b, c) = PBc (b, c) P (a, b, c) = 0
if g(b) = a, otherwise.
This ensuresthat BC - t A (P ), then use Proposition 3.1.
<>
3.2. MUTUAL COMPARISONOF DEPENDENCEDEGREES A natural intuitive requirement on a quantitative characteristic of degreeof dependenceis that a higher degreeof dependenceamong variables should
, , M. STUDENYAND J. VEJNAROVA
272
be reflected by a higher value of that characteristic . Previous results on conditional mutual information are in agreement with this wish : its minimal value characterizes independence , while its maximal values more or less correspond to the maximal degree of dependence. Well , what about the behavior "between" these "extreme " cases? One can imagine two "comparable " nonextreme cases when one case represents evidently a higher degree of dependence among variables than the other case. For example , let us consider two random vectors ~AB resp. 1]AB (take C = 0) having distributions PAB resp. QAB depicted by the following dia grams .
PAB
0
~
~
QAB
0
0
~
1
1
1
'7
'7
'7
!7
~ 7
0
!
!
0
!
!
0
7
7
7
7
Clearly , (PAB)A = (QAB )A and(PAB)B = (QAB)B. But intuitively , QAB expresses a higherdegree of stochastic dependence between 1JA = ~A and 1JB= ~B thanPAB. ThedistributionQABis more"concentrated " than PAB:QABisanimage ofPAB.Therefore , wecananticipate I (A;BI011 P) ~ I (A; BI011 Q), whichis indeed thecase . Thefollowing proposition saysthatconditional mutualinformation has the desired property . Notethat thepropertyis not derivable fromother properties of measures of dependence mentioned eitherbyRenyi[16] or by Zvarova [28] (in theunconditional case ). Proposition3.2 Suppose that(A,BIG) E T(N) andP, Q areprobability distributions overN suchthatpAC= QAC , pBC= QBCandQABC is an imageof pABC.Then I (A; BIG liP) :::; I (A; BIG IIQ) .
Proof Let us write P instead of pABG throughout the proof and similarly for Q. Supposethat Q is an image of P by f : XABC - t XABC. For every
MULTIINFORMATION AND STOCHASTIC DEPENDENCE
273
x E XABC with Q(x) > 0 put T = {y E XABC; f (y) = x & P(y) > O} and write (owing to the fact that the logarithm is an increasingfunction):
LP{y).lnP {y)~yET LP{y).In(L P{Z)) yET zET
= Q(x) . In Q(x) .
We can sum it over all such xs and derive
L P(y) .1nP(y) ~ L Q(x) .1nQ(x) . yEXABC zEXABC P(y O Q(z O Hence
- H(ABCIIP) ::::; - H(ABCIIQ) . Owingto the assumptionspAG = QAG, pBG = QBGonehasH (AC IIP) = H (AC IIQ), H (BC IIP) = H (BC IIQ) and H (C IIP) = H (C IIQ) . The formula (3) then givesthe desiredclaim. D Nevertheless
hold
laxed
,
when
,
the
as
mentioned
assumption
that
demonstrated
Example
depicted
the
3
by
. 3
the
by
Take
C
following
=
the
0
inequality from Proposition 3.2 may not marginals for AC and EC coincide is refollowing
and
consider
example
the
.
distri',but ionsPABand QAB
diagrams :
QAB 0
~
"18
"38
Evidently , QAB is an image of PAB, but I (A; BI011P ) > I (A ; BI011Q). 0 Remark One can imagine more general transformations of distributions : instead of "functional " transformations introduced in subsection 2.1 one can consider transformations by Markov kernels. However, Proposition 3.2 cannot be generalizedto such a case. In fact, the distribution PAB from the motivational example starting this subsection can be obtained from QAB by an "inverse" transformation realized by a Markov kernel.
, , M. STUDENYAND J. VEJNAROVA
274
3.3. TRANSFORMED DISTRIBUTIONS Renyi's condition F) in [16] states that a one-to-one transformation of a random variable does not change the value of a measure of dependence. Similarly , Zvarova [28] requires that restrictions to sub-u-algebras (which somehow correspond to separate simplifying transformations of variables) decreasethe value of the measureof dependence. The above mentioned requirements can be generalizedto the "conditional" case as shown in the following proposition. Note that the assumption of the proposition means (under the situation when P is the distribution of a random vector [~i]iEN) simply that the random subvector [~i]iEA is transformed while the other variables ~i , i E BG are preserved. Proposition 3.3 Let (A , BIG ) , (D , BIG ) E 7 (N ), P, Q be probability distributions over N . Suppose that there exists a mapping 9 : XA - t XD such that QDBC is an image of pABC by the mapping f : XABC - t XDBC defined by f (a, b, c) = [g(a), b, c]
for a E XA , (b, c) E XBC .
Then I (A ; BIG IIP ) ~ I (D ; BIG II Q) . Proof Throughout the proof we write P instead of pABa and Q instead of QDBC. Let us denote by Y the class of all (c, d) E XCD such that P (g- l (d) x XB x { c} ) > 0 where g- l (d) = { a E XA ; g(a) = d} . For every (c, d) E Y introduce a probability distribution RCdon 9- 1(d) x XB by the formula: Rcd(a, b) =
P (a, b, c) P (g- l (d) X XB X { c} )
for a E 9- 1(d), b E XB .
It can be formally considered as a distribution on XA x XB . Thus, by Consequence2.1(a) we have 0 ~ I (A ; BI011Rcd) for every (c, d) E Y . One can multiply this inequality by P (g- l (d) X XB x { c} ), sum over Y and obtain by simple cancellation of P (g- l (d) X XB X { c} ):
o~ L L (c,d)EY(a,b}E9 -1 (}> d)x P(abc OXB (abc ).P(g-l(d)XXBX{C}) P(abc ).InP({a}P xXBx{C}).P(g_l(d) x{b} x{c}) .
, , M. STUDENY ANDJ. VEJNAROVA
276
where the remaining values of P zero. Since A 1L BIG (P ) one has by Consequence2.1(b) I (A ; BIG liP ) = O. Let us consider a mapping 9 : XAC ~ XDE defined by
9(0,0) = 9(1,0) = (0,0)
9(0,1) = 9(1,1) = (1,0) .
Thenthe imageof P by the mappingf : XABC-t XDBEdefinedby f (a, b, c) = [g(a, c), b] for (a, c) E XAC, b E XB , is the followingdistributionQ on XDBE : 1 Q(O,0, 0) = Q(I , 1,0) = "2' Q(O, 1,0) = Q(I , 0, 0) = 0 . EvidentlyI (D; BIE IIQ) = In2. 4 . Different
levels of stochastic
<>
dependence
Let us start this section with some motivation . A quite common "philosoph ical " point of view on stochastic dependence is the following : The global strength of dependence among variables [~i ]iEN is considered as a result of various interactions among factors in N . For example , in hierarchical log-linear models for contingency tables [4] one can distinguish the first -order interactions , i .e. interactions of pairs of factors , the second-order interactions , i .e. interactions of triplets of factors , etc . In substance , the first -order interactions correspond to pairwise dependence relationships , i .e. to (unconditional ) dependences between ~i and ~j for i , j E N , i :tf j . Similarly , one can (very loosely ) imagine that the second-order interactions correspond to conditional dependences with one conditioning variable , i .e. to conditional dependences between ~i and ~j given ~k where i , j , kEN are distinct . An analogous principle holds for higher -order interactions . Note that we have used the example with loglinear models just for motivation - to illustrate informally the aim of this section . In fact , one can interpret only special hierarchical log-linear models in terms of conditional (in )dependence. This leads to the idea of distinguishing different "levels" of stochastic dependence. Thus , the first level could "involve " pairwise (unconditional ) dependences. The second level could correspond to pairwise conditional dependences between two variables given a third one, the third level to pairwise conditional dependences given a pair of variables , etc . Let us give a simple example of a probability distribution which exhibits different behavior for different levels. The following construction will be used in the next section , too .
MULTIINFORMATION
Construction distribution
AND
STOCHASTIC
DEPENDENCE
A Supposing A c N , card A ~ 2 , there P over N such that M ( B II P ) = In 2
whenever
M ( B II P ) = a
otherwise
exists
277
a probability
A c B c N , .
Proof Let us put Xi = {a, I } for i E A, Xi = {a} for i E N \ A. DefineP on XN as
follows
P([Xi]iEN) = 21-cardA P([Xi]iEN)= a
whenever EiEN Xi is even, otherwise . 0
The
distribution
dependences one
can
easily
ditionally to
is that
in
learning
the
per haps
help
the basis
order
and
subsets
of
get
N . In
a measure
[~ k ] kEK to
[~j ] jEB
given the
the
K
c
degree
models
degree
[26 ]
level of
provide
for
network
.
of depen
-
dependence is similar
a good
to
theoretical
that
the
\ { i , j } . This of dependence
mentioned
together
classification
with for
the
each
conditional
possibility
level
mutual
conditional
where
case
above
.
DEPENDENCE
of stochastic
[~ k ] kEC
a suitable
of each
of dependence
OF
argued
special
N
of the
log - linear
we
of
algorithms
measures
-
.
I ( i ; j I K ) of conditional
, where
measure
tests
is arbi
distributions
distribution
can
[~ i ] iEA D
conclusion
find
strength
. They
an analogue
measure
to
( with
A \ { i , j } . Or
variables
standard
a considered
MEASURES
section
the
model
the
main
level - specific
whether
in
- SPECIFIC
fail
i is con -
A \ { i ,j }
given
." Such
. The model
to measure
statistical
I ( A ; B I C ) is a good [~ i ] iEA
underlying
of
- level
E A , i i= j ,
[~ i ] i .ED , where
independent models
quantitative
to have
C
P , the
variables
highest
i ,j
2 . 3 ) that
ofj
distribution the
approximations
numerically
previous
the
the
pair
subset
independent
has
an
- independent
by
LEVEL
the
such
only
every Lemma
proper
" completely
a wish
necessary
any
- independent
recognize
, we wish
of expressing
In
to
interactions
4 .1.
of
, for
2 . 1 and
" although
. Good
pseudo
for Thus
[~ i ] iEN
justifies
one
fearful
given
network
separately
may
j
of A , are
case
A exhibits
A . Indeed
conditionally
[26 ] pseudo
Bayesian
dence
of
subset in
set
Consequence
dependent
proper
This
( by
, supposing
called
Construction
factor
i is not
" collectively
are
of
verify
P ) but
equivalently
trary
from
the
independent
respect
are
P
within
A ,B ,C
when
A
and
dependence leads for
directly a specific
C B
information
dependence N
are
are
singletons
between to level
pairwise
our .
~ i and proposal
between disjoint , we
will
~j given of how
278
, , M. STUDENY ANDJ. VEJNAROVA
Suppose that P is a probability distribution over N , A c N with card A 2:: 2. Then for each r = 1, . . . , card A - I we put : ~ (r , AIIP ) = } : { / (a;bIKIIP ) ; {a,b} C A, K c A \ {a, b} , cardK = r - I } .
If thedistribution P isknownfromthecontext , wewrite~(r, A) instead of ~(r,A IIP). Moreover , wewill occasionally writejust ~(r) asa shorthand for Ll(r, N). Weregardthisnumber asa basisof a measure of dependence of levelr among factorsfromA. Consequence 2.1 directlyimplies : Proposition4.1 Let P be a probabilitydistributionoverN, A c N, cardA~ 2, 1 ~ r ~ cardA- 1. Then (a) Ll(r, A IIP) ~ 0, (b) Ll(r, A IIP) ==0 iff [V(a, blK) E T(A) cardK = r - 1 a 11blK(P)]. So, the number d (r ) is nonnegative and vanishes just in case when there are no stochastic dependences of level r . Particularly , Ll (1) can be regarded as a measure of degree of pairwise unconditional dependence. The reader can ask whether there are different measures of the strength of level-specific interactions . Of course, one can find many such information -theoretical measures. However , if one is interested only in symmetric measures (i .e. measures whose values are not changed by a permutation of variables ) based on entropy , then (in our opinion ) the corresponding measure must be nothing but a multiple of d (r ). We base our conjecture on the result of Han [8] : he introduced certain level-specific measures which are positive multiples of ~ (r ) and proved that every entropy -based measure of mul tivariate "symmetric " correlation is a linear combination of his meMures with nonnegative coefficients . Of course, owing to Lemma 2.3 the number Ll (r ) can be expressed by means of the multiinformation function . To get a neat formula we introduce a provisional notation for sums of the multiinformation function over sets of the same cardinality . We denote for every A c N , card A ~ 2:
a (i , A ) = L { M (D II P ) ; DcA
, card D = i }
for i = 0, ..., card A .
Of course o-(i ) will be a shorthand for o-(i , N ). Let us mention that 0-(0) = 0' (1) = 0 . Lemma 4.1 For every r = 1, . . . , n - 1 (where n = cardN ~ 2)
Ll(r ) =
21)-a(r+l)-r.(n-r)-a(r)+(n-2 (r+ r+1)-u(r-l)-
279 MULTIINFORMATION AND STOCHASTIC DEPENDENCE Proof Let us fix1~r~n- 1and write byLemma 2.3 2Ll (r)=(a,bL)IK {M(abK )+M(K)- M(aK )- M(bK )}, (7) EC where.c is the classof all (a, blK) E T (N ) wherea, b are singletonsand cardK = r - 1. Note that in .c the triplets (a, blK) and (b, alK ) are distinguished: hencethe term 2d (r ) in (7). Evidently, the sumcontainsonly the terms M (D) suchthat r - 1 :::; cardD :::; r + 1 , and onecan write /),.(r ) = L { k(D) . M (D) ; D c N, r - 1 ~ cardD ~ r + 1 } , wherek(D) are suitable coefficients . However , sinceeverypermutation7r of factors in N transforms (a, blK) E .c into (7r(a),7r(b)I7r(K )) E .c the coefficientk(D) dependsonly on cardD . Thus, if one dividesthe number of overall occurrencesof terms M (E) with cardE = cardD in (7) by the number of sets E with cardE = cardD, the absolutevalue of 2k(D) is obtained. Sincecard.c = n . (n - 1) . (~--~) onecanobtain for cardD = r + 1 that k(D ) = ~.n(n- 1)(~=;)/ (r~1) = (r! l ). Similarly, in casecardD = r - 1 onehask(D ) = ! .n(n- l )(~--i )/ (r~l ) = (n- ; +l ). Finally, incasecardD = r onederives- k(D ) = ! . 2n(n - 1)(; --; )/ (~) = r (n - r ). To get the desired formula it sufficesto utilize the definitionsof a(r - 1), a(r ), a(r + 1). 0 Lemma4.1 providesa neat formula for ~ (r ), but in the casewhen a great numberof conditionalindependence statementsare known to hold, the definition formula is better from the computationalcomplexityviewpoint. 4.2. DECOMPOSITION OF MULTIINFORMATION Thus, for a factor set N , cardN ~ 2, the numberM (N ) quantifiesglobal dependence amongfactors in N and the numbers~ (r, N ) quantify levelspecificdependences . So, oneexpectsthat the multiinformationis at least a weightedsumof thesenumbers.This is indeedthe case,but asthe reader can expect, the coefficientsdependon cardN . For everyn ~ 2 and r E { I , . . . , n - I } we put (3(r, n) = 2 . r - l .
(~)-1,
Evidently , ,a(r , n) is always a strictly positive rational number.
,
280
,
M . STUDENY
AND
J . VEJNAROVA
over Proposition 4.2 Let P be a probability distribution Then n- l M (N IIP) = L (3(r, n) . ~ (r, N IIP) . r=l
N , card N ~ 2 .
Proof . UsingLemma 4.1wewrite(notethatthesuperfluous symbol of P isomitted throughout theproofand,a(r) isused instead of,a(r,n)) n~ l ,e(r) 0~(r) = n~ l ,e(r) 0( r ; 1) 0u(r + 1) - n~ l f3(r) 0r 0(n - r) 00'(r) + n~ l f3(r) 0( n - ; + 1) 0u(r - 1) 0 Letusrewrite thisintoa moreconvenient form: t=2(3(j - l )o(;) ou(j )- ~ .1 .1=1{3(j )ojo(n- j )ou(j )+"t' .1=0{3(j +l )o( n; j ) ou(j )o Thisis, in fact, Ej=oI(j ) . a(j ), where I(j ) aresuitable coefficients . Thus , l(n) = ,fi(n - 1) . (~) = 1, I(n - 1) = ,B(n - 2) . (n; l) - {3(n - 1) . (n - 1) = ~ - ~ = 0, andmoreover , forevery2 :::;j :::; n - 2 onecanwrite l(j ) = (3(j - 1) . (~) - (3(j ) .j . (n - j ) + (3(j + 1) . (n2j) = = (j )-1. {(n - j + 1) - 2(n - j ) + (n - j - 1)} = O. Hence , owingto 0'(0) = 0'(1) = 0 andn ~ 2 weobtain
n-l E (3(r) .Ll(r) = r=l
n
L
j
l
=
(
j
)
.
u
(
j
)
=
u
(
n
)
=
M
(
N
)
.
2
0 If one considers
a subset A c
N
in
the
role
of
N
in the preceding
statement , then one obtains cardA
M(AIIP) = L
- l
(3(r, cardA) . ~ (r, A liP)
(8)
r = l
for every A c N , card A ~ 2. One can interpret this in the following way. Whenever [~i]iEA is a random subvector of [~i]iEN, then M (A II P ) is a measure of global dependenceamong factors in A , and the value {3(r , card A ) . ~ (r , A IIP ) expressesthe contribution of dependencesof level
MULTIINFORMATION AND STOCHASTICDEPENDENCE
281
r among factors in A . In this sense, the coefficient f3(r , card A ) then reflects the relationship between the level r and the number of factors. Thus, the "weights" of different levels (and their mutual ratios, too) depend on the number of factors in consideration. The formula (8) leads to the following proposal. We proposeto measure the strength of stochastic dependenceamong factors A c N (card A ~ 2) of level r (1 ~ r ~ card A - I ) by meansof the number:
A(r, A IIP) = (j (r, cardA) . d (r, A IIP) . The symbol of P is omitted whenever it is suitable. By Proposition 4.1 A(r , A) is nonnegative and vanishesjust in caseof absenceof interactions of degree r within A . The formula (8) says that M (A) is just the sum of A(r , A)s. To have a direct formula one can rewrite the definition of "\(r , A) using Lemma 4.1 as follows:
=(a-r).(r:1)-1 -2.(a-r).(;)-1.a (r,A)+(a-r).(r:1)-1
A(r, A)
oa(r + l , A )
oa(r - l , A ) ,
where a = card A , 1 :s; r :s; a - I . Let us clarify the relation to Han's measure[8] ~ 2e~n) of level r among n = card N variables .
We have:
A(r , N ) = (n - r ) . Ll2e~n) for every1 ~ r ~ n - 1, n 2: 2 . We did not study the computational complexity of calculating the particular characteristics introduced in this section - this can be a suhiect of .. future , more applied research. 5. Axiomatic
characterization
The aim of this section is to demonstrate that the multiinformation func tion can be used to derive theoretical results concerning formal properties of conditional independence . For this purpose we recall the proof of the result from [20] . Moreover , we enrich the proof by introducing several concepts which (as we hope ) clarify the proof and indicate which steps are substan tial . The reader may surmise that our proof is based on Consequence 2.1 and the formula from Lemma 2.3. However , these facts by themselves are not sufficient , one needs something more . Let us describe the structure of this long section . Since the mentioned result says that probabilistic independency models cannot be characterized
282
, , M. STUDENYAND J. VEJNAROVA
by means of a finite number of formal properties of (= axioms for ) indepen dency models one has to clarify thoroughly what is meant by such a formal property . This is done in subsection 5.1: first (in 5.1.1) syntactic records of those properties are introduced and illustrated by examples , and then (in 5.1.2) their meaning is explained . The aim to get rid of superfluous for mal properties motivates the rest of the subsection 5.1: the situation when a formal property of independency models is a consequence of other such formal properties is analyzed in 5.1.3; "pure " formal properties having in every situation a nontrivial meaning are treated in 5.1.4. The subsection 5.2 is devoted to specific formal properties of probabilis tic independency models . We show by an example that their validity (= probabilistic soundness) can be sometimes derived by means of the multi information function . The analysis in 5.2.1 leads to the proposal to limit attention to certain "perfect " formal properties of probabilistic indepen dency models in 5.2.2. Finally , the subsection 5.3 contains the proof of the nonaxiomatizability result . The method of the proof is described in 5.3.1: one has to find an infinite collection of perfect probabilistically sound formal properties of independency models . Their probabilistic soundness is verified in 5.3.2, their perfectness in 5.3.3. 5.1. FORMAL PROPERTIES OF INDEPENDENCY MODELS We have already introduced the concept of an independency model over N as a subset of the class T (N ) (see subsection 2.2.) . This is too general a concept to be of much use. One needs to restrict oneself to special independency models which satisfy certain reasonable properties . Many authors dealing with probabilistic independency models have formulated certain reasonable properties in the form of formal schemata which they named axioms . Since we want to prove that probabilistic independency models cannot be characterized by means of a finite number of such axioms we have to specify meticulously what is the exact meaning of such formal schemata . Thus , we both describe the syntax of those schemata and explain their semantics . Let us start with an example . A semigraphoid [14] is an independency model which satisfies four formal properties expressed by the following schemata having the form of inference rules .
(A,BIG) -+ (B, AIC) (A,BOlD) -+ (A, OlD) (A,BCID) -t (A, BICD) [(A,BICD) A (A, OlD)] -t
symmetry decomposition weak union (A , BOlD )
contraction .
Roughly, the schematashould be understood as follows : if an independency
MULTIINFORMATION
AND
STOCHASTIC
DEPENDENCE
283
model contains the triplets before the arrow , then it contains the triplet after the arrow . Thus , we are interested in formal properties of independency models of such a type .
5.1.1. Syntax of an inference rule Let us start with a few technical definitions . Supposing S is a given fixed
nonempty finite set of symbols, the formulas (K, I , K, 21K , 3), where K, 1, K, 2, K, 3 are disjoint will
subsets of S represented
be called
We write
terms
K ~
by juxtapositions
of their
elements ,
over S .
to denote that
K and
are juxtapositions
of all ele-
ments of the samesubset of S (they can differ in their order). We say that a term (1\:::1, 1\:::2/1\:::3) over S is an equivalent version of the term ( 1, 21 3) over S if K, i ~ i for every i = 1, 2, 3. We say that (K, I , K, 21K , 3) is a symmetric version of ([, 1, [,21[,3) if K1 ~ 2, K2 ~ [, 1, K, 3 ~ [,3. For example, the term (AE , BCID ) over S = { A , B , C, D , E , F } is an equivalent version of the term (AE , CBID ) and a symmetric version of the term (BC , EAID ). A regular inference rule with r antecedents and s consequents is specified by
(a) positive integers r , s, (b ) a finite set of symbols S, possibly including a special symbol 0,
(c) a sequence of orderedtriplets [Sf , S~, Sf], k = 1, . . . , r + s of nonempty subsetsof S such that for every k the sets Sf , S~, s~ are pairwise disjoint . Moreover , we have several technical
requirements
:
- S has at least three symbols ,
- if Sf containsthe symbol0, then no other symbolfrom S is involved in Sf (for everyk = 1, . . . , r + s and everyi = 1, 2, 3), - if k, 1E { I , . . . , r + s} , k ~ l , then Sf ~ sf for somei E { I , 2, 3} , - every0' E S belongsto someSf , - there is no pair of different symbols u , T E S such that
Vk = 1, . . . , r + s Vi = 1, 2, 3 [u E Sf ~ T E Sf ]. A syntactic record of the corresponding inference rule is then
[ (st , si IS} ) A. . .A (S[ , S~ISr) ] -t [ (S[ +l , S; +lIS~+l ) V. . . V (S[ +8, S; +8IS~+8) ]
whereeachsf is representedby a juxtaposition of involvedsymbols. Here the terms (Sf , S~IS~) for k = 1, . . . , r are the antecedentterms, while (Sf , S~IS; ) for k = r + 1, . . . , r + s are the consequent terms.
, , M. STUDENY ANDJ. VEJNAROVA
284
Example 5.1 Take r = 2, s = 1, and S = { A , B , C, D } . Moreover, let
us put [Sf , si , 8J] = [{ A} , {B} , {a , D}], [S; , s~, sl ] = [{A} , {a } , {D}], [Sf , s~, sl ] = [{A} , {B , a } , {D}]. All our technicalrequirementsare satisfied . One possible corresponding syntactic record was already mentioned under the label "contraction " in the definition of semigraphoid . Thus , contraction is a regular inference rule with two antecedents and one consequent . Note that another possible syntactic record can be obtained for example by replacing the first antecedent term by its equivalent version :
[(A , BIDC ) 1\ (A , CID )] - + (A , BOlD ).
0
Of course, the remaining semigraphoid schemata are also regular infer ence rules
Remark
in the sense of our
Our technical
definition
requirements
.
in the above definition
anticipate
the
semantics of the symbols. The symbols from S are interpreted as (disjoint ) subsets of a factor set N and the special symbol 0 is reserved for the empty set. Terms are interpreted as elements of T (N ). The third requirement
ensures that
no term in a syntactic
record of an inference
rule
is an equivalent version of another (different ) term . Further requirements avoid redundancy of symbols in S : the fourth one means that no symbol is unused , while the fifth one prevents their doubling , as for example in the
"rule" : [(A , BEl CD ) 1\ (A , CID )] ~
(A, EBCID )
where the symbol B is doubled by the symbol E .
5.1.2. Semantics of an inference rule Let us consider a regular inference rule a with r antecedentsand s consequents. What is its meaning for a fixed nonempty factor set N ? A substitution mapping (for N ) is a mapping m which assignsa set m(u) C N to every symbol u E S in such a way that : - m(0) is the empty set, - { m (a) ; a E S } is a disjoint collection of subsetsof N , - UO 'ESk 1 m (o-) # 0 for every k = 1, . . . , r + s, - UO ' ESk 2 m (o-) # 0 for every k = 1, . . . , r + s. Of course, it may happen that no such substitution mapping exists for a factor set N ; for example in case of contraction for N with card N = 2. However, in case such a mapping m exists an inference instance of the considered inference rule (induced by m) is (r + s)-tuple [tl , . . . ,tr +s] of elementsof T (N ) defined as follows: tk = ( U m(O ") , U m(O ") I U m(O ") ) O 'Esk O 'Esk O ' ESk 1 2 3
for k = l ' 888,r + s 8
MULTIINFORMATION
The
( r + s ) - tuple
tuple
made
s- tuple Example and It
triplets
5 . 2 Let
with
divided
by m ) is then
for
N . The
are called
possible
. However
i3 = ( { I } , { 2 , 3 } 10) ,
2 = ( { 2 } , { 3 } 10) ,
3 = ( { 2 } , { I , 3 } 10) ,
t1 = ( { 2 } , { 3 } 1{ 1} ) ,
t2 = ( { 2 } , { 1} 10) ,
t3 = ( { 2 } , { I , 3 } 10) ,
[..1 = ( { 3 } , { 1 } 1{ 2 } ) , tl = ( { 3 } , { 2 } 1{ 1} ) ,
t2 .. = ( { 3 } , { 2 } 10) , [..3 = ( { 3 } , { I , 2 } 10) , t2 = ( { 3 } , { 1} 10) , t3 = ( { 3 } , { I , 2 } 10) .
number
inference
finite
rule
instances
and
Having
a fixed
consequents
factor
under
for N ) { tl , . . . , tr } C I 5 . 3 Let
set N
On the other
hand , the contraction
inference
5 . 1 .3 . The dency
instance
Logical
aim
implies
another
of the class of probabilistic [20 , 9 ] or various approach
hides
model
r antecedents
I
c
and
5 .2 . The
independency
model under in I .
= { ( { I } , { 2 } 10) , ( { I } , { 3 } 1{ 2 } ) } is not but
i3 tt M
for
the 0
rules
rules
is to sketch
probabilistic reasonable
properties models
class of independency
models
classes of possibilistic wish
formal
independency
independency
independency a deeper
s
~ 0.
, one has iI , i2 E M
of inference
inference
, especially
graph - isomorphic
set is
[tl , . . . , tr + s] E T ( N ) r + s ( of a
Example
M
. Indeed
implication
can have in mind
number
factor
of the triplet ( { I } , { 2 } 10) only is closed instance for N has both antecedents
model
for a
.
a with
{ tr + l , . .., tr + s } n I with
, the
for a given
an independency
rule
instance
is finite
[ i1 , i2 I i3 ]
of regular
models
we say that
0
set . Therefore rule
are
mappings
mappings
is sensible
inference
inference
us continue
closed
factor
inference
definition
I over N = { I , 2 , 3 } consisting contraction since no inference under
a fixed
a regular
iff for every
substitution
of a regular
and the following
T ( N ) is closed
Example
of possible
, there
substitution
i2 = ( { I } , { 2 } 10) ,
always
instance
t3 = ( { I } , { 2 , 3 } 10) .
and t3 is the consequent
i1 = ( { 2 } , { 1 } 1{ 3 } ) ,
regular
.
contraction
inference
i1 = ( { I } , { 3 } 1{ 2 } ) ,
of all inference
rthe
consequents
5 . 1 and consider
corresponding
other inference instances , induced by other for N . In this case one finds 5 other ones :
fixed
the
, and
[tl , t2 I t3 ] where
t2 = ( { 1 } , { 3 } 10) ,
tl , t2 are the antecedents
Of course , the
into
antecedents
m ( A ) = { 1} , m (B ) = { 2 } , m ( C ) = { 3 } , m ( D ) = 0 .
mapping
= ( { 1} , { 2 } 1{ 3 } ) ,
Example
285
DEPENDENCE
are called
tr + l , . . . , tr + s which
us continue
is a substitution
Here
tl , . . . , tr which
of the triplets
N = { 1 , 2 , 3 } . Put
( induced tl
STOCHASTIC
[tl , . . . , tr I tr + l , . . . , tr + s] is formally
of the
made
AND
or hope
of indepen
fact , one
models
instead
models . For example
the class of
[14 ] or the class of EMVD independency to characterize
models the
-
. In
- models
[ 1 , 6] . Such an respective
class
, , M. STUDENY ANDJ. VEJNAROVA
286 of
independency
regular of
models
inference
the
respective
For
,
of
verification
finite
number
done
by
needs a
a
desired
We
say
that
a
N
and
for
under
recognize
them .
completely collection be
the
under and
in
be rules
solution
) .
rules
the
a
can
inference
ideal
inference
interested
process
closed
of
an
regular
are
is
a
laborious ,
automatic
desired
such
without Indeed
model
would
we
of
One
from
following
such
relation
.
every is
,
collection
models
distribution
superfluous
Therefore
of
( AJ and
M
a
collection
collection
inference
whenever
regular
write
T
inference
F
' "
independency
closed
if
for
model
under
every
rules every
Mover
inference
T
logically
implies
( nonempty N
rule
finite
the
vET
)
following
, then
a
factor holds
M
is
:
closed
UJ .
Usually
, an
derivability mind
. . We
would
three
easy
We
sufficient
give
hope
be
it
5 .4
, BIE
)
Let
1\
(A
we
sequence
logical
insight
implication to
than
is
explain a
what
pedantic
( syntactic we
)
have
definition
in
, which
one
is
of
the
following
consequent )
1\
logically
(A
, DICE
) ]
implied a
by
special
symbols
S
regular
inference
rule
UJ
with
: - t
the
(A
derivation =
{ A
, B
, DIE
) .
semigraphoid
inference
sequence , C , D
, E
} .
of Here
terms is
the
rules over
the
derivation
: ( A
, B
I E
2 .
(A
, C
I BE
3 .
( A
, DICE
4 .
( A
, BC
5 .
( A
, C
6 .
( A
, CD
7 .
( A
, D
sequence
consider
, CIBE
set
1 .
The
better
construct
corresponding
for example
.
us
rule it
gives
and
inference show
condition illustrative
complicated
antecedents
This
an
that
too
Example
To
.
is
such
a
characterization
.
independency
course finite
rules
to
rules
removing
collection
regular
[ (A
for
inference
possible
given
under
axiomatic
independency
probability
Of
(a
criterion
among
set
.
minimal
closed the
probabilistic
inference
computer
be
it
a
known
those
about models
of
make
whether
of
a
should
case
inducing
of
of
speak
independency
the
an
class
can
of
in would
construction
the
. We
class
example
characterization
of
as
rules
last
) ,
) ,
I E
)
)
is
I E
)
)
is
, E
I E
term is
) ,
is either
is
directly
directly
is
derived
derived
directly
directly the an
consequent antecedent
from
2 . and
from
derived
4 . by
from
derived
term
6 . by of of
c,,} . c,,} , or
contraction
decomposition
3 . and
from
term
1 . by
5 . by
,
contraction
decomposition Every it
term is
,
" directly
,
. in
the derived
derivation "
from
.
MULTIINFORMATION
preceding
terms
inference
( in
rule
Now
, let
us
independency
rules
To
instance we the
I t4
U)
t.A) for
N Ul
induced
by
, . . . , U7
of
let a
( C
) I m
(B
) U
m
( E
) )
=
t2
,
U3
=
( m
( A
) , m
( D
) I m
( C
) U
m
( E
) )
: = t3
,
U4
=
( m
( A
) , m
( B
) U
) I m
(E
) ) ,
Us
=
( m
( A
) ,
m
( C
) I m
U6
=
( m
(A
) , m
( C
) U
(E
) ) ,
( m
( A
) , m
( D
the
fact
can
{ Ul
,
we
only
which
: = tl
an
(N
( N
)
inference
mapping
m
which
" copies
)
. So
, "
,
) I m
) )
: = t4
.
closed
under
assumption }
c
is
closed
inference
has
to
as
every { ti
M
.
, t2
semigraphoid , t3
}
,
t4
Especially under
a
instance
c
M E
inference by
M
induction
,
on
which
was
the
"-' .
a
concentrate " later
a
inference
<>
its
an
inference
of
inference
antecedent
,
For
also
"
m
to
( B
rules
technical like
is
whose
=
0 ) .
( which
those
demonstrated
in
)
which
reasons avoid
trivial
example
with
inference
instance as
( for m
" pure .
would
rule
antecedents
mapping
instances
have an
regular
of
class
5 . 2 . 2 ) we
may of
one
a
substitution on
- see
image
of
consequent for
possibly
symmetric
example
an
it
clear
(E is
, M
" informative
become
T
T
) ) ,
( D
, . . . , Uj
decomposition wish
c
inference
rules
that that
m
the
. Thus
inference
of
( C
(E
M
from
that
m
) I m
that
derive
M
:
) , m
sense
semigraphoid
consider
of
( A
case
a
substitution
elements
( m
the
of
semigraphoid
us
=
in
inference
consequent by
the
is
following
.
Exam
pIe
5 .5
, BCID
Take
N
{ 3 } . It
) =
Let
,
{ 3 }
of
say t . IJ
consider
, DIAC and
the
of
that ( for
put
) ,
t3
the
a
the
) ] m
inference
1{ 1 }
image
, we
us
(B
{ 1 , 2 }
symmetric
Thus
/\
induces
( { 2 }
instance
under
U2
Pure
287
semigraphoid all
) )
happen
=
] of
a
( E
=
virtue
and under
closed
sequence
N
closed
) I m
may
the
is
set
N
( B
5 . 1 .4 .
rules
factor over
) , m
It
t2
, t3
conclusion
will
fixed
M
a
1 , . . . , 7
have
by
( A
desired
Thus
)
( m
one
[ (A
that
sequence
to
the
sequence
=
U7
=
. DEPENDENCE
Ul
Jwing
j
, t2
a model
construct
derivation
rule
derivation
consider
show [ tl
can
the
STOCHASTIC
.
( i . e . an ) .
AND
following
- t ( A )
(B
, AID
=
{ I } ,
instance =
( { 2 }
,
[ tl { 1 }
antecedent
regular
arbitrary
tl
m
( B
, t21
1{ 3 }
inference
rule
:
) .
)
=
t3 ]
{ 2 } , with
Here
m
tl the
( C ) =
=
( { I }
0 , ,
consequent
m
{ 2 } t3
( D
)
I { 3 } is
) ,
. 0
rule set
=
the
.
inference factor
regular ) .
N
'..A) is )
in
pure which
if
there a
is
consequent
no
inference either
, , M. STUDENY ANDJ. VEJNAROV A
288 coincides
with
Such
a
condition late
a
give
E
is
S
are
s }
3j
E
distinguished
Lemma
5 .1
A
distinguished . . verSlons
Proof
.
First
0 #
m ( Sj
which
) c
to
ence image
We
of
symbols
sufficient
5 . 1 . 1 . To
regular
formu
inference
. We
say
-
rule that
the
(K
\
) U ( I:, \ K ) . A (
1 , 1: , 211: , 3 )
term
over
( K1
S
if
, IC2IJCg
ICi
i
term
of
1, 2 , 3.
rule
",
all
is pure
if
antecedent
: whenever substitution
) U m ( #
\ IC ) c
m (
every
consequent
terms
of
c,,) and
and
of
T ( N ) by
either
sets
K,
m
one
, terms
assumption
coincide
symbol mapping ( m ( IC ) \
) . Hence
elements
leave
it
to
a pure
the
reader
inference
from
their
m (
any
) ) U (m (
) \ in
substitution of
antecedent
or
distin
say
and
that
a
probabilistic means
property
difficult
weak
regular
not
pendency
rules
universal
. In
over were
them
multiinformation
trans
-
. There
-
a respective with
its
four found
be
the
all
infer
-
symmetric
a
good to
factors
probabilistically
as
a
if
-
that
for
lot was
has
consequence certain
every
expresses
this
of
. Is
inference purpose
all
a for
models
regular
characterize
of
sound
independency
[ 10 , 11 ] a
, namely
rule
a given tool
, it
con
see
", .
inference
soundness
. However
5 . 1 that easily
.
under
of
can
RULES
probabilistic
effort
whose
regarded function
is
closed
Lemma one
rules
sound
is
function can
",
is
soundness
function
of
hand
pure
rule
by
means
other
INFERENCE
model
is shared
multiinformation of
SOUND
probabilistic
models
inference
not
probabilistically
multiinformation
perhaps
by the
are
inference
, every
verify
verify . On
union
Y
which
to
The
to
independency
That mal
-
m ( IC ) ) ,
c,,) are
mapping
consequent
an
. c are
has
distinguished
, no with
rule
PROBABILISTICALL
We
)
and
0
is
5 .2 .
the
a
a
.
(,,) if
term
i =
from
every
mentioned
decomposition
some
c a
some
set
is
antecedent
need
from
",
an
.
traction
the
the
of
. We
concepts
in
S;
following
m ( IC \
can
verification
that
as
(,,) from
for
for
m ( IC )
instance
{ 1, 2}
c,,) both
the
the
S
image
.
distinct
, under
syntactic
inference
in
(, ) , then
implies
formed fore
note
in
' "
regular
symmetrlc
guished
of
distinguished
in in
symmetric
. Suppose
having
c
the
for
means
record K: ,
with
suitable
definitions
distinguished
c,,) is
not
by
two
{ I , . . . , r + S
or
is
syntactic sets
over are
definition
we
symbol 3k
antecedent
formulated
it
with
an
rule
probabilistic
of
lately deeper
" conditional
?
inde
-
sound
verified
appeared
it
, although
probabilistically
not
-
with
help
that
at
properties "
inequalities
of
least of
MULTIINFORMATION
AND
STOCHASTIC
289
DEPENDENCE
for the multiinformation (or entropic) function [27, 12]. Thus, the question whether every probabilistically means of the multiinformation
sound inference rule can be derived by
function
remains open . However , to support
our arguments about its usefulness we give an illustrative lieve that method
an example
is more didactic
than a technical
example . We bedescription
of the
.
Example 5 .6 To show the probabilistic soundness of weak union one has to verify for arbitrary factor set N , for any probability distribution P over N , and for any collection of disjoint sets A , B , C, D c N which are nonempty with the possible exceptions of C and D , that
A 11 BCID (P ) ~
A 11 BICD (P ) .
The assumption A 11 BOlD (P ) can be rewritten by Consequence 2.1(b) and
Lemma
the distribution
2 .3 in terms
of the
multiinformation
function
M
induced
by
P :
0 = M (ABCD) + M (D) - M (AD) - M (BCD) . Thenonecan"artificially" addandsubtractthetermsM (CD) - M (ACD) andby Lemma2.3 derive: 0 = {M (ABCD) + M (CD) - M(ACD) - M (BCD)} + {M(ACD) + M (D) - M (AD) - M (CD)} = I (A; BICD) + I (A; aiD ) . By Consequence2.1(a) both I (A ; BICD ) and I (A; aiD ) are nonnegative, and therefore they vanish! But that implies by Consequence2.1(b) that A Jl BICD (P ). <> Note that one can easily see using the method shown in the preceding example that every semigraphoid inference rule is probabilistically sound . 5.2.1. Redundant rules However , some probabilistically sound inference rules are superfluous for the purposes of providing an axiomatic characterization of probabilistic in dependency models . The following consequence follows directly from given definitions . Consequence 5.1 If (AJis a regular inference rule which is logically im plied by a collection of probabilistically sound inference rules , then (AJis probabilistically sound .
, , M. STUDENYAND J. VEJNAROVA
290
A clear example of a superfluousrule is an inferencerule with redundant antecedent terms.
Example 5.7 Theinference rule [- (A, BC -I D) /\ (C, B I A)]- -t (A, B I CD) is a probabilistically sound regular inference rule . But it can be ignored since it is evidently logically implied by weak union . 0 Therefore we should limit ourselves to "minimal " probabilistically sound inference rules , i .e. to probabilistically sound inference rules such that no antecedent term can be removed without violating the probabilistic soundness of the resulting reduced inference rule . However , even such a rule can be logically implied by probabilistically sound rules with fewer antecedents . We need the following auxiliary construction of a probability distribution to give an easy example . Construction B Supposing A c N , card A ~ 2, there exists a probability distribution P over N such that
M (B IIP) = max {O, card(A n B) - I } . In 2 for B c N .
Proof Let us put Xi on XN as follows :
= { a , I } for i E A , Xi
P ( [Xi ]iEN ) = ~
whenever
P ( [Xi )iEN ) = 0
otherwise
= { a } for i E N \ A . Define
[Vi , j E A
P
Xi = xi ] ,
. 0
Example from
5 . 8 We
Example
have
already
5 .4 is logically
implied
Hence , (.&) is probabilistically Let us consider antecedent term : [ ( A , BIE This
) 1\
disprove
its
pendency Use
model 2 . 1 one
-, [ { I } Jl alternative
)]
inference over
-+
[ ( A , B I E ) 1\
( A , D ICE B with
)]
=
which
inference
(.&)
rules .
by
a removal
of an
-+
and one consequent
is not
{ 1, 2 , 3 , 4 }
{ 1}
A =
made
one has to find
Jl
and
a probabilistic
closed A
{ 2 } 10 ( P ) , { 1 }
{ 4 } 10 ( P ) ] for the constructed " reduced " inference rule
use Construction
rule
rule
5 .1 .
2 antecedents
set N
N
that
inference
by the semigraphoid
with
a factor
B with
the
by Consequence
soundness
verifies
that
( A , DIE ) .
rule
probabilistic
Construction
quence
sound
earlier
a " red uced " inference
( A , GIBE
is a regular
verified
=
under
this
{ 1 , 4 } . By Jl
distribution
. To inde rule .
Conse -
{ 3 } 1{ 2 } ( P ) , but
P . As
concerns
an
(A , D I E ) { I , 3 , 4 } and
a distribution
P over
N
such
MULTIINFORMATION ANDSTOCHASTIC DEPENDENCE 291 that {I } 11 {2} 10(P), {I } Jl {4}1{3} (P), but -,[ {I } 11 {4} 10(P)]. As concerns the third possible"reduced " inference rule [ (A, C I BE) /\ (A, D ICE)] -t (A, D IE) useagainConstruction B with A = { I , 2, 3, 4}. Thus, onehasa distribution P with { I } 11{3} 1{2} (P), {I } 11{4} 1{3} (P), but -,[ {I } lL {4} 10(P)]. <) 5.2.2. ] Jerfect rules Thus , one should search for conditions which ensure that an inference rule is not logically implied by probabilistically sound inference rules with fewer antecedents . We propose the following condition . We say that a probabilistically sound regular inference rule with r antecedents (and s consequents) is perfect if there exists a factor set Nand an inference instance [tl ' . . . ' tr I tr +l , . . . , tr +s] E T (N )r+s such that the symmetric closure of every proper subset of { tl , . . . , tr } is a probabilistic independency model over N . Lemma
5
rule
with
. 2
r
Let
M
is
rule
-
Proof
is
of
the
all
[ tl
.
.
,
of
+
for
sequents
)
,
for
N
owing
( by
the
therefore
if
Therefore
M
Owing
the
,
.
.
.
,
tr
}
.
.
.
of
{
,
it
,
.
.
.
,
that
tf
to
the
has
,
{
of
'
.
.
is
+
I
+
not
.
the
iI
,
.
}
n
'
,
tr
}
.
c
<
.
.
,
}
=1=
0
closed
tr
+
s
}
under
,
.
.
( with
.
is
n
and
an
inference
,
tf
{
+
r
tf
+
E
which
C, , } .
1
I
,
.
( N
.
+
s
.
.
,
tf
under
antecedents
is
.
an
inference
and
+
s
}
n
M
s
=
0
I
is
)
under
fact
( IJ
was
is
defined
v
that
rule
rule
M
con
-
How
-
the
model
closed
the
inference
.
perfectness
independency
)
Since
closed
1
( of
inference
0
closure
is
-
) f
assumption
such
mentioned
antecedents
contradicts
any
=
M
r
7
-
( IJ
symmetric
most
s ]
:::;
of
the
probabilistic
the
M
as
that
at
and
a
)
show
r
the
under
that
'
us
and
if
( N
with
M
r
7
soundness
I
N
regular
instance
c
Let
[ tl
v
}
closed
.
inference
M
probabilistic
s
be
'
.
if
r
set
sound
rules
assumption
tr
be
Define
rule
fact
I
I
,
factor
inference
,
inference
tl
a
pure
.
)
.
{
has
to
one
tl
+
( N
tl
exists
,
that
that
definition
{
tion
T
inference
closure
,
E
contradiction
with
to
symmetric
So
s ]
there
such
" "
{
an
Then
sound
antecedents
perfectness
a
such
.
probabilistical1y
probabilistically
1
under
sound
of
1
-
antecedents
Suppose
ever
r
tr
of
set
~
every
most
probabilistically
instance
{
.
definition
,
N
closed
,
r
under
at
not
Let
perfect
Mover
closed
with
M
a
,
model
-
the
be
antecedents
independency
in
c, , )
v
I
,
C
.
and
M
.
.
pure
by
to
defini
-
contain
0
The preceding lemma implies the following consequencewith help of the definition of logical implication .
, , M. STUDENY ANDJ. VEJNAROVA
292
Consequence 5.2 No perfect probabilistically sound pure inference rule is logically implied by a collection of probabilistic ally sound inference rules with fewer antecedents . Contraction
is an example of a perfect pure regular
inference rule .
5.3. NOFINITEAXIOMATIC CHARACTERIZATION 5 .3 . 1 . It
is
a
finite
Method
of
clear
in
5
fect
of
. 3
of
Let
us
suppose
system
independency
Let of
regular
it
is
dency iff
us
that
we
, pure
of
to
have
disprove
the
existence
of
probabilistic
found
inference
regular
as
suppose
rule
inference
for
inde
every
with
rules
independency
model
c
,
where
r
By
at
r
least
~
r
-
3
a
per
-
antecedents
characterizing
models
>
r
.
.
probabilistic
closed
under
rules
in
T
is
We
)
with
at
most
from
T
has
from
5 .2 is r
in
M
.
we
a
that
for
there
every
probabilistic in
T f
T
.
,
a
factor
exists
factor
set
independency
.
Hence
~
3
,
every
which
According
sound
find
closed
-
at T
,
is
choose
rules
that
such
pure
the
inference
finite an
model rule
in
exceeds to
a N
T
the
be
maximal
with
N
)
prob
-
number there
c" , J
-
( over
must
assumption rule
system indepen
r
exists
a
antecedents
,
.
Lemma which
contradiction
rules
probabilistically
N
probabilistically
( N all
of
over
However
T
under sound
perfect
a
rules
M
closed
antecedents
rule
for
inference
abilistically of
how characterizing
.
Proof T
1 models
infinite
rules
.
sound
every
5 .2
inference
, probabilistically
Then
Consequence
regular
models
Lemma
proof
light
system
pendency
the
the
1
under
antecedents
most
r
is
not sound
but
-
Therefore
1 M
closed .
set
every not
antecedents is under
a
N
and
an
independency
probabilistically under ,
probabilistic f.I)
M
model
sound c" , J . is
Since
closed
every under
contradicts
rule inference
model the
fact
M rule
inference every
independency which
inference
over that
N " "
. is
0
Thus , we need to verify the assumptions of the preceding lemma . Let us consider for each n ~ 3 the following inference rule 1 (n ) with n antecedents and one consequent :
) . , (n) [ (A , B11B2) 1\ . . . 1\ (A, Bn- lIBn) 1\ (A,BnIB1)] -t (A,B2IB1 It is no problem to verify that each ~ (n) is indeed a regular inference rule. Moreover, one can verify easily using Lemma 5.1 that each 1 (n) is a pure rule.
, , M. STUDENY ANDJ. VEJNAROVA
294
Proof . It suffices to find a probabilistic independency model Mt with M c
Mt and t fj! Mt for every t E T (N ) \ M . Indeed, then M = ntET(N)\M Mt , and by Lemma 2.1 M is a probabilistic
independency model .
Moreover, one can limit oneself to the triplets of the form (a, biG) E T (N ) \ M where a, b are singletons. Indeed, for a given general (A , BIG ) E T (N ) \ M choose a E A , b E B and find the respective probabilistic inde pendency model Mt for t = (a, blG) . Since Mt is a semigraphoid , t ft Mt implies (A , BIG ) t$ Mt .
In the sequelwe distinguish 5 casesfor a given fixed (a, biG) E T (N )\ M . Each case requires a different construction of the respective probabilistic independency model Mt , that is a different construction of a probability
distribution P over N such that { a} 1L { i } I { i + I } (P ) for i = 1, . . . , n - 1, but -, [ { a} lL { b} I C (P ) ]. One can verify thesestatementsabout P through the multiinformation function induced by P . If the multiinformation func tion is known (as it is in the case of our constructions ) one can use Conse-
quence2.1(b) and Lemma 2.3 for this purpose. We leavethis to the reader. Here
is the
list
of cases .
I . Vi = 1, . . . , n - 1 { a, b} # { a, i } (C arbitrary ). In this caseuse Construction A where A = { a, b} . II . [3j E { l , . . . , n - I } { a, b} = { O,j } ] and C \ {j - l ,j + I } =10 . In this casechooser E C \ {j - 1, j + I } and use Construction A where A = { a, j , r } . III . [3j E { 2, . . . , n - I } { a, b} = { O,j } ] and C = {j - 1,j + 1} . In this caseuse Construction A where A = { a, j - 1, j , j + I } . IV . [3j E { 2, . . . , n - I } { a, b} = { O,j }] and C = {j - I } . Use Construction B where A = { a, j , j + 1, . . . , n} . V . [3j E { 1, . . . , n - 1} { a, b} = { 0, j } ] and C = 0. Use Construction
B where
A = N . 0
Consequence 5.3 Each above mentioned rule 1' (n) is perfect. ProD/: Let us fix n ~ 3, put N = { a, 1, . . . , n} and tj = ({ O} , { j } l{j + 1} ) for j = 1, . . . , n (convention n + 1 := 1), tn+l = ({ O} , { 2} 1{ 1} ). Evidently , [ tl , . . . , tn I tn+l ] is an inference instan~e of "'Y(n). To show that the symmetric closure of every proper subset of { tl , . . . , tn} is a probabilistic independency model it suffices to verify it only for every subset of cardinality n - 1 (use Lemma 2.1). However , owing to possible cyclic re-indexing of N
it suffices to prove (only) that the symmetric closure M of { tl , . . . , tn- I } is a probabilistic
independency
model . This follows from Lemma
5.5.
0
RMATIONAND STOCHASTIC DEPENDENCE MUL TIINFO
295
Proposition 5.1 There is no finite system T of regular inference rules characterizing independencymodels as independency models - probabilistic -
closedunderrulesin Y
Proof An easy consequence of Lemmas 5.3, 5.4 and Consequence 5.3.
0
Conclusions
Let us summarize
the paper . Several results support
our claim that condi -
tional mutual information I (A ; B IG) is a good measureof stochastic conditional dependence between random vectors ~A and ~B given ~c . The value
of I (A ; B IC) is always nonnegativeand vanishesiff ~A is conditionally independent of ~B given ~ciOn
the other hand , the upper bound for I (A ; BIG )
is min { H (AIG ), H (BIG )} , and the value H (AIG) is achievedjust in case ~A is a function of ~BC. A transformation of ~ABC which saves ~AC and ~BC
increasesthe value of I (A ; BIG ). On the other hand, if ~A is transformed while ~BC is saved, then I (A ; BIG ) decreases . Note that the paper [29] deals with a more practical use of conditional mutual information : it is applied to the problem of finding relevant factors in medical decision-making . Special level-specific measures of dependence were introduced . While
the value M (A) of the multiinformation function is viewed as a measure of global stochastic dependencewithin [~i]iEA' the value of .t\(r , A ) (for 1 ::; r ::; card A - I ) is interpreted as a measure of the strength of dependence
of level r among variables [~i]iEA' The value of .t\(r , A ) is always nonnegative and vanishes iff ~i is conditionally
independent
of ~j given ~K for arbitrary
distinct i , j E A , K c A , card K = r - 1. And of course, the sum of .t\(r , A)s is just M (A ). Note that measures.t\(r , A) are certain multiples of Han's [8] measures of multivariate symmetric correlation . Finally , we have used the multiinformation function as a tool to show that
conditional
independence
models have no finite
axiomatic
character -
ization . A didactic proof of this result, originally shown in [20], is given. We analyze thoroughly syntax and semanticsof inferencerule schemata (= axioms ) which characterize formal properties of conditional independence models . The result of the analysis is that two principal features of such schemata are pointed out : the inference rules should be (probabilistically ) sound and perfect . To derive the nonaxiomatizability result one has to find an infinite collection of sound and perfect inference rllles . In the verification of both soundness and perfectness the multiinformation function proved to be an effective
tool .
Let us add a remark concerning the concept of a perfect rule . We have used this concept only in the proof of the nonaxiomatizability result . How -
ever, our aim is a bit deeper, in fact. We (vaguely) guessthat probabilistic
, , M. STUDENY ANDJ. VEJNAROVA
296
independency models have a certain uniquely determined "minimal " axiomatic characterization , which is of course infinite . In particular , we conjecture that the semigraphoid inference rules and perfect probabilistically sound pure inference rules form together the desired axiomatic characteri zation of probabilistic independency models . Acknowledgments We would like to express our gratitude to our colleague Frantisek Matus
who directed our attention to the paper [8]. We also thank to both reviewers for their valuable
comments
and correction
v
of grammatical
errors . This
work was partially supported by the grant VS 96008 of the Ministry of Ed ucation of the Czech Republic and by the grant 201/ 98/ 0478 " Conditional independence structures : information theoretical approach " of the Grant Agency of Czech Republic . References 1.
2.
de Campos , L .M . ( 1995) Independence relationships in possibility theory and their application to learning in belief networks , in G . Della Riccia , R . Kruse and R . Viertl (eds.) , Mathematical and Statistical Methods in Artificial Intelligence , Springer Verlag , 119- 130. Csiszar , I . ( 1975) I -divergence geometry of probability distributions and minimazi tion problems , Ann . Probab ., 3 , 146- 158.
3. Cover, T .M ., and Thomas, J.A . (1991) Elements of Information Theory, John Wiley , New
York
.
4. Darroch , J.N ., Lauritzen , S.L ., and Speed, T .P. (1980) Markov fields and log-linear interaction
models for contingency
tables , Ann . Statist ., 8 , 522- 539.
5.
Dawid , A .P. (1979) Conditional independence in statistical theory, J. Roy. Stat.
6.
Fonck P. (1994) Conditional independencein possibility theory, in R.L . de Mantaras and D . Poole (eds.), Uncertainty in Artificial Intelligence: proceedings0/ the 10th
7.
Gallager, R .G. (1968) Information Theory and Reliable Communication, John Wi -
Soc . B , 41 , 1- 31 .
conference , Morgan Kaufman , San Francisco , 221- 226. ley , New York .
8. Han T .S. (1978) Nonnegative entropy of multivariate symmetric correlations, Infor mation
9.
and
Control
, 36 , 113 - 156 .
Malvestuto , F.M . (1983) Theory of random observables in relational data bases, Inform . Systems , 8 , 281- 289.
10.
Matus , F ., and Studeny, M . (1995) Conditional independenciesamong four random
11.
variables I ., Combinatorics , Probability and Computing , 4 , 269- 278. Matus , F . ( 1995) Conditional independencies among four random variables II ., Com binatorics , Probability and Computing , 4 , 407- 417.
12.
Matus , F . (1998) Conditional independencies among four random variables III ., submitted
13. 14.
to Combinatoncs , Probability
and Computing .
Pearl, J., and Paz, A . (1987) Graphoids: graph-based logic for reasoning about relevance relations , in B . Du Boulay, D . Hogg and L . Steels (eds.) , Advances in Artificial Intelligence - II , North Holland , Amsterdam , pp . 357- 363 . Pearl , J . ( 1988) Probabilistic Reasoning in Intelligent Systems : networks o/ plausible inference , Morgan Kaufmann , San Mateo .
MULTIINFORMATION
AND
STOCHASTIC
297
DEPENDENCE
Perez, A . (1977) c-admissible simplifications of the dependencestructure of a set of random variables , Kybernetika , 13 , 439- 449.
Renyi , A . (1959) On measures of dependence, Acta Math. Acad. Sci. Hung., 10, 441 - 451 .
Spohn, W . (1980) Stochastic independence, causal independence and shieldability , J . Philos . Logic , 9 , 73- 99.
Studeny, M . (1987) Asymptotic behaviour of empirical multiinformation , Kybernetika
, 23 , 124 - 135 .
Studeny , M . (1989) Multiinformation ditional
and the problem of characterization of con-
independence relations , Problems of Control and Information
Theory , 18 ,
3 - 16 .
Studeny , M . (1992) Conditional independence relations have no finite complete characterization , in S. Kubik and J.A . Visek (eds.), Information Theory, Statistical ,
Decision Functions
and Random Processes: proceedings of the 11th Prague confer -
ence - B, Kluwer , Dordrecht (also Academia, Prague), pp. 377- 396. Studeny , M . (1987) The concept of multiinformation in probabilistic decisionmaking (in Czech), PhD . thesis, Institute of Information Theory and Automation , Czechoslovak Academy of Sciences, Prague .
Vejnarova, J. (1994) A few remarks on measuresof uncertainty in Dempster-Shafer theory , Int . J . General Systems , 22 , pp . 233- 243. Vejnarova J . ( 1997) Measures of uncertainty and independence concept in different calculi , accepted to EP IA '97. Watanabe , S. ( 1960) Information theoretical analysis of multivariate correlation , IBM Journal of research and development , 4 , pp . 66- 81.
Watanabe, S. (1969) Knowing and Guessing: a qualitative study of inference and information
, John Wiley , New York .
Xiang , Y ., Wong, S.K .M ., and Cercone, N . (1996) Critical remarks on single link search in learning belief networks, in E. Horvitz and F . Jensen (eds.), Uncertainty in Artificial
. -,. ~. LC . . C ~"1 cow C '1 C '1 C "1 ~ C "1
27.
cisco
Intelligence : proceedings of 12th conference , Morgan Kaufman , San Fran -
, 564 - 571 .
Zhang , Z ., and Yeung , R . ( 1997) A non - Shannon type conditional information equality , to appear in IEEE Transactions on Information Theory .
in -
28. Zvarova, J . (1974) On measures of statistical dependence, Casopis pro pestovani matematiky , 99 , 15- 29.
. lC ,....4
. ~ ,....4
. t ,....4
. ( X) ,....4
. 0' ) ,....4
0 N
.
1 ''1 "'.4 C
29. Zvarova, J., and Studeny, M . (1997) Information -theoretical approach to constitu tion and reduction
of medical data , Int . J . Medical Informatics , 45 , 65- 74.
A TUTORIAL ON LEARNING WITH BAYESIAN NETWORKS
DAVID
HECKERMAN
Microsoft Research, Bldg 98 Redmond
WA , 98052 - 6399
heckerma @microsoft .com
Abstract
. A Bayesian
bilistic
relationships
with
statistical
data
analysis
variables
techniques
, it readily
can
predict a causal bining
prior
Bayesian In this
knowledge and
knowledge
to improve
both
Bayesian
and
unsupervised
A
a real - world
comes for
Bayesian
the parameters methods
learning
with avoiding
the
for com -
networks
overfitting
Bayesian
statistical
offer
of data .
networks
methods
from
for using
task , we describe of a Bayesian
incomplete
data . In addition
learning
to
has both
and structure
. We illustrate
and
form ) and data . Four ,
to the latter
with for
the model
.
, and
domain
Bayesian
for constructing
regard
relationships
representation
in causal
all
are missing
to techniques
the graphical
data meth -
network
Bayesian
for
supervised
- modeling
approach
case study .
network
is a graphical
a set of variables
become expert
a popular systems
. Over
representation ( Heckerman
the
model last
for
probabilistic
decade , the
for encoding
uncertain
et al . , 1995a ) . More 301
Bayesian expert recently
,
, we re -
Introduction
among in
methods
causal
for
among
entries
a problem
, it is an ideal
approach
for learning
- network
learn about
. Three , because
often
models . With
techniques
to
proba -
advantages
dependencies
some data
in conjunction
and summarize
late
1.
( which
methods principled
these
ods for learning
using
be used
understanding semantics
paper , we discuss
including
can
of intervention
probabilistic
statistical
an efficient prior
network
encodes
used in conjunction
has several
encodes
where
that
. When
model
model
situations
be used to gain
and
the
model
of interest
, the graphical
handles
the consequences
is a graphical
variables
. One , because
Two , a Bayesian hence
network
among
relationships network
has
knowledge , researchers
302
DAVIDHECKERMAN
have developed methods for learning Bayesian networks from data . The techniques that have been developed are new and still evolving , but they have been shown to be remarkably effective for some data -analysis prob lems. In this paper , we provide a tutorial on Bayesian networks and associated Bayesian techniques for extracting and encoding knowledge from data . There are numerous representations available for data analysis , including rule bases, decision trees, and artificial neural networks ; and there are many techniques for data analysis such as density estimation , classification , regression , and clustering . So what do Bayesian networks and Bayesian methods have to offer ? There are at least four answers. One, Bayesian networks can readily handle incomplete data sets. For example , consider a classification or regression problem where two of the explanatory or input variables are strongly anti -correlated . This correlation is not a problem for standard supervised learning techniques , provided all inputs are measured in every case. When one of the inputs is not observed , however , most models will produce an inaccurate prediction , because they do not encode the correlation between the input variables . Bayesian networks offer a natural way to encode such dependencies. Two , Bayesian networks allow one to learn about causal relationships . Learning about causal relationships are important for at least two reasons. The process is useful when we are trying to gain understanding about a problem domain , for example , during exploratory data analysis . In addi tion , knowledge of causal relationships allows us to make predictions in the presence of interventions . For example , a marketing analyst may want to know whether or not it is worthwhile to increase exposure of a particular advertisement in order to increase the sales of a product . To answer this question , the analyst can determine whether or not the advertisement is a cause for increased sales, and to what degree. The use of Bayesian networks helps to answer such questions even when no experiment about the effects of increased exposure is available . Three , Bayesian networks in conjunction with Bayesian statistical techniques facilitate the combination of domain knowledge and data . Anyone who has performed a real-world analysis knows the importance of prior or domain knowledge , especially when data is scarce or expensive . The fact that some commercial systems (i .e., expert systems) can be built from prior knowledge alone is a testament to the power of prior knowledge . Bayesian networks have a causal semantics that makes the encoding of causal prior knowledge particularly straightforward . In addition , Bayesian networks encode the strength of causal relationships with probabilities . Consequently , prior knowledge and data can be combined with well-studied techniques from Bayesian statistics .
A TUTORIAL
ON LEARNING
WITH
BAYESIAN
NETWORKS
303
Four , Bayesian methods in conjunction with Bayesian networks and other types of models offers an efficient and principled approach for avoiding the over fitting of data . As we shall see, there is no need to hold out some of the available data for testing . Using the Bayesian approach , models can be "smoothed " in such a way that all available data can be used for training .
This tutorial is organizedasfollows. In Section 2, we discussthe Bayesian interpretation of probability and review methods from Bayesian statistics for combining prior knowledge with data . In Section 3, we describe Bayesian networks and discuss how they can be constructed from prior knowledge alone. In Section 4, we discuss algorithms for probabilistic inference in a Bayesian network . In Sections 5 and 6, we show how to learn the proba bilities in a fixed Bayesian-network structure , and describe techniques for handling incomplete data including Monte -Carlo methods and the Gaussian approximation . In Sections 7 through 12, we show how to learn both the probabilities and structure of a Bayesian network . Topics discussed include methods for assessing priors for Bayesian-network structure and param eters, and methods for avoiding the overfitting of data including Monte Carlo , Laplace , BIC , and MDL approximations . In Sections 13 and 14, we describe the relationships between Bayesian-network techniques and meth ods for supervised and unsupervised learning . In Section 15, we show how Bayesian networks facilitate the learning of causal relationships . In Section 16 , we illustrate techniques discussed in the tutorial using a real - world case study . In Section 17 , we give pointers to software and additional liter ature
.
2 . The Bayesian To understand
Approach
to Probability
Bayesian networks
and Statistics
and associated learning
techniques , it is
important to understand the Bayesian approach to probability and statis tics . In this section , we provide an introduction to the Bayesian approach for those readers familiar only with the classical view . In a nutshell , the Bayesian probability of an event x is a person 's degree of beliefin that event . Whereas a classical probability is a physical property
of the world (e.g., the probability that a coin will land heads), a Bayesian probability is a property of the person who assignsthe probability (e.g., your degree of belief that the coin will land heads) . To keep these two concepts of probability distinct , we refer to the classical probability of an event as the true or physical probability of that event , and refer to a degree of belief in an event as a Bayesian or personal probability . Alternatively , when the meaning is clear , we refer to a Bayesian probability simply as a probability . One important difference between physical probability and personal
304
DAVIDHECKERMAN
probability is that , to measure the latter , we do not need repeated tri als. For example , imagine the repeated tosses of a sugar cube onto a wet surface . Every time the cube is tossed, its dimensions will change slightly . Thus , although the classical statistician has a hard time measurin ~ th ~ ......,
probability that the cube will land with a particular face up , the Bayesian simply restricts his or her attention to the next toss, and assigns a proba bility . As another example , consider the question : What is the probability
that the Chicago Bulls will win the championship in 2001? Here, the classical statistician
must remain silent , whereas the Bayesian can assign a
probability (and perhaps make a bit of money in the process). One common criticism of the Bayesian definition of probability is that probabilities seem arbitrary . Why should de~rees of belief satisfy the rules of -
-
probability ? On what scale should probabilities be measured? In particular ,
it makes senseto assign a probability of one (zero) to an event that will (not) occur, but what probabilities do we aBsignto beliefs that are not at the extremes ? Not surprisingly , these questions have been studied intensely . With regards to the first question , many researchers have suggested different sets of properties that should be satisfied by degrees of belief
(e.g., Ramsey 1931, Cox 1946, Good 1950, Savage1954, DeFinetti 1970). It turns out that each set of properties
leads to the same rules : the rules of
probability . Although each set of properties is in itself compelling , the fact that different sets all lead to the rules of probability provides a particularly strong argument for using probability to measure beliefs. The answer to the question of scale follows from a simple observation : people find it fairly easy to say that two events are equally likely . For exam-
ple, imagine a simplified wheel of fortune having only two regions (shaded and not shaded) , such M the one illustrated in Figure 1. Assuming everything about the wheel as symmetric (except for shading) , you should conclude that it is equally likely for the wheel to stop in anyone position .
From this judgment and the sum rule of probability (probabilities of mutually exclusive and collectively exhaustive sum to one) , it follows that your probability
that the wheel will stop in the shaded region is the percent area
of the wheel that is shaded (in this case, 0.3). This probability wheel now provides a reference for measuring your probabilities of other events. For example , what is your probability that Al Gore will run on the Democratic
ticket in 2000 ? First , ask yourself
the question :
Is it more likely that Gore will run or that the wheel when spun will stop in the shaded region ? If you think that it is more likely that Gore will run , then imagine another wheel where the shaded region is larger . If you think that it is more likely that the wheel will stop in the shaded region , then imagine another wheel where the shaded region is smaller . Now , repeat this process until you think that Gore running and the wheel stopping in the
A TUTORIAL
Figure 1.
ON
LEARNING
1'he probability
WITH
BAYESIAN
305
NETWORKS
wheel : a tool for assessing probabilities .
shaded region are equally likely . At this point , yo.ur probability
that Gore
will run is just the percent surface area of the shaded area on the wheel .
In general , the process of measuring a degree of belief is commonly referred to as a probability assessment. The technique for assessment that we have just
described
is one of many available
techniques
discussed in
the Management Science, Operations Research, and Psychology literature . One problem with probability assessment that is addressed in this litera ture is that of precision . Can one really say that his or her probability for event
x is 0 .601 and
not
0 .599 ? In most
cases , no . Nonetheless
, in most
cases, probabilities are used to make decisions, and these decisions are not sensitive to small variations in probabilities . Well -established practices of sensitivity
analysis help one to know when additional
precision
is unneces -
sary (e.g., Howard and Matheson , 1983) . Another problem with probability assessment is that of accuracy . F'or example , recent experiences or the way a question is phrased can lead to assessmentsthat do not reflect a person 's true beliefs (Tversky and Kahneman , 1974) . Methods for improving accu-
racy can be found in the decision-analysis literature (e.g, Spetzler et ale (1975)) . Now let us turn
to the issue of learning
with
data . To illustrate
the
Bayesian approach , consider a common thumbtack - .- -one with a round , flat head that c.an be found in most supermarkets . If we throw the thumbtack
up in the air, it will come to rest either on its point (heads) or on its head (tails) .1 Supposewe flip the thumbtack N + 1 times, making sure that the physical properties of the thumbtack and the conditions under which it is flipped remain stable over time . From the first N observations , we want to determine the probability of heads on the N + 1th toss. In the classical analysis of this problem , we assert that there is some physical probability of heads, which is unknown . We estimate .this physical probability from the N observations using c,riteria such as low bias and low variance . We then use this estimate as our probability for heads on the N + 1th toss. In the Bayesian approach , we also assert that there is 1This example is taken from Howard (1970).
306
DAVIDHECKERMAN
some physical probability
of heads, but we encode our uncertainty about
this physical probability using (Bayesian) probabilities, and use the rules of probability to compute our probability of heads on the N + Ith toss.2 To examine the Bayesian analysis of this problem , we need some nota -
tion . We denote a variable by an upper-case letter (e.g., X , Y, Xi , 8 ), and the state or value of a corresponding variable by that same letter in-- lower -- ---
-
case (e.g., x , Y, Xi, fJ) . We denote a set of variables by a bold-face uppercase letter (e.g., X , Y , Xi ). We use a corresponding bold-face lower-case letter (e.g., x , Y, Xi) to denote an assignmentof state or value to each variable in a given set. We say that variable set X is in configuration x . We
use p(X = xl ~) (or p(xl~) as a shorthand) to denote the probability that X == x of a person with state of information ~. We also use p(x It;) to denote the probability
distribution
for X (both mass functions and density ~
functions) . Whether p(xl~) refers to a probability , a probability density, or a probability distribution will be clear from context . We use this notation for probability throughout the paper . A summary of all notation is given at the end of the chapter . Returning to the thumbtack problem , we define e to be a variable3 whose values () correspond to the possible true values of the physical probability . We sometimes refer to (J as a parameter . We eXDress the uncerA
tainty about e using the probability density function p(()I~) . In addition , we use Xl to denote the variable representing the outcome of the Ith flip ,
1 == 1, . . ., N + 1, and D = { X I = Xl , . . . , XN = XN} to denote the set of our observations . Thus , in Bayesian terms , the thumbtack
problem
reduces
to computing p(XN+IID , ~) from p((}I~). To do so, we first use Bayes' rule to obtain the probability for e given D and background knowledge ~:
distribution
p((}ID ,~)=!J!I~ p(DI p)(DI ~)(}~
(1)
where
p(DI~)=Jp(DlfJ ,~)p(fJl ~)dfJ
(2)
Next, we expand the term p(DlfJ, ~) . Both Bayesiansand claBsicalstatisti cians agree on this term : it is the likelihood function for binomial sampling . 2Strictly speaking, a probability belongsto a singleperson, not a collectionof people. Nonetheless , in parts of this discussion , we refer to "our " probability English .
to avoid awkward
3Bayesianstypically refer to e as an uncertain variable, becausethe value of e is uncertain
. In contrast
, classical
statisticians
often
refer to e as a random
variable . In this
text , we refer to e and all uncertain / random variables simply as variables.
A TUTORIALON LEARNINGWITH BAYESIANNETWORKS 307 In particular, giventhe valueof 8 , the observationsin D are mutually independent , and the probability of heads(tails) on anyone observationis () (1 - lJ). Consequently , Equation 1 becomes p(fJID,~) == !!(~I~) Oh(1 - _Of p(DI~)
(3)
wherehand t arethe numberof headsandtails observedin D, respectively . The probability distributions p(OI~) and p(OID,~) are commonlyreferred to as the prior and posteriorfor 8 , respectively . The quantitieshand tare said to be sufficientstatisticsfor binomialsampling, becausethey provide a summarizationof the data that is sufficientto computethe posterior from the prior. Finally, we averageoverthe possiblevaluesof 8 (usingthe expansionrule of probability) to determinethe probabilitythat the N + Ith tossof the thumbtackwill comeup heads: p(XN+l == headsID,~) == J p(XN+l == headsIO , ~) p(OID,~) dO = J 0 p(OID,~) dO== Ep(8ID,<)(0)
(4)
whereEp((JID,~)(fJ) denotesthe expectationof fJwith respectto the distribution p(lJlD , ~). To completethe Bayesianstory for this example, we needa methodto assessthe prior distribution for 8 . A commonapproach , usually adopted for convenience , is to assumethat this distribution is a betadistribution: p
(
fJl
~
)
=
Beta
(
fJiah
,
at
)
=
:
-
r
r
where
ah
ah
+
r
(
l
to
at
)
>
,
=
0
and
=
1
and
r
.
(
.
The
be
of
The
ID
,
~
)
ah
at
-
r (
(
ah
+
are
a
+ h
)
so
also
N r
(
that
the
_ _ +
t
( JQh
.
in
l
l
(
1
-
(
l
-
beta
)
( X.
t
-
l
(
distribution
r
(
to
Figure
)
x
+
,
l
)
=
as
hyperparameters
can
be
xr
(
5
Qt
+
t
-
a
)
=
and
and
normalized
.
By
Equation
3
,
the
:
l
=
=
Beta
(
( }
lah
+
h
,
at
+
tions
say
that
for
the
binomial
set
of
sampling
beta
distributions
.
Also
t
)
6
)
)
(
We
)
.
.
tion
x
ah
2
bu
( J
fJ
hyperparameters
distri
-
-
reasons
beta
h
h
the
The
several
+
( X.
distribution
shown
a
O )
referred
( )
be
) at
_
at
satisfies
often
for
will
_ (
of
parameter
convenient
tion
) r
which
are
is
bu
r
and
the
zero
a )
parameters
distributions
prior
=
the
(
ah
function
than
distri
( }
are
from
beta
beta
posterior
0
Gamma
greater
Examples
(
>
the
them
must
p
is
quantities
distinguish
at
at
)
(
is
,
the
expectation
a
conjugate
family
of
( J
with
of
respect
distribu
-
to
this
308
DAVIDHECKERMAN
B
o [ ZJ
Bela ( I , I )
Beta ( 2,2 )
Figure 2.
distribution
hag
a
simple
form
Beta ( 3,2 )
Beta ( 19,39 )
Several beta distributions .
:
J IJBeta(IJIG G 'h ' h, G 't) dIJ= -;; Hence of
, heads
given in
a the
beta N
prior +
Ith
, toss
(7)
we have a simple expressionfor the probability :
P(XN +l=headsID ,~)=~ ~.:!:_~ o:+N
(8)
Assuming p ((JI~) is a beta distribution , it can be assessedin a number of ways . For example , we can assessour probability for heads in the first toss of the thumbtack (e.g., using a probability wheel) . Next , we can imagine having seen the outcomes of k flips , and reassessour probability for heads in the next toss . From Equation 8, we have (for k = 1)
G 'h ah+1 p(X1=headsl ~)=G =heads ,~)=G 'h+at p(X2=headslX1 'h+at+1 Given these probabilities, we can solve for ah and at . This assessment technique is known as the method of imagined future data. Another assessmentmethod is basedon Equation 6. This equation says that , if we start with a Beta(O, O) prior4 and observe ah heads and at tails , then our posterior (i.e., new prior) will be a Beta(ah, at ) distribution . Recognizingthat a Beta(O, 0) prior encodesa state of minimum information , we can assessO'h and at by determining the (possibly fractional) number of observations of heads and tails that is equivalent to our actual knowledge about flipping thumbtacks. Alternatively , we can assessp(Xl = headsl~) and 0' , which can be regarded as an equivalent sample size for our current knowledge. This technique is known as the method of equivalent samples.
4Technically ,be the hyperparameters prior should besmall positive numbers so that p(81 ~)can normalized . ofthis
A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 309 Other techniques for assessing beta distributions
are discussed by Winkler
(1967) and Chaloner and Duncan (1983) . Although the beta prior is convenient , it is not accurate for some prob lems. For example , suppose we think that the thumbtack may have been purchased at a magic shop . In this case, a more appropriate prior may be a mixture of beta distributions - for example ,
p((JI~) = 0.4 Beta(20, 1) + 0.4 Beta(l , 20) + 0.2 Beta(2, 2) where 0.4 is our probability that the thumbtack is heavily weighted toward
heads (tails) . In effect, we have introduced an additional hidden or unobserved variable H , whose states correspond to the three possibilities: (1) thumbtack is biased toward heads, (2) thumbtack is biased toward tails ,
and (3) thumbtack is normal; and we have assertedthat (J conditioned on each state of H is a beta distribution . In general , there are simple methods
(e.g., the method of imagined future data) for determining whether or not a beta prior is an accurate reflection of one's beliefs . In those cases where the beta prior
introducing
is inaccurate , an accurate
prior
can often
be assessed by
additional hidden variables , as in this example .
So far , we have only considered observations
drawn from a binomial
dis -
tribution . In general , observations may be drawn from any physical proba -
bility distribution :
p(xllJ, ~) = f (x , lJ)
where f (x , 6) is the likelihood function with parameters 6. For purposes of this discussion , we assume that
the number
of parameters
is finite . As
an example , X may be a continuous variable and have a Gaussian physical probability distribution with mean JLand variance v :
p(xI9,~) ==(27rV )-1/2e-(X-J.L)2/2v where (J == { J.L, v } . Regardless of the functional form , we can learn about the parameters given data using the Bayesian approach . As we have done in the binomial case, we define variables corresponding to the unknown parameters , assign priors to these variables , and use Bayes' rule to update our beliefs about these parameters given data :
p(8ID,~)
(pJ(,~ ) p ( 81 ~ ) --p(DI DI ~)
We then average over the possible values of e to make predictions.
(9) For
example ,
P(XN+lID,~) = J P(XN+119 ,~) p(9ID,~) d8
(10)
310
DAVIDHECKERMAN
For a class of distributions known a8 the exponential family , these computations can be done e-fficiently and in closed form .5 Members of this claBSinclude the binomial, multinomial , normal, Gamma, Poisson, and multivariate-normal distributions . Each member of this family has sufficient statistics that are of fixed dimension for any random sample, and a simple conjugate prior .6 Bernardo and Smith (pp. 436- 442, 1994) havecompiled the important quantities and Bayesian computations for commonlv ., used members of the exponential family. Here, we summarize these items for multinomial sampling, which we use to illustrate many of the ideas in this paper. In multinomial sampling, the observedvariable X is discrete, having r possible states xl , . . . , xr . The likelihood function is given by p(x = XkIlJ, ~) = {}k,
k = 1, . . . , r
where () = { (}2, . . . , Or} are the parameters. (The parameter (JI is given by 1 - )=:%=2 (Jk.) In -this case, as in the case of binomial sampling, the parameters correspond to physical probabilities. The sufficient statistics for data set D = { Xl = Xl , . . . , XN = XN} are { NI , . . . , Nr } , where Ni is the number of times X = xk in D . The simple conjugate prior used with multinomial Jampling is the Dirichlet distribution : p ( 81 ~ ) == Dir where p ( 8ID
0
=
distribution lent
Ei
, ~ ) == Dir samples
( 81G: l , . . . , G: r ) =
= l Ok , and ( 8lo1
, including , can
also
Q' k >
I1r k =r l ( rG: )( Ok ) kIIlJ =l
0 , k == 1 , . . . , r . The
+ N1 , . . . , Or + Nr ) . Techniques the be
methods used
to
conjugate prior and data set D , observation is given by
posterior for
imagined
future
assess
Dirichlet
distributions
probability
distribution
assessing
of
the
(11)
~k- l
distribution
data
and
the
. Given for
beta
equiva
the
-
this next
p(XN +l == x k ID , ~) == J (Jk Dlr :'l + Nl , . . ., O :'r + Nr ) d8 == O . (810 :'k Nk a+ N (12) As we shall see, another important quantity in Bayesian analysis is the marginal likelihood or evidencep(D I~). In this case, we have p(DI~) :=
r (a) - . II r (O :'k + N~l r (a: + N ) k=l r (O :'k)
(13)
5Recent advances in Monte-Carlomethodshavemadeit possibleto workefficiently with manydistributionsoutsidethe exponential family. See , for example , Gilkset al. (1996 ). 6Infact, exceptfor a few, well-characterized exceptions , theexponential familyis the only classof distributionsthat havesufficientstatisticsof fixeddimension(Koopman , 1936 ; Pitman, 1936 ).
A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 311 We note
that
the explicit
mention
cause it reinforces
the notion
once
is firmly
the
this
concept
remainder
of this
In closing classical
this
section
probability
opposite
of the classical
Namely
, in the
imagine all the binomial some
and
data
will
adds
same
from
the
clutter
Bayesian
prediction
are
data . As an illustration
in a manner
" estimate
that
,
" for the
is essentially
, fJ is fixed
( albeit
unknown
the
) , and
may be generated by sampling by fJ. Each data set D will occur
produce
an estimate
the expectation
and variance
we
from with
fJ* ( D ) . To evaluate of the estimate
with
sets : L p ( DI (}) ()* ( D ) D
Varp (DIB) ( (}* )
==
L p ( DI (}) ( ()* ( D ) - Ep (DIB) ( (}* ) ) 2 D
an estimator
of these
and
, they
==
choose
. In
.
Ep (DIB) ( (}* )
variance
,
.
approach
p ( DlfJ ) and
the
. Nonetheless
~ explicitly
. Here , the Bayesian
is obtained
approach
classical
, we compute
to all such
We then
of heads
simply
that , although
for learning
problem
~ is useful , be -
are subjective mention
yield
data sets of size N that distribution determined
probability
an estimator respect
methods
of knowledge
notation
not
sometimes
the thumbtack
physical
, we shall
, we emphasize
may
different
let us revisit
of the state probabilities
in place , the
tutorial
approaches
fundamentally
that
that
estimates
somehow over the
balances possible
(14)
the bias ( () - Ep (D 18) ( (}* ) ) values
for fJ.7 Finally
, we
apply this estimator to the data set that used estimator is the maximum - likelihood
we actually observe . A commonly (ML ) estimator , which selects the
value
p ( D I 0 ) . For binomial
of lJ that
maximizes
the likelihood
sampling
, we
have
OML (D) == ~ r~ N -k L-Ik=l Forthis(andothertypes ) ofsampling , theML estimator is unbiased . That is, for all valuesof 0, theML estimator haszerobias. In addition , for all values of (J, thevariance of theML estimator is nogreater thanthatof any otherunbiased estimator (see , e.g., Schervish , 1995 ). In contrast , in the Bayesian approach , D is fixed, andweimagine all possible valuesof (Jfromwhichthisdataset couldhavebeen generated . Given(J, the "estimate " of thephysical probability of heads isjust (Jitself. Nonetheless , weareuncertain about(J, andsoour finalestimate is the expectation of (Jwithrespect to ourposterior beliefs aboutits value : Ep(BID,~)(O) = J 0 p (OID, E) d(}
(15)
7Low bias and varianceare not the only desirablepropertiesof an estimator. Other desirablepropertiesinclude consistencyand robustness .
312
DAVIDHECKERMAN
The expectations in Equations 14 and 15 are different and , in many cases, lead to different "estimates " . One way to frame this difference is to say that the classical and Bayesian approaches have different definitions for what it means to be a good estimator . Both solutions are "correct " in that they are self consistent . Unfortunately , both methods have their draw backs, which h~ lead to endless debates about the merit of each approach . For example , Bayesians argue that it does not make sense to consider the expectations in Equation 14, because we only see a single data set . If we saw more than one data set, we should combine them into one larger data set . In contrast , cl~ sical statisticians argue that sufficiently accurate priors can not be assessed in many situations . The common view that seems to be emerging is that one should use whatever method that is most sensible for the task at hand . We share this view , although we also believe that the Bayesian approach has been under used, especially in light of its advantages mentiol )ed in the introduction (points three and four ) . Consequently , in this paper , we concentrate on the Bayesian approach . 3 . Bayesian
N etwor ks
So far , we have considered only simple problems with one or a few variables . In real learning problems , however , we are typically interested in looking for relationships among a large number of variables . The Bayesian network is a representation suited to this task . It is a graphical model that efficiently encodes the joint probability distribution (physical or Bayesian ) for a large set of variables . In this section , we define a Bayesian network and show how one can be constructed from prior knowledge . A Bayesian network for a set of variables X = { Xl , . . . , Xn } consists of (1) a network structure S that encodes a set of conditional independence assertions about variables in X , and (2) a set P of local probability distri butions associated with each variable . Together , these components define the joint probability distribution for X . The network structure S is a di rected acyclic graph . The nodes in S are in one-to- one correspondence with the variables X . We use Xi to denote both the variable and its correspond ing node , and PSi to denote the parents of node Xi in S as well as the variables corresponding to those parents . The lack of possible arcs in S encode conditional independencies . In particular , given structure S , the joint probability distribution for X is given by
n p(x) ==i=l IIp (xilpai )
(16)
The local probabili ty distributions P are the distributions corresponding to the terms in the product of Equation 16. Consequently, the pair (8 , P)
A TUTORIAL
ON
LEARNING
WITH
BAYESIAN
NETWORKS
313
encodesthe joint distribution p(x ). The probabilities encoded by a Bayesian network may be Bayesian or physical . When building Bayesian networks from prior knowledge alone , the probabilities will be Bayesian . When learning these networks from data , the
probabilities will be physical (and their values may be uncertain) . In subsequent sections , we describe how we can learn the structure and probabilities of a Bayesian
network
from data . In the remainder
of this section , we ex-
plore the construction of Bayesian networks from prior knowledge . As we shall see in Section 10, this procedure can be useful in learning Bayesian networks
~
well .
To illustrate the process of building a Bayesian network , consider the problem of detecting credit -card fraud . We begin by determining the vari ables to model . One possible choice of variables for our problem is Fraud
(F ) , Gas (G) , Jewelry (J ), Age (A), and Sex (8 ) , representing whether or not the current purchase is fraudulent , whether or not there was a gaB purchase in the last 24 hours , whether or not there w~ a jewelry purch ~ e in the last 24 hours , and the age and sex of the card holder , respectively . The states of these variables are shown in Figure 3. Of course, in a realistic problem , we would include many more variables . Also , we could model the
states
of one or more
of these
variables
at a finer
level
of detail
. For
example , we could let Age be a continuous variable . This initial task is not always straightforward . As part of this task we must (1) correctly identify the goals of modeling (e.g., prediction versus ex-
planation versus exploration), (2) identify many possibleobservationsthat may be relevant to the problem, (3) determine what subset of those observations is worthwhile to model, and (4) organize the observations into variables having mutually exclusive and collectively exhaustive states . Diffi culties here are not unique to modeling with Bayesian networks , but rather are common to most approaches. Although there are no clean solutions , some guidance is offered by decision analysts (e.g., Howard and Matheson ,
1983) and (when data are available) statisticians (e.g., Tukey, 1977). In the next phase of Bayesian-network construction , we build a directed acyclic graph that encodes assertions of conditional independence . One approach for doing so is based on the following observations . From the chain rule of probability , we have n
p(x) ==II p(xilxl, . . ., Xi- I)
(17)
i = 1
Now, for every Xi , there will be some subset IIi <; { XI , . . ., Xi - I } such that Xi and { X I , . . . , X i- I } \ lli are conditionally independent given Ili . That
is , for any x ,
p(xilxl , . . . , Xi- I ) = p(xil7ri)
(18)
314
DAVIDHECKERMAN p(a=<30) = 0.25 p(a=30- 50) = 0.40
p(g=yeslf=yes} = 0.2 p(g=yeslj=no} = 0.01
Figure 9 .
A Bayesian -network
p(j =yeslj=yes,a= *,s= *) = 0.05 p(j =yeslj=no,a=<30,s=male) = 0..0001 p(j =yeslj=no,a=30-50,s=male) = 0.0004 p(j =yeslj=no,a=>50,s=male) = 0.0002 p(j =yeslj=no,a=<30,s=jemale) = 0..0005 p(j =yeslj=no,a=30-50,s=female) = 0.002 p(j =yeslj=no,a=>50,s=female) = 0.001 for detecting
credit -card fraud . Arcs are drawn from
cause to effect. The local probability distribution (s) associated with a node are shown adjacent to the node . An asterisk is a shorthand
for "any state ."
Combining Equations 17 and 18, we obtain n
p(x) = IIp (xil7ri)
(19)
i= l
Comparing Equations 16 and 19, we seethat the variables sets (III , . . . , lln ) correspond to the Bayesian-network parents (Pal , . . . , Pan) , which in turn fully specify the arcs in the network structure S .
Consequently, to determine the structure of a Bayesian network we (1) order the variables somehow, and (2) determine the variables sets that satisfy Equation 18 for i = 1, . . . , n . In our example , using the ordering
(F, A , S, G, J ) , we have the conditional independencies p(alf ) = p(slf , a) = p(glf , a, s) = p(jlf , a, s, g) =
p(a) p(s) p(glf ) p(jlf , a, s)
(20)
Thus , we obtain the structure shown in Figure 3. This approach
has a serious drawback . If we choose the variable
order
carelessly, the resulting network structure may fail to reveal many conditional independenciesamong the variables. For example, if we construct a Bayesian network for the fraud problem using the ordering (J, G, S, A , F ) ,
A TUTORIAL ON LEARNING WITH BAYESIAN NETWORKS
315
we obtain a fully connected network structure . Thus , in the worst case, we have to explore n ! variable orderings to find the best one. Fortunately , there is another technique for constructing Bayesian networks that does not require an ordering . The approach is based on two observations : (1) people can often readily assert causal relationships among variables , and (2) causal relationships typically correspond to assertions of conditional dependence. In particular , to construct a Bayesian network for a given set of variables , we simply draw arcs from cause variables to their immediate effects . In almost all cases, doing so results in a network structure that satisfies the definition Equation 16. For example , given the assertions that Fraud is a direct cause of Gas, and Fraud , Age, and Sex are direct causes of Jewelry , we obtain the network structure in Figure 3. The causal semantics of Bayesian networks are in large part responsible for the success of Bayesian networks as a representation for expert systems (Heckerman et al ., 1995a) . In Section 15, we will see how to learn causal relationships from data using these causal semantics . In the final step of constructing a Bayesian network , we assessthe local probability distribution (s) p (xilpai ) . In our fraud example , where all vari ables are discrete , we assessone distribution for Xi for every configuration of P ~ . Example distributions are shown in Figure 3. Note that , although we have described these construction steps as a simple sequence, they are often intermingled in practice . For example , judg ments of conditional independence and/ or cause and effect can influence problem formulation . Also , assessments of probability can lead to changes in the network structure . Exercises that help one gain familiarity with the practice of building Bayesian networks can be found in Jensen (1996) .
4. Inference in a Bayesian Network Once we have constructed a Bayesian network (from prior knowledge , data , or a combination ) , we usually need to determine various probabilities of interest from the model . For example , in our problem concerning fraud detection , we want to know the probability of fraud given observations of the other variables . This probability is not stored directly in the model , and hence needs to be computed . In general , the computation of a probability of interest given a model is known as probabilistic inference . In this section we describe probabilistic inference in Bayesian networks . Because a Bayesian network for X determines a joint probability distri bution for X , we can- in principle - use the Bayesian network to compute any probability of interest . For example , from the Bayesian network in Fig ure 3, the probability of fraud given observations of the other variables can
316
DAVIDHECKERMAN
be computed
as follows :
(f,a ,g (f,a p(IIa,s,g,).)=pP (a,s,s,g .) =Llf ~p'P (f',s ,),j) ,a,g ,s,j) ,g,).)
(21)
For problems with many variables , however, this direct approach is not practical . Fortunately , at least when all variables are discrete , we can exploit the conditional independencies encoded in a Bayesian network to make this computation more efficient . In our example , given the conditional independencies in Equation 20, Equation 21 becomes
as
.
-
p(JI , , g,J) -
p(J)p(a)p(s)p(gIJ )p(jlf , a, s)
-
p(J )p(glf )p(jlf , a, s)
-
LI ' p(f ')p(glf ')pUlf ', a, s)
Several researchers have developed probabilistic for Bayesian
networks
with
-
(22)
LI ' p(J')p(a)p(s)p(gIJ')p(jIJ ', a, s)
discrete
variables
that
inference algorithms exploit
conditional
in -
dependence roughly as we have described , although with different twists . For example , Howard and Matheson (1981) , Olmsted (1983) , and Shachter
(1988) developed an algorithm that reversesarcs in the network structure until the answer to the given probabilistic query can be read directly from the graph . In this algorithm , each arc reversal corresponds to an applica -
tion of Bayes' theorem. Pearl (1986) developed a message - passing scheme that updates the probability distributions work
in response
to observations
for each node in a Bayesian net-
of one or more
variables
. Lauritzen
and
Spiegelhalter (1988) , Jensenet al. (1990) , and Dawid (1992) created an algorithm that first transforms the Bayesian network into a tree where each node in the tree corresponds to a subset of variables in X . The algorithm then exploits several mathematical properties of this tree to perform proba -
bilistic inference. Most recently, D 'Ambrosio (1991) developedan inference algorithm that simplifies sums and products symbolically , as in the trans formation from Equation 21 to 22. The most commonly used algorithm for
discrete variables is that of Lauritzen and Spiegelhalter (1988), Jensen et al (1990) , and Dawid (1992) . Methods multivariate
for
exact
- Gaussian
inference
or Gaussian
in
Bayesian
- mixture
networks
distributions
have
that
encode
been
devel -
oped by Shachter and Kenley (1989) and Lauritzen (1992), respectively. These methods also use assertions of conditional independence to simplify inference . Approximate methods for inference in Bayesian networks with other distributions , such as the generalized linear -regression model , have
also been developed (Saul et al., 1996; Jaakkola and Jordan, 1996) .
A TUTORIAL
ON
LEARNING
WITH
BAYESIAN
NETWORKS
317
Although we use conditional independence to simplify probabilistic inference, exact inference in an arbitrary Bayesian network for discrete vari -
ables is NP-hard (Cooper, 1990). Even approximate inference (for example, Monte-Carlo methods) is NP-hard (Dagum and Luby, 1993) . The source of the difficulty
lies in undirected cycles in the Bayesian-network structure -
cycles in the structure where we ignore the directionality of the arcs. (If we add an arc from Age to Gas in the network structure of Figure 3, then we
obtain a structure with one undirected cycle: F - G - A - J - F .) When a Bayesian-network structure contains many undirected cycles, inference is intractable . For many applications , however, structures are simple enough
(or can be simplified sufficiently without sacrificing much accuracy) so that inference is efficient . For those applications where generic inference meth ods are impractical , researchers are developing techniques that are custom
tailored to particular network topologies (Heckerman1989; Suermondt and Cooper, 1991; Saul et al., 1996; Jaakkola and Jordan, 1996) or to particular inference queries (Ramamurthi and Agogino, 1988; Shachter et al., 1990; Jensen and Andersen, 1990; Darwiche and Provan, 1996) . 5 . Learning In the
next
probability
Probabilities
several
sections
distributions
set of techniques
in a Bayesian , we show
Network
how to refine
the structure
and local
of a Bayesian network given data . The result is
for data analysis that combines prior knowledge with data
to produce improved knowledge . In this section , we consider the simplest version of this problem : using data to update the probabilities of a given Bayesian
network
structure
.
Recall that , in the thumbtack problem , we do not learn the probability of heads. Instead , we update our posterior distribution for the variable that represents the physical probability of heads. We follow the same approach for probabilities in a Bayesian network . In particular , we assume- perhaps from causal knowledge about the problem - that the physical joint proba bility
distribution
for X can be encoded in some network
structure
S . We
write n
p(xllJs,sh) = IIp (xilpai, lJi, Sh)
(23)
i= l
where 0 i is the vector of parameters for the distribution p(Xi Ipai ' Oi, Sh) , Os is the vector of parameters (01' . . . , On), and Sh denotes the event (or "hypothesis" in statistics nomenclature) that the physical joint probability distribution
can be factored according to 5 .8 In addition , we assume
BAs defined here, network-structure hypotheses overlap. For example, given X = { X1 , X2 } , any joint distribution for X that can be factored according the network
318
DAVIDHECKERMAN
that
we
have
a
probability
in
Section
ing
2
a
(
(
can
(
}
encode
)
ISh
)
.
be
We
,
a
pervised
ing
more
than
/
ships
.
tic
a
Examples
of
(
e
. g
.
,
,
)
bilities
in
.
Buntine
In
a
learning
e
,
principle
,
)
Neal
In
this
and
,
,
,
each
tions
,
,
)
and
Xi
pat
by
1
,
=
-
(
E
~
.
(
~
2
.
,
ijk
Xl
pari
)
(
~
( Jijk
~
.
2
)
containing
-
-
t
Therefore
and
X
2
,
Geiger
.
)
3
1
Xi
=
=
we
overlap
should
(
1996
add
)
describe
,
. g
.
(
Ji
,
(
( } ij2
be
to
set
-
,
proba
-
deci
1996
)
,
to
-
kernel
(
learn
Fried
-
proba
-
techniques
for
models
include
the
Herskovits
1994
.
;
,
,
1992
)
Heckerman
,
and
MacKay
,
1992a
and
.
ideas
for
learning
probabilities
distribution
Ti
possible
collection
S
)
=
)
.
values
of
xt
multinomial
Pai
=
( Jijk
.
In
,
.
this
.
.
,
X
~
i
distribu
Namely
,
we
,
-
assume
we
of
(
define
,
.
.
.
0
configurations
The
the
,
( } ijri
of
parameter
( Jijl
vector
of
is
24
)
Pai
,
given
parameters
)
according
model
to
averaging
definition
conditions
(
the
.
factored
the
>
denote
for
conditions
such
used
. g
-
probabilistic
and
having
ri
also
,
Bayesian
e
,
-
probabilis
,
,
-
Thus
methods
Cooper
parameters
,
can
.
regression
Buntine
-
noth
relation
)
,
su
is
produce
be
(
)
of
for
classifi
dictionary
,
,
function
independence
studied
.
. g
a
Goldszmidt
and
of
problems
one
,
basic
I1XiEPat
presents
-
the
h
,
=
net
probabilistic
1992b
cases
e
e
1996
convenience
arc
,
regression
,
.
:
-
compute
methods
that
most
is
the
,
function
-
configuration
are
As
defin
function
of
most
,
Ipat
.
by
Bayesian
as
with
can
the
(
.
viewed
multinomial
each
qi
~
For
no
Such
in
discrete
lJij
structure
,
al
,
regression
forms
function
(
)
and
(
is
for
.
( }
Sh
1992a
)
the
X
a
D
linear
unrestricted
E
distribution
8i
,
generalized
noise
k
and
Ji
collection
1994
,
distribution
case
density
in
conditional
these
illustrate
P
where
(
Friedman
of
the
a
probability
distribution
a
linear
we
joint
as
88
prior
models
,
et
D
sample
local
,
;
Saul
using
variable
local
one
Book
physical
of
parameters
or
Nonetheless
and
tutorial
each
and
a
MacKay
generalized
;
,
regression
Gaussian
and
1993
structure
case
(
the
Xl
familiar
distribution
with
1996
;
,
any
multinomial
1992b
.
;
.
regression
Geiger
I pai
,
network
available
unrestricted
linear
. g
from
random
Readers
by
1993
,
Bayesian
are
Xi
regression
methods
1995
(
as
/
}
.
organized
(
XN
a
classification
classification
estimation
man
)
viewed
linear
,
element
the
a
.
be
networks
trees
.
probabilities
that
,
include
density
Sh
recognize
can
neural
,
.
assessing
Given
p
models
outputs
sion
siD
probabilistic
regression
bilistic
J
.
an
learning
junction
network
cation
(
and
:
distribution
will
Bayesian
,
of
distribution
learning
'
to
about
simply
the
Xl
uncertainty
stated
to
local
{
refer
8s
(
=
We
problem
p
refer
aE
.
variable
The
distribution
8i
D
X
our
valued
s
now
posterior
(
sample
of
we
-
p
work
,
vector
function
a
random
distribution
to
.
the
,
insure
network
described
no
structure
.
overlap
in
.
Section
Heckerman
7
.
A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 319
sample 1
sample 2 . . .
Figure
. 4
.
A
pendence
and
Y
the
two
for
all
with
for
Bayesian
for
are
.
of
and
j
,
Given
use
We
x
use
and
x
the
to
class
The
D
is
.
that
assumption
of
structure
the
"
two
X
states
We
unrestricted
local
(
-
BsID
,
Sh
of
that
)
-
X
,
the
7
Y
parameter
.
and
inde
Both
-
variables
yand
OsISh
contrast
i )
X
to
denote
=
=
we
can
are
of
compute
Pai
the
form
no
D
Oij
distribution
-
.
closed
are
sample
vectors
)
in
there
random
this
functions
,
and
that
,
(
to
dimensional
model
efficiently
parameter
p
-
functions
is
the
low
regression
distribution
assumption
say
"
are
linear
first
sample
assumption
the
network
that
of
p
.
the
denote
term
generalized
distribution
is
depicting
of
.
the
this
sumptions
That
.
structure
parameters
distributions
example
dom
We
Y
multinomial
terior
network
the
binary
states
i
-
learning
missing
is
mutually
under
two
data
complete
pos
in
.
independent
The
the
-
as
ran
-
-
second
. 9
n qi II II p(Oij~Sh) i=1j =1
We refer to this assumption , which was introduced by Spiegelhalter and Lauritzen (1990) , as parameter independence. Given that the joint physical probability distribution factors according to some network structure S , the assumption of parameter independence can itself be represented by a larger Bayesian-network structure . For example , the network structure in Figure 4 represents the ~ sumption of parameter independence for X == { X , Y } (X , Y binary ) and the hypothesis that the network structure X ~ Y encodes the physical joint probability distribution for X . 9The computation is also straightforwardif two or more parametersare equal. For details, seeThiesson(1995).
320
DAVIDHECKERMAN
Under the assumptions of complete data and parameter independence , the parameters remain independent given a random sample :
p(lJsID , Sh) = Thus
, we
the
one
Dir
( Oij
can
update
- variable I O: ' ijl
each case
p ( OijID
where
Nijk As
rations
that
the of
p ( XN
Thus
is
in
+ IID ,
in
.
, . . . , O: ' ijri
85 , Sh
CaEe
of
Assuming
) '
, Sh
the
vector
we
)
=
number
parameters
each
obtain
( ( JijIO
cases
in , we
obtain
predictions
of
) ,
where
XN
XN
+ l ,
to
Xi
==
+ l x7
is and
+
the
D
in
Pai
which
Xi
average
pal
prior
to ,
+
=
over
. For case
==
the
, . . . , O: ' ijri
interest next
has
, just
as
in
distribution
distribution
Nijl
can
(25)
independently
( Jij
posterior
: ' ijl
example
Oij
vector
the
Dir
of
thumbtack
n qi IIlp{lJijID ,Sh ) iII=lj=
xf the
example be
k
and
Pai
=
possible
after and
(26)
)
, let
seen
where
Nijri
us D
j
pal
.
configu
-
compute .
depend
Suppose on
i .
,
p(XN +IID,Sh)= Ep (6sID ,Sh ) (g (Jijk ) To compute
this expectation
, we first
n
use the fact that
the parameters
n
J 1J1 ; II1J (Jijk p (8ijID , Sh) d8ij .=1(Jijk p (8sID , Sh) d8s = t=
fI O :'ijk + Nijk i=1 O :'ij + N ij
(27)
r' """",,r " where Qij = Lk ,: 1 Gijk and Nij = Lik -=1 Nijk These computations are simple because the unrestricted multinomial distributions are in the exponential family - Computations for linear re gression with Gaussian noise are equally straightforward (Buntine , 1994 ; Heckerman and Geiger , 1996) -
6. Methods for Incomplete
Data
Let us now discuss methods for learning about parameters when the random sample is incomplete (i .e., some variables in some cases are not observed) . An important distinction concerning missing data is whether or not the
A TUTORIAL
ON LEARNING
WITH
BAYESIAN
NETWORKS
321
absence of an observation is dependent on the actual states of the vari ables. For example , a missing datum in a drug study may indicate that a patient became too sick- perhaps due to the side effects of the drug - to
continue in the study. In contrast, if a variable is hidden (i.e., never observed in any case) , then the absenceof thi ~ data is independent of state. Although Bayesian methods and graphical models are suited to the analysis of both situations , methods for handling missing data where absence is independent of state are simpler than those where absence and state are dependent . In this tutorial , we concentrate on the simpler situation only .
Readers interested in the more complicated caseshould see Rubin (1978) , Robins (1986) , and Pearl (1995) . Continuing with our example using unrestricted multinomial distribu tions , suppose we observe a single incomplete case. Let Y C X and Z C X denote the observed and unobserved variables in the case, respectively . Under the assumption of parameter independence , we can compute the
posterior distribution of Oij for network structure S as follows:
p(OijIy, Sh) == L p(zly, Sh) p(OijIy, z, Sh)
(28)
z
== (1- p(patIY ,Sh )){P(OijISh )} + ri
L p(x7,paily ,Sh ) {p(Oijlxr ,pat,Sh )}
k= l
(See Spiegelhalter and Lauritzen (1990) for a derivation.) Each term in curly brackets in Equation 28 is a Dirichlet distribution . Thus , unless both Xi and all the variables in Pai are observed in case y , the posterior dis-
tribution of Oij will be a linear combination of Dirichlet distributions -
that is, a Dirichletmixturewithmixingcoefficients (1- p(pa1IY , Sh)) and p(xf,pa1Iy,Sh),k ==1,...,Ti. When we observe a second incomplete case, some or all of the Dirichlet components in Equation 28 will again split into Dirichlet mixtures . That is,
the posterior distribution for Oij we becomea mixture of Dirichlet mixtures. As we continue to observe incomplete cases, each missing values for Z , the
posterior distribution for Oij will contain a number of components that is exponential in the number of cases. In general , for any interesting set of local likelihoods and priors , the exact computation of the posterior distribution for Os will be intractable . Thus , we require an approximation for incomplete data .
322
DAVIDHECKERMAN
6.1. MONTE-CARLOMETHODS One class of approximations is based on Monte -Carlo or sampling methods . These approximations can be extremely accurate , provided one is willing to wait long enough for the computations to converge. In this
section
, we discuss
one of many
Monte
- Carlo
methods
known
as
Gibbs sampling, introduced by Geman and Geman (1984). Given variables X = { Xl , . . . , Xn } with some joint distribution p(x ) , we can use a Gibbs sampler to approximate the expectation of a function J (x ) with respect to p(x ) as follows. First , we choose an initial state for each of the variables in X somehow (e.g., at random). Next, we pick some variable Xi , unassign its current state , and compute its probability distribution given the states of the other n - 1 variables . Then , we sample a state for Xi
based on this probability distribution , and compute f (x ). Finally, we iterate the previous two steps, keeping track of the averagevalue of f (x ) . In the limit , as the number of cases approach infinity , this average is equal to
Ep(x) (f (x )) provided two conditions are met. First , the Gibbs sampler must be irreducible: The probability distribution p(x ) must be such that we can eventually sample any possible configuration of X given any possible ini -
tial configuration of X . For example, if p(x ) contains no zero probabilities, then the Gibbs sampler will be irreducible . Second, each Xi must be chosen infinitely often . In practice , an algorithm for deterministically rotating through the variables is typically used. Introductions to Gibbs sampling and other Monte -Carlo methods - including methods for initialization and
a discussion of convergence - are given by Neal (1993) and Madigan and York (1995) . To illustrate
Gibbs sampling , let us approximate the probability
den-
sity p(lJsfD, Sh) for someparticular configuration of lJs, given an incomplete data set D == { Yl , . . . , YN} and a Bayesian network for discrete variables with independent Dirichlet priors . To approximate p (8 siD , Sh) , we first ini tialize
the
states
of the
unobserved
variables
in each case somehow
. As a
result , we have a complete random sample Dc . Second, we choose some vari -
able Xii (variable Xi in case1) that is not observedin the original random sample D , and reassign its state according to the probability
p(xillDc \ Xii, Sh) =
distribution
P(x~l' Dc \ XilISh)
Lx ~~p(xii , Dc \ xillSh )
where Dc \ Xii denotes the data set Dc with observation Xii removed, and the sum in the denominator shall
see in Section
7 , the terms
runs over all states of variable in the numerator
Xii . As we
and denominator
can be
computed efficiently (seeEquation 35). Third , we repeat this reassignment for all unobserved variables in D , producing a new complete random sample
A TUTORIAL ON LEARNING WITH BAYESIAN NETWORKS
323
D~. Fourth , we compute the posterior density p(8sID~, Sh) as described in Equations 25 and 26. Finally , we iterate the previous three steps, and use the averageof p(8sID~, Sh) as our approximation.
6.2. THEGAUSSIAN APPROXIMATION Monte -Carlo methods yield accurate results , but they are often intractable for example , when the sample size is large . Another approximation that is more efficient than Monte -Carlo methods and often accurate for relatively large samples is the Gaussian approximation (e.g., Kass et al ., 1988; Kass and Raftery , 1995) . The idea behind this approximation is that , for large amounts of data , p ( 8sID cx
, Sh
p ( DI8s
) , Sh
Gaussian
)
. p ( 8s
ISh
distribution
.
can often be approximated as a multivariate-
) In
particular
g ( 8s
)
, let
= = log
( p ( DI8s
, Sh
)
. p ( 8sISh
(29)
) )
Also
,
define
9
s
configuration
be
also
posteriori
( MAP
nomial
to
of
g
)
configuration
maximizes )
( 8s
the
p
configura
( 8sID
! ion
about
the
8s
to
where
( 8s
negative and
using
p
-
8s
) t
Hessian
( 8sID
is of
Equation
, sh
)
( 8s
~
g
the g 29
<X
)
p
,
)
p
)
-
we
and
that is
Using
a g
maximizes
( 8s
of
-
ro at
Os
known
_w 8s
vector .
as
second
Raising
the
( 8s
) ,
( 8s
-
we
g
s ) .
This a
Taylor
poly
-
obtain
t
8s
( 8s
( 9
maximum
degree
) A
9
(30)
)
( 8s
lis )
) , to
and the
A power
is
the of
e
obtain
( DI8s
( DI8s
) ,
s
-
2
evaluated
, Sh -
~
.
1
( 8s
transpose
( 9s
8s
8
approximate
g
, Sh
of
of
)
p
h , S
( 8sISh -
)
p
( 8slS
) h )
( 31
)
1 - t exp{ - 2(9s- 9s)A(9s- 9s) }
Hence, p(9sID , Sh) is approximately Gaussian. To compute the Gaussian approximation,_we must compute Os as well as the negative Hessianof 9 (~s) evaluated at 9 s. In the following section, we discussmethods for finding 8s. Meng and Rubin (1991) describea numerical technique for computing the secondderivatives. Raftery (1995) shows how to approximate the Hessian using likelihood-ratio tests that are available in many statistical packages. Thiesson (1995) demonstrates that , for unrestricted multinomial distributions , the secondderivatives can be computed using Bayesian-network inference.
324
6
DAVID
. 3
.
THE
As
MAP
the
sample
sharper
,
limit
,
we
simply
A
further
increases
approximate
of
( J s
ML
size
tending
do
make
size
AND
of
to
not
a
need
APPROXIMATIONS
the
lis
increases
function
the
by
the
of
maximum
the
the
the
the
based
on
or
maximum
( ( J s
peak
will
)
become
-
( J s
expectations
.
.
In
Instead
this
,
we
.
observation
I Sh
ALGORITHM
configuration
configuration
the
p
EM
Gaussian
MAP
prior
THE
MAP
averages
on
is
effect
,
at
compute
based
approximation
,
AND
data
delta
to
predictions
HECKERMAN
that
diminishes
likelihood
,
.
( ML
as
the
Thus
)
sample
,
we
can
configuration
:
Os =arg max ,Sh )} Os{p(DIOs One class of techniques for finding a ML or MAP is gradient -based optimization . For example , we can use gradient ascent, where we follow the derivatives of g (88) or the likelihood p (DI88 , sh ) to a local maximum . Russell et ale ( 1995) and Thiesson (1995) show how to compute the deriva tives of the likelihood for a Bayesian network with unrestricted multino mial distributions . Buntine (1994) discusses the more general case where the likelihood function comes from the exponential family . Of course, these gradient -based methods find only local maxima . Another technique for finding a local ML or MAP is the expectation maximization (EM ) algorithm (Dempster et al ., 1977) . To find a local MAP or ML , we begin by assigning a configuration to 88 somehow (e.g., at ran dom ) . Next , we compute the expected sufficient statistics for a complete data set , where expectation is taken with respect to the joint distribution for X conditioned on the assigned configuration of 88 and the known data D . In our discrete example , we compute
N ,Os ,Sh ) Ep (xID ,Os,Sh )(Nijk) = L 1=1p(xf,pa1lYl
(32)
where Yl is the possibly incomplete lth case in D . When Xi and all the variables in Pai are observed in case Xl , the term for this case requires a trivial computation : it is either zero or one. Otherwise , we can use any Bayesian network inference algorithm to evaluate the term . This computa tion is called the expectation step of the EM algorithm . Next , we use the expected sufficient statistics as if they were actual sufficient statistics from a complete random sample Dc . If we are doing an ML calculation , then we determine the configuration of Os that maximize p (DcIOs, Sh) . In our discrete example , we have
()'l..Jk -Ek x(x,9 ID ,S s),S (N )) riEp =l(Ep ,9 ID sh )ijk (Nijk h
A TUTORIALON LEARNINGWITH BAYESIANNETWORKS 325 If we are doing a MAP calculation, then we determinethe configurationof Osthat maximizesp(OsIDc,Sh). In our discreteexample, we havel0
(JJok i=Lk rG Ep x(x,9 ID .,,S (Nijk h ,=:lijk (G :'i+jk +(Ep O ID "),S )(N h)ijk )) This assignment is called the maximization step of the EM algorithm . Dempster et ale (1977) showed that , under certain regularity conditions , iteration of the expectation and maximization steps will converge to a local maximum . The EM algorithm is typically applied when sufficient statistics exist (i .e., when local distribution functions are in the exponential fam ily ) , although generalizations of the EM algroithm have been used for more complicated local distributions (see, e.g., Saul et ale 1996) . 7 . Learning
Parameters
and Structure
Now we consider the problem of learning about both the structure and probabilities of a Bayesian network given data . Assuming we think structure can be improved , we must be uncertain about the network structure that encodes the physical joint probability distribution for X . Following the Bayesian approach , we encode this uncertainty by defining a (discrete ) variable whose states correspond to the possible network -structure hypotheses Sh, and assessing the probabilities p (Sh) . Then , given a random sample D from the physical probability distri bu tion for X , we compute the posterior distribution p (sh ID ) and the posterior distributions p (9 siD , Sh) , and use these distributions in turn to compute expectations of interest . For example , to predict the next case after seeing D , we compute
p(XN+IID ) =
Ep ) jP (XN+118s ,Sh) p(8sID,Sh) d8s Sh (ShID
In performing the sum , we assume that the network -structure are mutually exclusive . We return to this point in Section 9 .
(33)
hypotheses
lOThe MAP configuration lis depends on the coordinate system in which the parameter variables are expressed. The expression for the MAP configuration given here is obtained by the following procedure. First , we transform each variable set (Jij = ((Jij2, . . . , fJijri ) to the new coordinate system cPij = (cPij2, . . . , cPijri) ' where cPijk = log((Jijk/ fJijl ) , k = 2, . . . , ri . This coordinate system, which we denote by cPs , is sometimes referred to as the canonical coordinate system for the multinomial distribution (see, e.g., Bernardo and Smith , 1994, pp . 199- 202). Next , we determine the configuration of cPsthat maximizes p( cpsiD c, Sh) . Finally , we transform this MAP configuration to the original coordinate system. Using the MAP configuration corresponding to the coordinate system cPshas several advantages, which are discussedin Thiesson (1995b) and MacKay (1996) .
326
DAVIDHECKERMAN
The computation of p (OsID , Sh) is as we have described in the previous two sections . The computation of p (ShID ) is also straightforward , at least in principle . From Bayes' theorem , we have
p(ShlD ) = p(Sh)p(DISh )jp(D) where
p ( D ) is
ture we
. Thus need
to
possible
normalization
compute
nomial
the
the
an
plete
data
rate
we
marginal
parameter
( Jij
- sided
of
the
i- j
p ( DISh
was
first ,
impractical
data
pair
models
over
with
more
than
not
the
almost
problem
all
of :
proach
is
possi
is
ble
lated
?
whether The several good
models
, and
to
approaches
so , how or
not
question researchers hypothesis
each
do
of
models
of
use
often
it
as
a
sepa
-
, the
the
marginal
if
it
were
these
results
" good
accuracy
is
of
good
applied
good
for to .
to
models
in this
former ) from
ap among
. The
from
exhaustive
latter
among
all
. These
particular
And
.
address
The
Bayesian ?
can
decades
model
. In
is
user
intractable
models are
questions
when for
is
the
problem
correct
by
- network
hypotheses
hypothesis
the
is
produced
where
averaging
models
important
is
approaches
model
described
Bayesian
approach
two
) .
have
structure
this
(35)
)
( 1992 we
situations
by
number
search
yields
-
, each
have
of
consider
( i . e . , structure
that
shown
we
, the
selective
is
have
product
possible
, use
model
several
a model
data
, we
-
com
j . Consequently
that
confronted
and
we
Sec -
, and
missing
Herskovits
, in
hypotheses
been
accurate
no effect
bottleneck
of
these
pretend
raise
and
number
a manageable
and
in
multi
priors
. II r ( Qijk + Nijk k= 1 r ( G' ijk )
)
33 . If
of
"
detail
13 ) :
computation
the
" good
select
yield If
,
) ) for
unrestricted
and
the
approach
n . Consequently
selection
models
just
Cooper
in
types
a
.i
Equation
Equation
have
select
approaches
tures
all
other
model to
possible
approach
,
, who
context
-
ri
by
in
are . In
every
the by
important
variables
Statisticians
is
Bayesian
models
n
for
in
with
there
qi
full
exponential
exclude
struc
structures
( p ( DISh
, Dirichlet
when
( given
derived
the
. One
average
upon
network
likelihoods
) == II II r ( Qij ) i = 1 j = 1 r ( G' ij + Nij
Unfortunately
the
depend
data
example
independently
problem
of
each
,
updated
n
often
the
independence
thumbtack
for
formula
of
our
discussed
is
likelihood
This
not for
marginal
, consider
have
vector
likelihoods
does
distribution
likelihood
computation
,
. As
multi
that
posterior
marginal
introduction
distributions
parameter
the
.
discuss
9 . As
constant
determine
structure
We tion
a
, to
( 34 )
, do
- network how
re -
do
these struc
we
-
decide
" ? difficult
to
answer
experimentally accurate
predictions
that
in the
theory
. Nonetheless
selection
( Cooper
and
of
a single
Herskovits
,
A TUTORIAL
ON
LEARNING
WITH
BAYESIAN
NETWORKS
327
1992; Aliferis and Cooper 1994; Heckermanet al., 1995b) and that model averaging using Monte -Carlo methods can sometimes be efficient and yield
even better predictions (Madigan et al., 1996). These results are somewhat surprising , and are largely responsible for the great deal of recent interest in learning with Bayesian networks . In Sections 8 through 10, we consider different definitions of what is means for a model to be "good" , and discuss the computations entailed by some of these definitions . In Section 11, we discuss
model
search .
We note that model averaging and model selection lead to models that generalize well to new data . That is , these techniques help- us to avoid the ~ overfitting of data . As is suggested by Equation 33 , Bayesian methods for model averaging and model selection are efficient in the sense that all cases . III D can be used to both smooth and train the model . As we shall see
. III
the
following
approach
in
8 .
Criteria
two general
for
sections
,
this advantage
holds true for the Bayesian
.
Model
Selection
Most of the literature on learning with Bayesian networks is concerned with model selection . In these approaches, some criterion is used to measure the degree to which a network structure (equivalence class) fits the prior knowl edge and data . A search algorithm is then used to find an equivalence class that receives a high score by this criterion . Selective model averaging is more complex , because it is often advantageous to identify network struc tures that are significantly different . In many cases, a single criterion is unlikely to identify such complementary network structures . In this section , we discuss criteria for the simpler problem of model selection . For a discussion of selective model averaging , see Madigan and Raftery (1994) .
8.1. RELATIVE POSTERIOR PROBABILITY A criterion that is often used for model selection is the log of the relative posterior probability log p (D , Sh) = log p (Sh) + log p (DISh ) .11 The logarithm is used for numerical convenience. This criterion has two components : the log prior and the log marginal likelihood . In Section 9, we examine the computation of the log marginal likelihood . In Section 10.2, we discuss the assessment of network -structure priors . Note that our comments about these terms are also relevant to the full Bayesian approach . l1An
equivalent
criterion
that
is
often used is log(p(ShID)/p(S~ID))
=
log(p(Sh)/ p(S~)) + log(p(DISh)/ p(DIS~)). The ratio p(DISh)/p(DIS~) is knownas a Bayes ' factor .
328
DAVID
The
log
marginal
described
by
HECKERMAN
likelihood
Dawid
ha . s the
( 1984
) . From
following
the
chain
log
p ( xllxl
interesting rule
of
interpretation
probability
, we
have
N
log
p ( DISh
)
=
L
, . . . , Xl
- I , Sh
)
( 36
)
1= 1
The
term
after
p ( xllx1
, . . . , Xl
averaging
of
as
log
over
the
utility
p ( x ) . 12
highest
or
Thus
,
model
that
utility
function Dawid
the
,
omitted
Finally
the
of
Xl
this
made
log
the
priors
predictor
can
of
function ( or
structure
)
data
D
this
criterion
Sh
thought
likelihood on
the
model be
utility
marginal
equal
by
term
under
highest
assuming
-
, we
,
of
v,
=
a { Xl
reward
this
each is
validation on
this
, known
all
' . . . ' X ' - l , Xl
but
is
under
for
p ( x ) , we
under
every .
case
If
the
obtain
of } .
the
also
a
the
log
in
the
cross
- one
cases
Then
,
- out
in
we
utility
the
predict function
random is
cross
and leave
the
some
prediction the
as
one
+ l , . . . , XN
prediction
prediction
log
between
cross
model
procedure
for
function
relationship
form train
and
repeat
rewards
utility
first
say ,
the
one
we
case
the
notes
using
sample
sum
for
log
prediction
sequential
) also
validation
random
The
this the
,
prediction
.
with
best
. When
cross
the
. ( 1984
validation
model
the
is
for
probability is
)
parameters
reward
a
posterior
- I , Sh
its
sample
. , and
probabilistic
- validation
and
criterion
N
CV
( Sh
, D
)
== E
logp
( x / IVz
, Sh
)
( 37
)
1= 1
which
is
similar
training p ( xIIVI
, Sh
Whereas
)
data
likelihood ,
8 .2 .
we
the
and
this
.
utility to
,
we
. use
to
Various we the
x2 , Sh
lead
with
For for
) , we
the
from
problem
this
example
selection
for of
a
Xl
that
altogether
. Namely test
this
that
cases
testing
and
model
the
log
, when
that
compute
for
attenuating 36
and
is we
training
Equation
training
when
and
Xl
for
criterion
,
training use
approaches see
interchange
problem
findings
12This
) . but
problem
.
x2
for
over
fits
problem - marginal using
this
.
CRITERIA
Consider
people
One
p ( x2IV2
avoids
never
LOCAL
of
1984 ,
criterion
criterion
37
can
described
.
interchanged
compute
interchanges ,
36
are
Equation we
( Dawid
been
Equation cases
in
. Such
have
set
test
, when
testing the
to
and
function
assess rule
of
in
diagnosing
Suppose
their particular
that
is true
known
an the
as
probabilities , see
Bernardo
set
a
ailment of
proper .
For ( 1979
given
ailments
scoring a ) .
the
under
rule
characterization
observation
of
consideration
, because of
its proper
use
a are
encourages scoring
rules
A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 329
.
Figure 5.
.
.
A Bayesian -network structure
for medical diagnosis .
mutually exclusive and collectively exhaustive , so that we may represent these ailments using a single variable A . A possible Bayesian network for this classification problem is shown in Figure 5. The posterior -probability criterion is global in the sense that it is equally sensitive to all possible dependencies. In the diagnosis problem , the posterior probability criterion is just as sensitive to dependencies among the finding variables as it is to dependencies between ailment and findings . Assum ing that we observe all (or perhaps all but a few ) of the findings in D , a more reasonable criterion would be local in the sense that it ignores dependencies among findings and is sensitive only to the dependencies among the ailment and findings . This observation applies to all classification and regression problems with complete data . One such local criterion , suggested by Spiegelhalter et ale (1993) , is a variation on the sequential log-marginal -likelihood criterion :
N LC(Sh,D) ==}: logp " D" Sh) 1=1 (aIIF
(38)
where al and Fl denote the observation of the ailment A and findings F in the lth case, respectively . In other words , to compute the lth term in the product , we train our model S with the first 1 - 1 cages, and then determine how well it predicts the ailment given the findings in the lth case. We can view this criterion , like the log-marginal -likelihood , ag a form of cross validation where training and test cages are never interchanged . The log utility function has interesting theoretical properties , but it is sometimes inaccurate for real-world problems . In general , an appropriate reward or utility function will depend on the decision-making problem or problems to which the probabilistic models are applied . Howard and Math eson (1983) have collected a series of articles describing how to construct utility models for specific decision problems . Once we construct such util -
330
DAVIDHECKERMAN
ity models , we can use suitably modified forms of Equation 38 for model selection . 9 . Computation
of the Marginal
Likelihood
As mentioned , an often -used criterion for model selection is the log relative posterior probability log p (D , Sh) == log p (Sh) + log p (DISh ) . In this section , we discuss the computation of the second component of this criterion : the log marginal likelihood . Given (1) local distribution functions in the exponential family , (2) mutual independence of the parameters (Ji, (3) conjugate priors for these parameters , and (4) complete data , the log marginal likelihood can be computed efficiently and in closed form . Equation 35 is an example for unrestricted multinomial distributions . Buntine (1994) and Heckerman and Geiger (1996) discuss the computation for other local distribution func tions . Here , we concentrate on approximations for incomplete data . The Monte -Carlo and Gaussian approximations for learning about parameters that we discussed in Section 6 are also useful for computing the marginal likelihood given incomplete data . One Monte -Carlo approach , described by Chib (1995) and Raftery (1996) , uses Bayes' theorem : p ( DISh
For
any
uated
configuration directly
.
computed the
using
can
6 . 1 . Other
DiCiccio As
et
we
have
on
the
tationally based
as
Recall be
be
( 1995
Monte
that
approximated
- network
be
sophisticated
, Monte
into
Equation
obtain
the
- Carlo for
large
the
can
- Carlo
be
numerator
, the
sampling
posterior , as
we
methods
large
amounts
of
a multivariate
) =
eval
-
can
be
term
in
described are
closed
form
40 , integrating
large data
( DIOs . In
, and
efficient
data
sets
, p ( DllJs
in
described
particular
contrast ,
compu
-
, methods
and
can
, Sh ) . p ( lJsISh
) can
. Consequently
) dOs
be
as
logarithm
often ,
( 40 )
, substituting the
but
.
distribution
, Sh ) p ( OsISh
taking
accurate . In
more
- Gaussian
Jp
are
databases are
on
approximation
numerator in
methods
approximation methods
in
the
. Finally Gibbs
Monte
- Carlo
as
evaluated
inference using
, especially
, for
in term
( 39 )
) .
p ( DISh can
term
likelihood
computed
discussed
Gaussian
prior
, the
, more
ale
inefficient
accurate
Os , the
addition
Bayesian
denominator
Section by
of
In
) == p ( OsISh ) p ( DIOs , Sh ) p ( OsID , Sh )
Equation of
the
result
31 , we
:
h - h - h d 1 logp(DIS) ~ logp(DIOs ,S ) +logp(OsIS) + 2 log(27r ) - 2 logIAI (41)
A
where
TUTORIAL
d
is
ON
the
dimension
multinomial
1
)
Sometimes
Geiger
et
This
,
ale
(
errors
1996
we
)
.
,
1995
)
,
tering
who
(
We
1989
the
use
see
Thus
lin
for
,
large
we
~
N
,
can
derived
The
by
BIC
on
,
predicts
the
it
1987
_so
in
more
Kass
et
-
ale
,
,
1996
)
.
we
shown
see
,
.
)
,
Becker
efficient
for
Stutz
data
by
with
,
.
and
approximation
N
which
by
that
. g
clus
-
)
increase
IAI
e
Another
program
1996
the
incorrectly
Cheeseman
AutoClass
-
only
have
(
Carlo
large
using
doing
by
log
Monte
for
researchers
described
and
to
IAI
,
(
(
dj2
.
logp
.
(
Thus
,
provides
1978
:
log
p
ML
retaining
(
increases
the
DI98
,
as
d
Sh
)
log
configuration
13
of
,
N
.
88
.
)
interesting
in
,
Second
,
.
Third
Sh
connection
Sh
,
,
)
the
and
log
several
N
(
(
MDL
use
.
the
BIC
)
,
42
)
and
the
that
quite
described
9
,
does
intuitive
the
is
validation
it
.
model
punishes
criterion
cross
,
parameterized
approximation
Section
First
approximation
is
term
)
in
respects
can
well
BIC
(
between
~
approximation
a
the
discussion
-
criterion
we
how
)
)
.
measuring
DI8s
the
a
,
information
)
Length
recalling
DI8s
Consequently
term
logN
(
Bayesian
(
Description
.
-
cases
For
intensive
accurate
logp
the
prior
a
data
model
)
called
~
is
a
Minimum
likelihood
)
prior
contains
the
the
DISh
Schwarz
the
assessing
Namely
example
Kass
relative
of
.
circumstances
approximated
approximation
depend
without
(
(
is
first
not
for
in
that
,
number
approximate
the
less
N
be
.
' s
.
,
relative
,
41
-
obtain
approximation
Wag
but
with
s
logp
This
(
-
Heckerman
in
arly
8
to
Heckerman
Equation
the
nevertheless
some
is
efficient
is
efficient
is
and
and
in
increases
,
in
Chickering
ri
Laplace
accurate
Although
approximation
very
N
parameters
approximation
terms
which
Also
.
accurate
Chickering
a
those
the
be
s
the
also
I
is
A
as
conditions
extremely
see
(
lower
approximation
where
be
qi
.
I A
Hessian
and
'
obtain
only
,
Laplace
)
simplification
of
,
is
among
of
)
l
is
known
Laplace
,
1995
of
can
Cun
ljN
=
.
regularity
approximation
independencies
Le
of
(
One
approximation
and
(
Ei
dimension
point
the
by
this
is
approximation
computation
elements
variant
as
can
this
s
this
certain
O
Raftery
'
the
models
impose
41
unrestricted
given
,
integration
under
331
with
typically
of
are
Laplace
,
the
is
Equation
of
network
variables
for
,
NETWORKS
Bayesian
discussion
that
and
Although
dimension
a
approximation
Kass
approaches
For
hidden
a
to
Laplace
and
.
BAYESIAN
dimension
approximation
the
diagonal
this
for
shown
discussions
1988
)
are
)
refer
have
this
,
detailed
(
(
88
technique
)
Thus
(
there
ale
and
1988
in
.
when
approximation
method
et
(
9
,
,
WITH
of
distributions
.
See
D
LEARNING
we
see
complexity
exactly
minus
by
that
the
and
Rissanen
marginal
MDL
.
130ne of the technical assumptions used to derive this approximation is that the prior ,. is non-zero around (Js.
332
DAVIDHECKERMAN
10. Priors To
compute
must
the
assess
( unless
the
we
are
parameter tions
prior
priors
large
,
we
network
structures
Several
authors
have
ods
deriving
priors
for
SpiegeIhaIter 1996
et
) . In
10 . 1 .
First
this
PRIORS
, let
us
structures
.
address
the
holds
and
tures
for
the
example
as
of
of
, under
direct
, 1992
for
assessments
; Buntine
; Heckerman
these
-
cer -
priors
corresponding
, 1991
al . , 1995b
func
structures
parameter
and
Herskovits et
scoring
. Nonetheless
number
)
) . The
network
and
, we
p ( fJsISh
BICjMDL
alternative
intractable
some
approaches
.
meth
-
, 1991
;
and
Geiger
of
network
,
.
PARAMETERS
of
the
approach
the
local
and
approach
X
Y X
priors
for
of
Heckerman
the
distribution
the
X
parameters et
ale
functions
assumption
Z
structure
is one
conditional
if and
ordered to
of
( 1995b
are
)
who
unrestricted
parameter
. In
independence
between distribution
consideration
have per
that and
- 4
Y
Pearl
is
an
) . For - 4
, these
arc
network assertion are
each
possible
p ( x ) are
are
independence arc
) . A
from
of
n ! pos -
for
ignoring
, 1990
X
Z ,
assertion
no
structures
of
inde
-
direc
-
v - structure to
Y
and
is from
Z .
that
is all
closely
Bayesian
functions F
X
for
structure
and
equivalence
distribution
se , because
1990
structure
same
there
,
, there
structures
network
( Verma
. Suppose local
network
the
Pearl
encodes
variables
-
set
, a complete
is , it n
network
, two
X
that
same
Y . Consequently
example
-
struc
independence
given
contains
have
the
and
the
equiva
- network
structures
another
: one
, Y , Z ) such
of
equivalence
,
they
arc
concept
( Verma
only
-
independence
Bayesian represent
network
complete
general if
X
edge X
v - structures (X
no
missing
. All
dependence
restriction
no
:
two
they
represent
. As
structures
tuple
if for
. When
only
concepts that
independent
variables
same
Y , but
The
Z
network
and
the
f -
equivalent
has
equivalent
equivalent tions
Y
conditionally
that
the
say
equivalent
independence
of
pendence
f -
are
complete
key
. We
{ X , Y , Z } , the
X
are
two
assertions =
Z , and
and
on
independence
given
- t
structures
ordering
based equivalence
are
network
sible
is
- independence ,
f -
that
a
and
structure priors
many
assumptions
assessment
where
distribution
conditional
Z
such
examine
consider
the
structure
; Heckerman
distributions
Their
an
be the
network
.
lence
X
for
, when
a manageable
NETWORK
case
multinomial
will
( Cooper
, we
a
parameter such
required
derive from
consider We
also
discussed
al . , 1993
ON
the
8 . Unfortunately
can
section
of
and
approximations
assessments
assumptions
many
- sample
Section
, these
probability p ( Sh )
p ( 6 s I s h ) are in
possible
tain
posterior
using
discussed
are
relative structure
can
be
a
in large
related
to
networks the family
family .
F We
that
for
X
. This say
that
of
in -
under is
not two
A TUTORIAL
ON
LEARNING
WITH
BAYESIAN
NETWORKS
Bayesian-network structures 51 and 52 for X are distribution
333
equivalent
with respect to (wrt) F if they represent the same joint probability distributions
for X - that
is , if , for every 881, there exists
a 882 such that
p(xI881' Sf ) = p(xI882, S~), and vice versa. Distribution equivalence wrt some F implies independence equivalence , but the converse does not hold . For example , when : F is the family of generalized linear -regression models , the complete network structures for n ~ 3 variables
do not represent the same sets of distributions
. Nonetheless , there
are families F - for example , unrestricted multinomial distributions and linear -regression models with Gaussian noise- where independence equiva-
lence implies distribution equivalencewrt F (Heckermanand Geiger, 1996). The notion of distribution equivalence is important , because if two network structures 81 and 52 are distribution equivalent wrt to a given F , then the hypotheses associated with these two structures are identical - that is,
Sf == 5~. Thus, for example, if 51 and 52-aredistribution equivalent , then their probabilities must be equal in any state of information . Heckerman et
ale (1995b) call this property hypothesisequivalence. In light of this property , we should associate each hypothesis with an equivalence class of structures rather than a single network structure , and our methods for learning network structure should actually be interpreted
as methods for learning equivalenceclassesof network structures (although, for the sakeof brevity, we often blur this distinction ) . Thus, for example, the sum over network -structure hypotheses in Equation 33 should be replaced with a sum over equivalence-class hypotheses . An efficient algorithm for identifying the equivalence class of a given network structure can be found
in Chickering (1995). We note that hypothesis equivalence holds provided we interpret Bayesian-network structure simply as a representation of conditional independence. Nonetheless , stronger definitions of Bayesian networks exist
where arcs have a causal interpretation (see Section 15) . Heckerman et al. (1995b) and Heckerman (1995) argue that , although it is unreasonable to assume hypothesis equivalence when working with causal Bayesian networks , it is often reasonable to adopt a weaker assumption of likelihood equivalence, which says that the observations in a database can not help to discriminate two equivalent network structures . Now
let
us return
to the
main
issue
of this
section
: the
derivation
of
priors from a manageable number of assessments. Geiger and Heckerman
(1995) show that the assumptionsof parameter independenceand likelihood equivalence imply that the parameters for any complete network structure Sc must have a Dirichlet distribution with constraints on the hyperparam eters given by
aijk = a p(xf, pa1IS ~)
(43)
334
DAVIDHECKERMAN
where G is the user's equivalent sample size,14, and p {xf , pa1IS ~) is computed from the user's joint probability distribution p {xI8 ~) . This result is rather remarkable , as the two assumptions leading to the constrained Dirichlet solution are qualitative . To determine the priors for parameters of incomplete network struc tures , Heckerman et ale (1995b) use the assumption of parameter modular ity , which says that if Xi has the same parents in network structures 81 and 82 , then
p((JijIS~) = p((JijIS~) for j = 1, . . . , qi. They call this property parameter modularity , because it says that the distributions for parameters 8ij depend only on the structure of the network that is local to variable Xi - namely , Xi and its parents . Given the assumptions of parameter modularity and parameter independence ,15 it is a simple matter to construct priors for the param eters of an arbitrary network structure given the priors on complete network structures . In particular , given parameter independence, we construct the priors for the parameters of each node separately . Furthermore , if node Xi has parents Pai in the given network structure , we identify a complete network structure where Xi has these parents , and use Equation 43 and parameter modularity to determine the priors for this node. The result is that all terms aijk for all network structures are determined by Equation 43. Thus , from the assessments O! and p (xIS ~) , we can derive the parameter priors for all possible network structures . Combining Equation 43 with Equation 35, we obtain a model-selection criterion that assigns equal marginal likelihoods to independence equivalent network structures . We can assess p (xIS ~) by constructing a Bayesian network , called a prior network , that encodes this joint distribution . Heckerman et al . (1995b) discuss the construction of this network .
10.2. PRIORSON STRUCTURES Now , let us consider the assessment of priors on network -structure hypothe ses. Note that the alternative criteria described in Section 8 can incorporate prior biases on network -structure hypotheses. Methods similar to those discussed in this section can be used to assesssuch biases. The simplest approach for assigning priors to network -structure hypotheses is to assume that every hypothesis is equally likely . Of course, 14Recallthe method of equivalent samplesfor assessingbeta and Dirichlet distributions discussed in Section 2. 15This construction procedure also assumesthat every structure has a non-zero prior probability .
A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 335 this assumption is typically inaccurate and used only for the sake of convenience. A simple refinement of this approach is to ask the user to exclude various hypotheses (perhaps based on judgments of of cause and effect ) , and then impose a uniform prior on the remaining hypotheses . We illustrate this approach in Section 12. Buntine ( 1991) describes a set of a5sumptions that leads to a richer yet efficient approach for assigning priors . The first assumption is that the variables can be ordered (e.g., through a knowledge of time precedence) . The second assumption is that the presence or absence of possible arcs are mutually independent . Given these ~ sumptions , n (n - 1)/ 2 probability ~ sessments (one for each possible arc in an ordering ) determines the prior probability of every possible network -structure hypothesis . One extension to this approach is to allow for multiple possible orderings . One simplifica tion is to assume that the probability that an arc is absent or present is independent of the specific arc in question . In this case, only one probability assessment is required . An alternative approach , described by Heckerman et ale (1995b) uses a prior network . The b~ ic idea is to penalize the prior probability of any structure according to some me~ ure of deviation between that structure and the prior network . Heckerman et ale ( 1995b) suggest one reasonable measure of deviation . Madigan et ale (1995) give yet another approach that makes use of imaginary data from a domain expert . In their approach , a computer program helps the user create a hypothetical set of complete data . Then , using techniques such aB those in Section 7, they compute the posterior proba bilities of network -structure hypotheses given this data , aBsuming the prior probabilities of hypotheses are uniform . Finally , they use these posterior probabilities as priors for the analysis of the real data .
11. Search Methods In this section , we examine search methods for identifying network struc tures with high scores by some criterion . Consider the problem of finding the best network from the set of all networks in which each node ha.s no more than k parents . Unfortunately , the problem for k > 1 is NP -hard even when we use the restrictive prior given by Equation 43 (Chickering et al . 1995) . Thus , researchers have used heuristic search algorithms , including greedy search, greedy search with restarts , best-first search, and Monte -Carlo methods . One consolation is that these search methods can be made more efficient when the model -selection criterion is separable . Given a network structure for domain X , we say that a criterion for that structure is separable if it
336
DAVID HECKERMAN
can be written as a product of variable -specific criteria : n C (Sh, D ) = II C(Xi , p ~ , Di ) i=l
(44)
where Di is the data restricted to the variables Xi and P ~ . An example of a separable criterion is the BD criterion (Equations 34 and 35) used in conjunction with any of the methods for assessingstructure priors described in Section 10. Most of the commonly used search methods for Bayesian networks make successive arc changes to the network , and em ploy the property of separability to evaluate the merit of each change. The possible changes that can be made are easy to identify . For any pair of variables , if there is an arc connecting them , then this arc can either be reversed or removed . If there is no arc connecting them , then an arc can be added in either direction . All changes are subject to the constraint that the resulting network contains no directed cycles. We use E to denote the set of eligible changes to a graph , and ~ (e) to denote the change in log score of the network resulting from the modification e E E . Given a separable criterion , if an arc to Xi is added or deleted , only C(Xi , p ~ , Di ) need be evaluated to determine ~ (e) . If an arc between Xi and Xj is reversed, then only C(Xi , p ~ , Di ) and c(Xj , llj , Dj ) need be evaluated . One simple heuristic search algorithm is greedy search. First , we choose a network structure . Then , we evaluate ~ (e) for all e E E , and make the change e for which ~ (e) is a maximum , provided it is positive . We terminate search when there is no e with a positive value for ~ (e) . When the criterion is separable , we can avoid recomputing all terms ~ (e) after every change. In particular , if neither Xi , Xj , nor their parents are changed, then Ll (e) remains unchanged for all changes e involving these nodes as long as the resulting network is acyclic . Candidates for the initial graph include the empty graph , a random graph , a graph determined by one of the polynomial algorithms described previously in this section , and the prior network . A potential problem with any local-search method is getting stuck at a local maximum . One method for escaping local maxima is greedy search with random restarts . In this approach , we apply greedy search until we hit a local maximum . Then , we randomly perturb the network structure , and repeat the process for some manageable number of iterations . Another method for escaping local maxima is simulated annealing . In this approach , we initialize the system at some temperature To. Then , we pick some eligible change e at random , and evaluate the expression p = exp (~ (e) IT 0) . If p > 1, then we make the change e; otherwise , we make the change with probability p. We repeat this selection and evaluation process Q: times or until we make ~ changes. If we make no changes in Q: repetitions ,
WITH BAYESIAN NETWORKS 337 A TUTORIALONLEARNING then we stop searching . Otherwise , we lower the temperature by multiplying the current temperature To by a decay factor 0 < 'Y < 1, and continue the search process. We stop searching if we have lowered the temperature more than 8 times . Thus , this algorithm is controlled by five parameters : To, a , ,8, 'Y and 8. To initialize this algorithm , we can start with the empty graph , and make To large enough so that almost every eligible change is made, thus creating a random graph . Alternatively , we may start with a lower temperature , and use one of the initialization methods described for local search. Another method for escaping local maxima is best-first search (e.g., Korf , 1993) . In this approach , the space of all network structures is searched systematically using a heuristic measure that determines the next best structure to examine . Chickering ( 1996) has shown that , for a fixed amount of computation time , greedy search with random restarts produces better models than does either simulated annealing or best-first search. One important consideration for any search algorithm is the search space. The methods that we have described search through the space of Bayesian-network structures . Nonetheless, when the assumption of hypoth esis equivalence holds , one can search through the space of network -structure equivalence classes. One benefit of the latter approach is that the search space is smaller . One .drawback of the latter approach is that it takes longer to move from one element in the search space to another . Work by Spirtes and Meek (1995) and Chickering (1996)) confirm these observations experimentally . Unfortunately , no comparisons are yet available that determine whether the benefits of equivalence-class search outweigh the costs.
12. A Simple Example Before we move on to other issues, let us step back and look at our overall approach . In a nutshell , we can construct both structure and param eter priors by constructing a Bayesian network (the prior network ) along with additional assessments such as an equivalent sample size and causal constraints . We then use either Bayesian model selection , selective model averaging , or full model averaging to obtain one or more networks for prediction and / or explanation . In effect , we have a procedure for using data to improve the structure and probabilities of an initial Bayesian network . Here, we present two artificial examples to illustrate this process. Consider again the problem of fraud detection from Section 3. Suppose we are given the database D in Table 12, and we want to predict the next casethat is, compute P(XN+ IID ) . Let us assert that only two network -structure hypotheses have appreciable probability : the hypothesis corresponding to the network structure in Figure 3 (81) , and the hypothesis corresponding
338 database for the fraud prob -
I CaseI 1 2 3 4 5 6 7 8 9 10
Fraud no no yes no no no no no no no
Gas Jewelry Age no no 30-50 no no 30-50 yes yes > 50 no no 30-50 yes no <30 no no <30 no no > 50 no yes 30-50 yes no < 30 no no <30
Sex I female male male male female female male female male female
to the same structure with an arc added from Age to Gas (S2). Furthermore , let us assert that these two hypotheses are equally likely - that is,
p(Sf ) = p(S~) = 0.5. In addition, let us usethe parameterpriors given by Equation 43, wherea = 10 and p(xIS~) is given by the prior network in Figure 3. Using Equations34 and 35, we o'btain p(SfID ) = 0.26 and p(S~ID) = 0.74. Becausewe haveonly two modelsto consider, we can model average according to Equation 33:
p(XN+IID ) = 0.26 p(XN+IID , Sf ) + 0.74 p(XN+IID , sg) where p(XN+IID , Sh) is given by Equation 27. (We don't display these probability distributions .) If we had to chooseone model, we would choose S2, assuming the posterior -probability
criterion is appropriate . Note that
the data favors the presence of the arc from Age to Gas by a factor of three .
This is not surprising , because in the two cases in the database where fraud is absent and gas was purchased recently , the card holder was less than 30 years
old .
An application of model selection, describedby Spirtes and Meek (1995) , is illustrated in Figure 6. Figure 6a is a hand-constructed Bayesian network for the domain
of ICU ventilator
management , called the Alarm
network
(Beinlich et al., 1989) . Figure 6c is a random sample from the Alarm network of size 10,000. Figure 6b is a simple prior network for the domain .
This network encodesmutual independenceamong the variables, and (not shown) uniform probability distributions for each variable. Figure 6d shows the most likely network structure found by a two- pass greedy search in equivalence-class space. In the first pass, arcs were added until the model score did not improve . In the second paBS, arcs were deleted
A TUTORIAL ON LEARNING WITH BAYESIAN NETWORKS
339
until the model score did not improve . Structure priors were uniform ; and parameter priors were computed from the prior network using Equation 43 with a = 10. The network structure learned from this procedure differs from the true network structure only by a single arc deletion . In effect , we have used the data to improve dramatically the original model of the user. 13 . Bayesian
Networks
for Supervised
Learning
As we discussed in Section 5, the local distribution functions p (xilpai , (Ji, Sh) are essentially classification / regression models . Therefore , if we are doing supervised learning where the explanatory (input ) variables cause the out come (target ) variable and data is complete , then the Bayesian-network and classification / regression approaches are identical . When data is complete but input / target variables do not have a simple cause/ effect relationship , tradeoffs emerge between the Bayesian-network approach and other methods . For example , consider the classification prob lem in Figure 5. Here , the Bayesian network encodes dependencies between findings and ailments as well as among the findings , where~ another classification model such ~ a decision tree encodes only the relationships between findings and ailment . Thus , the decision tree may produce more accurate classifications , because it can encode the necessary relationships with fewer parameters . Nonetheless , the use of local criteria for Bayesian-network model selection mitigates this advantage . Furthermore , the Bayesian network provides a more natural representation in which to encode prior knowl edge, thus giving this model a possible advantage for sufficiently small sample sizes. Another argument , based on bias- variance analysis , suggests that neither approach will dramatically outperform the other (Friedman , 1996) . Singh and Provan (1995) compare the classification accuracy of Bayesian networks and decision trees using complete data sets from the University of California , Irvine Repository of Machine Learning databases . Specifically , they compare C4 .5 with an algorithm that learns the structure and proba bilities of a Bayesian network using a, variation of the Bayesian methods we have described . The latter algorithm includes a model-selection phase that discards some input variables . They show that , overall , Bayesian networks and decisions trees have about the same classification error . These results support the argument of Friedman (1996) . When the input variables cause the target variable and data is incom plete , the dependencies between input variables becomes important , as we discussed in the introduction . Bayesian networks provide a natural frame work for learning about and encoding these dependencies. Unfortunately , no studies have been done comparing these approaches with other methods
A TUTORIALON LEARNINGWITH BAYESIANNETWORKS 341
Figure 7. A Bayesian-network structure for AutoClass. The variable H is hidden. Its possible states correspond to the underlying classesin the data.
14 . Bayesian
Networks
for Unsupervised
Learning
The techniques described in this paper can be used for unsupervised learn ing . A simple example is the AutoClass program of Cheeseman and Stutz ( 1995) , which performs data clustering . The idea behind AutoClass is that there is a single hidden (i .e., never observed) variable that causes the observations . This hidden variable is discrete , and its possible states correspond to the underlying claEses in the data . Thus , AutoClaES can be described by a Bayesian network such as the one in Figure 7. For reaEons of compu tational efficiency , Cheeseman and Stutz (1995) aEsume that the discrete variables (e.g., D1 , D2 , D3 in the figure ) and user-defined sets of continuous variables (e.g., { C1, C2, C3} and { C4, C5} ) are mutually independent given H . Given a data set D , AutoClaES searches over variants of this model (including the number of states of the hidden variable ) and selects a variant whose (approximate ) posterior probability is a local maximum . AutoClass is an example where the user presupposes the existence of a hidden variable . In other situations , we may be unsure about the presence of a hidden variable . In such cases, we can score models with and without hidden variables to reduce our uncertainty . We illustrate this approach on a real-world case study in Section 16. Alternatively , we may have little idea about what hidden variables to model . The search algorithms of Spirtes et ale ( 1993) provide one method for identifying possible hidden variables in such situations . Martin and VanLehn (1995) suggest another method . Their approach is based on the observation that if a set of variables are mutually dependent , then a simple explanation is that these variables have a single hidden common cause rendering them mutually independent . Thus , to identify possible hidden variables , we first apply some learning technique to select a model containing no hidden variables . Then , we look for sets of mutually dependent variables in this learned model . For each
342
DAVIDHECKERMAN
(a) Figure
8.
Bayesian
- network
structure
such
(a )
in
set
of
variables
( and variable
.
the
We .
we
a
means
( 1993
,
the
exposure the
of
or
product learn
be
true
not
the
(b )
A
the
network
a
new
model
model
create
variables
conditionally
finding
Figure
shows
8a
another
one
haB
better
two
model
sets
of
containing
.
.
We
a particular of
a
an
individual
on
. Let
. In
that
these
both
sides
we
should
in
variables seen
component
probability
that
the
B that
the
of == true
are
our
and
Spirtes
) .
analysts , or
to
maximize
and
Buy and
given
new
( 1995
analysis
== true
causal
, see
marketing
(A )
, we
how
issue
, decrease
order
provide section
on
advertisement
B
this
methods of
are
Ad
one
In
Freedman
increase
has
probability
.
network
discussion
and
advertisement
product
physical
note
basic
Humphreys
we
a Bayesian
relationships a
, suppose not
of
causal
) , and
or
physical the
in
8b
of
, possibly
model
provide
illustration
, respectively
, and
models
the
discussions
whether
sales
whether
to
of
know
set
semantics
learn and
learned
( 1995
purposes
from
the
be critical
) , Pearl
this
causal
can
can
to
.
by
Relationships
we
. For
For
variables
) suggested
) , we
that
new ,
by
semantics
relationships
thereof
. Figure
, the
which
these
controversial
want
observed
( shaded
renders
the
variables
mentioned by
examine
that
example
Causal
have
ale
for
variables
combinations
suggested
Learning
As
structure
hidden
score
For
dependent variables
.
et
then
original
mutually hidden
- network
with
a hidden
independent
15
Bayesian
structure
(a ) .
containing
than
A
(b)
that given
who
leave
alone
our (B ) has
, we we that
profit
represent purchased would force
we
like A
force
to A
A TUTORIAL ON LEARNING WITH BAYESIAN NETWORKS
343
to be false.16 We denote these probabilities p (bliL) and p (bla) , respectively . One method that we can use to learn these probabilities is to perform a randomized experiment : select two similar populations at random , force A to be true in one population and false in the other , and observe B . This method is conceptually simple , but it may be difficult or expensive to find two similar populations that are suitable for the study . An alternative method follows from causal knowledge . In particular , suppose A causes B . Then , whether we force A to be true or simply observe that A is true in the current population , the advertisement should have the same causal influence on the individual 's purchase. Consequently , p (bla) = p (bla) , where p (bla) is the physical probability that B = true given that we observe A = true in the current population . Similarly , p (bla) = p (bla) . In contrast , if B causes A , forcing A to some state should not influence B at all . Therefore , we have p (bliL) = p (bla) = p (b) . In general , knowledge that X causes Y allows us to equate p (ylx ) with p (ylx ) , where x denotes the intervention that forces X to be x . For purposes of discussion , we use this rule as an operational definition for cause. Pearl (1995) and Heckerman and Shachter (1995) discuss versions of this definition that are more complete and more precise. In our example , knowledge that A causes B allows us to learn p (bliL) and p (bla) from observations alone- no randomized experiment is needed. Bu t how are we to determine whether or not A causes B ? The answer lies in an assumption about the connection between causal and probabilistic dependence known as the causal Markov condition , described by Spirtes et ale (1993) . We say that a directed acyclic graph C is a causal graph for variables X if the nodes in C are in a one-to -one correspondence with X , and there is an arc from node X to node Y in C if and only if X is a direct cause of Y . The causal Markov condition says that if C is a causal graph for X , then C is also a Bayesian-network structure for the joint physical probability distribution of X . In Section 3, we described a method based on this condition for constructing Bayesian-network structure from causal assertions . Several researchers (e.g., Spirtes et al ., 1993) have found that this condition holds in many applications . Given the causal Markov condition , we can infer causal relationships from conditional -independence and conditional -dependence relationships that we learn from the data .17 Let us illustrate this process for the mar keting example . Suppose we have learned (with high Bayesian probability ) 16It is important that these interventions do not interfere with the normal effect of A on B . See Heckerman and Shachter (1995) for a discussion of this point . 17Spirteset al. (1993) also require an assumption known as faithfulness . We do not need to make this assumption explicit , becauseit follows from our assumption that p( 88ISh) is a probability density function .
DAVIDHECKERMAN
344
~
Buy
(a)
(b)
Figure .9. (a) Causal graphs showing for explanations for an observed dependence between Ad and Buy. The node H corresponds to a hidden common cause of Ad and Buy. The shaded node S indicates that the casehas been included in the database. (b) A Bayesian network for which A causesB is the only causal explanation, given the causal Markov condition .
that the physical probabilities p (bla) and p (bla) are not equal . Given the causal Markov condition , there are four simple causal explanations for this dependence: ( 1) A is a cause f'or B , (2) B is a cause for A , (3) there is a.. hidden common cause of A and B (e.g., the person's income ) , and (4) A and B are causes for data , selection . This last explanation is known as selection bias. Selection bias would occur , for example , if our database failed to include instances where A and B are false. These four causal explanations for the presence of the arcs are illustrated in Figure 9a. Of course, more complicated explanations ~- such as the presence of a hidden common cause and selection bias- are possible. So far , the causal Markov condition has not told us whether or not A causes B . Suppose, however, that we observe two additional variables : In come (I ) and Location (L ) , which represent the income and geographic location of the possible purchaser , respectively . Furthermore , suppose we learn (with high probability ) the Bayesian network shown in Figure 9b. Given the causal Markov condition , the only causal explanation for the conditional independence and conditional -dependence relationships encoded in this Bayesian network is that Ad is a cause for Buy . That is, none of the other explanations described in the previous paragraph , or combinations thereof , produce the probabilistic relationships encoded in Figure 9b. Based on this observation , Pearl and Verma (1991) and Spirtes et ale (1993) have created algorithms for inferring causal relationships from dependence relationships for more complicated situations .
A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 345 TABLE 2. Sufficient statistics for the Sewall and Shah (1968) study. 4
349
13
64
9
207
33
72
12
126
38
54
10
67
49
43
2
232
27
84
7
201
64
95
,12
115
93
92
17
79
119
59
8
166
47
91
6
120
74
110
17
92
148
100
6
42
198
73
4
48
39
57
5
47
123
90
9
41
224
65
8
17
414
54
24
5
454
9
44
5
312
14
47
8
216
20
35
13
96
28
11
285
29
61
19
236
47
88
12
164
62
85
15
113
72
50
7
163
36
72
13
193
75
90
12
174
91
100
20
81
142
77
6
50
36
58
5
70
110
76
12
48
230
81
13
49
360
98
Reproduced by permission from the University sity of Chicago . All rights reserved .
16 . A Case Study : College
of Chicago Press . @ 1968 by The Univer -
Plans
Real-world applications of techniques that we have discussed can be found
in Madigan and Raftery (1994) , Lauritzen et ale (1994) , Singh and Provan (1995) , and Friedman and Goldszmidt (1996). Here, we consider an application that comesfrom a study by Sewell and Shah (1968) , who investigated factors that influence the intention of high school students to attend college. The data have been analyzed by several groups of statisticians , including
Whittaker (1990) and Spirtes et ale (1993), all of whom have used nonBayesian techniques .
Sewelland Shah (1968) measuredthe following variablesfor 10,318 Wisconsin high school seniors: Sex (SEX): male, female; SocioeconomicStatus (SES): low, lower middle, upper middle, high; Intelligence Quotient (IQ) : low, lower middle, upper middle, high; Parental Encouragement(PE) : low, high; and College Plans (CP): yes, no. Our goal here is to understand the (possibly causal) relationships among these variables. The data entry
denotes
particular
are described the number
by the sufficient of cases in which
statistics
in Table
the five variables
16 . Each
take on some
configuration . The first entry corresponds to the configuration
SEX = male , SES = low , IQ = low , PE = low , and CP = yes . The remaining
en-
tries correspond to configurations obtained by cycling through the states of
eachvariable such that the last variable (CP) varies most quickly. Thus, for example, the upper (lower) half of the table correspondsto male (female) students
.
As a first pass, we analyzed the data assuming no hidden variables . To generate
priors for network
parameters , we used the method
described
in
Section 10.1 with an equivalent sample size of 5 and a prior network where
346
DAVIDHECKERMAN
,
,j ' ,j ,
logp(DIS: ) = - 45699 p(S~ID) = 1.2 X 10-10
log p ( DIS )h) = - 45653 p ( S)hI D ) = 1.0
Figure 10. The a posteriori most likely network structures without hidden variables.
p (xIS ~) is uniform . (The results were not sensitive to the choice of parame ter priors . For example , none of the results reported in this section changed qualitatively for equivalent sample sizes ranging from 3 to 40.) For structure priors , we assumed that all network structures were equally likely , except we excluded structures where SEX and/ or SES had parents , and/ or CP had children . Because the data set was complete , we used Equations 34 and 35 to compute the posterior probabilities of network structures . The two most likely network structures that we found after an exhaustive search over all structures are shown in Figure 10. Note that the most likely graph has a posterior probability that is extremely close to one. If we adopt the causal Markov assumption and also assume that there are no hidden variables , then the arcs in both graphs can be interpreted causally . Some results are not surprising - for example the causal influence of socioeconomic status and IQ on college plans . Other results are more interesting . For example , from either graph we conclude that sex influ ences college plans only indirectly through parental influence . Also , the two graphs differ only by the orientation of the arc between PE and IQ . Either causal relationship is plausible . We note that the second most likely graph was selected by Spirtes et ale (1993) , who used a non-Bayesian approach with slightly different assumptions . The most suspicious result is the suggestion that socioeconomic status has a direct influence on IQ . To question this result , we considered new models obtained from the models in Figure 10 by replacing this direct influence with a hidden variable pointing to both SES and IQ . We also considered models where the hidden variable pointed to SES , IQ , and P E , and none, one, or both of the connections SE S - P E and P E - I Q were removed . For each structure , we varied the number of states of the hidden variable from two to six . We computed
the posterior
probability
of these models using the
A TUTORIALON LEARNINGWITH BAYESIANNETWORKS 347 p(H=lJ)=0.63 p(H=l)=0.37 PE I()w I()w high high
H p(IQ=llighIPE ,H) 0I O .()9K 0.22 0 0.21 I 0.49
SES low low high high
Figure
11
.
The
Cheeseman
is
,
we
among
in
taining variable
to
a
evidence
that IQ
in
bilities
all
)
vational
they data
In
addition
suggests
)
and
Among
that
the con
the .
-
hidden
Thus
,
that we
we
have
if
we
have strong
socioeconomic
status
examination
of
variable
the
model
from
then
both An
hidden
the
proba
corresponds
to
-
some
is
Software
incomplete
.
models
and
references
learn
describe
the
likely
assume ,
.
( ) s .
model
also
find
.
guide
( 1993
best
.
maximum
likely arc
influencing
and
one
more
the
.
space
probability
most
additional
result
that "
times
and
To
local of
next
than
of
posterior
consideration
sensible
graphical
another
ale
The
likely
is a
additional
to ,
-
this
1010
. one
from
Literature
,
approximations cussed
less
.
variable
lack
.
largest
highest
assumption
quality
provides
et
times
variable
about
2
has
model
11
following
Spirtes
which
hidden
for
initializations
variable
Markov
hidden
to
the
( 1996
causal
" parent
more
offer
9
reasonable
tutorials
learning
-
the
the
is
a
omitted
approximation
taking
with
model
,
. 10
one
PE p(CP =yesISES ,IQ,PE ) low 0.011 high 0.170 !ow 0.124 high 0.53 low 0,0')3 high 0.39 low 0 .24 high 0.84
with
are
Laplace ,
hidden
population
Pointers
Like
5
Figure of
.
is
a
this
in
measure
17
,
the
IQ low low high high low low high high
structure
random
the
This
no
the
omitted
and
.
variable
PE
of
different ,
11
network probabilities
algorithm
with
Figure
adopt
not
runs
likely Some
variant
considered
hidden
.
EM
containing a
again
)
the
100
model
values
( 1995
we
shown
most
MAP
used
models
best
posteriori are
- Stutz 9s
the
a
shown
from
SES low low low low high high high high
SEX (PE =highISES ) ,SEX mulc 0.32 femuJe 0.166 malc 0.K6 femalc O.Kl log p(ShID) =. - 45629
Probabilities
MAP
H p(SES =highlH ) I()w O.OKR hig'h 0.51
to
Pearl
the
)
learning
readers
use
for
pointers
literature
networks for
those
methods and
( 1995
Bayesian methods
For
interested
learning to
in
them
software
.
,
we
Buntine
.
methods .
In
causal
based addition relationships
,
on
large
as
we
- sample have
from
dis
-
obser
-
struc
-
.
to
directed
models
,
researchers
have
explored
network
348
DAVIDHECKERMAN
tures
containing
undirected
representations
(
are
1990
)
,
discussed
Frydenberg
Bayesian
(
methods
Dawid
and
Finally
,
for
several
a
effect
.
(
.
called
)
,
compiles
)
in
For
~
have
et
)
ale
and
for
Pearl
(
are
1997
)
described
(
1994
)
have
)
sampler
that
for
created
as
a
model
a
system
Bayesian
computer
-
and
systems
criteria
have
devel
cause
built
of
for
have
about
variety
1992
)
network
program
.
Acknowledgments
I
thank
Max
Rommelse
this
used
bringing
Chickering
,
and
manuscript
to
.
analyze
this
,
Usama
Padhraic
I
also
the
data
Fayyad
Smyth
thank
Max
Sewall
set
to
for
and
my
attention
,
Eric
their
Horvitz
,
comments
Chickering
on
for
Shah
(
.
1968
Chris
Meek
earlier
data
the
,
and
,
Koos
versions
implementing
)
.
by
systems
ale
learning
1994
a
(
-
and
.
specified
Gibbs
Verma
software
(
Gilks
,
These
Richardson
et
using
a
)
.
data
Scheines
problem
into
1994
II
jsgaard
,
1982
and
developed
,
learning
,
from
(
models
problem
)
models
example
H
a
(
1990
TETRAD
and
representation
Lauritzen
Buntine
Spiegelhalter
takes
this
)
(
and
graphical
that
.
groups
.
1992
Thomas
. g
called
mixed
BUGS
and
1993
program
with
e
knowledge
Whittaker
research
Badsberg
selection
,
a
such
models
software
learn
)
as
learning
(
graphical
oped
(
1990
Lauritzen
learning
can
edges
Chris
of
software
Meek
for
A TUTORIAL
ON
LEARNING
WITH
BAYESIAN
NETWORKS
349
Notation X , Y, Z, . . .
Variables or their corresponding nodes in a Bayesian network
x , Y , Z, . . .
Sets of variables or corresponding sets of nodes
x = x x = x
Variable
x ,y ,z
Typically refer to a complete case, an incomplete
X
is in state
x
The set of variables X is in configuration x case, and missing data in a case, respectively
X\ Y D Dl p(xly)
Ep(o)(x) S Pai
The
variables
A data The
in X
that
are not
in Y
set : a set of cases
first
1-
1 cases in D
The probability
that X == x given Y == Y
(also used to describe a probability density, probability distribution , and probability density) The expectation of x with respect to p(.) A Bayesian network structure (a directed acyclic graph) The variable or node corresponding to the parents of node Xi in a Bayesian network structure A configuration of the variables Pai
pai ri
The number
of states of discrete variable
qi
The number of configurations of Pai
Xi
Sc
A complete network structure
Sh
The hypothesis corresponding to network structure S
8ijk
The multinomial
parameter corresponding to the
probabilityp(Xi = xflPai = pa1) 8ij (Ji (Js 0: '
O:' ijk Q' ij
Nijk
Nij
== ((Jij2, . . ., (Jijri) = (lJil , . . . , lJiqi) = (lJ1, . . . , lJn) An equivalent sample size
The Dirichlet hyperparameter corresponding to (Jijk """,,, r '
== L-Ik,=l O :' ijk
Thenumberof cases in datasetD whereXi = xf andPai = pat r
'
= Lk '= l Nijk
350
DAVIDHECKERMAN
References Aliferis , C. and Cooper, G. ( 1994). An evaluation of an algorithm for inductive learning of Bayesian belief networks using simulated data sets . In Proceedings of Tenth Con ference on Urtcertainty in Artificial Intelligence , Seattle , WA , pages 8- 14. Morgan Kaufmann
.
Badsberg, J. ( 1992). Model search in contingency tables by CoCo. In Dodge, Y . and Whittaker , J ., editors , Computational delberg .
Statistics , pages 251- 256. Physica Verlag , Hei -
Becker, S. and LeCun, Y . (1989). Improving the convergenceof back-propagation learning with second order methods . In Proceedings of the 1988 Connectionist School , pages 29- 37 . Morgan Kaufmann .
Models Summer
Beinlich, I ., Suermondt, H ., Chavez, R., and Cooper, G. (1989). The ALARM monitoring system : A case study with two probabilistic inference techniques for belief networks . In Proceedings of the Second European Conference on Artificial Intelli gence in Medicine , London , pages 247- 256. Springer Verlag , Berlin .
Bernardo, J. ( 1979). Expected information as expected utility . Annals of Statistics , 7 :686 - 690 .
Bernardo, J. and Smith , A . (1994) . Bayesian Theory. John Wiley and Sons, New York . Buntine , W . ( 1991). Theory refinement on Bayesian networks. In Proceedingsof Seventh Conference on Uncertainty Morgan Kaufmann .
in Artificial
Intelligence , Los Angeles , CA , pages 52- 60 .
Buntine , W . (1993). Learning classification trees. In Artificial Intelligence Frontiers in Statistics : AI and statistics III . Chapman and Hall , New York .
Buntine , W . (1996) . A guide to the literature on learning graphical models. IEEE Transactions
on Knowledge and Data Engineering , 8:195- 210.
Chaloner, K . and Duncan, G. (1983). Assessmentof a beta prior distribution : PM elicitation
.
The
Statistician
, 32 : 174 - 180 .
Cheeseman, P. and Stutz , J. ( 1995). Bayesian classification (AutoClass) : Theory and results . In Fayyad , U ., Piatesky -Shapiro , G ., Smyth , P., and Uthurusamy , R ., editors , Advances in Knowledge Discovery and Data Mining , pages 153- 180. AAAI Press , Menlo
Park
, CA .
Chib , S. (1995). Marginal likelihood from the Gibbs output . Journal of the American Statistical
Association
, 90 : 1313 - 1321 .
Chickering, D . (1995). A transformational characterization of equivalent Bayesian network structures . In Proceedings of Eleventh Conference on Uncertainty Intelligence , Montreal , QU , pages 87- 98 . Morgan Kaufmann .
in Artificial
Chickering, D . (1996). Learning equivalence classesof Bayesian-network structures . In Proceedings of Twelfth Conference on Uncertainty 0 R . Morgan Kaufmann .
in Artificial
Intelligence , Portland ,
Chickering, D ., Geiger, D ., and Heckerman, D. ( 1995) . Learning Bayesian networks: Search methods and experimental results . In Proceedings of Fifth Conference on Artificial Intelligence and Statistics , Ft . Lauderdale , FL , pages 112- 128. Society for Artificial
Intelli
~ ence
in
Statistics
.
Chickering, D . and Heckerman, D . (Revised November, 1996). Efficient approximations for the marginal likelihood of incomplete data given a Bayesian network . Technical Report MSR - TR -96- 08 , Microsoft Research , Redmond , W A .
Cooper, G. (1990). Computational complexity of probabilistic inference using Bayesian belief networks (Research note). Artificial Intelligence, 42:393- 405. Cooper, G. and Herskovits, E. (1992) . A Bayesian method for the induction of probabilistic
networks from data . Machine Learning , 9:309- 347.
Cooper, G. and Herskovits, E. (January, 1991). A Bayesian method for the induction
A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 351 Technical Report SMI-91- 1, Section on Medical
I
American Journal of
pages
352
DAVID
Heckerman, D., The combination
Heckerman,
networks.
Heckerman,
D.
soning.
Journal
Hłjsgaard, report,
Geiger, D., and of knowledge
D.,
Bayesian
Mamdani,
and
of
S., Skjłth, Department
Howard, mentation.
HECKERMAN
A.,
Communications
Shachter,
Artificial
Strategic
of
Jaakkola,
in
T.
intractable
Artificial Jensen, F. Jensen, F.
Decision of the
IEEE,
Kass,
R.,
pages
M.
(1996).
In
Tierney,
L.,
J.,
DeGroot,
B.
Mathematical
and
M.,
Oxford
(1993).
B,
Pe
P
(1996).
Computing
Proceedings
repor K. (
University
(1936).
On
Lindley,
structures
J. (1988) D.,
Press.
an
distributions adm 39:399 409.
Society,
Linear-space
50:157 224.
Computational Bayes facto
Kadane,
Lauritzen, S. (1982). Lectures Aalborg, Denmark. Lauritzen, S. (1992). Propagation cal association models. Journal Lauritzen, S. and Spiegeihalter, graphical Society
(1 Com
Menlo
D.
based Denmark. systems. Technical Lauritzen, S., and Olesen,
261 278.
R.
and
B.
analysis: 58:632 643.
Group,
Freedman, 47:113 118.
Jordan,
Resear
Aalborg,
Koopman, the American Korf,
Decisions
networks.
A
Dec
J. (1981). Influence on the Principles Strategic Decisions J., editors (1983).
by local computations. and 90:773Raftery, A. (1995). 795.
Bernardo, 3,
and
the
(1995).
Intelligence, Portland, OR, pages (1996). An Introduction to Bayesian and Andersen, S. (1990). Approximat
knowledge University, Jensen, F.,
models Kass, R. Association,
P. and Science,
R.
Weilman
of
Intelligence
Howard, R. and Matheson, son, J., editors, Readings volume II, pages 721 762. Howard, R. and Matheson, Analysis.
and
F., and Thiesson, of Mathematics
R. Proceedings (1970).
Humphreys, Philosphy
Chickering, and statistical
and
best-first
search
on
Contingenc of probabi of the America D. (1988).
their
application
Lauritzen, S., Thiesson, B., and Spiegethalter, model selection methods: A case study. Al and NewStatistics IV, volume Lecture Notes Verlag, York. MacKay,
MacKay, Neural MacKay, Cavendish
D.
Madigan, D., the predictive tics:
Madigan, tainty
(1992a).
D. (1992b). Computation, D.Laboratory, (1996).
Theory
in
Bayesian
Garvin, J., performance
and
D. and graphical
interpolation.
A practical 4:448 472. Choice of Cambridge,
Methods,
Raftery, models
and
Bayesian basis for th UK.
Raftery, of Bayesian
24:2271 2292.
A.
(1995) graphic
A. (1994). Model using Occam s win
A TUTORIALONLEARNING WITHBAYESIAN NETWORKS 353
Interna -
909. Neal, R. (1993 ). Probabilistic inference usingMarkovchainMonteCarlomethods . TechnicalReportCRG-TR-93-1, Department of ComputerScience , Universityof Toronto. Olmsted , S. (1983 ). Onrepresenting andsolvingdecision problems . PhDthesis , Departmentof Engineering -EconomicSystems , StanfordUniversity . Pearl, J. (1986 ). Fusion , propagation , and structuringin beliefnetworks . Artificial Intelligence , 29:241-288. Pearl, J. (1995 ). Causaldiagramsfor empiricalresearch . Biometrika , 82:669- 710. Pearl, J. andVerma,T. (1991 ) . A theoryof inferredcausation . In Allen, J., Fikes, R., andSandewall , E., editors, Knowledge Representation andReasoning : Proceedings of theSecond InternationalConference , pages441-452. MorganKaufmann , NewYork. Pitman, E. (1936 ) . Sufficientstatisticsandintrinsicaccuracy . Proceedings of the CambridgePhilosophy Society , 32:567-579. Raftery, A. (1995 ). Bayesian modelselection in socialresearch . In Marsden , P., editor, Sociological Methodology . Blackwells , Cambridge , MA. Raftery, A. (1996 ). Hypothesis testingandmodelselection , chapter10. Chapmanand Hall. Ramamurthi , K. and Agogino , A. (1988 ). Realtime expertsystemfor fault tolerant supervisory control. In Tipnis, V. andPatton, E., editors , Computers in Engineering , pages333-339. AmericanSocietyof Mechanical Engineers , CorteMadera , CA. Ramsey , F. (1931 ). Truth andprobability . In Braithwaite , R., editor, TheFoundations of Mathematics andotherLogicalEssays . HumanitiesPress , London . Reprintedin KyburgandSmokIer , 1964 . Richardson , T. (1997 ) . Extensions of undirectedand acyclic , directedgraphicalmodels. In Proceedings of SixthConference on Artificial Intelligence andStatistics , Ft. Lauderdale , FL, pages407-419. Societyfor Artificial Intelligence in Statistics . Rissanen , J. (1987 ) . Stochastic complexity(with discussion ). Journalof theRoyalStatisticalSociety , SeriesB, 49:223-239and253-265. Robins , J. (1986 ). A newapproach to causalinterence in mortalitystudieswith sustained exposure results. Mathematical Modelling , 7:1393 - 1512 . Rubin, D. (1978 ). Bayesian inference for causaleffects : Theroleof randomization . Annals of Statistics , 6:34-58. Russell , S., Binder, J., Koller, D., andKanazawa , K. (1995 ). Locallearningin probabilis tic networkswith hiddenvariables . In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence , Montreal , QU, pages1146 - 1152 . Morgan Kaufmann , SanMateo, CA. Saul, L., Jaakkola , T., and Jordan , M. (1996 ). Meanfield theoryfor sigmoidbelief networks . Journalof Artificial Intelligence Research , 4:61- 76. Savage , L. (1954 ) . TheFoundations of Statistics . Dover,NewYork. Schervish , M. (1995 ). Theoryof Statistics . Springer -Verlag . Schwarz , G. (1978 ). Estimatingthedimension of a model.AnnalsofStatistics , 6:461-464. Sewell , W. andShah,V. (1968 ). Socialclass , parentalencouragement , andeducational aspirations . AmericanJournalof Sociology , 73:559-572. Shachter , R. (1988 ). Probabilistic inference andinfluence diagrams . Operations Research ,
354
DAVIDHECKERMAN
36 : 589 Shachter
- 604
.
, R . , Andersen
posable
graphs
, S . , and .
In
Intelligence
, Boston
Intelligence
, Mountain
Shachter , R . and 35 : 527 - 550 . Silverman and Singh
, M . and
Provan
classifiers
Spiegelhalter expert
Density .
, D . and on
directed
, P . and
data
.
Data
In
Meek
Mining
Suermondt
, H . and
MS
Truesson .
Data
In
Mining
Truesson
, B .
complete Aalborg ,
( 1995b
) .
A .,
) .
A . and . Science
, T . and
, J . ( 1990
ings of Sixth Conference 220 - 227 . Morgan Kaufmann , J . ( 1990 Sons .
) .
Graphical
Artificial
-
,
Chapman
Bayesian
Information
net
-
Science
. in
, R . ( 1993
) .
of
- 605
decision
analysis
Bayesian
.
analysis
conditional
in
probabili
-
.
Causation
, Prediction
networks
Kaufmann
) .
A
and
,
and
Search
:
- 311
( 1992
In
Bernardo
Judgment
and
Uncertainty .
in
in
and
Applied
Bugs
:
A
- 842
. Oxford
- Wesley
.
synthesis
of
Multivariate
-
incomplete
models
and
with
causal
to
, J . , Dawid
perform
Press
Heuristics
models
.
, Boston
Statistics
,
, A . , and
University
:
in -
University
program
uncertainty
Intelligence
, 5 : 521
Discovery
, Aalborg
, J . , Berger
under
Artificial
with
exponential
) .
inference
.
Systems
837
for
I ( nowledge
Kaufmann
recursive
. Addison
) . Equivalence
from
Reasoning
networks on
Electronic
. 4, pages
) .
variables Discovery
algorithms
Bayesian
W . .
Analysis
Models
exact Approximate
. Morgan
of
,
Statistics
on
of
for
Gilks
, D . ( 1974 - 1131 .
of of
Conference
306
discrete
I ( nowledge
.
Journal
sampling
Data
with on
combination
, Institute
Gibbs
, Bayesian
Kahneman , 185 : 1124
in
.
selective
encoding
Conference
information
D .,
Exploratory
Pearl
Bayesian
, pages
and
,
Artificial
Science
Analysis
and
updating
) .
quantification
report
using
, A . , editors
, J . ( 1977
Tversky , biases
Score
of
, PA
, 20 : 579
( 1993
International
, QU
Spiegelhalter inference
, R .
. International
First
, Montreal
learning
) . Probability
Networks
. Morgan
Accelerated of
decom
in
Management
Data
- 95 - 36 , Computer
International
, G . ( 1991
data . Technical , Denmark .
Bayesian Smith
) .
) . Efficient
) . Sequential .
.
and
, S . , and Cowell , 8 : 219 - 282 .
) . Learning
, QU
Proceedings
Uncertainty
diagrams
, Philadelphia
Scheines
networks
, B . ( 1995a
data
for
Statistics
- CIS
, C . ( 1975 - 358 .
First
Cooper
belief
and
Uncertainty
.
, C . ( 1995
, Montreal
on Bayesian 542 .
Whittaker and
, 1995
, S . ( 1990
of
Association
for
structures
, C . , and , New York
algorithms
on
influence
Pennsylvania
Lauritzen
.
) . Gaussian
Report
graphical
Proceedings
- 244
reduction
Conference
.
, G . ( November
of
) . Directed
Sixth
Estimation
Stael von Holstein Science , 22 : 340
Spirtes , P . , Glymour Springer - Verlag
Verma
237
, CA
, C . ( 1989
the
, D . , Dawid , A . , Lauritzen systems . Statistical Science
ties
Thomas
, pages
, University
Spiegelhalter
Spirtes
MA
. Technical
Department
, K . ( 1990 of
View
Kenley
Spetzler , C . and Management
Tukey
,
, B . ( 1986 ) . Hall , New York
work
Poh
Proceedings
.
In
.
and
Proceed
, MA
, pages
John
Wiley
-
Winkler ,R.(1967 ).The assessment of:prior distributions inBayesian analysis .American Statistical Association Journal ,62 776 -800 .
A VIEW OF THE EM ALGORITHMTHAT JUSTIFIES INCREMENTAL , SPARSE , AND OTHERVARIANTS
RADFORD
M
Dept
.
of
.
Statistics
and
University
of
http
:
/
/
www
.
Department
Abstract
resem
bles
with
with
, it
in
free
the
is
easy
which
the
exploits
sparse
of other
.
and
Canada
/
/
Ontario
,
hinton
/
~
maximum
over an
. This
that
the
variant
estimation
of the
is shown problem
are
also
that
maximizes
this
E step
of
the
unobserved
variant
described
to
possible
be
. From
this
EM
algo -
variables to give
of
is also seen
maximizes
variables
empirically
. A
for
a function
step the
variant
distributions
algorithms
M
and
unobserved
one
estimation
present
the
incremental only
Canada
likelihood
. We
parameters
for
conditional
variant
edu
show
model
justify
E step
,
radford
unobserved
distribution
a mixture
~
,
performs
distribution
in
/
Toronto
toronto
are
the
to
edu
Science
Ontario
Science
energy to
to
.
,
.
variables
Computer
HINTON
cs
algorithm
in each
1 .
.
of
,
Computer
EM
convergence
range
www
.
Toronto
toronto
Toronto
some
recalculated
that
/
respect
respect
perspective rithm
/
negative
function it
:
which
.
of
. The
in
.
of
http
,
cs
E
University
Dept
Toronto
GEOFFREY
data
NEAL
the
is
faster
algorithm
, and
a wide
.
Introduction
The
Expectation
parameter
- Maximization
estimates
Special
cases
of
grown
even
cussed
by Dempster
applications
more
are
in
the
( EM
problems
algorithm
since
its
date
generality
, Laird evident
, and in
the
) algorithm
where
finds
maximum
variables
back
several
and
widespread
Rubin book
some
decades
were , and
applicability
( 1977 ) . The
scope
of the
McLachlan
and
Krishnan
by
355
likelihood unobserved its
use were
algorithm
. has dis 's
( 1997 ) .
356
RADFORDM. NEALAND GEOFFREYE. HINTON
The EM algorithm estimatesthe parametersof a. model iteratively, starting from someinitial guess . Each iteration consistsof an Expectation (E) step, which finds the distribution for the unobservedvariables, giventhe knownvaluesfor the observedvariablesand the currentestimate of the parameters , and a Maximization (M) step, which re-estimatesthe parametersto be those with maximumlikelihood, under the assumption that the distribution found in the E step is correct. It can be shownthat eachsuchiteration improvesthe true likelihood, or leavesit unchanged(if a local maximumhasalreadybeenreached , or in uncommoncases , before then). The M step of the algorithm may be only partially implemented , with the new estimatefor the parametersimproving the likelihood given the distribution found in the E step, but not necessarilymaximizingit . Sucha partial M stepalwaysresultsin the true likelihoodimprovingaswell. Dempster, et al refer to suchvariantsas "generalizedEM (GEM)" algorithms. A sub-classof GEM algorithmsof wide applicability, the "ExpectationConditional Maximization (ECM)" algorithms, have been developedby Meng and Rubin (1992), and further generalizedby Meng and van Dyk (1997). In many cases , partial implementationof the E step is also natural. The unobservedvariablesare commonlyindependent , and influencethe likelihoodof the parametersonly throughsimplesufficientstatistics. If these statisticscan be updatedincrementallywhenthe distribution for oneof the variablesis re-calculated, it makessenseto immediatelyre-estimatethe parametersbeforeperformingthe E stepfor the next unobservedvariable, as this utilizesthe newinformationimmediately , speedingconvergence . An incrementalalgorithmalongthesegenerallineswasinvestigatedby Nowlan (1991). However , suchincrementalvariantsof the EM algorithm havenot previouslyreceivedany formal justification. We presentherea view of the EM algorithmin whichit is seenas maximizing a joint function of the parametersand of the distribution over the unobservedvariablesthat is analogousto the "free energy" function used in statistical physics, and whichcan alsobe viewedin termsof a KullbackLiebler divergence . The E step maximizesthis function with respectto the distribution over unobservedvariables ; the M step with respectto the parameters . Csiszarand Tusnady(1984) and Hathaway(1986) havealso viewedEM in this light. In this paper, we use this viewpoint to justify variants of the EM algorithm in which the joint maximization of this function is performed by other means - a processwhich must also lead to a maximum of the true likelihood. In particular , we can now justify incremental versions of the algorithm , which in effect employ a partial E step, as well as "sparse"
A VIEW OF THE EM ALGa RITHM AND ITS VARIANTS
357
versions, in which most iterations update only that part of the distribu tion for an unobserved variable pertaining to its most likely values, and "winner-take-all" versions, in which, for early iterations, the distributions over unobservedvariables are restricted to those in which a single value hag probability one. We include a brief demonstration showing that use of an incremental algorithm speedsconvergencefor a simple mixture estimation problem.
2. General theory Supposethat we haveobservedthe valueof somerandomvariable, Z , but not the valueof anothervariable, Y , and that basedon this data, we wish to find the maximumlikelihoodestimatefor the parametersof a modelfor Y and Z. We assumethat this problemis not easilysolveddirectly, but that the correspondingproblemin which Y is also knownwould be more tractable. For simplicity, we assumeherethat Y has a finite range, as is often the case, but the resultscan be generalized . Assumethat the joint probability for Y and Z is parameterizedusing (J, as P(y, z I 0). The marginal probability for Z is then P(z I (J) =: Ly P(y, z I fJ). Given observeddata, z, we wish to find the valueof (J that maximizesthe log likelihood, L ((J) = logP(z I (J). The EM algorithmstarts with someinitial guessat the maximumlikelihood parameters , (}(O), and then proceedsto iteratively generatesuccessive estimates, (J(l ), (J(2), . . . by repeatedlyapplyingthe followingtwo steps, for t = 1, 2, . . . E Step: Computea distribution p (t) overthe rangeof Y such that p (t)(y) = P(y I z, O(t- l )).
(1)
M Step: Set (J(t) to the (J that maximizesEp(t)[logP(y, z I (J)). Here, Ep[ .] denotes - expectationwith respectto the distribution over the rangeof Y givenby P. Notethat in preparationfor the later generalization , the standardalgorithm hasherebeenexpressed in a slightly non-standard fashion. The E step of the algorithm can be seenas representingthe unknown valuefor Y by a distribution of values, and the M step as then performing maximumlikelihood estimationfor the joint data obtainedby combining this with the knownvalueof Z , an operationthat is assumedto be feasible. As shownby Dempster, et ai, eachEM iteration increases the true log likelihood, L (lJ), or leavesit unchanged . Indeed, for most models, the algorithm will convergeto a local maximum of L (8) (though there are exceptions to this). Suchmonotonicimprovementin L (8) is also guaranteedfor any
358
RADFORD
GEM
algorithm
M
step
is
,
with
( }
greater
reached
)
In
the
to
,
the
simply
(
( J
)
[
AND
only
log
a
to
P
(
sense
y
GEOFFREY
partial
z
I
value
8
(
t
E
.
maximization
some
,
and
-
l
)
)
such
]
(
or
HINTON
is
that
is
performed
Ep
equal
if
*
(
,
,
8
)
.
then
(
t )
[
in
log
P
(
z
,
convergence
a
E
step
updates
as
will
y
the
'
( J
has
(
t
)
)
]
been
,
that
if
of
variety
of
one
ar
factor
at
~
of
in
least
local
increasing
occurs
at
( )
for
,
,
of
*
ag
well
F
.
We
maximizing
incremental
P
which
maximum
algorithms
which
only
variants
or
a
L
performing
its
maximizing
show
among
partially
and
maximum
wide
,
of
algorithm
seen
local
F
idea
EM
We
a
maximizing
each
the
are
P
contemplate
of
corresponding
of
,..., steps
F
P
the
view
M
, ,...,
*
of
a
the
therefore
means
which
NEAL
set
t
use
and
at
item
)
function
occurs
which
make
we
E
same
can
in
.
.
step
both
the
t
,
Ep
order
E
by
(
than
M
L
algorithms
corresponding
,
to
one
in
data
. . ... .,
The
function
F
(
P
,
8
)
is
defined
as
follows
:
,...,
F
(
,...,
P
,
( J
)
=
"""'
where
H
that
z
F
,
P
is
y
,
z
I
8
fJ
)
are
is
fJ
(
y
,
)
=
(
]
I
is
(
the
is
,
fJ
that
F
the
zero
states
)
,
,
is
,
varies
follows
H
(
P
)
(
(
the
log
For
of
the
P
we
P
the
will
,
(
y
,
)
F
is
I
a
Note
data
here
but
z
.
observed
assume
finite
P
F
,
"
,
this
fJ
)
,
that
restriction
is
a
continuous
continuous
function
)
,
fJ
)
=
-
(
of
a
to
that
the
state
is
divergence
"
the
"
physical
-
varia
-
states
log
P
( y . .. ... .
between
( )
)
+
L
of
-
that
"
energy
,
function
Liebler
(
F
the
free
partition
-
P
(
,
z
I
y
)
and
( J
)
.
"
fJ
corresponding
to
Boltzmann
"
and
.
(3)
)
that
They
the
also
divergence
is
non
well
-
distribution
free
energy
correspond
-
to
negative
and
is
.
of
by
fJ
PIIP
properties
variational
the
given
with
"
Liebler
D
state
value
,
analogous
-
P
Kullback
fixed
fJ
is
provided
energy
-
distributions
a
(
distribution
for
always
function
physics
identical
F
2
:
the
the
between
continuously
+
the
that
the
lemmas
that
1
]
that
Kullback
statistical
properties
maximizes
)
,
conclude
and
the
as
minimizes
to
Lemma
8
of
assume
can
sign
Y
to
two
from
related
I
simplicity
physics
of
following
facts
over
z
value
-
The
,
particular
For
statistical
F
known
y
entropy
a
to
of
of
F
z
so
we
values
y
)
need
change
"
be
,
which
relate
P
P
.
a
to
y
.
also
from. . . . . ,
P
also
(
to
zero
We
energy
can
[ log
.. .. ...
P
respect
never
from
taken
[ log
throughout
and
free
One
Ep
fixed
.
Apart
( )
-
with
of
both
tional
p
=
essential
function
of
)
is
not
Ep
.. ... ,
P
defined
which
(
is
(
=
.
( J
p
( )
,
(
there
y
)
is
=
=
P
a
(
y
unique
I
z
distribution
,
fJ
)
.
Furthermore
,
p
,
( )
this
,
that
p
( )
A VIEW OF THE EM ALGORITHM AND ITS VARIANTS
359
,..., PROOF. In maximizing F with respect to P , we are constrained by the requirement that P (y) ~ 0 for all y . Solutions with P (y) == 0 for some yare not possible , however - one can easily show that the slope of the entropy is infinite at such points , so that moving slightly away from the boundary will increase F . Any maximum of F ,.,.-must therefore occur at a critical point subject to the constraint that Ly P (y) == 1, and can be found using a Lagrange multiplier . At -such a maximum , Po, the gradient of F with respect to the components of P will be normal to the constraint surface , i .e. for some A and for all y ,
8F A=8P ,.,(y)(p(}) =log P(y,zIfJ )- log Po (y)- 1
(4)
From this, it follows that Pe(y) must be proportional to P (y, z 18) . Normalizing so that Ey p()(y) == 1, we have p()(y) == P (y I z, 8) as the unique solution. That Po varies continuously with () follows immediately from our
assumption thatP(y, z I (}) does . .., .., Lemma 2 If P(y) ==P(yIz, (J) ==p(}(y) then F(P,0) ==logP(zI0) ==L((J). .., PROOF . If P(y) ==P(yIz, (J), then "" "" F(P,O) == Ep[logP(y, zI0)] + H(P) == Ep[logP(y, z10)] - Ep[logP(yIz, 0)] == Ep[logP(y, zI0) - logP(yIz, 0)] == Ep[logP(z10)] == logP(zIlJ) Aniteration ofthethestandard EMalgorithm cantherefore beexpressed interms ofthefunction F asfollows : E Step: Setp(t) totheP thatmaximizes F(P,f}(t-l)). ,., (5) M Step: Setf}(t) tothef} thatmaximizes F(P(t),8). Theorem 1 Theiterations given by(1) andby(5) areequivalent . PROOF . ThattheE steps oftheiterations areequivalent follows directly fromLemma 1. ThattheMsteps areequivalent follows fromthefactthat the entropy term in the definition of F in equation (2) does not depend on (}. Once the EM iterations have been expressed in the form (5) , it is clear that the algorithm converges to values P * and (}* that locally maximize
360
RADFORD
M . NEAL
AND
GEOFFREY
E . HINTON
,.., F ( P , (J) lowing will
( ignoring
the
theorem
shows
yield
a local
also
algorithm
of
formed
that
, in
well
of
P
and
If
F ( P , fJ )
to
, finding
for
as
to
convergence
general
variants
,,.., as
respect
of
maximum
( 5 ) , but
partially
with
possibility
a saddle
a local
it
in
which in
(J simultaneously
not
the
E
which
a
2
local
and
maximum
0 * , then
PROOF
. By
To
show
L
any
that
near
to
then
we
is
In
of
typical
applications
Z , can Y , as
then
I
fJ )
An be
==
incremental
independent ,.., P
that
maximum
on , we
factor . We
can
then
to
be
that
( Zl
need of
to
maximize
0 ( 0 ) , and
that
that
,..,near
guess
L (8)
-
has ,..,
at
P *
P (z I 0) =
=
is no
(Jt
a (} t existed
,
Pot
. But
since
P * , contradicting (J* . The
to this
there
,..,if such pt
to
P * and
for
the
proof
nearby
the for
global
of
(J. The
values
latter
result
.
=
IIi
,
the
F
. The
in
Y
the
as
identically
this
that of
will ,..,
, ,
factored
also
exploits
Note
P
be
variable variable
.
a maximum ,..,
form
can are
here that
2 .
Z
items
algorithm
parameter
observed
unobserved
and
that
Theorem for
the
data
assume
Pi ( Yi ) , since
write
likelihood
items
for
EM
of
maximum
, . . . , Zn ) , and
the " " search
the
to
distributions
F
have
structure
since
this
Yi
form ,..,
at
are
the
F ( P , fJ) == L: : i Fi ( Pi , fJ ) , where ,..,
==
algorithm F , and
some
done
maximum
L ( O ) == log
,..,
incremental
8 * , then
show
, note
data
to
the
basis
Fi { Pi , (J)
An
per
is
F { P {}* , {} * ) == F { P * , {} * ) .
to
restriction
find
I 0 ) . Often
can "" restrict P (y )
and
a global
need
at
probability
not
the
as
must
independent
joint
variant
justified
see
see ,..,this
the
as
will
has
F ( P * , (J* ) , where
unnecessary
wish
of
P ( Yi , Zi we
are
hms
, we
IIi
L , we
maximum
decomposed
, but
P *
, L {8* ) =
of
fJ , pt
F
2 , we
, fJt ) >
are
( Y1 , . . . , Y n ) . The
P (y , z
steps
maximization
(J* .
particular
without
a number be
distributed
can
, but
algorit
given
at
L ( (J* ) . To ,..,
a local
continuity
Incremental
estimate
>
F ( pt
at
, if
1 and
, in
with haB
analogous
aBsumptions
. Similarly
maximum
have
F
maximum
maximum
that
continuously
maxima
well
L ( (Jt )
also
M
F (P , 8 ) standard
.
local
Lemmas
0 , and
that
a
a global
which
would
assumption
3.
as
{} * is a local
(J* for
varies
8*
has
the
,..,fol -
,.., has
combining
F ( P {} , O ) , for
Po
at
for
only
and
the
,.., Theorem
) . The
maximum
L ( 8 ) , justifying
algorithms
point
E ~ [ logP
using
the
hence
L , starting
at
distribution
the
( Yi ' ziI8
following from
)]
+
H
( Pi )
iteration some
, ~ ( O) , which
(6 )
can
guess might
at
the or
then
be
used
parameters might
not
, be
A VIEW OF THE EM ALGORITHMAND ITS VARIANTS
361
consistentwith (}(O): E Step: Choosesomedata item, i , to be updated. Set pit ) = pJt- l ) for j =I i . (This takesno time). Set Pi(t) to the Pi that maximizesPi (Pi, (j(t- l )), given by Pi(t) (Yi) = P(Yi I Zi, (}(t- l )).
(7)
M Step: Set (}(t) to the (} that maximizesF (P(t), (}), or, equivalently , that maximizesEp(t)[logP(y, Z I (})]. Data items might be selected for updating in the Estep cyclicly , or by some scheme that gives preference to data items for which Pi has not yet -
stabilized
.
Each E step of the above algorithm requires looking at only a single data item , but , ~ written , it appears that the M step requires looking at -
all components
-
of P . This can be avoided in the common
case where the
inferential import of the complete data can be summarized by a vector of sufficient statistics that can be incrementally updated , as is the case with models in the exponential family .
Letting this vector of sufficient statistics be s(y, z) = Ei Si(Yi, Zi) , the standard EM iteration of (1) can be implemented ~ follows:
E Step: Set .s'(t) = Ep[s(y, z)], whereP(y) = P(y I Z, (}(t- l ). (In detail set s(t) = ~ . ~t) with ~ t) = E- [s'(y' z.))
- ' L..,t 1. , t where Pi (Yi) = P (Yi I Zi, (}(t- l )).)
Pi t 1., 't ,
(8)
M Step : Set (}(t) to the {} with maximum likelihood given s( t) . Similarly , the iteration of (7) can be implemented using sufficient statistics
thataremaintained incrementally , starting withaninitialguess , ~O), which mayor follows
may not be consistent with (}(O). Subsequent iterations proceed ag :
E Step :
Choose some data item , i , to be updated .
Sets)t) = ~t-l) forj =I i. (Thistakes notime .) Set~t) = Epi[Si(Yi, Zi)], for Pi(Yi) = P(YiI Zi, (}(t- l)).
(9)
Sets(t) = s
In iteration (9), both the E and the M steps take constant time, independent of the number of data items . A cycle of n such iterations , visiting
362
RADFORD
M . NEAL
AND
GEOFFREY
E . HINTON
each data item once, will sometimes take only slightly more time than one iteration of the standard algorithm , and should make more progress, since the distributions for each variable found in the partial E steps are utilized immediately , instead of being held until the distributions for all the unobserved variables have been found . Nearly as fast convergence may be obtained with an intermediate variant of the algorithm , in which each E
step recomputes the distributions for several data items (but many fewer than n) . Use of this intermediate variant rather than the pure incremental algorithm reduces the amount of time spent in performing the M steps.
Note that an algorithm basedon iteration (9) must save the last value computed for each Si, so that its contribution a new value for Si is computed . This
to S may be removed when
requirement
will
onerous . The incremental update of s could potentially with
cumulative
avoided
round - off
error . If
in several ways -
necessary , this
one could
generally
lead to problems
accumulation
use a fixed - point
not be can
representation
be
of
5, in which addition and subtraction is exact , for example , or recompute 5 non-incrementally at infrequent intervals . An incremental variant of the EM algorithm somewhat similar to that
of (9) was investigated by Nowlan (1991). His variant does not maintain strictly accurate sufficient statistics , however . Rather , it uses statistics computed as an exponentially decaying average of recently -visited data points , with iterations of the following form : E Step :
Select the next data item , i , for updating .
Set~t) == Ep[Si(Yi, Zi)], forPi(Yi) ==P(YiIZi, (J(t- l)). a
t
Sets(t) == , s(t- l) + ~ ).
(10)
M Step : Set (J(t) to the () with maximum likelihood given s(t). where 0 < , < 1 is a decay constant . The above algorithm will not converge to the exact answer, at least not if , is kept at some fixed value. It is found empirically , however , that it can converge to the vicinity of the correct answer more rapidly than the standard EM algorithm . When the data set is large and redundant , one might expect that , with an appropriate value for " this algorithm could be
faster than the incremental algorithm of (9) , since it can forget out-of-date statistics more rapidly .
4. Demonstration
for a mixture
model
In order to demonstrate that the incremental algorithm of (9) can speed convergence, we have applied it to a simple mixture of Gaussians problem .
The algorithm using iteration (10) was also tested.
A VIEW OF THE EM ALGORITHMAND ITS VARIANTS
363
Given s(y, z) == LiSi (Yi, Zi) == (no, mo, qo, nl , ml , ql ), the maximum likelihood parameter estimates are given by 0: = nl / (no + nl ) , /-Lo = moina ,
0"5 == qolno- (molno)2, /-Ll == ml / nl , and ai == ql/ nl - (ml / nl )2. We synthetically tribution
with
generated a sample of 1000 points , Zi, from this dis-
0: == 0 .3 , /-Lo = 0 , 0"0 == 1 , /-Ll == - 0 .2 , and al == 0 . 1 . We then
applied the standard algorithm of (8) and the incremental algorithm of (9)
to thisdata. Asinitialparameter values , weused0:(0) = 0.5, /-L~O) ==+1.0, O "~O) = 1, /-L~O) = - 1, andaiD) = 1. Fortheincremental algorithm , a single iteration of the standard algorithm was then performed to initialize the distributions for the unobserved variables . This is not necessarily the best procedure , but was done to avoid any arbitrary selection for the starting distributions , which would affect the comparison with the standard algorithm . The incremental algorithm visited data points cyclicly . Both algorithms converged to identical maxima of L , at which 0: * = 0.269, /-La = - 0.016, 0' 0 = 0.959, /-Li = - 0.193, and 0' ; = 0.095. Special measures to control round -off error in the incremental algorithm were found
to be unnecessaryin this case (using 54-bit floating-point numbers) . The rates of convergence of the two algorithms are shown in Figure 1, in which the log likelihood , L , is plotted as a function of the number of " passes" - a pass being one iteration for the standard algorithm , and n iterations
for the incremental algorithm . (In both case, a pass visits each data point once.) As can be seen, the incremental algorithm reached any given level of L in about half as many passes as the standard algorithm . Unfortunately , each pass of the incremental algorithm required about twice as much computation time as did a pass of the standard algorithm , due primarily to the computation required to perform an M step after visiting every data point . This cost can be greatly reduced by using an
364
RADFORDM. NEALAND GEOFFREYE. HINTON
- 1080 . . .
- 1100
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
'
.
- 1120 - 1140
. . . . . . .
- 1160
.'
- 1180 - 1200 - 1220
-
- 1240 0
10
20
30
40
Figure 1. Comparison of convergencerates for the standard EM algorithm (solid line) and the incremental algorithm (dotted line). The log likelihood is shown on the vertical axis, the number of passesof the algorithm on the horizontal axis.
- 1080 ,
- 1100
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
'
- 1120 - 1140 - 1160 - 1180 - 1200 - 1220 - 1240 0
5
10
15
20
25
Figure 2. Convergencerates of the algorithm using exponentially decayed statistics with 'Y= 0.99 (dashed line) and 'Y = 0.95 (dotted line) . For comparison, the performance of the incremental algorithm (solid line) is reproduced as well (as in Figure 1) .
A VIEW OF THE EM ALGORITHM AND ITS VARIANTS
365
intermediate algorithm in which each E step recomputes the distributions for the next ten data points . The rate of convergence with this algorithm is virtually indistinguishable from that of the pure incremental algorithm , while the time required for each pass is only about 10% greater than for the standard algorithm , producing a substantial net gain in speed. The algorithm of iteration (10) was also tested . The same initialization procedure was used, with the elaboration that the decayed statistics were computed , but not used, during the initial standard iteration , in order to initialize them for use in later iterations . Two runs of this algorithm are shown in Figure 2, done with 'Y == 0.99 and with 'Y == 0.95. Also shown is the run of the incremental algorithm (as in Figure 1) . The run with I == 0.99 converged to a good (but not optimal ) point more rapidly than the incremental algorithm , but the run with 'Y = 0.95 converged to a rather poor point . These results indicate that there may be scope for improved algorithms that combine such fast convergence with the guarantees of stability and convergence to a true maximum that the incremental algorithm provides .
5. A sparse algorit hm A "sparse" variant of the EM algorithm may be advantageous when the unobserved variable , Y , can take on many possible values, but only a small set of " plausible " values have non-negligible probability (given the observed data and the current parameter estimate ) . Substantial computation may sometimes be saved in this case by "freezing" the probabilities of the im plausible values for many iterations , re- computing only the relative proba bilities of the plausible values. At infrequent intervals , the probabilities for all values are recomputed , and a new set of plausi ble values selected (which may differ from the old set due to the intervening change in the parameter estimate ) . This procedure can be designed so that F is guaranteed to increase with every iteration , ensuring stability , even though some iterations may decrease L . In detail , the sparse algorithm represents p (t) as follows :
p(t)(y)
q ~t )
if y ~ S ( t )
Q ( t ) r ~t )
if y E S (t )
(13)
Here , S (t) is the set of plausible values for Y , the qtt) are the frozen probabil ities for implausible values, Q (t) is the frozen total probability for plausible values, and the r ~t) are the relative probabilities for the plausible values, which are updated
every iteration
.
366
RADFORD
Most iterations E Step :
M . NEAL AND GEOFFREY
of the sparse algorithm
E . HINTON
go as follows :
Set S (t) == S (t - l ) , Q (t ) = Q (t - l ) , and qtt ) = qtt - l ) for all y ~ S (t) . (This takes no time .)
(14)
Set rtt ) == P (y I z , (}(t- l )) / P (y E S (t ) I z , (}(t - l )) for all y E S (t) . M
Step : Set (}(t ) to the () that maximizes
F (P (t) , (}) .
It can easily be shown that the above E step selects those rtt ) that maximize F (P (t) , e(t - l )) . For suitable models , this restricted E step will take time proportional only to the size of S (t), ind .ependent of how many values are in the full range for Y . For the method to be useful , the model must also be such that the M step above can be done efficiently , as is discussed below . On occasion , the sparse algorithm E Step :
performs
a full iteration
, as follows :
Set s (t) to those y for which P (y I z , (}(t - l ) ) is non - negligible . For all y ~ s (t) , set qtt ) == P (y I z , 8(t - l )) .
(15)
Set Q (t) == P (y E s (t) I z , (}(t - l ) ) . For all y E s (t ) , set rtt ) == P (y I z , O(t - l )) / Q (t). M Step : Set (}(t) to the 0 that maximizes
F (P (t), 8) .
The decisions as to which values have " non - negligible " probability can be made using various heuristics . One could take the N most probable values , for some predetermined N , or one could take as many values as are needed to account for some predetermined fraction of the total probability . The choice made will affect only the speed of convergence , not the stability of the algorithm - even with a bad choice for S (t) , subsequent iterations cannot decrease F . For problems with independent each data item , i , can be treated " plausible " values , s1t ) , and with
observations , where Y == (Y1, . . . , Yn) , independently , with a separate set of distributions
Pi(t ) expressed in terms of
~,y ", ) . For efficient implementation quantities q~ t) , Q ~t) , and r ~ty of the M step in ( 14) , it is probably necessary for the model to have simple sufficient statistics . The contribution to these of values with frozen probabilities can then be computed once when the full iteration of ( 15) is performed , and saved for use in the M step of ( 14) , in combination with the statistics for the plausible
values in Si(t )
A VIEWOFTHEEMALGORITHM ANDITSVARIANTS The
Gaussian
mixture
usefulness
of
mixture
,
having
the
each
data
come
the
algorithm
that
applied
in
.
distant
a
non
-
potential
in
negligible
the
probability
means
are
.
the
effect
of
nearby
avoids
negligible
the
components
components
have
of
many
whose
that
the
sparse
on
Freezing
continual
the
re
course
of
-
the
and
can
be
,
find
the
employ
For
,
but
(
,
,
stages
true
maximum
an
the
and
,
be
EM
-
can
)
.
"
winner
easily
be
-
"
-
,
-
-
all
-
all
method
also
,
,
but
which
they
though
In
'
t
L
in
seen
in
this
light
applied
proportions
Hidden
is
L
guaranteed
regard
sensible
the
the
estimating
to
the
.
as
instance
one
There
finding
mixing
used
.
in
algorithm
and
find
of
converged
be
EM
more
don
variant
variant
of
can
lead
appear
they
a
this
has
neither
might
such
capable
often
.
to
the
maximum
using
the
be
by
maximizes
variances
is
may
represented
variant
of
with
be
that
variant
all
version
(
view
probability
unconstrained
fJ
algorithm
recognition
hoc
even
-
clustering
take
iteration
take
,
This
algorithm
,
the
a
.
zero
,
of
to
jointly
EM
to
switching
fJ
Obviously
to
value
EM
could
methods
assign
course
.
converge
winner
means
speech
ad
F
,
and
the
to
of
One
.
advantages
take
each
,
a
P
of
- -
of
.
optimization
procedures
variant
find
problem
-
completely
maximizing
"
like
variants
F
-
P
,
not
winner
for
with
all
-
one
general
the
K
standard
to
can
F
,
increase
-
computational
known
Models
as
in
mixture
The
Markov
EM
only
maximizing
respect
probability
when
Gaussian
with
the
of
of
distribution
maximizing
incremental
ods
of
not
terms
distribution
need
only
well
fixed
to
,
)
take
assigned
hence
of
The
to
,
fJ
other
a
is
however
early
as
)
,
-
Such
in
variety
the
that
fJ
-
P
winner
cannot
P
might
"
.
algorithm
wide
into
a
value
value
F
(
constraining
one
single
a
F
insight
by
it
of
of
example
are
viewing
any
provide
obtained
the
variants
algorithms
by
maximum
also
all
incremental
sparse
justified
example
can
and
.
incremental
for
of
have
components
the
example
are
variants
that
of
there
typically
few
quantities
combination
Other
The
to
an
If
.
Note
6
a
for
of
.
will
only
probabilities
computation
provides
algorithm
point
from
small
problem
sparse
367
these
when
meth
seen
unconstrained
in
-
terms
maximum
.
ACKNOWLEDGEMENTS
We
thank
cke
,
This
Wray
and
work
of
tu
.
te
for
Geoffrey
Advanced
,
Bill
Titterington
was
Council
Centre
Buntine
Mike
Byrne
for
supported
by
Canada
and
Hinton
is
Research
the
by
,
Mike
Jordan
comments
the
the
Natural
,
an
Jim
Kay
earlier
Sciences
Ontario
Nesbitt
.
on
and
Burns
of
the
Stol
this
-
paper
.
Research
Technology
fellow
Andreas
of
Engineering
Information
-
,
version
Research
Canadian
Insti
-
368
RADFORD
M . NEAL
AND
GEOFFREY
E . HINTON
REFERENCES Csiszar I . and Tusnady, G. (1984) "Information geometry and alternating minimization procedures" , in E. J. Dudewicz, et al (editors) RecentResults in Estimation Theory and Related Topics (Statistics and Decisions, Supplement Issue No. 1, 1984) . Dempster, A . P., Laird , N. M ., and Rubin, D . B. (1977) "Maximum likelihood from incomplete data via the EM algorithm" (with discussion), Journal of the Royal Statistical Society B , vol . 39, pp . 1-38.
Hathaway, R. J. (1986) "Another interpretation of the EM algorithm for mixture distributions " , Statistics and Probability Letters , vol . 4, pp . 5356 .
McLachlan, G. J. and Krishnan,. T . (1997) The EM Algorithm and Extensions , New York : Wiley .
Meng, X . L. and Rubin, D. B. (1992) "Recent extensionsof the EM algorithm (with discussion)" , in J. M . Bernardo, J. O. Berger, A . P. Dawid, and A . F . M . Smith (editors ) , Bayesian Statistics 4, Oxford : Clarendon Press .
Meng , X . L . and van Dyk , D . (1997) "The EM algorithm -
an old folk -
song sung to a fast new tune" (with discussion), Journal of the Royal Statistical Society B , vol . 59, pp . 511-567.
Nowlan, S. J. (1991) Soft Competitive Adaptation.. Neural Network Learning Algorithms based on Fitting Statistical Mixtures , Ph . D . thesis , School of Computer Science, Carnegie Mellon University , Pittsburgh .
LATENT VARIABLE MODELS
CHRISTOPHER M. BISHOP Microsoft Research St. GeorgeHouse 1 Guildhall Street Cambridge CB2 3NH, U.K.
Abstract . A powerful approach to probabilistic modelling involves supplementing a set of observed variables with additional latent , or hidden , variables . By defining a joint distribution over visible and latent variables , the corresponding distribution of the observed variables is then obtained by marginalization . This allows relatively complex distributions to be expressed in terms of more tractable joint distributions over the expanded variable space. One well -known example of a hidden variable model is the mixture distribution
in which the hidden variable is the discrete component
the case of continuous
latent
variables
we obtain
models
label . In
such as factor
ana -
lysis . The structure of such probabilistic models can be made particularly transparent by giving them a graphical representation , usually in terms of a directed acyclic graph , or Bayesian network . In this chapter we provide an overview
of latent variable
models for representing
continuous
variables .
We show how a particular form of linear latent variable model can be used to provide a probabilistic formulation of the well -known technique of princi -
pal components analysis (PCA ). By extending this technique to mixtures, and hierarchical mixtures , of probabilistic PCA models we are led to a powerful interactive algorithm for data visualization . We also show how the probabilistic PCA approach can be generalized to non-linear latent variable models leading to the Generative Topographic Mapping algorithm (G TM ) . Finally , we show how GTM can itself be extended to model temporal data . 371
372
CHRISTOPHER M. BISHOP
1. Density
Modelling
One of the central problems in pattern recognition and machine learning is that of density estimation , in other words the construction of a model of a probability distribution given a finite sample of data drawn from that distribution . Throughout this chapter we will consider the problem of modelling the distribution of a set of continuous variables tl , . . . , td which we will collectively denote by the vector t . A standard approach to the problem of density estimation involves parametric models in which a specific form for the density is proposed which contains a number of adaptive parameters . Values for these parameters are
then determined from an observeddata set D = {tl , . . . , tN } consisting of N data vectors . The most widely used parametric model is the normal , or Gaussian , distribution
given by
p(tlJ -L,~)=(27r )-d/21 ~1-1/2exp {-~(t - J-L)~-l(t - J-L)T} (1) where JL is the mean , ~ is the covariance matrix , and I~ I denotes the determinant of E . One technique for setting the values of these parame ters is that of maximum likelihood which involves consideration of the log probability of the observed data set given the parameters , i .e.
N (I.L, ~ ) = Inp(DII.L, ~ ) = LInp .l"E) n=l (tnIJ
(2)
in which it is assumed that the data vectors tn are drawn independently from the distribution . When viewed as a function of IL and E , the quantity p (DIIL , E ) is called the likelihood function . Maximization of the likelihood (or equivalently the log likelihood ) with respect to IL and E leads to the set of parameter values which are most likely to have given rise to the observed data set. For the normal distribution (1) the log likelihood (2) can be maximized analytically , leading"""- to the intuitive result [1] that the maximum likelihood solutions [L and E are given by
... ~
-
1
N
NLtn
(3)
n = l
... E
-
1
N
N L (tn - j:L)(tn - j:L)T
(4)
n = l
corresponding to the sample mean and sample covariance respectively . As an alternative to maximum likelihood , we can define priors over J..I, and E use Bayes' theorem , together with the observed data , to determine
LATENTVARIABLEMODELS
373
the posterior distribution . An introrlluction to Bayesian inference for the normal distribution is given in rS]. While the simple normal distribution (1) is widely used, it suffers from some significant limitations . In particular , it can often prove to be too flexible in that the number of independent parameters in the model can be excessive. This problem is addressed through the introduction of conti nuous latent variables . On the other hand , the normal distribution can also be insufficiently flexible since it can only represent uni -modal distributions . A more general family of distributions can be obtained by considering mix tures of Gaussians , corresponding to the introduction of a discrete latent variable . We consider each of these approaches in turn .
1.1. LATENTVARIABLES Consider
the
~
is
a
further
symmetric
.
large
data
E
d
free
assumption and
, such
a
different
can
now
be
to
that
data
d and
the
to
reduce
of to
.
3 ) /
2
t
are
There
are in
large
numbers
maximum
likelihood
the
number
of
matrix
corresponds
to
a
statistically
capture
Since
parameters
covariance
,
components
+
excessively
diagonal
however
( d
that way
unable
the
while
number still
' hidden
the
first
,
very
free
which strong
independent
the
p
latent
to
)
of
assume
,
so
( t
)
correlations
that
the
( x
of
the
the
of
( Xl
, between
joint
within
of
be
a
'
tl
. . . ' Xq
)
conditional
, . . . , td
, x
variables given
the
latent
by
)
intro
terms <
into
-
model
in q
and
distribution
distribution
,
model
variable
where
( t
latent
the
captured
latent
distributionp the
variables the
goal variables
=
joint )
data that
freedom to
The
x
the p
of
correlations
.
variables
decomposing
( tlx
degrees
variables
distribution p
of
allowing
' )
distribution of
variables
parameters
,
ensure One
( 1 ) .
.
,
convenient
the
distribution
making d2
a ,
normal
independent JL ,
consider This
therefore
how
( or
distribution
to
to
the
like
.
is
2
in
required
.
is
show
by
often
the
namely
marginal
1 ) /
determined
model
number
the
+
grows
parameters
model
express
achieved
nal
the
latent
smaller
of
be well
controlled
ducing
( d
number
components
We
is
this
is
in just
d
in
parameters
may
for
parameters
contains
d
points
parameters has
it
free
independent
For
solution
of
, d
total of
number
d .
the the variables factorizes
of
a
This
is
product conditio
.
It
is
over
becomes
d p (t , x ) = p (x )p (tlx ) = p (X ) n p (tiIX ) . i=l
(5)
This factorization property can be expressedgraphically in terms of a Bayesian network, as shown in Figure 1.
374
CHRISTOPHER M. BISHOP p(X)
p(tdlx) -
-
-
-
-
-
-
Figure 1. Bayesian network representation of the latent variable distribution given by (5) , in which the data variables tl , . . . , td are independent given the latent variables x .
11 y(x;w)
s
., ~~""-'- - - """"'------.... Xz
Xl
13
t2
Figure2. The non-linearfunctiony (x ; w) definesa manifoldS embeddedin data spacegivenby the imageof the latent spaceunderthe mappingx -t y . We next expressthe conditional distribution p(tlx ) in terms of a mapping from latent variables to data variables, so that t = y (x ; w ) + u
(6)
where y (x ; w ) is a function of the latent variable x with parameters w , and u is an x -independent noise process. If the components of u are uncorrela ted , the conditional distribution for t will factorize as in (5) . Geometrically the function y (x ; w ) defines a manifold in data space given by the image of the latent space, as shown in Figure 2. The definition of the latent variable model is completed by specifying the distribution p (u ) , the mapping y (x ; w ) , and the marginal distribution p (x ) . As we shall see later , it is often convenient to regard p (x ) as a prior distribution over the latent variables .
LATENT VARIABLE MODELS
375
The desired model for the distribution p(t ) of the data is obtained by marginalizing over the latent variables p(t ) = jP (tlx )p(x ) dx .
(7)
This integration will , in general, be analytically intractable except for specific forms of the distributions p(tlx ) and p(x ). One of the simplest latent variable models is called factor analysis [3, 4] and is based on a linear mapping y (x ; w ) so that t = Wx
+ Jl, + u ,
(8)
in which Wand I'" are adaptive parameters . The distribution p (x ) is chosen to be a zero- mean unit covariance Gaussian distribution N (O, I ), while the noise model for u is also a zero mean Gaussian with a covariance matrix 'li which is diagonal . Using (7) it is easily shown that the distribution p (t ) is also Gaussian , with mean J.L and a covariance matrix given by 'l' + WWT . The parameters of the model , comprising W , 'Ii and Jl" can again be determined by maximum likelihood . There is, however, no longer a closedform analytic solution , and so their values must be determined by iterative procedures . For q latent variables , there are q x d parameters in W together with d in 'Ii and d in J-L. There is some redundancy between these para meters , and a more careful analysis shows that the number of independent degrees of freedom in this model is given by
(d + 1)(q + 1) - q(q + 1)/ 2.
(9)
The number of independent parameters in this model therefore only grows linearly with d, and yet the model can still capture the dominant correlations between the data variables. We consider the nature of such models in more detail in Section 2.
1.2. MIXTUREDISTRIBUTIONS The density models we have considered so far are clearly very limited in terms of the variety of probability distributions which they can model since they can only represent distributions which are uni - modal . However , they can form the basis of a very general framework for density modelling , obtained by considering probabilistic mixtures of M simpler parametric distributions . This leads to density models of the form
p(t ) =
M L (tli) i==l7rip
(10)
376 in
CHRISTOPHER M. BISHOP
which
the
and
p
might
each
(
7ri
integrate
have
10
~
to
these
example
)
1
,
of
are
Bayesian
called
normal
7ri
=
can
network
so
,
)
will
individual
represent
as
( t
shown
in
form
.
the
non
-
1
)
The
requi
negative
-
and
densities
also
distribution
3
(
Ei
satisfy
be
mixture
Figure
n
the
component
the
mixture
matrix
and
p
the
of
covariance
that
the
We
and
coefficients
1
( assuming
.
of
distributions
lLi
mixing
Ei
unity
)
components
mean
and
properties
simple
individual
independent
in
~
the
for
own
7ri
0
represent
,
its
rements
will
)
consist
with
parameters
a
( tli
(
10
)
as
.
. I
p
Figure
3
The
.
Bayesian
mixing
values
of
to
can
label
i
.
evaluate
tli
For
a
the
be
of
value
takes
to
ofp
for
( iltn
)
can
' explaining
reverse
the
The
log
'
Maximization
of
of
due
powerful
to
the
the
for
based
Zni
point
of
on
the
specifying
tn
)
then
,
}
)
=
In
the
] ,
log
likelihood
)
if
and
we
i
~
was
( tli
would
' comp ({7ri, ILi, Ei } ) =
Bayes
the
)
}
(
.
the
introductory
a
single
An
[ 5
a
] .
set
for
The
of
EM
12
)
com
elegant
-
and
expectation
account
in
-
of
EM
in
algorithm
indicator
generating
NM L=liL=lZni In{7riP (tli)} n the
theorem
form
for
logarithm
responsible
take
i
'
.
called
given
component
using
than
the
were
)
.
7riP
given
by
11
is
complex
an
given
'
.
optimization
is
,
{
the
Bayes
( 1 J
takes
inside
this
use
which
3
more
sum
for
then
,
responsibility
Figure
E
can
)
tn
distribution
distributions
that
Ii (
P
this
in
[ 11
component
the
is
algorithm
( tn j
Effectively
arrow
Ei
of
we
probabilities
7r
.
probabilities
tn
~ L. . . . . , j
mixture
performing
observation
the
.
likelihood
mixture
which
,
JLi
=
as
the
presence
( EM
context
,
)
distribution
prior
point
7rip
( 1, ltn
tn
the
log
technique
maximization
7ri
this
p
regarded
for
mixture
posterior
point
direction
likelihood
( {
ponent
be
data
simple
as
data
. = =
a
interpreted
given
corresponding
Rni
The
)
representation
coefficients
the
theorem
network
(
is
variables
each
data
form
(13)
LATENTVARIABLEMODELS
377
and its optimization would be straightforward , with the result that each component is fitted independently to the correspondinggroup of data points, and the mixing coefficientsare given by the fractions of points in eachgroup. The { Zni} are regarded as 'missing data', and the data set { tn } is said to be 'incomplete'. Combining { tn } and { Zni} we obtain the corresponding 'complete' data set, with a log likelihood given by (13). Of course, the values of { Zni} are unknown, but their posterior distribution can be computed using Bayes' theorem, and the expectation of Zni under this distribution is just the set of responsibilities Rni given by (11). The EM algorithm is based on the maximization of the expected complete-data log likelihood given from (13) by
N M ( comp ({1Ti , .ui, ~i})) == L L RniIn{7riP (tli )} . n==li ==l
(14)
It alternates between the E-step, in which the Rni are evaluated using (11) , and the M -step in which (14) is maximized with respect to the model parameters to give a revised set of parameter values. At each cycle of the EM algorithm the true log likelihood is guaranteed to increase unless it is already at a local maximum [11]. The EM algorithm can also be applied to the problem of maximizing the likelihood for a single latent variable model of the kind discussed in Section 1.1. We note that the log likelihood for such a model takes the form N
N
(W,J.L,"lI) ==L Inp (tn)==L In{! p(tnlxn )p(xn)dxn }. n == l
(15)
n == l
Again , this is difficult to treat because of the integral inside the logarithm . In this case the values of xn are regarded as the missing data . Given the prior distribution p (x ) we can consider the corresponding posterior distri bution obtained through Bayes' theorem
p(xnltn) =
p(tnIxn)p(Xn) p(tn)
(16)
and the sufficient statistics for this distribution are evaluated in the E-step . The M -step involves maximization of the expected complete -data log like lihood and is generally much simpler than the direct maximization of the true log likelihood . For simple models such as the factor analysis model discussed in Section 1.1 this maximization can be performed analytically . The EM (expectation -maximization ) algorithm for maximizing the likelihood function for standard factor analysis was derived by Rubin and Thayer
[23].
378
CHRISTOPHER M. BISHOP
We can combine the technique of mixture modelling with that of latent variables , and consider a mixture of latent -variable models . The correspon ding Bayesian network is shown in Figure 4. Again , the EM algorithm
-
-
-
-
-
-
Figure .4. Bayesian network representation of a mixture of latent variable models. Given the values of i and x , the variables tl , . . . , td are conditionally independent.
provides
a
and
riable
of
to
the
be
the
values
treated
to
and
obtain
a
clude
data
data
,
on
linear
and
a
,
most
set
and
the
largest
axes
d
v
j
,
j
variance
vectors
v
associated
j
are
the
eigenvalues
.
be
subject
maximizes
the
- dimensional
I
,
.
.
the
a
-
concepts
fruitful
part
density
-
modelling
for
be
its
data
PCA
found
,
dimensio
in
many
-
practically
applications
visualization
is
,
q
}
,
the
,
are
terms
in
those
of
the
vectors
is
q
in
variance
projection
by
may
of
data
.
va
in
-
exploratory
.
of
{
,
latent
how
in
technique
,
derivation
E
see
used
for
Examples
processing
under
given
shall
can
- established
recognition
which
observed
retained
well
on
pattern
the
Analysis
a
image
parameters
of
.
analysis
,
model
and
.
algorithms
visualization
chapter
common
of
principal
the
a
compression
projection
For
powerful
is
the
i
we
Component
multivariate
analysis
The
of
data
analysis
reduction
text
data
chapter
distributions
range
component
of
label
missing
this
Principal
Principal
every
component
of
and
Probabilistic
nality
determination
mixture
classification
.
for
the
as
sections
variables
pattern
of
together
subsequent
latent
nership
q
framework
both
x
In
2
natural
allows
{
}
dominant
)
of
It
eigenvectors
the
,
n
sample
space
E
{
axes
.
can
( i . e
N S=~n L= Il(tn-j:L)(tn-j'L)T Aj
standardized
projected
tn
orthonormal
maximal
a
covariance
.
.
.
[ 14
N
onto
be
.
I
those
}
,
] .
the
which
shown
that
with
the
matrix
(17)
such that SVj = AjVj. Here ~ is the samplemean, given by (3). The q principal componentsof the observedvector tn are given by the vector
LATENTVARIABLE MODELS
379
Un = yT (tn - Ii ) , where yT = (VI , . . . , Vq) T , in which the variables Uj are decorellated such that the covariance matrix for U is diagonal with elements { Aj } . A complementary property of FaA , and that most closely related to the original discussions of Pearson [20] , is that , of all orthogonal linear projections Xn = yT (tn - fL) , the principal component projection minimizes the squared reconstruction error "- En IItn - in 112 , where the optimal linear reconstruction of tn is given by tn = V Xn + it . One serious disadvantage
of both these definitions
of PCA is the absence
of a probability density model and associated likelihood measure . Deriving PCA from the perspective of density estimation would offer a number of important
advantages , including
the following :
. The corresponding likelihood measure would permit comparison with other density - estimation techniques and would facilitate statistical tes ting . . Bayesian inference methods could be applied (e.g . for model comparison ) by combining the likelihood with a prior . . If PCA were used to model the class - conditional densities in a classifica tion problem , the posterior computed .
probabilities
of class membership
could be
. The value of the probability density function would give a measure of the novelty of a new data point . . The single PCA model could be extended to a mixture of such models . In this section we review the key result of Tipping and Bishop [25] , which shows that principal component analysis may indeed be obtained from a probability model . In particular we show that the maximum - likelihood estimator of W in ( 8) for a specific form of latent variable models is given by the matrix of (scaled and rotated ) principal axes of the data .
2.1. RELATIONSHIP TO LATENTVARIABLES Links between principal component analysis and latent variable models have already been noted by a number of authors . For instance Anderson [2] observed that principal components emerge when the data is assumed to comprise a systematic component , plus an independent error term for each variable having common variance 0-2. Empirically , the similarity between the columns of Wand the principal axes has often been observed in situa tions in which the elements of 'l1 are approximately equal [22]. Basilevsky [4] further notes that when the model WwT + 0-21 is exact , and therefore equal to S, the matrix W is identifiable and can be determined analytically through eigen-decomposition of S, without resort to iteration .
380
CHRISTOPHER M. BISHOP
As
well
consider of
the
that
' I1
= be
likelihood
analysis 0- 21 , we
terms
now
principal
THE
the
a probability
show
WML
that
is
of
this
density
component
analysis
PROBABILITY
the
the
that
noise
when
the
is
are
may refer
to
so matrix
maximum
the
scaled
matrix
PCA shall
isotropic
0- 21 , the
columns
that we
is
not case
covariance
+
covariance
, which
do
a particular
data
wwT
sample
derivation
observations
covariance
whose
the
( PPCA
, such considering
form
matrix
model
exact
. By
even
of
a probability
is
which
using
eigenvectors
isotropic
model context
in
exactly
consequence of
the
model
expressed
principal
important
that - likelihood
estimator
rotated
For
assuming
maximum
factor
cannot
2 .2 .
as
the
S
and
[ 25 ] . An
be
expressed
as
probabilistic
in
) .
MODEL
noise
model
distribution
u
over
f "V N
( O , a2I
t - space
for
) , equations a given
p(tlx ) = (27ra2 )- d/2exp{ - ~ llt In the case of an isotropic
x
Wx
( 6 ) and given
(8 )
by
- JL112 }.
Gaussian prior over the latent
imply
variables
(18) defined
by
p(x)=(27r )-q/2exp {_~xTx } we then obtain the marginal distribution
(19 )
of t in the form
p(t) = jP(tlx)p(X)dX
(20 )
= (27r )-d/2ICI -1/2exp {-~(t - J.I,)TC -1(t - J.I,)} (21 ) w here the model
covariance
is
C = a2I + WWT . U sing Bayes ' theorem , the posterior
distribution
(22) of the latent
variables
x given the observed t is given by p (xlt )
= (27r )- q/2\a2M\- 1/2 x
exp{ - ~(x - (x))T(a2M)- I(X- (x))}
(23)
where the posterior covariancematrix is given by
a2M = a2(u2I + WTW)- l
(24)
LATENTVARIABLE MODELS and the mean of the distribution
381
is given by
(x) = M- 1WT(t - JL).
(25)
Note that M has dimension . q x q while C has dimension d x d. The
log - likelihood
for the observed
data
under
this
model
is given
by
N .c
where
=
L In { p ( tn ) } n= l
=
- Nd 2
the sample
In principle mizing
the
covariance
, we could
of the form
for the model
parameters
Our key result span
derivative
show
S of the observed the EM
that , for
model
of Rubin
is an exact
by ( 17 ) . by maxi -
and
case of an isotropic
, there
MAXIMUM
- LIKELIHOOD
the log - likelihood
the principal
subspace
of ( 26 ) with
{ tn } is given for this
algorithm
the
(26)
-N Tr { C - I S } 2
Thayer
noise
analytical
cova -
solution
.
OF THE
is that
-
the parameters
we are considering
2 .3 . PROPERTIES
of W
matrix , using
, we now
-N lnICI 2
determine
log - likelihood
[23 ] . However riance
In ( 27r ) -
respect
SOLUTION
( 26 ) is maximized
when
of the data . To show this
the columns
we consider
the
to W :
a .c == N ( C - 1SC - 1W
aw which
may
be obtained
from
standard
[ 19 ] , pp 133 ) . In [25 ] it is shown zero stationary
points
W where
the q column
eigenvalues rotation
matrix
point
corresponding
U q comprises to the
eigenvectors
q largest
represent
( 28 ) , the
columns
principal
eigenvectors
eigenvalues
together
matrix
to the
(.28)
Aq , and
global
, it
is also
maximum
eigenvectors
of S , with shown
of the
) and
that
likelihood
maximum
- likelihood scalings
the parameter
that
q x q ortho the
likelihood
stationary
occurs
of S ( i .e . the eigenvectors
of the
of S , with
corresponding
R is an arbitrary
eigenvalues
with
( see non -
for :
saddle - points
of the
results
by ( 22 ) , the only
= U q (Aq - a2J ) 1/ 2R
. Furthermore
the principal
( 27 )
differentiation
C given
in U q are eigenvectors
in the diagonal
gonal
ponding
vectors
C - 1W )
matrix
that , with
of ( 27 ) occur
-
all
other
estimator
determined a2 , and with
when corres -
combinations
of
surface . Thus , from WML
contain
the
by the corresponding arbitrary
rotation
.
382
CHRISTOPHER M. BISHOP
I t may also be shown that for W estimator for a2 is given by
W ML, the maximum-likelihood
d 2-- ~ Lq+lAj aML d-qj=
(29)
which has a clear interpretation as the variance'lost' in the projection, averagedover the lost dimensions . Note that the columnsof W ML are not orthogonalsince (30) WT MLW ML - RT(A q - a2I)R , which in generalis not diagonal. However , the columnsof W will be orthogonalfor the particular choiceR f I . In summary, we can obtain a probabilistic principal componentsmodel by finding the q principal eigenvectorsand eigenvalues of the sample covariancematrix. The density model is then given by a Gaussiandistribution with meanJ.t given by the samplemean, and a covariancematrix WwT + a2I in which W is givenby (28) and a2 is givenby (29). 3. Mixtures of Probabilistic PCA We now extend the latent variable model of Section 2 by considering a mix ture of probabilistic principal component analysers [24], in which the model distribution is given by (10) with component densities given by (22) . It is straightforward to obtain an EM algorithm to determine the parameters 7ri , JLi, W i and at . The E-step of the EM algorithm involves the use of the current parameter estimates to evaluate the responsibilities of the mixture components i for the data points tn , given from Bayes' theorem by
Rni =p(ptnli ))1Ti (tn .
(31)
In .the M-step, the mixing coefficientsand componentmeansare re-estimated usIng -
-
7ri
-
ILi ==
1N Nn L=lRni E ~=lRnitn - --NLn =l Rni
(32) (33)
while the parametersWi and at areobtainedby first evaluatingthe weighted covariance matrices given by
Si = }:::;:=1Rni (tN- [L)(tn- [L)T Ln =l Rni
(34 )
LATENTVARIABLEMODELS
383
and then applying (28) and (29).
3.1. EXAMPLE APPLICATION : HAND -WRITTENDIGIT CLASSIFICATION One
ten
will
potential
application
digit
recognition
generally
lie
geometry
of
scaling
' similar
Hinton
et
applied
set
digit
.
=
the
best
for
used
the
the
split
the
is
to
of
the
a
digit
,
as
,
of
model
of
model
such
each
to
the
rotation
classification
the
-
given
such
a
to
handwrit
manifold
digit
build
according
problem
CEDAR
digit
which
into
11
10
training
to
they
the
Hinton
et
to
,
but
also
,
set
rather
ale
be
the
than
[ 12
a
]
same
reported
a
single
of
individually
,
of
)
. 64
error
the
,
gray
] .
of
set
the
the
'
'
-
values
the
. 91
%
localized
- chosen
bs
probabi
used
of
4
The
data
parameter
%
-
each
,
and
digits
.
We
in
would
clustering
- estimated
arbitrarily
'
and
was
4
partly
of
' br
classification
an
8
[ 15
the
using
choice
of
-
reconstructed
data
misclassified
result
use
best
,
-
- by
database
of
sets
same
method
validation
service
subset
the
problem
l ' econstruction
8
validation
the
same
soft
smoothed
model
with
The
postal
- digit
which
utilizing
.
.
using
and
and
experiment
)
. S
, OOO
digit
,
scaled
U
an
handwritten
models
of
approach
=
of
PCA
according
while
model
]
images
to
)
is
continuous
of
best
conventional
improvement
[ 12
models
pixel
approach
digits
using
on
component
One
classification
from
q
model
the
in
to
mixture
PPCA
scale
properties
the
of
the
and
,
.
discussed
classified
set
each
]
'
repeated
test
-
smooth
the
constructed
was
10
expect
the
[ 12
,
PCA
( M
stroke
unseen
further
We
listic
the
ale
taken
was
gray
by
necessarily
' mixture
images
( which
of
density
' .
clustering
were
- dimensional
determined
classify
a
models
high
- dimensional
of
and
most
test
lower
not
,
scale
Examples
is
( although
based
.
a
thickness
separately
and
on
which
and
digits
are
for
,
of
values
of
global
value
a
[
.
One of the advantages of the PPCA methodology is that the definition of the density model permits the posterior probabilities of class membership to be computed for each digit and utilized for subsequent classification . After optimizing the parameters M and q for each model to obtain the best performance on the validation set, the model misclassified 4.61% of the test set. An advantage of the use of posterior probabilities is that it is possible to reject (using an optimal criterion ) a proportion of the test samples about which the classifier is most 'unsure ' , and thus improve the classification performance on the remaining data . Using this approach to reject 5% of the test examples resulted in a misclassification rate of 2.50%.
384
CHRISTOPHER
4
.
Hierarchical
An
Mixtures
interesting
,
is
to
extension
problem
a
spaces
.
of
1
.
standard
principal
orthogonal
leading
1
-
,
and
tion
of
*
(
is
,
)
is
is
an
not
into
because
of
.
W
=
xn
is
derived
mation
of
in
to
a2
>
The
onto
set
as
a
with
developed
-
in
(
,
-
of
the
of
tn
When
then
onto
prior
over
We
x
note
since
,
.
0
,
-
may
the
is
,
infor
point
taking
this
that
still
shrinkage
given
by
1
}
M
variables
(
xn
)
,
(
convey
data
*
-
of
,
reconstruction
WML
-
manifold
Because
data
by
2
becomes
the
however
each
-
0 -
projec
model
projection
,
.
posterior
orthogonal
density
.
)
the
projection
the
It
an
the
by
that
tn
variable
latent
vector
the
necessary
optimally
Figure
.
that
.
and
,
35
)
infor
even
two
.
the
.
The
benefits
3
.
A
in
-
the
case
in
to
two
other
the
this
model
while
toy
(
has
hierarchical
-
small
spaced
third
has
data
Gaus
closely
the
set
.
three
flat
data
latent
a
of
are
close
using
are
interactive
dimensional
also
relatively
,
which
are
clusters
of
plot
space
mixture
is
these
each
-
data
a
point
latent
visualization
which
this
data
dimensional
of
space
Gaussian
structure
-
type
of
of
each
two
from
two
of
single
mapping
the
points
latent
Each
parallel
first
in
this
properties
,
4
)
generated
planes
Section
xn
that
in
points
space
demonstrate
Note
points
dimension
the
by
(
in
to
data
principal
in
5
dimensional
from
visualized
mean
visualization
450
one
their
to
0
required
be
map
the
three
)
>
WML
property
will
of
in
1WT
by
spanned
this
seen
In
visualized
(
model
be
-
latent
original
therefore
topographic
illustrate
separated
order
the
in
consisting
variance
{
posterior
close
We
M
shrinkage
the
WML
in
.
are
becomes
result
this
Thus
can
illustrated
sufficiently
sians
.
corresponding
,
2
the
the
data
satisfies
set
]
=
data
.
the
space
[ 25
=
reconstruct
O
=
T
tn
-
visualization
points
may
the
"
and
)
0 -
WML
of
plane
it
although
a
inter
framework
data
data
then
For
from
With
for
the
)
projection
reconstructed
account
25
1WT
as
orthogonal
lost
optimally
origin
.
further
powerful
structure
PCA
(
(
)
a
probabilistic
components
(
-
recovered
the
not
,
by
WM
undefined
towards
xn
mation
be
is
model
and
given
and
to
the
probabilistic
)
PPCA
a
PCA
our
23
of
.
principal
For
(
is
l
thus
.
From
tn
-
PCA
and
shrunk
(
)
so
singular
W
WTW
a
into
analysis
)
.
mixtures
considering
led
retains
PPCA
the
eigenvectors
slightly
]
single
onto
projection
-
a
component
modified
mean
M
10
and
By
are
PROBABILISTIC
of
projection
two
is
use
we
insight
[
the
,
which
USING
first
,
.
model
dimensionality
Consider
model
considerable
VISUALIZATION
BISHOP
visualization
visualization
provide
high
PPCA
data
mixture
for
can
.
Visualization
the
of
hierarchical
algorithm
which
Data
for
the
to
active
4
for
application
models
and
M
is
been
well
chosen
approach
variable
model
is
385
Figure 5. Illustration of the projection of a data vector tn onto the point on the principal subspace corresponding to the posterior mean.
trained on this data set, and the result of plotting the posterior the data points is shown in Figure 6.
means of
Figure 6. Plot of the posterior means of the data points from the toy data set, obtained from the probabilistic PCA model, indicating the presence of (at least) two distinct clusters.
4.2. MIXTURE MODELS FOR DATA VISUALIZATION Next we consider the application of a simple mixture of PPCA models to data visualization . Once a mixture of probabilistic PCA models has been fitted to the data set, the procedure for visualizing the data points involves plotting each data point tn on each of the two-dimensional latent spaces at the corresponding posterior mean position (Xni ) given by
(Xni) = (W; Wi + a[I)- lW; (tn - J.Li)
(36)
as illustrated in Figure 7. As a further refinement, the density of 'ink ' for each data point tn is weighted by the corresponding responsibility Rni of model i for that data point , so that the total density of 'ink ' is distributed by a partition of
386
CHRISTOPHER M. BISHOP tn
~ ~ ~
.
, ~
~
" " ~
~ ~ ~ ~
" " ~ ~
" " ~
~
" " ~ ~
Figure 7. Illustration of the projection probabilistic PCA mixture model .
of a data vector onto two principal
surfaces in a
unity across the plots . Thus , each data point is plotted on every compone ~t model projection , while if a particular model takes nearly all of the posterior probability for a particular data point , then that data point will effectively be visible only on the corresponding latent space plot . We shall regard the single PPCA plot introduced in Section 4.1 as the top
level
in a hierarchical
visualization
model , in which
the mixture
model
forms the second level . Extensions to further levels of the hierarchy will be developed
in Section 4 .3.
The model can be extended
to provide
an interactive
data exploration
tool as follows . On the basis of the single top -level plot the user decides on an appropriate number of models to fit at the second level , and selects points x (i) on the plot , corresponding , for example , to the centres of apparent clusters . The resulting points y (i) in data space, obtained from y (i ) = W x (i ) + Jl" are then used to initialize
the means .t i of the respective
sub-models . To initialize the matrices Wi we first assign the data points to their nearest mean vector JLi and then compute the corresponding sample covariance matrices . This is a hard clustering analogous to K -means and represents
an approximation
to the posterior
probabilities
Rni in which the
largest posterior probability is replaced by 1 and the remainder by O. For each of these clusters we then find the eigenvalues and eigenvectors of the sample covariance matrix and hence determine the probabilistic PCA density model . This initialization is then used as the starting point for the EM algorithm . Consider the application of this procedure to the toy data set intro duced in Section 4 .1. At the top level we observed two apparent clusters ,
and so we might select a mixture of two models for the second level , with centres
initialized
somewhere
near
the
centres
of the
two
clusters
seen at
the top level . The result of fitting this mixture by EM leads to the two- level visualization plot shown in Figure 8. The visualization process can be enhanced further by providing infor -
LATENTVARIABLEMODELS and
the
were
model
given
second become
would
collapse
to a simple
a set of indicator
level
generated
variables
each N
data
we only
posterior points ding
have
tn , obtained
to the posterior
Rni
distribution
reduces
to the
Maximization as shown mixture
that
likelihood
would
.
( 39 )
in the
i having
form
generated
of the hierarchy
the expectation
of the
the
. The
data
correspon
of ( 39 ) with
-
respect
7ri L 7rjliP ( tli , j ) jEQi
as constants
( 40 )
. In the particular
to complete
certainty
for each data
point
case in which about
which
the
model
, the log likelihood
( 40 )
( 39 ) .
of ( 40 ) can again
, discussed
, we
i at the
Mo
is responsible
in [10 ] . This
probability
level
= L L Rni In n= l i= l
form
log
model
of the Zni to give
are all 0 or 1 , corresponding level
the
, information model
by taking
the Rni are treated
in the second
each
the second
N
in which
for
is obtained
.
then
7ri L 7rjliP (tli , j ) jEQi
, probabilistic
Rni
from
log likelihood
tn
which
Mo
partial
responsibilities
model . If , however
zni specifying
point
.c = L L Zni In n= l i = l In fact
mixture
389
in Section model
be performed
has the same form 1 .2 , except
(i , j ) generated
using
as the EM that
data
in the
point
the EM
algorithm
algorithm
E - step , the
tn is given
,
for a simple posterior
by
Rni ,j = RniRnjli
( 41 )
in which R
This
result
automatically
. .= nJ 17 , satisfies
7rjliP (tnli , j ) ~ L. "j ' 7rj ' liP (t n 10 1" J0' ) .
( 42 )
the relation
L Rni ,j = Rni jEQi so that point
the responsibility n is shared
of offspring
models
hierarchical
approach
The Figure
result 10 .
of each model
by a partition at the
third this
at the second
level
of unity
between
level . It
is straightforward
to any desired
of applying
( 43 )
number
approach
for a given
the corresponding
data group
to extend
this
of levels .
to the
toy
data
set is shown
in
LATENTVARIABLE MODELS
391
consisting of 1000 points is obtained synthetically by simulating the phy sical processes in the pipe , including the presence of noise determined by photon statistics . Locally , the data is expected to have an intrinsic dimen sionality of 2 corresponding to the 2 degrees of freedom given by the fraction
of oil and the fraction of water (the fraction of gas being redundant). However , the presence of different configurations , as well as the geometrical interaction between phase boundaries and the beam paths , leads to numerous distinct clusters . It would appear that a hierarchical approach of the kind discussed here should be capable of discovering this structure . Results from fitting the oil flow data using a 3-level hierarchical model are shown in Figure 11.
Figtl.re 11. Results of fitting the oil data. The symbols denote different multi -phase flow configurations corresponding to homogeneous(e), annular (0) and laminar (+ ). Note, for example, how the apparently single cluster, number 2, in the top level plot is revealed to be two quite distinct clusters at the second level.
In the case of the toy data , the optimal choice of clusters and subclusters is relatively unambiguous and a single application of the algorithm is sufficient to reveal all of the interesting structure within the data . For more complex data sets, it is appropriate to adopt an exploratory perspective and investigate alternative hierarchies through the selection of differing numbers of clusters and their respective locations . The example shown in Figure 11 has clearly been highly successful. Note how the apparently single
392
CHRISTOPHER M. BISHOP
cluster
,
number
clusters
at
guration
a2
.
- linear
The
latent from
,
third
that
this
cluster
the
physics
is
data
and
the
revealed
points
can .
is
to
from
be
level
of
y
( x 2
seen
be
the to
on
of to
a
diagnostic
two
quite
distinct
' homogeneous
lie
Inspection
confined
the
then
models
An
alternative
; w
a
the
'
two
confi
-
- dimensional
corresponding
nearly
va
planar
data
is
for
sub
the
-
- space
,
homogeneous
non
.
derived
Thus
the
a
far
based
form S
in
data
is
a
the
,
shown hyper
Section
3 . 1 )
variable
consider
map
not
in
latent
a
which
space
considered
to
in
which
linear
be
on
( 6 )
manifold
of
would
are
the
data
mixture ,
Mapping
manifold on
digits a
models
latent
.
variable
.
a
non
- linear
integration ,
called
so of
living
- written
using
However ,
.
however
- linear
the
considered
Data
hand
,
with
model
x .
Topographic
variables
using
general
intractable
have data
in
approach
in
linear
linear
the
difficulty
that
we
hyperplane
is
Generative
to
approximated
which
The
)
a
example be
The
variables
is
( for
can
model
:
variable
Figure
planar
Models
latent
function
by
making
the
mapping
over
x
in
careful
Generative
function ( 7 )
will
model
y
choices
Topographic
( x
become a
Mapping
; w
)
in
( 6 )
analytically tractable or
,
GTM
,
non
can
be
[8 ] .
The a
plot
Also
.
ping
is
in
from
Non
in
- level
.
isolated
confirms
expected
top
level
been
configurations
5
the
structure of
as
in
second
have
triangular lue
2 ,
the
central
sum
of
concept
delta
is
to
functions
introduce
centred
a on
the
prior
distribution
nodes
of
a
p regular
( x
)
grid
given in
by latent
space
1
p
in
which
case
linear
the
an
isotropic
to
deal
y
data
in
; w
) .
point ,
Xi
which
Figure space
. then
=
is
can
From takes
the
)
( 44
analytically
.
( Note
and a
that
a
Gaussian
( 44
)
we
see
this
is
data
is
for
chosen
)
easily
the
function distribution
be
generalized
considering
y
non to
the
distributions point
density that
)
by
multinomial
corresponding
of
even ( tlx
categorical
Gaussian
and
Xi
performed
a2
to
-
distributionp
and
centre ( 7 )
l5 ( x l
be
variance
mapped
the
L i =
conditional
of then
K
K
continuous
forms 12
)
( 7 )
The
with
mixed product
latent space
( x
Gaussian with
corresponding
in
integral
functions
( x
( Xi ,
; w as
)
. )
Each
in
data
illustrated function
in
form
K 2 1~ p(tIW ,a)=KL ..,p ,W,a2) i= l (tIXi
(45)
LATENTVARIABLEMODELS
t1
y(x;w) .
.
393
.
' -_'~ - _..._..'_ ...'~ _.'----.-..... .
X2
.
.
.
.
.
.
t2
t3
Xl
Figure 12. In order to formulate a tractable non-linear latent variable model, we consider a prior distribution p(x ) consisting of a superposition of delta functions, located at the nodes of a regular grid in latent space. Each node Xi is mapped to a corresponding point Y(Xi ; w ) in data space, and forms the centre of a corresponding Gaussian distribution .
which corresponds to a constrained Gaussian mixture model [13] since the centresof the Gaussians, given by Y(Xi; w ), cannot move independently but are related through the function y (x ; w ). Note that , provided the mapping function Y(X; w ) is smooth and continuous, the projected points Y(Xi; w ) will necessarilyhave a topographicordering in the sensethat any two points xA and xB which are close in latent spacewill map to points Y(XA; w ) and Y(XB; w ) which are close in data space.
5.1. ANEMALGORITHM FORGTM Since
G
for
TM
is
a
form
maximizing
form
for
M
- step
by
a
of
the the
has
mixture
y
simple
form
generalized
linear
log
( x
; w .
)
In
we
is
the a
d
x
elements M
universal
cJ > ( x
the
tion
of
basis such
typically the
models
present
consist
)
of
however
are
,
exponentially context
) is
that
with this
is
choosing
significant
particular
( x
; w
)
which to
the
be
given
( 46
functions
>j
models - layer
number
basis q
problem
) ,
The
)
W same ,
pro
usuallimita functions
of
and
the
networks .
of
( X
possess
adaptive
dimensionality a
a in
y
appropriately the
the not
basis
multi
algorithm
)
fixed
chosen
EM
form
regression as
( X
the
an
algorithm
choose
q, ( x
M
linear
>j ,
w
seek
By
EM
shall of
=
capabilities functions
grow In
)
Generalized
approximation
vided
[5 ] .
.
an
we
; w
to .
obtain
model
( x
natural
likelihood
can
regression
of
matrix
is
particular
y
where
it
corresponding
mapping a
model
-
must
the
input
since
the
space dimen
-
394
CHRISTOPHER M. BISHOP
sionality is governed by the number of latent variables which will typically be small . In fact for data visualization applications we generally use q = 2. In the E-step of the EM algorithm we evaluate the posterior probabilities for each of the latent points i for every data point tn using
~n
-
p(xiltn, W, 0'2) P(tnIXi,W,0'2)
Then in the M-step we obtain a revisedvalue coupledlinear equationsof the form
for W
(47) (48) by solving
~TG~WT = ~TRT
a set of
(49)
where tl is a K x M matrix with elements ij = cPj(Xi ), T is a N x d matrix with elements tnk , R is a K x N matrix with elements ~ n, and G is a K x K diagonal matrix with elements Gii =
N L=l~n(W ,0-2). n
(50)
We can now solve (49) for W using singular value decomposition to allow for possible ill -conditioning . Also in the M -step we update 0-2 using the following fe-estimation formula
NK 2 1 """"" 2 2 a = NdnL,., LII Rin ( W , a ) IIW l/J (Xi ) tnll =li=l
(51)
Note that the matrix ~ is constant throughout the algorithm , and so need only be evaluated once at the start . When using G TM for data visualization we can again plot each data point at the point on latent space corresponding to the mean of the posterior distribution , given by
(xltn,W,0-2)
-
f xp(xltn, W , 0-2) dx
K = L RinXi . i=l It should
(52) (53)
be borne in mind ,. however ,. that as a consequence of the non linear mapping from latent space to data space the posterior distribution can be multi -modal in which case the posterior mean can potentially give a
LATENTVARIABLE MODELS
395
very misleading summary of the true distribution . An alternative approach is therefore to evaluate the mode of the distribution , given by
=arg max . '/.,max {i} Rin
(54)
In practice it is often convenient to plot both the mean and the mode for each data point , as significant differences between them can be indicative of a multi -modal distribution . One of the motivations for the development of the GTM algorithm was to provide a principled alternative to the widely used 'self-organizing map ' (SaM ) algorithm [17] in which a set of unlabelled data vectors tn (n = 1, . . . , N ) in a d-dimensional data space is summarized in terms of a set of reference vectors having a spatial organization corresponding to a (generally ) two -dimensional sheet. These reference vectors are analogous to the projections of the latent points into data space given by Y(Xi ; w ) . While the SaM algorithm has achieved many successesin practical appli cations , it also suffers from some significant deficiencies , many of which are highlighted in [18]. These include : the absence of a cost fuIJ.ction , the lack of any guarantee of topographic ordering , the absence of any general proofs of convergence, and the fact that the model does not define a probability density . These problems are all absent in GTM . The computational complexities of the GTM and SOM algorithms are similar , since the dominant cost in each case is the evaluation of the Euclidean distanced between each data point and each reference point in data space, and is the same for both algorithms . Clearly , we can easily formulate a density model consisting of a mixture of GTM models , and obtain the corresponding EM algorithm , in a prin cipled manner . The development of an analogous algorithm for the SaM would necessarily be somewhat ad-hoc.
5.2. GEOMETRYOF THE MANIFOLD An additional advantageof the GTM algorithm (compared with the SOM) is that the non-linear manifold in data spaceis defined explicitly in terms of the analytic function y (x ; w ). This allows a whole variety of geometrical properties of the manifold to be evaluated [9]. For example, local magnification factors can be expressedin terms of derivatives of the basis functions appearing in (46). Magnification factors specify the extent to which the area of a small patch of the latent spaceof a topographic mapping is magnified on projection to the data space, and are of considerable interest in both neuro-biological and data analysis contexts. Previous attempts to consider magnification factors for the SOM were been hindered becausethe manifold is only defined at discrete points (given by the referencevectors).
396
CHRISTOPHER M. BISHOP
We tion
can
determine
factors
der
a
each
, using
standard point
in
techniques
of
coordinates
of
P
~ 1, in
'
is
data
space
manifold
differential
in
x '" of
P
, as
Y
xi by
a
, the
mapping
including
which
each
; W
as in
[9 ] .
latent
continuous
space
function
point in
magnifica
follows
the
defines
illustrated
( X
,
geometry
mapped
manifold .
~ ." =
the
coordinates
space in
the .
values
of
Cartesian
latent
p ~ int
coordinate
properties
set P
responding
the
a P
'
Figure
is
of
.
-
Since a
cor
-
curvilinear
labelled
13
Consi .
to
set
-
with
the
Throughout
this
)
~ --, , '- ' ~ - - - - - - - - - --~. . .. 2
X
~
dA
Figure
Xl
13 .
Xi in manifold
This
latent
diagram
space
shows
onto
a
the
mapping
curvilinear
of
the
coordinate
1
Cartesian
system
~i
coordinate in
the
system
q - dimensional
S .
section
we
raised
indices
shall
use
components
covariant
- contravariant
We
the
denote
covariant
local
~
2
first
,
discuss
coordinates
coordinates
is
given
d
2 s
where
gij
is
the
notation
-
of
differential
components
with
an
implicit
indices
transformation
tesian
standard
contravariant
geometry
and
lowered
summation
over
in
indices pairs
of
which denote
repeated
.
the
metric
,
at
some
properties
( i
=
( i (~ ) .
point
P '
Then
in
the
of
the
S ,
to
manifold a
squared
set
S . of
length
Consider
a
rectangular element
Car in
-
these
by
~ dl " J1. dl " V u J1.l.I ~ ~
metric
tensor
-
~ 8 ( J-L 8 ( 11 d (: id u J1.V W W ~
, which
is
(: j ~
-
therefore
d (: id ~
gij
given
(: j ~
( 55
)
( 56
)
by
8 ( JL 8 ( II gij
We
now
seek
Consider Since to
S the
an
again is
expression the
embedded
squared
for
squared
9ij
length
within length
=
k i-
in
terms
element
the
element
c} JLIIWW
Euclidean of
the
.
of
the
ds2
non
lying
data
space
- linear
mapping
within , this
the
manifold
also
y
(x ) . S .
corresponds
form
~ .ayl
.
.
.
.
ds2 == <5kidy dy - <5ki8xi axJ.dxzdxJ = 9ijdxzdxJ
(57)
LATENTVARIABLE MODELS and
397
so we have
oyk oyl gij = <5klo XZ00xJ o.
(58)
Using (46) the metric tensor can be expressedin terms of the derivatives of the basis functions cPj(x ) in the form g = nTwTwn
(59)
where n has elements nji = 8>j / 8xi . It should be emphasizedthat , having obtained the metric tensor as a function of the latent space coordinates, many other geometrical properties are easily evaluated, such as the local curvatures of the manifold. Our goal is to find an expression for the area dAI of the region of S corresponding to an infinitesimal rectangle in latent spacewith area dA == IIi dx't ~ shown in Figure 13. The area element in the manifold S can be related to the corresponding area element in the latent space by the Jacobian of the transformation ~ - + ( dA' = II d( J-L= J II d~i = J II dxi = JdA J-L i i
(60)
where the Jacobian J is given by 8( J-L) = det ( ~ 8( J-L) . J = det ( ~
(61)
We now introduce the determinant 9 of the metric tensor which we can write in the form 9 = det (gij ) = det ( <5J-LII~ 8( IJ.& 8(J11 ) = det ( & 8(i IJ.) det ( & 8(J11 ) = J2
(62)
.
and so, using (60), we obtain an eXpreSSIC.n for the local mag-nification factor in the form
dA ' =J=det dA 1/2g.
(63)
Although the magnification factor represents the extent to which areas are magnified on projection to the data space, it gives no information about which directions in latent space correspond to the stretching . We can recover this information by considering the decomposition of the metric tensor g in terms of its eigenvectors and eigenvalues. This information can be conveniently displayed by selecting a regular grid in latent space (which could correspond to the reference vector grid , but could also be much finer ) and plotting at each grid point an ellipse with principal axes oriented according to the eigenvectors , with principal radii given by the square roots of
LATENTVARIABLEMODELS large
stretching
partial
in
The
is
the
region
of
between
males
corresponding
metric
of
the
separation
plot
given
in
stretching
of
Figure
15
,
local
and
Within
each
eigenvector
shows
"
cluster
there
decomposition
both
the
~
.
"
.
,
.
Plots
Figure
of
.
"
~
.
'
the
14
,
.
.
"
'
1
.
-
-
,
.
_
.
.
.
~
~
.
.
.
.
is
of
direction
a
and
the
magnitude
.
/
-
'
.
~
"
.
.
.
-
"
.
-
.
,
,
,
,
'
.
.
.
,
,
-
-
.
~
4 ~
-
,
.
.
18
.
.
~
.
.
stretching
the
,
.
1
~
.
.
8e
local
using
.
.
"
.
,
"
/
in
'
"
-
15
.
.
the
"
.
Figure
females
.
.
example
them
from
399
"
of
ellipse
.
the
latent
representation
space
discussed
,
corresponding
to
in
the
text
the
.
6. Temporal Models : GTM Through Time In all of the models we have considered so far, it has been assumedthat the data vectors are independent and identically distributed . One common situation in which this assumption is generally violated in where the data vectors are successivesamples in a time series, with neighbouring vectors having typically a high correlation. As our final example of the use of latent variables, we consider an extension of the GTM algorithm to deal with temporal data [6]. The key observation is that the hidden states of the GTM model are discrete, as a result of the choice of latent distribution p(x ), which allows the machinery of hidden Markov models to be combined with GTM to give a non-linear temporal latent variable model. The structure of the model is illustrated in Figure 16, in which the hidden states of the model at each time step n are labelled by the index in corresponding to the latent points {Xin} . We introduce a set of transition probabilities p(in+l \in ) corresponding to the probability of making a transition to state in+l given that the current state is in . The emission density for the hidden Markov model is then given by the GTM density model (45). It should be noted that both the transition probabilities p(in+l \in )
CHRISTOPHER M. BISHOP
400
p(i2Ii})
nil
p(i3Ii2 ) ..
.
p(t}li}) p(t2Ii2 ) p(t3Ii3 ) Figure 16 . The temporal version ofGTM consists ofahidden Markov model in which the hidden states are given by the latent points of the GTM model , and theemission probabilities aregoverned bytheGTM mixture distribution . Note thattheparameters oftheGTM model , MwellMthetransition probabilities b_etween states ,are tiedtocommon values across alltime steps .Forclarity wehave simplified the graph and not made the factorization property of the conditional distribution p(tli) explicit . and the parameters Wand a2 governing the G TM model are common to all time steps, so that the number of adaptive parameters in the model is independent of the length of the time series. We also introduce separate prior
probabilities
7ril on each of the latent
points at the first time step of
the algorithm . Again we can obtain an EM algorithm for maximizing the likelihood for the temporal
G TM
model . In the context
of hidden
Markov
models , the
EM algorithm is often called the Baum - Welch algorithm , and is reviewed
in [21]. The E-step involves the .evaluation of the posterior probabilities of the hidden states at each time step , and can be accomplished efficiently using a technique called the forward -backward algorithm since it involves two counter -directional propagations along the Markov chain . The M -step equations again take the form given in Section 5.1. As an illustration of the temporal GTM algorithm we consider a data set obtained from a series of helicopter test flights . The motivation behind this application is to determine the accumulated stress on the helicopter air frame . Different flight modes, and transitions between flight modes, cause different
levels
of stress , and at present
maintenance
intervals
are determi
-
ned using an assumed usage spectrum . The ultimate goal in this application would be to segment each flight into its distinct regimes, together with the transitions between those regimes, and hence evaluate the overall integrated stress with
greater accuracy .
The data used in this simulation was gathered from the flight recorder over four test flights , and consists of 9 variables (sampled every two seconds)
402
CHRISTOPHER M. BISHOP
are
the
the
observed
inference
and
the
of
the
integrating
of
over
all
in
time
,
The
of
the
classes
for
mult
.iple
may
grow
hidden
the
since
the
focus
of
the
research
communities
to
of
within
the
for
considered
at
motivate
anyone
the
,
which
this
-
in
-
states
.
models
graphical
to
hidden
variables
re
there
leads
of
such
-
general
in
models
with
develop
more
considered
hidden
deal
the
choices
active
configurations
number
to
)
we
be
many
of
approximations
extensive
computing
number
For
hidden
Gaussian
instance
be
for
.
.
For
can
,
or
variables
can
helps
.
However
with
controlled
also
states
.
exponentially
of
,
model
)
summing
variables
states
algorithm
continuous
and
hidden
given
EM
hidden
distribution
however
the
involves
the
( linear
hidden
variables
of
over
discrete
probabilistic
variables
algorithms
lopment
simple
the
hidden
step
integration
mixture
discrete
hidden
tractable
of
,
of
the
of
standard
viewpoint
new
presentations
are
case
one
to
graphical
ment
,
of
the
-
which
of
chapter
In
the
E
configurations
only rise
the
function
because
.
which
giving
likelihood
this
possible
structure
models
the
in
was
model
of
to
possible
considered
variables
distribution
( corresponding
evaluation
models
the
posterior
variables
The
is
modelling
deve
-
currently
and
neural
.
ACKNOWLEDGEMENTS The
author
the
would
work
like
reported
Svensen
,
in
Michael
to
thank
this
the
chapter
Tipping
:
and
following
for
Geoffrey
Chris
their
Hinton
Williams
,
contributions
lain
to
Strachan
,
Markus
.
References 1. Anderson York
, :
T
John
.
W
.
Wiley
( 1958
) .
An
Introduction
to
Multivariate
Statistical
Analysis
.
New
.
2. Anderson of
,
T
.
W
.
Mathematical
( 1963
)
.
Asymptotic
Statistics
34
theory
,
122
-
148
for
principal
component
analysis
.
Annals
.
3. Bartholomew
,
Charles
D
Griffin
.
&
J
.
Co
( 1987 .
) .
Ltd
Latent
Variable
Models
and
Factor
Analysis
.
London
:
.
4. Basilevsky
,
Wiley
A
.
( 1994
) .
M
.
( 1995
) .
M
. ,
Statistical
Factor
Analysis
and
Related
Methods
.
New
York
:
.
5. Bishop Press
,
C
.
,
C
.
Neural
Networks
for
Pattern
Recognition
.
Oxford
University
.
6. Bishop Proceedings
G
lEE
bridge
,
U
. K
,
C
. ,
.
E
.
Fifth pp
.
Hinton
,
and
I
International
111
-
116
.
G
.
D
.
Strachan
Conference
( 1997 on
) .
Artificial
GTM
through
Neural
time
Networks
,
.
In
Cam
-
dual
-
.
7 . Bishop energy
8.
in
.
and
G
Research
,
nerative tion
M
C
.
M
To : /
/
appear wwv
D
.
James
M
. ncrg
( 1993
and A
. ,
topographic .
.
densitometry
Physics
Bishop
http
.
gamma
.
327
,
580
Svensen
-
593
,
in
.
volume . ac
10 . uk
/
, .
Analysis
networks
of .
Nuclear
multiphase
Hows
using
Instruments
and
Methods
.
and
mapping
. aston
) .
neural
C
.
K
.
I .
Accepted
for
number
1 .
Williams
( 1997a
publication Available
) . in
as
NCRG
GTM
:
Neural
the
ge
Computa /
96
/
015
-
from
LATENTVARIABLE MODELS
403
9. Bishop , C. M., M. Svensen , andC. K. I. Williams(1997b ). Magmfication factorsfor
theGTMalgorithm . In Proceedings lEE FifthInternational Conference onArtificial NeuralNetworks , Cambridge , U.K., pp. 64-69. 10. Bishop , C. M. andM. E. Tipping(1996 ). A hierarchical latentvariablemodel for datavisualization . Technical ReportNCRG /96/028,NeuralComputing Research Group , AstonUniversity , Birmingham , UK. Accepted forpublication in IEEEPAMI. 11. Dempster , A. P., N. M. Laird,andD. B. Rubin(1977 ). Maximum likelihood fromincomplete dataviatheEMalgorithm . JournaloftheRoyalStatistical Society , B 39(1), 1-38. 12. Hinton , G. E., P. Dayan , andM. Revow (1997 ). Modeling themanifolds of images of handwritten digits. IEEETransactions onNeuralNetworks 8(1), 65-74. 13. Hinton,G. E., C. K. I. Williams , andM. D. Revow (1992 ). Adaptive elasticmodels for hand -printedcharacter recognition . In J. E. Moody , S. J. Hanson , andR. P. Lippmann (Eds.), Advances in Neural Information Processing Systems , Volume 4, pp. 512 - 519.Morgan Kauffmann . 14. Hotelling , H. (1933 ). Analysis of a complex of statistical variables intoprincipal components . Journalof Educational Psychology 24, 417 - 441. 15. Hull, J. J. (1994 ). A database for handwritten text recognition research . IEEE Transactions onPatternAnalysis andMachine Intelligence 16, 550 - 554. 16. Jordan , M. I. andR. A. Jacobs (1994 ). Hierarchical mixtures of expertsandthe EMalgorithm . NeuralComputation 6(2), 181 - 214 . 17. Kohonen , T. (1982 ). Self -organized formation oftopologically correct feature maps . Biological Cybernetics 43, 59-69. 18. Kohonen , T. (1995 ). Self-Organizing Maps . Berlin:Springer -Verlag . 19. Krzanowski , W. J. andF. H. C. Marriott(1994 ). Multivariate Analysis Part I: Distributions , Ordination andInference . London : Edward Arnold . 20. Pearson , K. (1901 ). Onlinesandplanes of closest fit to systems ofpointsin space . TheLondon , Edinburgh andDublinPhilosophical Magazine andJournalof Science , SixthSeries2, 559 - 572. 21. Rabiner , L. R. (1989 ). A tutorialonhidden Markov models andselected applications in speech recognition . Proceedings of theIEEE77(2), 257 - 285 . 22. Rao,C. R. (1955 ). Estimation andtestsof significance in factoranalysis . Psycho metrika20, 93- 111 . 23. Rubin,D. B. andD. T. Thayer(1982 ). EM algorithms for ML factoranalysis . Psychometrika 47(1), 69-76. 24. Tipping , M. E. andC. M. Bishop(1997a ). Mixturesof probabilistic principal component analysers . Technical ReportNCRG /97/003,NeuralComputing Research Group , AstonUniversity , Birmingham , UK. Submitted to NeuralComputation . 25. Tipping , M. E. andC. M. Bishop(1997b ). Probabilistic principal component analysis.Technical report,NeuralComputing Research Group , AstonUniversity , Birmingham , UK. Submitted to Journalof theRoyalStatistical Society , B.
STOCHASTIC ANAL
ALGORITHMS
YSIS
DATA
FOR
EXPLORATORY
DATA
:
CLUSTERING
JOACHIMM
AND
DATA
. BUHMANN
Institut Jur Informatik Rheinische
VISUALIZATION
Friedrich
III ,
- Wilhelms
- Universitiit
D - 53117 Bonn , Germany jb ~ informatik . uni - bonn . de http : ! ! www- dbv . informatik . uni - bonn . de
Abstract . Iterative , EM -type algorithms for data clustering and data vi sualization are derived on the basis of the maximum entropy principle . These algorithms allow the data analyst to detect structure in vectorial or relational data . Conceptually , the clustering and visualization procedures are formulated
as combinatorial
or continuous
optimization
problems which
are solved by stochastic optimization .
1.
INTRODUCTION
Exploratory Data Analysis addresses the question of how to discover and model structure hidden in a data set. Data clustering (JD88 ) and data visualization are important algorithmic tools in this quest for explanation of data relations . The structural
relationships
between
data points , e.g .,
pronounced similarity of groups of data vectors , have to be detected in an unsupervised fashion . This search for prototypes poses a delicate tradeoff : a sufficiently rich modelling approach should be able to capture the essential structure
in a data
set but
we should
restrain
ourselves
from
imposing
too much structure which is absent in the data . Analogously , visualization techniques should not create structure which is caused by the visualization methodology rather than by data properties . In the first part of this article we discuss the topic of grouping vectorial and relational data . Conceptually , there exist two different approaches to data clustering : 405
406
JOACHIM M. BUHMANN
-
Parameter
estimation
of mixture
models by parametric
statistics .
-
Vector quantization of a data set by combinatorial optimization .
Parametric statistics assumes that noisy data have been generated by an unknown number of qualitatively similar , stochastic processes. Each indi vidual process is characterized by a unimodal probability density . The density of the full data set is modelled by a parametrized mixture model, e.g., Gaussian mixtures
are used most frequently
(MB88 -) . This model -based an . -
a
. proach to data clustering requires estimating the mixture parameters , e.g., the mean and variance of four Gaussians for the data set depicted in Fig . la . Bayesian statistics provides a unique conceptual framework to compare
and validate
different
mixture
models
.
The second approach to data clustering which has been popularized as vector quantization in information and communication theory , aims at finding a partition of a data set according to an optimization principle . Clustering as a data partitioning problem arises in two different forms depending on the data format 1:
- Central clustering of vectorial data B = {Xi E n:td : 1 ~ i ~ N } ; - Pairwise clustering of proximity data V = { Vik E ill. : 1 :::; i , k :::; N } . The goal of data clustering is to determine a partitioning which either
minimizes
of a data set
the average distance of data points to their cluster
centers for central clustering or the average distance between data points of the same cluster for pairwise clustering . Note that the dissimilarities Vik do not necessarily
respect the requirements
of a distance
measure , e.g ., dissi -
milarities or confusion values of protein , genetic , psychometric or linguistic data frequently violate the triangle inequality and the self-dissimilarity Vii is not necessarily zero. For Sect. 3 we only assume symmetry Vik = Vki . The second main topic of this paper addresses the question
how rela -
tional data can be represented by points in a low-dimensional Euclidian space. A class of algorithms known as Multidimensional determine
coordinates
in a two - or three - dimensional
Scaling (Sect. 4)
Euclidian
space such
that pairwise distances IIXi - xkll match as close as possible the pairwise dissimilarities V . A combination of data visualization and data clustering is discussed in Sect. 5. This algorithm preserves the grouping structure of relational data during the visualization procedure . 2. Central
Clustering
The most widely used nonparametric technique to find data prototypes is central clustering or vector quantization . Given a set of d-dimensional data JIn principle , it is possible to consider triple or even more complicated data relations in an analogous fashion but the discussion in this paper is restricted to vectorial and relational
data
.
DATA
vectors
B
to
=
{ Xi
determine
: rn . d
:
CLUSTERING
E
an
1
::::;
v
specified
::::;
by
: n: td
AND
:
1
~
optimal
K
}
i
set
of
according
Boolean
~
N
=
}
,
an
central
clustering
)
E
.
and
{ O , l
}
the
vectors
criterion M
i = l ,...,N
poses
reference
optimality variables
( Miv
407
VISUALIZATION
d - dimensional
to
assignment
M
DATA
a
A
problem Y
data
configuration
NXK
=
{ y
v
E
partition
is
space
M
,
(1)
,
lI = l , . . . , K
(2) M
Miv
=:
1 ( 0 )
vector
y
rations
v .
Eq
.
The
quality
of
tion
which
favors
.
1i
( a
)
favor
~
set
of
,
( a
' )
is
+
specific
1i
we
point
constraint
( a
" ) .
Xi
is
as
the
CENTRAL
1 ,
)
Vi
}
.
assigned
set
of
unique
to
configu
assignment
=
1 ,
reference
admissible
Vi
of
-
data
to
.
CLUSTERING
vectors
is
assessed
with
a
this
number
' , a
"
design
of
.
objective
costs
always a
,
i .e . ,
favorable bias
function
-
H
be
spurious
cost
-
compact
clustering
The
func
cluster
should
principle
clusters
an
intra
of
clusters
Without
by
high
superadditivity
two
"
=
( not
1 Miv
solutions
" optimal
Miv
a
): : : ; ~ =
require
into
E
defined
reference
a
:
requires
assignment
cluster
1i
a
data
FOR
Furthermore
a
the space
the
FUNCTION
a
{ a , 1 } NXK
Admissibility by
COST
E
that
( 2 ) .
2 . 1 .
splitting
M
solution
expressed
ness
{
denotes The
in
clusters
=
could
for
central
clustering N
HCC
( M
)
=
L
L
i =
fulfills
V
these
( Xi
Yv
, Yv
)
. An
between
, Yv
The
)
: =
the
most
Ilxi
-
standard
( Mac67
A
sion
probability
code
vector
cation
setup
the
to
to
the
the
Y
.
cluster
index
'Y.
The
, Yv
( 3 )
average
)
distortion
the
on
squared
known
its as
cluster
( 3 )
has
a
and
following
been
' Y
make
it
.
and
reesti
-
convergence
.
discussed a
displays
vector algorithm
centers
with
diagram
distance
- means
until
channels
vector application
reference
K
assignments
Noisy
the
Euclidian
and
closest
data
error
reference
depends
vector
function
- coding
) .
corresponding
( Xi
data
objective
- channel
close
the
( 3 ) ,
to
, Yv
the
being
the
data
between Ya
and
minimizing
assigns
of
source
up
choice
for
( Xi
l
V
between
according
generalization of
Xi
common
centers
MivV
v =
measure
Yvl12
iteratively
these
context
vector
algorithm
) ,
mates
data
l
summing
distortion
with
( Xi
by
a
appropriate
domain
V
requirements
K
in
significant
preferable
to the
the
confu
place
communi
:
~~~ ! ; J .!.
{Xi} -+ IEncoder Xi -+ YaI ~
~ [:~:2!!:i~;!!!!~~!1 4 ~ _ecod
-
-t {Y'Y}
-
408 In
JOACHIMM. BUHMANN
addition
nel
to
the
distortion
quantization
due
function
for
to
the
error
index
total
V
( Xi
corruption
,
Ya
into
distortions
)
we
have
account
between
to
.
sender
An
and
take
the
chan
appropriate
-
cost
receiver
K
V
( Xi
,
Ya
)
=
L v
favors
the
topological
measure
v
Toll
due
to
noise
of
popular
as
Lut89
. 2
;
.
BSW97
this
)
paper
cost
advocate
1 {
cc
costs
uniqueness
( M
,
we
)
have
principle
)
,
,
states
( 3
.
to
,
e
a
. ,
"
the
)
closeness
a
- dimensional
-
with
,
dimensional
neural
( M
index
topological
grid
are
computation
"
very
( Koh84
;
)
)
stochastic
optimization
of
requires
data
many
clusters
this
by
suggested
for
are
The
distri
the
introducing
central
-
of
yield
an
principle
.
The
clustering
distributed
assign
probability
might
entropy
to
optimization
the
ambiguity
assignments
.
Stochastic
assignments
maximum
principle
to
.
determining
Since
the
originally
that
to
index
low
two
variables
break
. g
1lcC
random
M
constraint
in
assignments
be
assignments
expected
( RGF90
we
to
function
over
a
maps
and
considered
according
confusing
a
or
OF
,
( 4
.
centroids
are
entropy
with
chain
OPTIMIZATION
optimized
the
Distortions
a
Yvl12
vectors
of
topological
;
Throughout
bution
code
probability
defining
selforganizing
RMS92
ments
.
clusters
STOCHASTIC
find
the
-
l
of
specifies
transmission
arrangement
2
organization
which
Tavllxi =
unbiased
maximum
by
according
-
same
Rose
to
et
the
ale
Gibbs
distribution
pGibbS
( 1 {
CC
F
( M
)
( 1 - lCC
)
=
exp
)
=
-
(
T
-
( 1lcC
log
( M
L
)
-
:
exp
(
-
F
( 1icC
)
1icC
)
( M
jT
)
)
/
T
,
)
( 5
)
( 6
)
MEM
The
"
the
in
computational
expected
Eq
.
costs
( 6
function
exp
term
temperature
)
can
1lcc
(
-
1lcC
exp
.
( M
(
-
:
.
be
F
/
pointed
as
factor
T
( 1lcC
)
.
)
exp
( :
F
a
( 1lcC
Constraining
/
T
)
can
T
serves
out
interpreted
The
)
As
"
rewritten
( HB97
smoothed
)
the
be
in
as
/
T
a
)
,
Lagrange
the
free
version
)
the
by
for
energy
of
normalizes
assignments
parameter
the
F
orie
'-'
exponential
E
~
l
( 1 - lcC
: inal
)
cost
terms
Mill
=
1
the
as
(7)
(8)
DATACLUSTERING AND DATAVISUALIZATION
409
for predefinedreferencevectorsT = {y 1I} ' This Gibbsdistribution can also be interpreted as the completedata likelihood for mixture modelswith parametersT . Basically, the distribution (8) describesa mixture model with equalpriors for eachcomponentand equal, isotropiccovariances . The optimal referencevectors{y~} are derivedby maximizingthe entropy of the Gibbs distribution, keepingthe averagecosts(1lCC ) fixed, i.e., Y* = argmax - L pGibbs (1lcC(M )) logpGibbs (1lcC(M )) T MEM = argmax 2::: 1{cC(M )pGibbs (1{cC(M ))/ T Y MEM N K + 2::: log 2::: exp(- V (Xi, YJL )/ T) . i=l JL =l
(9)
pGibbs (M ) is the Gibbsdistribution of the assignments for a set 1 of fixed referencevectors. To determineclosedequationsfor the optimal reference vectorsy~ we differentiatethe argumentin Eq. (9) with the expectedcosts being kept constant. The resultingequation N a 0 = L (MiV)aV i=l Yv (Xi, Yv) with exp(- V (Xi, Yv)/ T) (Miv) = "L".,.. ,,,,JLexp(- "n( )/ T) Vv E { I , . . . ,K } v X1 ". YJL
(10) (11)
is known as the centroid equationin signalprocessingwhich are optimal in the senseof rate distortion theory (CT91). The angular bracketsdenote Gibbsexpectationvalues, i.e., (f (M )) := L:::MEMj (M )pGibbs (M ). The equations(10,11) are efficientlysolvedin an iterative fashionusingthe expectation maximization(EM) algorithm (DLR77): The EM algorithm alternatesan estimationstep to determinethe expectedassignments(M iv) with a maximizationstepto estimatemaximum likelihood valuesfor the clustercentersYv. Dempsteret ale(DLR77) have proventhat the likelihood increasesmonotonicallyunder this alternation scheme . The algorithm convergestowardsa local maximumof the likelihood function. The log-likelihoodis up to a factor (- T) equivalentto the freeenergyfor centralclustering. The function f (.) denotesan appropriate annealingschedule , e.g., f (T) == T / 2. The sizeK of the cluster set, i.e., the complexityof the clusteringsolution, has to be determinedby a problem-dependentcomplexitymeasure
410
JOACHIMM . BUHMANN EM
Algorithm
I for Centroid
Estimation
INITIALIZE y~o) E Rd randomlyand (Miv)(O) E (0, 1) arbitrarily; temperature
T +- To ;
WHILE T > T FIN AL t +- O., REPEAT
E-step: estimate(M iv) (t+1) as a function of y ~t) ; M-step: calculatey~t+l ) for given (Miv)(t+l ); t +-
t +
1;
UNTIL all { {Miv)(t), y~t)} satisfyEqs. (10,11) T +- f (T );
(BK93 ) which monotonically grows with the number of clusters . Simulta neous minimization of the distortion costs and the complexity costs yields an optimal number K * of clusters . Constant complexity costs per cluster
or logarithmiccomplexitycosts- log(E ~ l Miv/N) (Shannon information ) are utilized in various applications like signal processing , image compression or speech recognition . A clustering result with logarithmic complexity costs is shown in Fig . lc . It is important to emphasize that the vector quantiza tion approach to clustering optimizes a data partitioning and , therefore , is not suited to detect univariate components of a data distribution . The split -
ting of the four Gaussiansin Fig. la into 45 clusters (Fig. lc ) results from low complexity costs and is not a defect of the method . Note also, that the distortion costs as well as the complexity costs determine the position of the
prototypes { Yv} . For example, the density of clusters in Fig. lc is asymptotically independent of the density of data (in the area of non-vanishing data density ) , which is a consequence of the logarithmic complexity costs. 3 . Pairwise 3 .1 .
COST
Clustering
FUNCTION
FOR
PAIRWISE
CLUSTERING
The second important class of data are relational data which are encoded by a proximity or dissimilarity matrix . Clustering these non-metric data which are characterized by relations and not by explicit Euclidian coordinates is usually formulated as an optimization problem with quadratic assignment costs. A suggestion what measure we should use to evaluate a clustering solution for relational data is provided by the identity
N !2~ =lEk Mkvllxi -xkl12 (12) -Yvl12 .l,M 7 ,.vEf N iL=lMivllxi iL= =lMkv -
UALIZA TI0N DATACLUSTERING ANDDATAVIS
Figure
1 .
Clustering
estimated the
Gaussian .
Data connected
This . rIng
and ( c )
by
a
data
model stars
is (* )
shows
a
self
- organizing
are
data
( a )
depicted
generated in
cluster
centers
partitioning
using
chain
is
by
( b ) , the
shown
four
plus
. The
Gaussian
signs
( +
circles
a
logarithmic
in
( d ) ,
sources
)
are
denote
the the
The of
covariance
complexity neighboring
:
centers
measure clusters
.
being
.
with
tances
Figure
clustering
multivariate
mixture
sources
estimates
Ilxi
of
Gaussian
411
Yv -
= Xk is
identity
E 112
~ =
l
MkvXk
pairwise
identical
to strongly
/
E
clustering central
~ =
l
Mkv
.
with
For
squared
Euclidian
normalized
clustering
with
distances
average cluster
intra means
Vik - cluster
as
prototypes
INK 1lPC ({Miv })=2 ~~ i,L k=lvL=l~El =lMlv motivates
the
objective
function
for
pairwise
= dis
.
cluste
-
( 13
)
with Miv E { a, I } and E ~=l Miv = 1. Whenever two data items i and k are assignedto the same cluster v the costs are increasedby the amount Vik / E ~ l Mlv . The objective function has the important invariance property that a constant shift of all dissimilarities Vik - t Vik + Va does not effect the assignmentsof data to clusters (for a rigorous axiomatic frame-
412
JOACHIM M. BUHMANN Algorithm
INITIALIZE
Clustering
iv (O) and (Miv ) (O) E (0, 1) randomly ; temperature
WHILE
II for Pairwise T f-- To ;
T > T FIN AL t f -- O., REPEAT
E-like step: estimate (M iv) (t+ 1) as a function of tiv (t) ; M -like step : calculate t ~
iv (t+ l ) for given (Miv ) (t+ l )
t + 1;
UNTIL all { (M iv) (t) , [;iv (t)} satisfy (14) T f-- f (T );
work, see (Hof97)) which dispensesthe data analyst from estimating absolute dissimilarity values instead of (relative) dissimilarity differences.
3.2. MEANFIELDAPPROXIMATION OF PAIRWISECLUSTERING Minimization of the quadraticcost function (13) turns out to be algorithmically complicateddue to pairwise, potentially conflicting interactions betweenassignments . The deterministicannealingtechnique, which producesrobust reestimationequationsfor centralclusteringin the maximum entropy framework, is not directly applicableto pairwiseclusteringsince there is no analyticaltechniqueknownto capturecorrelationsbetweenassignmentsMill and Mkll in an exact fashion. The expectedassignments (Mill)' however , can be approximatedby calculatingthe averageinfluence ill exertedby all Mkll, k # i on the assignmentMill (pp. 29, (HKP91)), thereby neglectingpair-correlations((MillMkll) = (Mill) (Mkll))' A maximum entropyestimateof ill yieldsthe transcendentalequations (Mill) =
iLl({ (Mill)} IV)
-
Kexp(- ill/ T) }:::JL =l exp(- iJL / T) ,
(14)
-Ef Mjv )V) jv jk kt=l}:;N j= jv 1Vik 2}:=;~ j= #li(M)+kv # il(M (15)
The partial assignmentcosts [,iv of object i to cluster v measurethe average dissimilarity in cluster v weighted by its cluster size. The secondsummand in the bracket in (15) can be interpreted as a reactive term which takes the increased size of cluster v after assignment of object i into account. Equation (14) suggestsan algorithm for learning the optimized cluster assignments which resemblesthe EM algorithm : In the E- step, the assignments { Mill } are estimated for given { 'iv} . In the M-step the { 'iv} are
DATA
reestimated
CLUSTERING
on
gorithm
the
converges
data
basis
to
clustering
AND
of
a
new
(
estimates
solution
HB97
)
413
VISUALIZATION
assignment
consistent
problem
DATA
of
.
This
assignments
iterative
for
al
the
-
pairwise
.
3.3. TOPOLOGICALPAIRWISECLUSTERING Central
clustering
gical break
the
titioning in
with
clustering
, i .e . , to
pairwise
to same
break
clustering
clustering
for
criterion
yields
( 3 ) has
distortions
symmetry
between in
index
can
be
generalization
the
permutation
cost
function
generalized
which
and
space
favor
for
clusters
coincides
distances
a topolo
with
-
distortions a data
, e .g . , a linear
introduced of
to
generalized
clusters
symmetry
Euclidian
cost
been
( 4 ) . These
neighborhood
quadratic the
function
with
permutation according
Fig . Id . The
ring
cost
problem
pairwise . We
par -
chain
as
cluste
introduce
topological
as dissimilarities
. This
a
central design
function N
1ltPC ( { Miv
} )
==
~
K
L L MivMkv i ,k = lv = l
--- : ~ El = l Mlv
( 16 )
K Miv
=
L T J.tvMiJ .t8 J1 .= 1
( 17 )
-The
effective
ments
assignments
which
encode
maximum
entropy
clustering
we
and
transform
apply the
Mill
the
estimates the
can
of
the
inverse
solution
be
interpreted
neighborhood
of
assignments
transformation
of the
as
structure
resulting
for to
costs
transformed clusters
. To
topological
the
cost
( 13 ) which
function
assign find
-
the
pairwise ( 17 )
yields
-
(Mio) =
; xp(- ia~T) ,
LJ1.=l exp(- iJ1 ./ T )
(18)
K
io = L Tov iv({(Miv)}IV),
(19)
v = l
iv in -r .h .s. of replaced byMiv.
where
4 . Data
Visualization
(19) are the assignment costs (15) with Miv being
by Multidimensional
Scaling
Grouping data into clusters is an important concept in discovering struc ture . Apart from partitioning a data set, the data analyst often explores data by visual inspection to discover correlations and deviations from ran domness. The task of embedding given dissimilarity data V in ad -dimensional
415
DATACLUSTERING ANDDATAVISUALIZATION intermediate (l ) Wik -
normalization 1
N
which
(N
corresponds
error
( DH73 We
the
to
the
minimization
same
clustering
expected the
the
1 . w ~m ) = 'L.... """ "'~,k = l V tk ? ' ."k of
relative
1 D 'l.k . ,L.... """ lN,m = l V2 lm
, absolute
or
( 21 )
intermediate
) .
pursue
pairwise
of
. w ~g ) = I ) V ."k ? " "k
-
optimization
case
strategy
, i . e . , we
coordinates
derive
( Xi ) using
to
the
the
minimize
maximum
( 20 )
entropy
approximation
that
as
in
the
of
distribution
exp (exp -fi((-xfi)(/X T))i/T)i iII =l-00 J00 dXi
embedding
coordinates
pO(XIi)
-
X
factors
according
the
estimate
to
(22)
f i (Xi) = a?IIXi114 + IIXi112xThi + Tr -[xixTHi] + xThi. (23) Utilizing the symmetry Vik == Vki and neglecting constant terms an expansion of 1lmdsyields the expected costs (expectations w .r .t . 22)
N (1{mdS )
:::1Wik[ 2(llxi114 ) - 8(IIXi112Xi )T(Xk) + 2(IIXi112 )(11Xk 112 ) ?',2 ,k= +4Tr [(XiX[ )(XkX [ )] - 4Vik((llxiI12 ) - (Xi)T(Xk)) ] , (24)
Tr
[
A
]
denoting
The
the
statistics
2
} =: : :
k
=
1
wik
,
=
f
statistics
of
(
a
with
)
of
clearly
the
ported
} =: : :
8
=
226
(
I
k
=
XkX
,
hi
,
(
b
>
1
)
,
,
1
wik
[
)
(
+
)
(
.
.
.
,
< I >
N
)
used
)
information
,
A
detailed
,
1
hi
=
lxk
12
in
in
derivation
~
8
)
-
i
the
box
an
iterative
~
,
is
N
(
are
allows
c
)
intra
in
(
defined
KB97
as
the
(
or
the
data
.
with
d
)
a
?
.
-
)
)
(
to
.
12xk
)
)
-
compute
2
dark
gives
an
grey
by
weighting
levels
minimizing
.
with
IXk
dissimilarity
The
different
embed
-
accuracy
dissimilarities
.
Embedding
to
visualize
structure
rather
(
determinis
Fig
derived
data
visualized
-
the
family
(
)
III
globin
the
Xk
a
fashion
cluster
set
(
Algorithm
local
-
Vik
propose
experimentalist
that
in
wik
We
Starting
and
the
.
are
Clustering
guarantee
1
)
for
the
of
and
=
embeddings
structure
-
k
)
.
from
inter
} =: : :
Vik
practice
)
cluster
scaling
no
(
in
Pairwise
,
Xk
4I
intermediate
of
the
.
hi
sequences
the
Simultaneous
by
,
dissimilarities
reveal
however
Hi
sketched
protein
representation
,
?
be
small
Multidimensional
exists
a
A "
N
(
global
dings
.
8
(
might
to
)
-
Wik
MDS
correspond
5
1
< I >
how
matrix
in
=
=
algorithm
idea
20
(
matrix
N
hi
} =: : :
annealing
the
(
=
"
Hi
tic
of
< I > i
N
and
trace
than
data
is
being
indeed
generated
.
There
sup
-
by
416
JOACHIM M. BUHMANN Algorithm
INITIALIZE WHILE
III : MDS
by Deterministic
Annealing
the parameters <1> of pO(XI ) randomly.
T > TFINAL REPEAT
E -like step :
Calculate(Xi)(t+l ), (xixT )(t+l ), (1IxiI12Xi )(t+l ) w.r .t . p (t)(XI (t ) M -like step :
compute it+l), 1 ~ k ~ N, k # i t ~
UNTIL
t +
1;
convergence
T f - f (T );
the visualization process. We, therefore, have proposed a combination of pairwise clustering and visualization with an emphasis on preservation of the grouping statistics (HB97). The coordinates of data points in the embedding spaceare estimated in such a way that the statistics of the resulting cluster structure matches the statistics of the original pairwise clustering solution. The relation of this new principle for structure preserving data embedding to standard multidimensional scaling is summarized in the following diagram: { Dik } -- + .t 1lmds { llxi - Xk112 } -- +
1lpc(MI { Vik } )
-- +
1lcC(MI { Xi} )
- -t
pGibbS (1lpc(MI { Vik } )) .t I (1lCC II1lPc) pGibbS (1{ cC(MI { Xi} )).
Multidimensional scaling provides the left/ bottom path for structure detection, i .e., how to discover cluster structure in dissimilarity data. The dissimilarity data are first embedded in a Euclidian space and clusters are derived by a subsequent grouping procedure as K -means clustering. In contrast to this strategy for visualization, we advocate the top fright path , i .e., the pairwise clustering statistic is measured and afterwards, the points are positioned in the embedding space to match this statistic by minimizing the Kullback-Leibler divergence I (.) between the two Gibbs distributions pGibbs (1lcC(MI {Xi} )) and pGibbs (1lPC (MI { Vik } ))' This approach is motivated by the identity (12) which yields an exact solution (.T(pGibbs (1lCC ) IIPGibbS (1lPC )) = 0) for pairwise clustering instanceswith ' Dik= IIXi - Xk112 . Supposewe have found a stationary solution of the mean- field equations (14). For the clustering problem it sufficesto considerthe mean assignments (Mill ) with the parameters eill being auxiliary variables. The identity (12)
417
DATACLUSTERING AND DATA
VISUALIZATION
allows
us
to
centroid
interpret
these
under
scaling
problem
tentials
ill
definition
the
=
for
~
~
1
be
MivXk
squared
E
the
the
~
1
distance
data
are
of
/
embedding
the
Euclidian
Xi
to
E
the
1
KiXi
of
coordinates
restricted
Yll
equations
as
assumption
the
are
variables
In
the
unknown
form
Miv
.
-
then
Y
the
are
v
112
.
with
If
the
the
following
fulfilled
cluster
multidimensional
po
reestimation
K
}
:
11 =
( Miv
)
(
11Y1I112
-
t : iv
(
{
( Miv
)
}
IV
)
-
centroid
:
K
' 2
the
quantities
II Xi
,
coordinates
to
)
(
Yv
-
}
1
(25) :
( MiJ
J- L =
- L ) YJ
-L )
'
1
K
Ki
=
( yyT
)
i
-
( Y
) i
( Y
) ;
with
( Y
) i
=
} v
The
t : iv
dissimilarity
which
values
are
Appendix
C
iteratively
determine
defined
of
in
( HB97
solving
)
the
.
(
The
the
15
)
.
Algorithm
the
{
)
according
IV : Structure
) Yv
Xi
of
coordinates
( 25
( Miv
Xi
(26)
.
1
coordinates
Details
equations
: =
through
the
derivation
}
can
and
to
{
the
Preserving
YII
}
are
potentials
be
found
in
calculated
Algorithm
by
IV
.
MDS
INITIALIZEx~o) randomly and(Miv)(O) E (0,1) arbitrarily ; WHILE
temperature T > TFIN AL
T +- To ;
t t - O', REPEAT
E-likestep:estimate (Miv)(t+l) asa function of {x~t),y~t)} M - like step : REPEAT
calculate x~t+l) given(Miv)(t+l) andy~t) updatey~t+l ) to fulfill the centroidcondition tf UNTIL
UNTIL - t + 1
convergence
convergence
T f - f (T );
The derived system of transcendental equations given by (11) with quadratic distortions , by (25) and by the centroid condition explicitly reflects the dependencies between the clustering procedure and the Euclidian representation . Simultaneous solution of these equations leads to an efficient algorithm which interleaves the multidimensional scaling process and the clustering process, and which avoids an artificial separation into two uncor related data processing steps. The advantages of this algorithm are convin cingly demonstrated in the case of dimension reduction . 20-dimensional
418
JOACHIMM. BUHMANN 1
a)
I
M
L
LL
LL
L
:
E
iM
ME1u -.
E
EE ~
0
p
H
GKK
NEN ~tf .r ~ Q ~
C
N
2
f$G --q~
:
DC C~
"fr ~H
K
~
R
~H
K
I
~
_0 5
a ~i 0S :A :
p
0
.
0
s
I
S 0
i !
p
G~
-2
FF
0 0 O~ ~AA
A
"
I: " Y
At !
...
F
0 : !
~
C
F F ~r
~
:
4, 5 S
B ~ ~~ L. ~
0
P all!! s ~Ss R IftQ Q
C
5
M.
KK Q K K
D\ pQ !.. ~~ T .T ~! T ~RI ~r.iI 13 ; roLI
!
J
J JG_ ,JG
~
.
b)
J
G
,' PJI' I
' LL EEEiM E M N !
O. 5
4
G
,.
I
: ~
~
i A : A - 1 - 1
. 0
- 0 . 5
0 . 5
1
- 4
- 4
- 2
0
2
Figure 3. Embedding of 20-dimensional data into two dimensions: (a) projection of the data onto the first two principal components; (b) cluster preserving embedding with Algorithm IV . Only 10% of the data are shown.
data generated by 20 Gaussians located on the unit sphere are projected onto a two- dimensional plane in Fig . 3 either by Principal Component Analysis (a) or by Structure Preserving MDS (b) . Clearly the cluster structure is only well preserved in case (b ) .
6. Discussion Exploratory sets
data
. Standard
methods
setting
algorithms
. The
value
. Extensions ) or
conceptually
discovery this
rich
be which
straightforward
with and
for
by
shown
to
to
data
been
clustering
clustering
as yields
selecting
an well
- all
in
a
well
as
robust
appropriate on
real
clustering
published
data
algorithms
perform
- take
in
data
deriving
hierarchical
K - winner
have
structure
methodology
precision
algorithms
approaches
for
both
been
hidden
encompasses
annealing in
have
clustering
goal
discussed
tuned
of
framework
deterministic
can
clustering
the
achieve
been
and of
at
. A
has
which
temperature
HB96b
to
visualization
visualization
data
aims
methods
and
probabilistic data
analysis
- world ( HB95
rules
( HB96a
elsewhere
.
;
) are
References
J . Buhmann
and
H . Kuhnel
tions on Information C . M . Bishop - , M . Svensen self - organizing T . F . Cox Statistics
and
map
. Vector
. Neural
M .A .A . Cox . and
Applied
quantization
with
complexity
Theory , 39 ( 4 ) : 1133 - 1145 , July ,. and C . K . I . Williams . GTM Computation
. IEEE
Transac
alternative
to
-
the
, ( in press ) , 1997 .
Multidimensional
Probability
costs
1993 . : A principled . .
. Chapman
Scaling &
Hall
.
Number , London
59
in
, 1994 .
Monographs
on
DATA
T
.
M
.
Cover
New R
. O
A
.
and
York
P .
.
, the
- field
pages
197
Thomas
- 202
, A
Cie
,
.
Theory
and
,
.
John
Wiley
&
Sons
,
Scene
Analysis
.
Wiley
, New
York
,
Joachim
M
1996
,
. . .
.
Maximum Soc
.
likelihood Se1
.
B
,
Hierarchical
ICANN
Buhmann
39
from
: 1 - 38
pairwise ' 95
.
An
Proceedings
Springer .
.
Rubin Statist
of
In .
.
,
A
,
incomplete
1977
.
data
clustering
NEURONIMES
' 95
,
by
volume
II
,
.
M
1996
B
Buhmann
1995
.
York
annealed
of
" neural
ICANN
' 96
,
gas
pages
"
151
network - 156
for
,
Berlin
,
.
Buhmann
In
.
Infering
Proceedings
Portland
,
1997
hierarchical
of
, Redwood
the
City
clustering
Knowledge ,
CA
structures
Discovery
, USA
,
1996
.
and
AAAI
Data
Press
.
( in
, and
Wesley
,
Hofmann
.
,
Jain
1997
and ,
Hansj6rg
N
R
J
Lecture
.
and
.
.
,
E .
, T
281
metrika
Jr
Computers
.
A ,
Theory
of
Neural
Deterministic
determinis
-
Intelligence
,
Computation
Annealing
Mathematisch
.
Framework
- Naturwissenschaftliche
Bonn
Clustering
Buhmann
and
.
, D - 53117
Data
.
Bonn
Prentice
,
Springer
Verlag
, Fed
Hall
scaling
editors
, ,
,
.
lEE
Proceedings
and
. Rep
.
Englewood
Symposium
on
, ,
of
determinis
-
EMMCVPR
' 97
,
.
Springer
analysis
by
Proceedings
1997
Memory .
classification
1967
E .
Multidimensional
. Hancock
Associative
Berkeley
,
.
.
.R
quantizations for
5th
K
E
Science
Berlin
136
,
: 405
1984
- 413
multivariate
. ,
1989
.
observations
Mathematical
Statistics
.
and
Pro
-
.
Basford
and
and
42
G
.
.
Fox
Letters
,
, and , New .
A ,
An ,
M
- 297
,
Wesley
:
the
- Universitat
for
and
the
. Martinetz
Takane scaling
by
Machine
Mixture
Models
.
Marcel
Dekker
,
INC
,
New
York
,
.
. Sammon
Yoshio
:
thesis
Algorithms
vector
Gurewitz
Addison
.
. Pellilo
Recognition
. Ritter
clustering and
.
Computer
and
Pattern
to
Beyond
PhD
- Wilhelms
methods
, pages
Rose
Dubes 1988
M
of
1988
data Analysis
Introduction
and
Joachim
Some
McLachlan ,
Pairwise
Pattern
.
.
- organization
Proceedings
Basel
.
1991
Analysis
Hierarchical
Macqueen
,
Clustering
In In
Self .
bability
. ,
.
Luttrell
.
on
Friedrich
C
Notes
. Kohonen
Buhmann
. Palmer
York
Data
07632
Klock
.
.
annealing
on
. G
, Rheinische
Cliffs
In
R
New
Data
Germany
M
Transactions
.
Exploratory
tic
Joachim
IEEE
. Krogh
Fakultat
J . W
&
.
Royal
M
and
( 1 ) : 1 - 14
for
H
D J .
Proceedings
In
annealing
19
Thomas
.
.
Joachim
New
.
Addison
K
and
quantization
Hofmann
J .
Information
) .
J . Hertz
.
,
.
EC2
and
annealing
.
of
419
VISUALIZATION
Classification
Joachim
and
tic
. K
Laird
Conference
Thomas
G
.
deterministic
press
S .P .
M
and
,
Mining
Elements
Pattern
algorithm
Hofmann by
J .
.
vector
Thomas
.
.
annealing
Heidelberg
T
N
Hofmann robust
. Hart
EM
Hofmann mean
Thomas
DATA
3 .2 .
Dempster
Thomas
.
AND
.
P . E
Sect
via
A
1991
and .
data
A
J .
,
. Duda 1973
CLUSTERING
Forest
W
, .
least ,
March
A
.
1992
1969
- 594 Neural
annealing ,
1990
approach
to
clustering
.
.
Computation
and
Self
- organizing
Maps
.
. for
data
structure
analysis
.
IEEE
Transactions
.
Young
.
squares 1977
deterministic
) : 589
mapping
- 409
alternating ( 1 ) : 7 - 67
,
- linear
: 401
. ( 11
. Schulten
York
non
18
K
11
.
Nonmetric method
individul with
differences optimal
scaling
multidimensional features
.
Psycho
-
LEARNING BAYESIAN NETWORKS WITH LOCAL STRUCTURE
NIR FRIEDMAN Computer ScienceDivision, 387 SodaHall, University of California , Berkeley, CA 94720. nir @cs.berkeley.edu AND MOISESGOLDSZMIDT SRI International , 333 RavenswoodAvenue, EK329, Menlo Park, CA 94025. [email protected] .com Abstract . We examine a novel addition to the known methods for learning Bayesian networks from data that improves the quality of the learned networks . Our approach explicitly represents and learns the local structure in the condi tional probability distributions (CPDs ) that quantify these networks . This increases the space of possible models, enabling the representation of CPDs with a variable number of parameters . The resulting learning procedure induces models that better emulate the interactions present in the data . We descri be the theoretical foundations and practical aspects of learning local structures and provide an empirical evaluation of the proposed learning procedure. This evaluation indicates that learning curves characterizing this procedure converge faster , in the number of training instances , than those of the standard procedure , which ignores the local structure of the CPDs . Our results also show that networks learned with local structures tend to be more complex (in terms of arcs) , yet require fewer parameters .
1 . Introduction Bayesian networks are graphical representations of probability distribu tions ; they are arguably the representation of choice for uncertainty in artificial intelligence . These networks provide a compact and natural representation , effective inference , and efficient learning . They have been suc421
422
NIRFRIEDMAN ANDMOISES GOLDSZMIDT A
B
A BEl
E
Pr(SIA, B, E) 0.95 0.95 0.20 0.0.) 0.00 0.00 0.00 0.00
0
S
Figure 1. A simple network structure and the associated CPD for variable S (showing the probability values for S = 1).
cessfully applied in expert systems , diagnostic engines, and optimal decision making systems . A Bayesian
network
consists of two components . The first is a directed
acyclic graph (DAG ) in which eachvertex correspondsto a random variable. This graph describes conditional independence properties of the represented distribution . It captures the structure of the probability distribution , and is exploited for efficient inference and decision making . Thus , while Bayesian networks can represent arbitrary probability distributions , they provide computational advantage to those distributions that can be represented with
a sparse
DAG
. The
second
component
is a collection
of conditional
probability distributions (CPDs) that describethe conditional probability of each variable given its parents in the graph . Together , these two components
represent a unique probability distribution (Pearl, 1988) . In recent years there has been growing interest in learning Bayesian net-
works from data; see, for example, Cooper and Herskovits (1992) ; Buntine (1991b); Heckerman (1995); and Lam and Bacchus (1994) . Most of this research has focused on learning the global structure of the network , that is, the edges of the DAG . Once this structure is fixed , the parameters in the CPDs quantifying the network are learned by estimating a locally exponential
number
of parameters
from the data . In this article
we introduce
,-.jo ,-.jO
~ o
~ ~ oo
.
~ ~ oo
networks
-
methods and algorithms for learning local structures to represent the CPDs as a part of the process of learning the network . Using these structures , we can model various degrees of complexity in the CPD representations . As we will show , this approach considerably improves the quality of the learned
,-.jO
a distribution
over the values X can take . For example , con -
to specify
~ ~ ~ ~ oooo
In its most naive form , a CPD is encoded by means of a tabular representation that is locally exponential in the number of parents of a variable X : for each possible assignment of values to the parents of X , we need
LEARNING BAYESIAN NETWORKS WITH LOCAL STRUCTURE 423 sider the simple network in Figure 1, where the variables A , B , E and S correspond to the events " alarm armed ," " burglary ," "earthquake ," and "loud alarm sound ," respectively . Assuming that all variables are binary , a tabular representation of the CPD for S requires eight parameters , one for each possible state of the parents . One possible quantification of this CPD is given in Figure 1. Note , however, that when the alarm is not armed (i .e., when A = 0) the probability of S = 1 is zero, regardless of the values Band E . Thus , the interaction between S and its parents is simpler than the eight -way situation that is assumed in the tabular representation of the CPD . The locally exponential size of the tabular representation of the CPDs is a major problem in learning Bayesian networks . As a general rule , learning many parameters is a liability , since a large number of parameters requires a large training set to be assessed reliably . Thus learning procedures generally encode a bias against structures that involve many parameters . For example , given a training set with instances sampled from the network in Figure 1, the learning procedure might choose a simpler network structure than that of the original network . When the tabular representation is used, the CPD for S requires eight parameters . However , a network with only two parents for S , say A and B , would require only four parameters . Thus , for a small training set, such a network may be preferred , even though it ignores the effect of E on S . This example illustrates that by taking into account the number of parameters , the learning procedure may penalize a large CPD , even if the interactions between the variable and its parents are relatively benign . Our strategy is to address this problem by explicitly representing the local structure of the CPDs . This representation often requires fewer parameters to encode CPDs . This enables the learning procedure to weight each CPD according to the number of parameters it actually requires to capture the interaction between a variable and its parents , rather than the maximal number required by the tabular representation . In other words , this explicit representation of local structure in the network 's CPD allows us to adjust the penalty incurred by the network to reflect the real complexity of the interactions described by the network . There are different types of local structures for CPDs , a prominent example is the noisy -or gate and its generalizations (Heckerman and Breese, 1994; Pearl , 1988; Srinivas , 1993) . In this article , we focus on learning local structures that are motivated by properties of context -specific independence (CSI ) (Boutilier et al ., 1996) . These independence statements imply that in some contexts , defined by an assignment to variables in the network , the conditional probability of variable X is independent of some of its parents . For example , in the network of Figure 1, when the the alarm is not set
424
NIR FRIEDMANAND MOISESGOLDSZMIDT A
A
BEl
Pr (SIA , B , E )
1
1
1
0 .95
1 1
1 0
0 1
0 .95 0 .20
1
0 *
0 I
(a) Figure 2. decision
0 .05 0 .00
0.0 5
0 .05
0 .20
(b)
Two representations of a local CPD structure : (a) a default table, and (b) a
tree .
(i .e., the context defined by A = 0) , the conditional probability does not depend on the value of Band E ; P (S I A == 0, B == b, E == e) is the same for all values band e of Band E . As we can see, CSI properties induce equality constraints among the conditional probabilities in the CPDs . In this article , we concentrate on two different representations for capturing the local structure that follows from such equality constraints . These representations , shown in Figure 2, in general require fewer parameters than a tabular representation . Figure 2(a) describes a default table , which is simi lar to the usual tabular representation , except that it does not list all of the possible values of S 's parents . Instead , the table provides a default probability assignment to all the values of the parents that are not explicitly listed . In this example , the default table requires five parameters instead of the eight parameters required by the tabular representation . Figure 2(b) describes another possible representation based on decision trees (Quinlan and Rivest , 1989) . Each leaf in the decision tree describes a probability for S , and the internal nodes and arcs encode the necessary information to decide how to choose among leaves, based on the values of S 's parents . For example , in the tree of Figure 2(b) the probability of S = 1 is 0 when A == 0, regardless of the state of Band E ; and the probability of S = 1 is 0.95 when A = 1 and B = 1, regardless of the state of E . In this example , the decision tree requires four parameters instead of eight . Our main hypothesis is that incorporating local structure representations into the learning procedure leads to two important improvements in the quality of the induced models . First , the induced parameters are more reliable . Since these representations usually require less parameters , the frequency estimation for each parameter takes , on average, a larger num ber of samples into account and thus is more robust . Second, the global
LEARNING
structure
BAYESIAN
of the induced
NETWORKS
network
WITH
is a better
LOCAL
STRUCTURE
approximation
425
to the real
(in)dependenciesin the underlying distribution . The use of local structure enables the learning procedure to explore networks that would have
incurred an exponential penalty (in terms of the number of parameters required) and thus would have not been taken into consideration. We cannot stress enough the importance of this last point . Finding better estimates of the parameters for a global structure that makes unrealistic independence assumptions will not overcome the deficiencies of the model . Thus , it is crucial to obtain a good approximation of the global structure . The experiments described in Section 5 confirm our main hypothesis . Moreover
, the
results
in that
section
show
that
the
use of local
represen -
tations for the CPDs significantly affects the learning process itself : The learning procedures require fewer data samples in order to induce a network that
better
approximates
the target
distribution
.
The main contributions of this article are: the derivation of the scoring functions and algorithms for learning the local representations ; the formu lation of the hypothesis introduced above, which uncovers the benefits of having an explicit local representation for CPDs ; and the empirical investigation that validates this hypothesis . CPDs with local structure have often been used and exploited in tasks of knowledge acquisition from experts ; as we already mentioned above, the
noisy-or gate and its generalizationsare well known examples (Heckerman and Breese, 1994; Pearl , 1988; Srinivas , 1993) . In the context of learn ing , several authors have noted that CPDs can be represented via logis-
tic regression, noisy-or, and neural networks (Buntine, 1991b; Diez, 1993; Musick, 1994; Neal, 1992; Spiegelhalter and Lauritzen, 1990). With the exception network
of Buntine , these authors structure
is fixed
have focused on the case where the
in advance , and motivate
the
use of local
struc -
ture for learning reliable parameters . The method proposed by Buntine (1991b) is not limited to the case of a fixed structure ; he also points to the use of decision trees for representing CPDs . Yet , in that paper , he does not provide empirical or theoretical evidence for the benefits of using local structured representations with regards to a more accurate induction of the global structure of the network . To the best of our knowledge , the benefits that relate to that , as well as to the convergence speed of the learning pro-
cedure (in terms of the number of training instances) , have been unknown in the literature
prior to our work .
The reminder of this article is organized as follows : In Section 2 we review the definition of Bayesian networks , and the scores used for learning these
networks
. In Section
3 we describe
the two forms
of local
structured
CPDs we consider in this article . In Section 4 we formally derive the score for learning networks with CPDs represented as default tables and decision
426
NIR FRIEDMANAND MOISESGOLDSZMIDT
trees , and describe the procedures for learning these structures . In Section 5 we describe the experimental results . We present our conclusions in Section 6.
2. Learning Bayesian N etwor ks Consider
a
each
letters
set
Xi
such
as
x
X
can
,
IIX
y
,
z
to
II
=
I
as
X
in
,
,
,
,
if
y
)
=
I
Z
P
x
Z
and
a
edges
G
Xi
is
)
,
y
a
a
;
the
finite
those
Val
(
y
(
,
an
the
x
,
y
such
of
is
values
denoted
as
capital
in
,
capital
set
set
variables
as
z
(
we
z
Y
)
)
>
,
and
o
.
z
letters
these
sets
use
Val
(
are
X
)
U
the
is
,
.a
Z
)
statements
in
,
we
have
and
a
of
=
(
'
G
.
variables
)
.
.
given
'
.
that
encoded
in
-
struc
G
as
.
a
-
The
can
set
family
P
G
DAG
variable
in
to
graph
I
whose
each
distribution
the
a
and
graph
parents
any
x
.
is
,
The
referred
(
prob
G
Xn
:
its
usually
P
variables
,
.
and
given
joint
random
=
Xl
,
that
statements
show
U
,
encodes
B
the
is
1988
)
set
independence
parents
Pearl
of
(
pair
,
its
Val
variables
nondescendants
and
E
that
the
random
of
variables
independent
DAG
composed
between
its
the
conditionally
annotated
set
independence
The
this
boldface
over
for
(
.
by
to
use
letters
variables
denoted
where
We
lowercaEe
of
Yare
dependencies
arguments
.
and
cardinality
such
domain
variable
variables
domain
,
values
and
E
following
of
the
X
to
independent
Standard
.
P
the
of
isfies
U
is
direct
by
)
are
network
encodes
random
a
.
correspond
composed
X
of
whenever
represent
ture
discrete
distribution
X
Bayesian
of
letters
)
of
,
}
names
variables
of
(
Xn
taken
(
assignments
Val
)
,
from
probability
E
z
.
variable
Val
of
joint
nodes
.
denote
network
Formally
.
values
values
Sets
distribution
whose
to
subsets
I
,
on
way
Bayesian
ability
Xl
lowercase
x
(
{
as
.
,
be
all
=
)
a
Z
for
A
,
obvious
,
=
specific
,
be
Y
Y
boldface
P
X
=
take
denoted
X
Y
the
Let
Z
(
by
! ' IXII
,
denote
Val
denoted
let
X
is
=
U
may
aE
attain
such
z
finite
variable
.
that
be
sat
-
factored
as
n
p
(
X
1
,
.
.
.
,
X
n
)
=
=
II
i
where
a
Pai
denote
the
distribution
of
abilities
on
the
the
P
is
(
exactly
conditional
,
hand
network
probability
,
Xi
I
one
Pai
as
we
conditional
deal
.
all
This
specified
probability
I
P
Note
to
ai
)
,
(
that
set
1
Xi
.
specify
It
prob
the
of
conditional
immediately
form
follows
of
-
component
Equation
that
1
with
the
.
variables
,
such
)
specify
conditional
second
CPDs
the
completely
the
the
of
has
to
provide
precisely
This
in
tables
i
.
is
that
discrete
G
variables
distribution
with
X
need
.
for
(
l
in
only
namely
)
Xi
we
side
probabilities
When
of
form
right
Bayesian
there
parents
this
P
=
as
we
usually
the
represent
one
in
Figure
the
1
.
These
CPDs
in
tables
LEARNINGBAYESIANNETWORKS WITH LOCALSTRUCTURE427 contain
a
The
Given
=
fit
( } xi
of
a
B
(
of
a
)
that
and
to
:
1994
we
the
)
.
1
.
The
in
our
.
D
,
used
by
fined
length
the
the
(
e
Thomas
(
work
,
the
B
,
D
U
find
a
(
in
that
1995a
)
on
scoring
,
the
,
1995
)
literature
.
.
In
frequently
Lam
and
Bacchus
,
.
. g
.
D
,
1991
the
)
such
.
learning
Bayesian
a
the
(
balances
(
,
this
PB
a
)
is
PH
we
MDL
we
;
.
the
in
build
an
Cover
a
net
description
induced
-
and
implies
that
the
network
represents
-
and
choose
This
-
in
see
should
.
the
network
probable
network
minimized
a
over
can
code
network
-
class
,
more
Huffman
of
the
de
the
The
is
,
the
then
particular
model
to
,
of
complexity
which
the
,
principle
.
.
words
or
to
a
D
model
and
.
from
distribution
code
MDL
is
D
compression
one
distribution
Using
the
of
length
length
the
with
the
-
of
store
length
networks
the
respect
accuracy
in
the
com
suitable
version
also
version
is
a
-
store
a
find
compact
must
description
combined
with
we
Sup
to
save
to
description
used
shorter
to
a
,
and
is
.
like
space
total
probability
.
would
data
compressed
encoding
According
that
The
model
Shannon
)
conserve
D
coding
we
produce
the
total
universal
which
the
model
assigns
using
to
to
.
of
data
by
,
recover
optimal
the
motivated
use
to
describes
in
of
the
with
frequencies
.
now
describe
of
.
.
heuristic
most
(
)
of
Heckerman
are
score
Pai
network
a
rely
proposed
(
follows
goodness
introduce
)
,
Val
as
,
usually
ones
.
E
of
networks
compressing
length
that
procedure
degree
We
candidate
of
is
like
the
B
data
learning
of
MDL
instances
of
able
minimizes
,
encoded
the
storage
that
)
of
would
of
scheme
stances
the
PSi
stated
normally
been
al
and
now
notion
we
(
1989
compress
the
appearing
encoding
we
possible
can
the
that
network
stances
in
to
context
a
)
the
et
D
be
description
)
In
Xi
be
instances
,
Heckerman
encoder
of
dictates
Such
of
data
of
way
to
sum
interest
}
on
we
One
want
encoder
of
(
formalize
have
set
,
.
the
we
the
principle
UN
To
the
,
a
D
that
as
the
as
,
1
problem
Rissanen
Naturally
of
for
Val
can
Length
given
version
model
of
(
are
records
Moreover
.
.
attention
(
E
SCORE
we
pressed
.
space
score
principle
that
.
to
our
BDe
MDL
MDL
pose
,
Description
the
THE
Ul
functions
focus
Xi
network
D
the
Minimal
and
value
optimization
scoring
article
used
{
the
over
different
this
=
respect
solve
techniques
Several
each
Bayesian
matches
with
,
search
D
best
network
for
a
set
,
/ pai
learning
training
G
function
2
parameter
problem
both
network
in
the
detail
network
is
defined
the
representation
and
as
the
the
length
coded
total
data
description
.
required
The
for
MDL
length
score
.
To
the
of
store
a
a
IThroughout this article we will assume that the training data is complete , i .e., that each Ui assigns values to all variables in U . Existing solutions to the problem of missing values apply to the approaches we discuss below ; see Heckerman ( 1995) .
428
NIR FRIEDMANAND MOISESGOLDSZMIDT
network B = (G , ) , we need to describe U , G , and . To describe U , we store the the number of variables , n , and the cardi nality of each variable Xi . Since U is the same for all candidate networks , we can ignore the description length of U in the comparisons between networks . To describe the DAG G , it is sufficient to store for each variable Xi a description of Pai (namely , its parents in G ) . This description consists of the number of parents , k , followed by the index of the set Pai in some (agreed upon ) enumeration of all (~) sets of this cardinality . Since we can encode the number k using log n bits , and we can encode the index using log (~) bits , the description length of the graph structure is2
DLgraph (G) = ~ (lOg n+ log(IP : I) ) . To
describe
tional
the
CPDs
probability
IIP
~
table
II ( 11Xill
ters
-
depends
usual for
a
1 )
on
choice
in
"
. For
we
the
parameters
the
of
we
use
1 / 2 log
this
the
parameters with
representation
bits
is
discussion
store associated
The
of
literature
must table
.
number
the
thorough
C P D
in
N
point
in Xi
, we
length
for
each
( see
condi to
these
and
, the
encoding
-
1 ) logN
-
store
parame
parameter
Friedman
) . Thus
need
of
numeric
each
-
. The
Yakhini
( 1996
length
of
)
Xi
'8
is 1 DLtab
To the code
encode
,
the
signed this
the
network
to length
B exact
approximate ing
length
approximated
each by
==
of
,
it
optimal instance
is
- IIPaill 2
, we a
length
However
)
data
construct
particular
the of
, P ~
training to
that .
( Xi
use
the
Huffman
each
codeword .
u .
length Thus
,
for
the
is and
no
on
Thomas
using
-
defined
instances
closed
description
.
measure
the
depends
There ( Cover
encoding
probability
code
instance known
( IIXill
the
PB
by
. In
this as -
description
1991
length
D
probability
- form ,
log
in
( u )
) as of
that the the
of we encod data
can is
N DLdata (D I B) = - Llog (Ui ). i=l PB
We can rewrite this expression " in a more convenient form . We start by introducing some notation . Let PD be the empirical probability measure 2Since description lengths are measured in terms of bits , we use logarithms throughout this article .
of base 2
LEARNING
BAYESIAN
NETWORKS
WITH
LOCAL
STRUCTURE
429
induced by the data set D . More precisely, we define
A
1~
PD(A) = N 't!=--1 lA (Ui) where lA (u) =
{ 0I ififuuctEAA
for all eventsof interest, i.e., A <; Val(U). Let N:f (x) be the numberof A
instancesin D where X = x (from now on, we omit the subscript from PD,
andthe superscriptand the subscriptfrom NJl, whenever they areclear ,
.
from the context). Clearly, N (x ) = N . P (x ). We use Equation 1 to rewrite the representation length of the data as N
DLdata (D I B) = - } : logPB(Uj) j=l
= - N} : p(u) logII P(Xi I pai) u
i
= -} : } : N(Xi, pai) logP(Xi I pai). i
(2)
Xi ,p ~
Thus , the encoding of the data can be decomposed as a sum of terms that
are "local" to eachCPD : these terms dependonly on the counts N (Xi, pai ). Standard arguments show the following .
Proposition 2.1: IfP (Xi I Pai ) is representedas a table, then the parameter values that minimize DLdata(D I B ) are (JXiIpai = P (Xi I PSi). A
Thus , given a fixed network structure G , learning the parameters that min imize the description length is straightforward : we simply compute the appropriate long-run fractions from the data . Assuming that we assign parameters in the manner prescribed by this
proposition, we can rewrite DLdata(D I B ) in a more convenient way in terms of conditional entropy: N Li H (Xi I P8i ), where H (X I Y ) = - Lx ,y P (x , y) log P (x I y) is the conditional entropy of X , given Y . This A
A
formula provides an information -theoretic interpretation tion
of the
data : it measures
how
many
bits
to the representa-
are necessary
to encode
the
value of Xi , once we know the value of Pai .
Finally , the MDL score of a candidate network structure G , assuming that we choose parameters
as prescribed
above , is defined as the total
de-
scri ption length
DL(G, D)
DLgraph (G) + L DLtab (Xi,Pai) + ~
NE~ H(Xi I P8i).
(3)
430
NIRFRIEDMAN ANDMOISES GOLDSZMIDT
According to the MDL principle , we should strive to find the network struc ture that minimizes this description length . In practice , this is usually done by searching over the space of possible networks . 2 .2 .
THE
BDE
SCORE
Scores for learning Bayesian networks can also be derived from methods of Bayesian statistics . A prime example of such scores is the BDe score,
proposed by Heckerman et ale (199Sa). This score is basedon earlier work by Cooper and Herskovits (1992) and Buntine (1991b). The BDe score is (proportional to) the posterior probability of each network structure , given the data. Learning amounts to searchingfor the network(s) that maximize this probability . Let Ch denote the hypothesis that the underlying distribution
satisfies
the independe.ncies encodedin G (seeHeckermanet ale (1995a) for a more elaborate discussion of this hypothesis). For a given structure G, let 8a represent the vector of parameters for the CPDs quantifying G . The pos-
terior probability we are interested in is Pr (Gh I D ) . Using Bayes' rule we write
this
term
as
Pr (Gh I D ) == aPr (D I Gh) Pr (Gh) ,
(4)
where a is a normalization constant that does not depend on the choice of
G. The term Pr (Gh) is the prior probability of the network structure , and the term Pr (D I Gh) is the probability of the data, given that the network structure
is G .
There are several ways of choosing a prior over network structures . Heck-
erman et ale suggest choosing a prior Pr (Gh) cx a~(G,G'), where L\ (G, G/), is the difference in edges between G and a prior network structure G' , and 0 < a < 1 is penalty for each such edge. In this article , we use a prior b~ ed
on the MDL encoding of G. We let Pr (Gh) cx 2DLgraph (G). To evaluate the Pr (D I Gh) we must consider all possible parameter assignments to G . Thus ,
Pr(DIGh )=J Pr(DI8a,Gh )Pr(8aIGh )d8a ,
(5)
where Pr (D I 8a , Gh) is defined by Equation 1, and Pr (8a I Gh) is the prior density over parameter assignmentsto G. Heckermanet ale (following Cooper and Herskovits (1992)) identify a set of assumptions that justify decomposing this integral . Roughly speaking , they assume that each distri -
bution P (Xi I pai ) can be learned independently of all other distributions .
LEARNING BAYESIAN NETWORKS WITHLOCALSTRUCTURE 431 Usingthis aBsumption , they rewritePr(D I Gh) as Pr(D I Gh) = IJ1. pai II J IIXi o~f; ~iPai ) Pr(8 XilpaiI Gh)d8Xilpai.
(6)
(This decomposition is analogous to the decomposition in Equation2.) Whenthe prior on eachmultinomialdistributione XiIpaiis a Dirichlet prior, the integralsin Equation6 havea closed -formsolution(Heckerman , 1995 ). We briefly reviewthe propertiesof Dirichletpriors. For moredetailed description , wereferthe readerto DeGroot(1970 ). A Dirichletprior for a multinomialdistributionof a variableX is specifiedby a set of hyperpa rameters{N~ : x E Val(X )}. Wesaythat Pr(8x ) f'VDirichlet({N~ : x E Val(X )}) if Pr(8 x ) ==QII (}~~, x wherea is a normalization constant . If the prior is a Dirichletprior, the probabilityof observing a sequence of valuesof X with countsN (x) is J IIx (J~(X) Pr(8x I Gh)d8x == r (Lxr (N~ (Lx+N~) + N (x)) , N (x))) IIx r (N~ r (N~) wherer (x) = foco tX- le- tdt is the Gammafunctionthat satisfies the propertiesr (l ) = 1 andr (x + 1) = xr (x). Returningto the BDescore , if weassignto eache XiIpaia Dirichletprior with hyperparameters N~i Ipai' then Pr(D I Gh) = II II r (~ X' N; .lpa;) II r (N~ilpai+ N (Xi. pai)) i pair (EXi Nxilpa,. + N(pai)) X., r (N"XIIpai) . (7) Therestill remainsa problemwith thedirectapplicationof this method . Foreachpossiblenetworkstructurewewouldhaveto assignpriorson the parametervalues . This is clearlyinfeasible , sincethe numberof possible structuresis extremelylarge.Heckerman et alepropose a setof assumptions that justify a methodby which,givena priornetworkBPandanequivalent samplesizeN', we can assignprior probabilitiesto parameters in every possiblenetworkstructure.The prior assigned to e Xilpaiin a structureG is computed fromthe priordistributionrepresented in BP. In this method , we assignN~ilpai == N' . PBv(Xi, pai). (Notethat Pai arethe parentsof
432 Xi
NIR FRIEDMANAND MOISESGOLDSZMIDT
in
G
,
but
the
conditional
the
expected of
The
the
ables
G
Once
( u
again
we
we
can
,
( Xi
use
are
, D
,
where
eG
large
1994
,
given
, Gh
term
I D
data
network
prior expected
or
by
)
Pr
) Pr
the
set
of
. In
set
" Pr
the
I D
the
=
Oi
a
, Ch
( Xi
of
average
the
G
, and
I pai
, D
I
Gh
MD
(
. )
' D
, Ch
L
)
~
in
the
score
More
. Using
parameter
, Gh
)
where
) dOXilpaj
general
and
.
I
of
8a
,
setting
be
Ch
)
(
prior
-
the
score
~
tend
scores
derived
Schwarz
the
solution
two
by
7
,
as
1978
)
using
done
.
by
Schwarz
,
logN
for
is
BDe
these
Equation
parameters
our
the
can
on
( D
- form
,
in
result
logPr
closed
precisely
function
likelihood
which
have
equivalence
r
-
) d8ao
data
I pai
constraints
maximum
,
the
This
vari to
should
structure
Pr
( Oxilpai
.
regularity
( D
G
,
order
of
parameters
, we
( 8a
integrals
the
using
the
number
G
over
observable
, Gh
sets
.
as
( e .g . ,
structure
the
using
similarly
some
of
G
,
,
number
( 8
given
of
D
free
,
and
d
)
is
parameters
.
Note
that
attempts
the
to
( which
one
which
corresponds
is
.
the
BP
,
completely
these
to
)
dimension
term
, then
structures
are
the
methodology
18a
OXilpai
equivalent
(
uses
network
NxilPa ' ;+N(X,pai Pr (XIpai i,D,Gh )=Exi N~ilpai +N(pi))ai .
log
G
: = J
approximations
that
in
)
priors
asymptotically
shows
N
, Gh
consider
Bouckaert
( u
( u
in
compute
. Thus
this
Pr
essentially
prior
to
a
to
ea
Pr
the
distribution
Bayesian
of
that
candidate
asymptotic
the
== J
above
get
I pai
we
score
)
score
need
the
decompose
we
Dirichlet
When
to
, Gh
proposal in
proportional
to
, we
to
,
confidence
probability
to
their
.
how
the G
stated
independence
If
I D
assumptions
Pr
shows
, pai
the is
Pai
assignments
Pr
,
of
structure
Thus
given
)
values
. According
possible
.)
,
Similarly
about a
the all
the
the above
u , given
over
.
predictions
quantify
BP Xi
hyperparameters
of
exposition
make
in of
probability
occurrences
to
necessarily probability
magnitude of
not
negligible
term
on
maximize
)
attempts
to
to
-
in
the
the
is
the
right
the
negative
minimize
logarithm -
asymptotic
-
hand
side
of
)
,
when
of
of
the
we
the
analysis
Equation
8
MDL
score
ignore
prior
Pr
,
since
the
( . Gh
).
it
( which
of
one
Equation
3
description
.
does
Note
of
also
not
that
depend
G
this
on
,
LEARNING
3. In
Local the
BAYESIAN
NETWORKS
WITH
LOCAL
STRUCTURE
433
Structure
discussion
above , we
have
assumed
the
standard
tabular
repre -
sentation of the CPDs quantifying the networks . This representation requires that for each variable Xi we encode a locally exponential number ,
//PBiII (IIXill - 1) , of parameters. In practice, however, the interaction between Xi and its parents p ~ can be more benign , and some regularities can be exploited to represent the same information with fewer parameters . In the example of Figure 1, the the CPD for S can be encoded with four
parameters, by meansof the decision tree of Figure 2(b), in contrast to the eight parameters required by the tabular representation . A formal foundation for representing and reasoning with such regularities
is provided in the notion of context-specific independence(CSI) (Boutilier et al., 1996). Formally, we say that X and Yare contextually independent, given Z and the context c E Val(C), if P (X I Z , c, Y ) := P (X I Z , c) wheneverP (Y , Z, c) > O.
(9)
CSI statements are more specific than the conditional independence state ments captured by the Bayesian network structure . CSI implies the inde-
pendenceof X and Y , given a specific value of the context variable(s), while conditional independence applies for all value assignments to the condition ing variable . As shown by Boutilier et ale (1996) , the representation of CSI leads to several benefits in knowledge elicitation , compact representation , and computational efficiency . As we show here, CSI is also beneficial to learning , since models can be quantified with fewer parameters . As we can see from Equation 9, CSI statements force equivalence lations between certain conditional probabilities . If Xi is contextually
rein -
dependent of Y given Z and c E Val(C) , then P (Xi I Z, c, y ) == P (Xi I Z , C, y' ) for y , y' E Val(Y ) . Thus, if the parent set Pai of Xi is equal to Yu Z u C , such CSI statements will induce equality constraints among the conditional probability of Xi given its parents . This observation suggests an alternative way of thinking of local structure in terms of the partitions they induce on the possible values of the parents of each variable Xi . We note that while CSI properties imply such partitions , not all partitions can be characterized by CSI properties . These partitions impose a structure over the CPDs for each Xi . In this article we are interested in representations that explicitly capture this struc ture that reduces the number of parameters to be estimated by the learning procedure . We focus on two representations that are relatively straightfor ward to learn . The first one, called default tables, represent a set of singleton partitions
with
one additional
partition
that can contain
several values of
Pai . Thus , the savings it will introduce depends on how many values in
434
NIR FRIEDMANAND MOISESGOLDSZMIDT
Val(Pai ) can be grouped together. The second representation is based on decision trees and can represent more complex partitions . Consequently it can reduce the n urn ber of parameters
even further . Yet the induction
al -
gorithm for decision trees is somewhat more complex than that of default tables .
We introduce a notation that simplifies the presentation below . Let L be a representation for the CPD for Xi . We capture the partition structure represented by L using a characteristic random variable T L . This random variable maps each value of Pai to the partition that contains it . Formally ,
TL (pai) = TL (pai) for two values pai and pai of Pai , if and only if these two values are in the same partition
in L . It is easy to see that from the
definition of lL , we get P (Xi IlL ) = P (Xi I Pai ), since if p~ and Pa~ are in the same partition , it must be that P (Xi I pai) .= P (Xi I pai). This means that we can describe the parameterization of the structure L
in terms of the characteristic random variable Y L as follows: e L = { lJxiIv : Xi E Val(Xi ) , v E Val(TL )} . As an example , consider the tabular CPD representation , in which no CSI properties
are taken into consideration
. This implies that the corresponing
partitions contain exactly one value for Pai . Thus, in this case, Val(lL ) is isomorphic to Val(Pai ). CPD representationsthat specify CSI relations will have fewer partitions , and thus will require fewer parameters . In the sections below we formally describe default tables and decision trees , and the partition structures they represent . 3 .1.
DEFAULT
TABLES
A default table is similar to a standard tabular representation of a CPD , except that only a subset of the possible values of the parents of a variable are explicitly represented as rows in the table . The values of the parents that are not explicitly represented as individual rows are mapped to a special row called the default row. The underlying idea is that the probability of a node X is the same for all the values of the parents that are mapped to the default row ; therefore , there is no need to represent these values separately in several entries . Consequently , the number of parameters explicitly represented in a default
table can be smaller than the number
of parameters
in
a tabular representation of a CPD. In the example showing in Figure 2(a), all the values of the parents of S, where A = 0 (the alarm is not armed), are mapped to the default row in the table , since the probability of S = 1 is the same in all of these situations , regardless of the values of Band E .
Formally, a default table is an object V = (Sv , 8v ) . Sv describes the structure of the table , namely , which rows are represented explicitly , and
which are representedvia the default row. We define Rows(V ) ~ Val(Pai )
LEARNING
to
be
the
defines the
BAYESIAN
set
the
of
T ( pai
then
the
that
are
of
set
of
from
Rows
A
= D '
DECISION
tree
can
take
The
, and
leaves
start
. At
each of
that (5
with
the
value
with
0 ) , and
A
can
tree
by
the
test
T , and
of a
, we
the
, the
explicit
for
V . It P ( Xi
cases
.
If
the
I
PSi
( V ) , then
is
E
P ( Xi
partition
I
that
be
to the
the
the
root
set
values
in
pai
consistent
of
the
, and the
. It
is
lying
path
to in
the
the verify tree
is
that . The
leaf first
tree
.
com
by and
every
partitions
.
by is
a
repre
Y
structure at
, for
the
root
v of
Label
this
representa a leaf
consistent
for
( annotated
recursively
tree
root
is
2 (b ) .
represented
value
the
to
Val ( Y ) } ) , where a
induced
path
B
tested
the
like
annotated
) . The
is
E
testing
appropriate
defined
leaf
: y
with
at
,
reach
Figure is
composite
STy
between of
is
variable
partitions
easy
, 87
( 57
== ( Y , { STy tree
arcs
==
. A
(T)
labeling
the
tree
the
the
edge
, and
Label
the
left
in
.
outgoing
would
edge
reach
== A . A
by
we
X
parents
the
shown
node
we
traverse
this
tree
associated
describe
if
T
S7 S7
of by
the
composite
form
subtree
pai
unique
a
constant
denote
with
of
tree
we
object
structure or
the
its
until
following that
A , since
follow
that over
of
tree
is
node
by
to
and
in
E , till
an
subtree
at
at
particular
a value the
, suppose
, we
aE
special
( T , v ) the
== 1 )
a
node
distribution
, given
node
. Thus
internal
from
traverse
which
subtree
edge
each
represented
X
and
that
== 0 , E
tree
which edges
of
value
right
of
in
a probability
choose
right
a leaf a
at
need
that
the
Y . We
a path
signment is
of
, we
. Let
there
to
Sub
consistent
in
determine
rJ Rows (V )
variable
node
annotates
a
tree
the
root
A . Similarly
either
a
with
node
denote
variable
by
(V ) , values
( V ) . Thus
rows
two
PSi
Rows
, outgoing
that
the
the
structure
y
Finally
is
a
value
the
parameterization
consider
=:
probability
the
== 1 , B to
represents be
is
that
the
1 for
equal
sented
each
I A
edge
, we
X
to
again
Formally
structure
to
)
Rows
the
( V ) , then
rJ Rows all
Val ( Y v ) . To
= { pai } . If
annotated
at
parent
== 1 the
, 57
) -
Rows
PSi
.
variable
internal
the
follow
ponent
the E
need
. If
structure
2 (a ) ) .
Xi
Val ( Psi
values
corresponds
Pr
contains
to
This
E
that ( Psi
Figure
value
(}xiITv =:
row
are
retrieving
know
tion
the
of
edge
is
parent
. We
value
We
a with
follows
a leaf the
=
variable
process as
) D
default
for
with
annotated
I pai
Val
PSi
PSi
435
TREES
decision
are
is
the
in
we
where
is
. If
.
only
correspond
, constitutes each
representation
to
annotated
, that
( e . g . , as , 8v
P ( Xi
(}xiITv
corresponds
3 .2 .
table
variable
partition
table
STRUCTURE
explicitly
contains
default
default
LOCAL
represented
random
(}Xi Iv for
this
==
are
represented
the
WITH
that
the
parameters
( V ) , then )
is
the
parameters
)
PSi
)
by
representation
that
partition
explicitly
defined
The
the
T ( pSi
not
partitions
V
characteristic ) is
value
contains
in
following
value
PSi
rows
NETWORKS
with pai
E induced
of
(T ) .
. A
path
the
as -
Val ( Pai
)
by
a
436
NIR FRIEDMANAND MOISESGOLDSZMIDT
decision tree correspond to the set of paths in the tree , where the partition that correspond to a particular path p consists of all the value assignments that p is consistent with . Again , we define the set of parameters , 8r , to contain parameters 8Xilv for each value Xi E Val(lv ) . That is, we associate with each (realizable ) path in the tree a distribution over Xi . To deter mine P (Xi I pai ) from this representation , we simply choose 8Xilv, where v = { pa ~ I pa ~ is consistent with p} , where p is the (unique ) path that is consistent with psi . 4
.
Learning
We
start
for
default
Local
this
section
a
learning
4 . 1 .
( 1996
L
,
)
and
by
E
4
. 1 . 1 .
Let
Val
B
tion as
in
encode
,
We
networks
.
scoring
We Note
derive
of
such
the
the
the
score
material
and
of
parent
describe
that
a
produce
the
variables
representations
functions
then
representation
( G ( Xi
Section
CPDs
.
( See
.
Let
that Boutilier
. )
, {
Li I
} )
structure parameters
( YL
for
Pai
P
( Xi
I
by
for
PSi
S L
) }
be
a The
a
L
.
default of
We
table the
,
local
assume
L a
denote
tree
,
or
a
representation
that
8L
=
{ 8Xilv
:
Structure
network
MDL
Changes and given
,
encoding occur
,
e .g . ,
derivations
structure of
Bayesian
SLi
our
.
Local
) .
) ,
the
parameterization Val
2 . 1 .
the
necessary
denote
the E
Score
optimal
structure
.
) , v
MDL
P
BDe .
to
values
notations of
table 8L
( Xi
== of
the of
representation
( complete
Xi
scoring
structured
discussion
some
local
any
and
FUNCTIONS
introduce
a
of
a
MDL
generalized
over
for
the representations
high
easily for
partition )
SCORING
We
be
procedure
.
both tree
for
can
a
al
deriving decision
searching
section
represents et
and
for
this
by
table
procedures in
Structure
the the
in
of the
Li
the
now
depends
the G
.
Additionally
of eLi
,
is
DAG
encoding
parameters data
where
Li
on
local
representa
remains .
We
now ,
the
-
the
choice
same have
the
to
choice of
local
.
First , we describe the encoding of SL for both default table and tree representations. When L is a default table V , we needto describe the set of rows that are representedexplicitly in the table, that is, Rows(V ). We start by encoding the number k = IRows(V ) I; then we describeRows(V ) by encoding its index in some (agreed upon) enumeration of all (1IP : ill) sets of this cardinality .
LEARNINGBAYESIANNETWORKS WITH LOCALSTRUCTURE437 Thus, the description length of the structure V is
DL'ocal -struct (V) =logIIPail1 +log(IIP ; II) . When
L
the
all
the
be
any
in
'
tree
In
the
the
node
In
,
from
the
if
the
(
k
tree
)
.
,
since
along
are
The
the
the
(
T
,
k
)
=
node
in
by
the
+
log
(
k
)
+
Ei
DLr
(
1i
,
k
1
r
if
is
this
Next
formula
,
we
,
we
encode
define
the
(
DLlocal
11Xill
-
-
1
)
struct
liT
(
11
T
)
=
tested
then
we
leaf
a
(
parameters
T
,
tree
for
'
,
IPail
L
:
composite
subtrees
DLr
need
description
formula
a
is
with
Using
,
test
variable
total
T
a
been
recurring
)
,
can
of
each
tree
following
-
variable
not
The
leaf
depends
test
the
.
is
a
choice
have
variable
leaf
description
test
the
we
if
1
the
that
and
composite
variable
,
1
DLr
,
path
test
a
a
from
the
test
subtree
single
current
described
of
and
variables
:
it
root
a
a
k
the
encoding
the
,
by
follows
differentiate
variable
At
at
the
is
.
to
tree
proposed
as
of
tree
describe
structure
,
encoding
the
to
to
O
1
the
encoding
recursively
to
The
of
the
test
there
root
bits
encoded
contraEt
,
structure
use
associated
in
.
We
value
.
general
log
is
.
equal
to
restricted
.
path
of
the
the
tree
value
set
parents
more
only
length
S
encode
the
A
of
of
the
. 3
subtrees
once
store
to
in
with
bit
description
Xi
is
most
need
)
bit
a
position
of
we
1989
immediate
variable
yet
(
single
a
the
,
nodes
with
by
on
to
a
starts
followed
T
Rivest
by
tree
tree
internal
and
encoded
at
a
of
Quinlan
of
is
labeling
)
Tl
,
.
.
.
,
T
m
.
.
with
description
length -1
1
DLparam
Finally
,
given
as
the
We
we
model
now
a
,
.
.
.
,
=
2
2
Equation
network
4
1
)
n
.
2
,
:
then
If
P
we
(
IIXill
1
I
can
P
~
rewrite
,
.
CPDs
Xi
-
we
)
IIYLlllogN
.
describe
1
to
the
describe
are
)
l
encoding
of
the
data
.
2
when
1
(
.
Proposition
Proposition
=
L
Section
using
for
i
in
generalize
rameters
for
did
(
is
the
optimal
represented
using
represented
D
Ldata
by
(
D
I
B
local
)
choice
local
representation
of
pa
structure
-
.
Li
,
as
DLdata (DIB) = -NL L L P(Xi,lLi =v)log (Jxilv8 i vE Val (TLi ) Xi 3Wallace and Patrick ( 1993) note that this encoding is inefficient, in the sense that the number of legal tree structures that can be described by n-bit strings, is significantly smaller than 2n. Their encoding, which is more efficient, can be easily incorporated into our MDL encoding. For clarity of presentation, we use the Quinlan and Rivest encoding in this article .
438
NIR FRIEDMANAND MOISESGOLDSZMIDT
Moreover, the parameter valuesfor L that minimize DLdata(D I B ) are , . .
()XiITLi=v = P (Xi I T Li = v). As in the case of tabular
CPD representation
the parameters correspond to the appropriate ing data . As a consequence
cal structure
of this
, DLdata is minimized
when
frequencies in the train -
result , we find
that
for a fixed
lo -
L , the minimal representation length of the data is simply
N . H (X IlL ). Thus, once again we derive an information-theoretic interpretation of DLdata(8 L, D ) . This interpretation showsthat the encoding of X depends only on the values of 1 L . From the data processing inequality
(Cover and Thomas, 1991) it follows that H (Xi I lL ) ~ H (Xi I Pai ). This implies that a local structure cannot fit the data better than a tabular CPD . Nevertheless , as our experiments confirm , the reduction in the number of parameters can compensate for the potential loss in information . To summarize , the MDL score for a graph structure augmented with a local structure
Li for each Xi is
DL(G, 1, . . ., Ln, D) == DLgraph (G) + L (DLlocal -struct (Li) + DLparam (Li)) t
+N E H(~"YI YLi)' t
4.1.2. EDe Score for Local Structure We now describe how to extend the BDe score for learning local struc -
ture. Given the hypothesisGh, we denoteby ~ the hypothesisthat the underlying distribution
satisfies the constraints of a set of local structures
.c == { Li : 1 ~ i ~ n} , where Li is a local structure for the CPD of Xi in G. Using Bayes' rule , it follows that
Pr(Gh, L~ I D) cxPr(D I L~,Gh) Pr(L~ I Gh) Pr(Gh). The specification of priors on local structures presents no additional compli cations other than the specification of priors for the structure of the network
Gh. Buntine (1991a, 1993), for example, suggestsseveral possible priors on decision trees . A natural prior over local structures is defined via the MDL
descriptionlength, by setting Pr(.c~ I Gh) cx : 2- Ei DLlocal -struct (Li). For the term Pr(D I L~, Gh), we makean assumptionof parameterindependence, similar to the one made by Heckerman et ale (1995a) and by Buntine (1991b): the parameter valuesfor eachpossiblevalue of the characteristic variable Y Li are independent of each other . Thus , each multinomial
LEARNINGBAYESIANNETWORKS WITH LOCALSTRUCTURE439 sample is independent of the others, and we can derive the analogue of Equation 6:
Pr(D I ~, Gh) = U II J II (}~ f~i'V) Pr(8 Xilv I Lih, Gh)d8 Xilv t vEVal (TLi) Xi (10) (This decompositionis analogousto the onedescribedin Proposition4.1.) As before, we assumethat the priors Pr(8 Xilv I ~, Gh) are Dirichlet, and thus we get a closed -form solutionfor Equation10,
Pr
( D
I
. c ~
, Gh
)
=
II
II i
Once
more
priors
,
of
global
is
to
that
is and
set
BP
Recall
First
,
instances
that
the
Pr
from
N
N
xilv
~ ilv +
)
Xilv
I
.
Our
objective
prior
~
II
N
( v
problem
( 8
) )
of
, Gh
)
for ,
r
( N
; ilv
+
r
( N
Xi
specifying
each
ag
a
the
N
( Xi
~ ilv
)
, V ) ) .
multitude
possible
in
distribution
of
combination
cage
tabular
represented
on
, Z
the
that
and to
prior
that
CPDs
by
tests
then Y
=
, Z
Y =
a
. by
example on
.
Our
z ,
be
variable ) .
value
grouped
first
on y
for
For
( Pai
We
a
, Y
and
the
,
specific
this
then
this
on
same
.
Z
the
,
the
trees
and
another that
prior
of
of
possible
the in
both
-
set
value
two
-
vari on
particular
requires the
par
assump
partition
only
consider
the
two
characteristic
depends
assumption assigned
by
of It
are
make
generated
structure
are .
Val
groupings
local
variable
one
random over
the
the
on parents
CPD
characteristic
structure and
that
depend the
the
local
priors
assume not
of
the
random
correspond
( l:::~ i
( l : : : xi
the
a
values
the
of
first
r
with
structures
by
characteristic same
)
.
we
does
( iLj
faced
priors
regarding
able
r
Val
specifying
imposed
tions
are
local
that
titions
tests
,
these
network
the
we
' lIE
for that leaves trees
.
Second, we assume that the vector of Dirichlet hyperparameters assigned to an element of the partition that corresponds to a union of several sm~ller partitions in another local structure is simply the sum of the vector ~ of Dirichlet hyperparameters assigned to these smaller partitions . Again , consider two trees , one that consists of a single leaf , and another that has one test at the root . This assumption requires that for each Xi E Val(Xi ) , the Dirichlet hyperparameter N ~ilv ' where v is the root in the first tree , is the sum of the N ~ilv ' for all the leaves in the second tree . It is straightforward to show that if a prior distribution over structures , local structures , and parameters satisfies these assumptions and the assumptions of Heckerman et ale (1995a) , then there must be a distribution p ' and a positive real N ' such that for any structure G and any choice of
440
NIR
local
structure
Pr
FRIEDMAN
a
( 8Xilv
1 . c ~
for
AND
MOISES
GOLDSZMIDT
G
, Gh
)
' "
Dirichlet
( {
N
'
.
P
' ( Xi
, if
=
v
)
:
Xi
E
Val
( Xi
) } )
. ( 11
This
result
network N
allows B
' .
From
Finally MDL
( that
these
hypothesis
the
'
us
we
,
we and
to
represent
specifies two
need
,
we
to
note
that
BDe
scores
the
the
prior
compute
local
Schwarz structure
)
and
a a
)
Bayesian
positive
hyperparameters
learning
use
using PI
Dirichlet
during
can for
information
distribution the
evaluate
we
prior
real for
every
.
's are
result
( 1978
)
to
asymptotically
show equivalent
that .
4.2. LEARNING PROCEDURES Once we define the appropriate score, the learning ta8k reduces to finding the network that maximizes the score, given the data . Unfortunately , this is an intractable problem . Chickering (1996) shows that finding the network (quantified with tabular CPDs ) that maximizes the BDe score is NP -hard . Similar arguments also apply to learning with the MDL score. Moreover , there are indications that finding the optimal decision tree for a given fam ily also is an NP -hard problem ; see Quinlan and Rivest (1989) . Thus , we suspect that finding a graph G and a set of local structures { Ll , . . . , Ln } that jointly maximize the MDL or BDe score is also an intractable problem . A standard approach to dealing with hard optimization problems is heuristic search. Many search strategies can be applied . For clarity , we focus here on one of the simplest , namely greedy hillclimbing . In this strategy , we initialize the search with some network (e.g., the empty network ) and repeatedly apply to the "current " candidate the local change (e.g., adding and removing edges) that leads to the largest improvement in the score. This " upward " step is repeated until a local maxima is reached, that is, no modification of the current candidate improves the score. Heckerman et ale (1995a) compare this greedy procedure with several more sophisticated search procedures . Their results indicate that greedy hillclimbing can be quite effective for learning Bayesian networks in practice . The greedy hillclimbing procedure for learning network structure can be summarized as follows . procedure Le2fnNetwork ( Go ) Let G current -i;- Go do Gener2te 211successorsS = { GI , . . . , Gn } of Gcurrent ~ Score = maxGES Score (G ) - Score (Gcurrent) If ~ Score > 0 then Let Gcurrent -i;- arg maxGESScore (G )
LEARNING
BAYESIAN
while
(
dScore
return
The
>
successors
of
an
arc
This
greedy
ent
sets
,
we
we
Bouckaert
(
When
a
1994
)
to
parent
local
learning
operation
a
the
sets
are
The
specific
procedures
below
.
the
,
we
.
fix
Both
Since
the
choice
,
in
the
local be
of
( i . e as
a
. ,
.
Score
to
need
to
We
,
default
)
is
.
find
see
loop
modi
( an
this
by
This
only
-
approxi
one
-
or
two
procedure
greedy
a
to
1
Li
)
only
decision
Pai
in
,
and
discussion
properties
the
Pr
of
defined
score
log
to
the
partitions
,
trees
independently
( D
I
of
the
Gh
,
by
data
~
)
the
,
for
given
BDe
)
,
counts
.
the
the
,
that
the
of
the
.
values
several
partition
these
new
tables
finding
the
This
that
.
procedure
row
be
done
to
to
( i . e
.
. ,
Then
the
,
assign
leads
corresponded
correspond
-
only
.
The
row
can
we
subpartitions
explicitly
refinement
that
we
corre
then
default
single
represented
if
( that
,
the
takes
that
partitions
only
by
Xi
implies
default
term
terms
possible
to
when
the
of
,
one
containing
score
)
replacing
inducing
row
replace
v
decomposition
correspond
for
)
I
of
This
union
table
default
sum
v
by
by
( Xi
)
that
parents
to
the
( TLi
= =
terms
in
with
MDL
structure
the
the
only
and
parents
precisely
of
Li
default
improvement
need
its
SCOre
Val
strategy
trivial
refines
values
tables
applied
decomposability
More
function
1
the
a
row
G
Since
invoke
are
of
L
a
of
reevaluate
we
.
;
this
of
to
underlying
for
vE
local
value
iteratively
biggest
since
v
the
with
of
.
-
successor
savings
modify
.
default
the
DLdata
where
one
use
starts
ment
I
refining
sponds
it
( " "Yi
instances
consider
)
par
sum
i
these
we
additional
of
L
in
each
we
CPD
,
and
on
variable
structure written
where
Pai
two
additional
attempts
each
learning
Xi
terms
random
can
I
most
-
and
( Xi
at
evaluate
get
score
Score
successor
procedures
rely
functions
characteristic
for
these
procedures
score
the
successor
used
next
CPD
Li
to
that
each
,
net
MDL
modify
,
procedure
in
the
form
each
for
arc
only
.
scoring
structure
an
consider
.
described
each
the
terms
)
adding
( We
Bayesian
both
search
to
1991b
.
learning
,
representations
search
modified
CPDs
the
few
before
local
is
computations
(
arc
for
That
the
structured
local
best
.
recompute
Buntine
an
efficient
have
these
allowing
)
these
are
441
. )
decompose
score
to
and
cycle
during
cache
by
of
particularly
use
BDe
need
can
invokes
mation
for
)
a
considered
only
,
fication
STRUCTURE
generated
direction
involve
is
we
the
are
the
not
scores
successors
Moreover
adding
do
of
the
structure
reversing
procedure
the
( logarithm
Since
LOCAL
)
current
and
that
since
the
the
,
successors
works
0
WITH
Gcurrent
removing
legal
NETWORKS
to
-
the
efficiently
the
new
,
previous
row
and
442
NIR FRIEDMAN AND MOISESGOLDSZMIDT
the new default row . This greedy expansion is repeated until no improve ment in the score can be gained by adding another row . The procedure is summarized as follows .
procedureLe2fnDef2Ult() Let Rows(V ) -t- 0 do
Let r == argmaXrEVa~Pai)- Rows(V) Score(Rows(V) U { r } ) if Score(Rows(V) U { r } ) < Score(Rows(V)) then returnRows(D) Rows(V) f - Rows(V) U { r } end
For inducing decision trees , we adopt the approach outlined by Quinlan and Rivest (1989) . The common wisdom in the decision-tree learning liter ature (e.g., Quinlan (1993)) , is that greedy search of decision trees tends to become stuck at bad local minima . The approach of Quinlan and Rivest at tempts to circumvent this problem using a two -phased approach . In the first phase we "grow " the tree in a top -down fashion . We start with the trivial tree consisting of one leaf , and add branches to it in a greedy fashion , until a maximal tree is learned . Note that in some stages of this growing phase, adding branches can lower the score: the rationale is that if we continue to grow these branches , we might improve the score. In the second phase, we remove harmful branches by "trimming " the tree in a bottom -up fashion . We now describe the two phases in more details . In the first phase we grow a tree in a top -down fashion . We repeatedly replace a leaf with a subtree that has as its root some parent of X , say Y ; and whose children are leaves, one for each value of Y . In order to decide on which parent Y we should split the tree , we compute the score of the tree associated with each parent , and select the parent that induces the best scoring tree . Since the scores we use are decomposable , we can compute the split in a local fashion by evaluating on the instances with respect to the training data that are compatible with the path from the root of the tree to the node that is being split . This recursive growing of the tree stops when the node has no training instances associated with it , the value of X is constant in the associated training set, or all the parents of X have been tested along the path leading to that node. In the second phase, we trim the tree in a bottom -up manner . At each node we consider whether score of the subtree rooted at that node is better or equal to the score replacing that subtree by a leaf . If this is the case, then the subtree is trimmed and replaced with a leaf . These two phases can be implemented by a simple recursive procedure , LecrnT fee, that receives a set of instances and returns the " best" tree for
LEARNING
BAYESIAN
NETWORKS
WITH
LOCAL STRUCTURE
443
this set of instances . procedure Simple T ree(Y ) For y E Val (Y ) , let ly f - A ( i .e., 2- le2f ) return (Y , { ly : Y E Val (Y ) } ) end procedure Le2fnTree ( D ) if D = 0 or Xi is homogeneous in D then return A . I I Growing ph25e Let Ysplit == arg maxYEPai Score (SimpleTree for y E Val (Ysplit ) Let Dy = { Ui ED : Ysplit = Y in Ui } LetTy = Exp2-ndT ree( A , Dy ) let r = (Y split , { Ty : y E Val (Y ) } ) II Trimming ph2-se if Score (A I D ) > Score (T I D ) then return A else return T end
5.
Experimental
(Y ) I D )
Results
The main purpose of our experiments is to confirm and quantify the hy pothesis stated in the introduction : A learning procedure that learns local structures for the CPDs will induce more accurate models for two reasons : 1) fewer parameters will lead to a more reliable estimation , and 2) flexible penalty for larger families will result in network structures that are better approximations to the real (in ) dependencies in the underlying distribution . The experiments compared networks induced with table - based , tree based , and default - based procedures , where an X - based procedure learns networks with X as the representation of CPDs . We ran experiments using both the MDL score and the BDe score . When using the BDe score , we also needed to provide a prior distribution and equivalent sample size . In all of our experiments , we used a uniform prior distribution , and examined several settings of the equivalent sample size N ' . All learning procedures were based on the same search method discussed in Section 4 .2 . We ran experiments with several variations , including different settings of the BDe prior equivalent size and different initialization points for the search procedures . These experiments involved learning approximately 15,000 networks . The results are summarized below .
444
NIR FRIEDMAN
TABLE
Name
1.
Description
AND MOISES GOLDSZMIDT
of the networks
used in the experiments
I Description
.
IIUII
I n
Alarm
A network by medical experts for monitoring intensive care ( Beinlich et al ., 1989 ) .
in
37
253.95
509
Hailfinder
A network for modeling summer hail in northeastern Col orado (http : / / uuv . lis . pitt . edu / - dsl / hailfinder ).
56
2106.56
2656
A network for classifying
27
244.57
1008
Insurance
patients
lei
insurance applications
( Russell et
al ., 1995 ) .
5 .1.
METHODOLOGY
The
data
sets
networks these
, 16000
them
have
. The
the
results of
By the
virtue
the
. In
same
having
and
local
the
only
data model
could
in
model
on
the
Bayesian each
of
the
learning
data
sets , and
increase
10
the
did
not
accuracy
( independently
, the
,
procedures
of
sampled
methods
we
)
compared
. each
precisely
original
the
to
experiments
three 1 . From
250 , 500 , 1000 , 2000 , 4000 ran
order with
from
in Table
-
and
experiment of the
a golden
structures
-
. In
training
, we
ofsizes
received
network
all
experiment
quantify . We
the
were
parameter
also
, represented error able
estimation
to
and
by
between
the
quantify
the
the
structure
.
MEASURES
the
procedures
networks
of the
selection
As
of
sets
sampled
described
instances
generating
the
were are
training 32000
data
models
effect
experiments
repeated
as input
induced
5 .2 .
the
, we
original
, and
learning
to
training
received
the
characteristics
we sampled , 24000
access
sets
in
main
networks
8000 on
used
whose
main
as Kullbak tribution bu tion
P
OF
ERROR
measurement
- Leibler to
the
to
an
of error
divergence induced
we use
and
relative
distribution
approximation
the
entropy
. The
Q
entropy
distance
) from
entropy
is defined
the
( also
known
generating
distance
from
dis -
a distri
-
as P (x )
D ( PIIQ
This the tropy
quantity distri
is
bu tion
distance
is
D ( Q IIP ) . Another that
D ( PIIQ
a measure is Q not
) =
of
when
the
the
symmetric
important
) ~ 0 , where
L :x : P ( x ) log inefficiency
real
distri
property holds
incurred
bu tian
, i .e . , D ( PIIQ
equality
Q( ; ) .
of
the
if and
) is
is P . not
entropy only
by
assuming
N ate equal
in
distance if P =
that
Q .
that the
en -
general
to
function
is
LEARNING BAYESIAN NETWORKS WITHLOCALSTRUCTURE 445 There
On
are
the
several
axiomatic
side
properties
of
the
only
ples
possible
true
)
and
.
a
We
the
in
G
the
a
to
a
of
target
( P
(
induced
*
IIG
1991
-
entropy
Q
instead
of
monetary
)
for
a
*
networks
of
this
We
)
the
expected
discussion
allows
procedures
influences
P
,
or
different
on
distribution
Dstruct
the
the
separate
.
exam
distribution
Thomas
is
motivating
.
of
structure
distance
examples
needed
and
structure
network
also
.
desirable
entropy
both
the
examples
distance
several
that
In
bits
entropy
suggest
are
using
Cover
)
There
.
by
error
network
.
additional
to
1980
the
show
gambling
these
using
and
them
distance
assessing
be
respect
reader
entropy
induced
Let
. ,
generalization
interested
the
. g
of
the
the
of
incurred
( e
analysis
Measuring
(
,
all
loss
P
refer
Johnson
and
the
detailed
compare
and
satisfies
distribution
losses
and
compression
measures
the
for
measures
that
data
distance
Shore
approximation
function
from
,
justifications
the
error
.
us
We
are
parameter
to
also
estimation
.
define
the
inherent
error
of
G
with
as
=
mill
D
( P
*
II ( G
,
)
)
.
. c
The
inherent
choice
eters
CPDs
for
As
a
G
it
,
we
I
~
)
.
Then
)
P
An
*
:
P8i
is
as
,
Y
IIG
Z
be
Intuitively
,
and
by
the
smaller
"
best
error
can
be
expect
,
of
Xi
any
possible
possible
than
param
Dstruct
evaluated
the
,
"
( P
by
best
P8i
,
IIG
)
.
of
for
is
*
means
CPDs
given
-
G
are
identical
to
this
X
when
Ip
( X
*
.
G
Thomas
;
We
I
I
,
)
1991
=
is
)
)
.
is
*
be
a
such
distribu
that
= =
Hp
if
and
are
-
P
( Xi
I
a
each
of
error
these
of
the
independen
strength
of
mutual
conditional
network
independence
the
conditional
the
of
the
measure
the
of
;
error
"
degree
defined
know
,
*
P
inherent
measuring
-
the
depen
-
information
mutual
.
information
as
( X
I
how
0
let
where
to
measure
already
Z
,
the
variables
Z
)
reasonable
what
of
the
of
Y
)
and
attempt
to
,
*
"
can
way
Z
,
how
measures
we
Y
.
One
given
( X
II ( G
about
of
sets
term
;
*
thinking
is
,
( P
structure
.
estimating
P
Y
network
D
i
of
three
and
=
all
variables
,
a
find
error
might
a
measure
by
Ip
that
)
for
a
in
X
compress
be
in
violated
between
( Cover
*
way
,
between
X
we
achievable
can
get
of
As
G
)
structure
is
dency
and
Let
encoded
network
Let
. 1
I
assumptions
cies
to
we
distribution
( P
( Xi
G
error
if
measure
conditional
alternative
structure
even
hope
this
.
Dstruct
=
smallest
,
.
5
tion
the
Thus
equation
Proposition
Pai
is
.
cannot
,
the
P
G
G
still
- form
( Xi
for
out
where
*
of
. c
turns
closed
those
P
error
of
Z
)
-
much
Z
only
.
It
Hp
the
well
if
( X
Y
,
Z
)
knowledge
known
X
I
is
of
that
independent
.
Ip
Y
( X
of
;
helps
Y
Y
I
,
Z
given
us
)
~
0
Z
,
446
NIR FRIEDMAN AND MOISESGOLDSZMIDT
Using the mutual information as a quantitative measure of strength of dependencies, we can measure the extent to which the independence assumptions represented in G are violated in the real distribution . This suggests that we evaluate this measure for all conditional independencies represented by G . However , many of these independence assumptions "overlap " in the sense that they imply each other . Thus , we need to find a minimal set of independencies that imply all the other independencies represented by G . Pearl (1988) shows how to construct such a minimal sets of independence . Assume that the variable ordering Xl , . . . , X n is consistent with the arc direction in G (i .e., if X i is a parent of X j , then i < j ) . If , for every i , Xi is independent of { Xl , . . . Xi - I } - Pai , given Pai , then using the chain rule we find that P can be factored as in Equation 1. As a consequence, we find that this set of independence assumptions implies all the independence assumptions that are represented by G . Starting with different consistent orderings , ~ e get different minimal sets of assumptions . However , the next proposition shows that evaluating the error of the model with respect to any of these sets leads to the same answer. Proposition 5 .2 : Let G be a network structure , Xl , . . . , Xn be a variable ordering consistent with arc direction in G , and P * be a distribution . Then
Dstruct (P*IIG ) = LIp 't . (Xi; {XI, . . .Xi-I} - PaiI Pai) . This proposition shows that Dstruct(P * IIG) == 0 if and only if G is an [ map of P * ; that is, all the independence statements encoded in G are also true of P * . Small values of Dstruct(P * IIG) indicate that while G is not an I-map of P * , the dependencies not captured by G are "weak ." We note that Dstruct(P * IIG) is a one-sided error measure, in the sense that it penalizes structures for representing wrong independence statements , but does not penalize structures for representing redundant dependence statements . In particular , complete network structures (i .e., ones to which we cannot add edges without introducing cycles) have no inherent error , since they do not represent any conditional independencies . We can postulate now that the difference between the overall error (as measured by the entropy distance ) and the inherent error is due to errors introduced in the estimation of the CPDs . Note that when we learn a local structure , some of this additional error may be due to the induction of an inappropriate local structure , such as a local structure that makes aBsumptions of context -specific independencies that do not hold in the tar get distribution . As with global structure , we can measure the inherent error in the local structure learned . Let G be a network structure , and let SLl ' . . . , SLn be structures for the CPDs of G . The inherent local error of
LEARNING BAYESIAN NETWORKS WITHLOCALSTRUCTURE 447 G and3Ll' .. .,SLnis Dlocal (P*IIG, {SL1, . . ., SLn}) = From the above , tion 5 .1.
we get the following expected generalization
Proposition 5 .3 : Let G be a network structure , let SL1' . . . , SLn structure for the CPDs of G , and let P * be a distribution . Then
Dlocal(P* IIG, { SL1' . . . ' SLn} ) == D (P*II(G, [, *)), where [, * is such that P (X i I Y Li) == P* (Xi I YLi ) for all i . From
the
definitions
of inherent
error
above
it follows
that
for any
net -
work B == (G, ) , D (P* IIPB) ~ Dlocal(P* IIG, { SLl ' . . . , SLn} ) ~ Dstruct(P* IIG) . Using these measures in the evaluation
of our experiments
, we can mea -
sure the "quality " of the global independence assumptions made by a net-
work structure (Dstruct) , the quality of the local and global independence assumptions made by a network structure and a local structure (Dlocal) , and the total error , which also includes the quality of the parameters . 5 .3 . RESULTS We want
to characterize
the
error
in the
induced
models
as a function
of the number of samples used by the different learning algorithms for the induction . Thus , we plot learning curves where the x-axis displays the n urn ber of training
instances
N , and the y- axis displays
the error of the
learned model . In general , these curves exhibit exponential decrease in the error . This makes visual comparisons between different learning procedures
hard, since the differences in the large-sample range (N ~ 8000) are obscured , and , when a logarithmic scale is used for the x-axis , the differenc .es
at the small-sample range are hard to visualize. Seefor example Figure 3(a) and (b) . To address this problem , we propose a normalization of these curves , motivated by the theoretical results of Friedman and Yakhini (1996) . They show that learning curves for Bayesian networks generally behave as a linear
function of
. Thus , weplottheerrorscaled by~ . Figure 5(a) shows
the result of the application of this normalization to the curves in Figure 3. Observe that the resulting curves are roughly constant . The thin dotted
diagonal lines in Figure 5(a) correspond to the lines of constant error in Figure 3. We plot these lines for entropy distances of 1/ 2i for i == 0, . . ., 6.
448
NIRFRIEDMAN ANDMOISES GOLDSZMIDT " "
~ ...~.!
J
Table
~' :4 i ,
Default
...... .....
.
,
0 .8
.
.
~\ ,
I
.
I
,
I
,
I
.,
\
I
,
0 .6
.
"'" \ ", \ ,
I
._ --_._~ - _..
(a)
.
,
.
, .
0 .4
\
8. " , "
,
"
' .
, ,
, .
, ' .
\ .
\ .
- -
--
_ ._ .._ -
-
,
~
0 .2
~.
. .
" .- " - . ~
. .
-....-......-.-.-....-.-...--...-. I
4
8
16 Number
24
of instances
32
X I ,( xx)
I 1/2
.-.........-...-..-.....--......-....-..-...-...-..-...-.....-----..-.--.....-.-....- ..-.-.-..-....-
1/4 .~...-"'1'---1/8 .----.-- -- - -- - - '."" ':::--_. ." ~_-.. ' &... 1/16 . ...............-.........-.. .
(b)
---- -----1/32 ...... ........ . .,......... "' . ..................................-......-................."'."'.-""""-""..'.."" 1/64 I
4
8 16 24 Number ofinstances X 1,(xx)
32
~Jue1S !P..(dOJ1U3
Figure 9 . (a) Error curves showing the entropy distance achieved by procedures using the MDL score in the Alarm domain . The x -axis displays the number of training instances N , and the y- axis displays the entropy distances of the induced network . (b ) The same curves with logarithmic y-axis . Each point is the average of learning from 10 independent data sets .
Figures BDe
4 and 5 display
and MDL
iments
, the
eventually
scores
learning they
the entropy
( Table curves
would
2 summarizes appear
intersect
respect
to the
entropy
of networks
these
to converge
the dotted
f > o . Moreover , all of them appear specified by the results of Friedman With
distance
target
line of f entropy ) conform .
, tree - based
distribution distance
to the
procedures
procedures
in the
performed
As a general
~Jue1S !P..(dOJ1U3
based
poorly
AI2fm
and
in the
Insur2flce
H2ilfinder
rule , we see a constant
domains domain
. The
behavior
few excep the table -
default
- based
.
gap in the the curves
:
for all
performed
better in all our experiments than table - based procedures . With tions , the default - based procedures also performed better than methods
via the
values .) In all the exper to the
to ( roughly and Yakhini
distance
learned
corresponding
LEARNING BAYESIAN NETWORKS WITHLOCALSTRUCTURE 449
105
! ! !: "I ",
i i
.
, ,I
,.
i
I(M) Iii
Table Tree
/
I
I'
! ~.
l
95
'. /
9{ )
. - - -.....- _.
Default..... ....
/
".../ "
/
/ '." "
Po5 (a)
,
/
;1 .". ........... .
" '"
i "
PoO
""
.
.
i
7 () j ,
65
/'
'
i ,
"
"
6{ )
~JeWJoN ~~mns~p f.doJ1u~ ~
4
4( X )
) '
.
i
- - - - - --I I I ' I
- - - - ... - - - - .
- -. . - - - - .
- - -
-
.
/
R
16
24
of instances
X
32
I ,( KK )
Tahle . Tree ---.,6..--- ,I
jI
Default..... ....! / ...
i
.' .. . '
1
;
..
j '
... '
320
.,'
I
3 ( K)
" "
!
.." ,
----/ !
2()()
i - - - - - - - - - - - - , , " , - - - - . . .
i
/
,,
/
,
/
, ' ,.
!
,J
~
,/
16 Number
/
i
,,' "
,
4
/
../ ' / ~ - - - - - - - - - - - - - - - - - -- ,," ' j !
. .
-----.. -
I
/
. . .
280
I
/
...."
0
/
......"'.'/'.'
.'
/
..
Z
. . .........
...'
'."'. .....'
.~
I
_-
I
! I i
3M )
- -. .
'
i
3 ') - {' K
/
I
,
3RO
/
/ I I
J I
~
~ ""; e
.
I
Number
:;
;
.
. . . -- - - - - -
i
! I
(b)
.
"
!
0
.
I
I
~
.
.. ... . .. ...... .. .. ..... . .. .. .. .
,
t)
.
I
I /' '
i
U
.
I; '"""",
75
~ ~ "' !
.
J
"
/
,/
/
24
of instances
32
X I ,( KX )
16( ) 150 1M ) 13 ( ) 1 I ....
. ".. 7
(c)
...
.....8
. ..
120
. ...
.'.
..
.'...'"
110
....."
.
."
,, '
....
" ' . ..
I (X) 9( ) RO
------
70
~~~ S!P'(doJ1u ~ ~ !reWJoN
1
4
R
16 Number
of instances
24
32
X I ,( KK )
Figure 1, . Normalized error curves showing the entropy distance achieved by procedures using the BDe (with N ' = 1) score in the (a) Alarm domain, (b) Hailfinder domain, and (c) Insurancedomain. The x-axis displays the number of training instances N , and the y-axis displays the normalized entropy distances of the induced network (seeSection 5.2) .
450
NIRFRIEDMAN ANDMOISES GOLDSZMIDT
3 ( M) /
/!
' !
, \
25()
I
,I I !
I i
,' i!
, a.
i ,/
Ii
.
" "\ ;
'
,
,. ':;t.----t-., i " .
(a)
"
;
": !I : f ' I
2 ( XI
.
,
.
,
.
.
.
.
,
I : /:
; ,
-
'
-
-
-
-
/ '
/
I
!
"
/ /
,/
4
-
". .. -- - -.... . I '
/
.
.
'. .
.. .. ..
.
. . . . .' :o~
-......' .
/ "
..'
.. .. .. . . " .......... ............ '. '
.
' ,
"
/,;"" .....""'"
R
4( K)
I /
---
.
...', ....
I
I
I
/
/
-----------~ -
32
X 1,(XM)
Table Tree
/ 'I
/
I
. -- - .... -- -
Default
.... . ....
/ //
!
35 ( ) .
'
........
..........
I!
i
.
24
of instances
/i ,
i
.
....
",,'
/ . . . . .
...... --------------
,,/
16 Number
45 ( )
"
.,
.
......
"~ .....
", '
I
(b)
-
---- / '
-
.-.: . . . . . . . . . . . . . . . .
.
/'
/
J i / I
~.1U ' e1S !P'(doJ1u ~ ~ !fEWJo N
.
f ..
I'
150 '! iI I
f
. .I . . . . ~
/
/
/
' . ,/ ~ .
.i'
~.. .1 i ''
;"
/
/ '
, ,
' i /' .. .
/
I
,
,
.,
/
/!
' .
II I ,
:
!
I
I
I
,i :
I
I'
I
//
,I
,
/ ....,. ' .
I
.
'
/'
. r .. .' .. .
.. .
. . . . . .. .
............j ...........
. '
....
.
.
'
i ## .
'
'
.
.
..... . 8
---- ---------6
."
,
.,#' .
.
/
.."
/
,
.
------; A--
. . . .. .. .. ..... . ....
I
,
3 (X )
/
/
,
/
/
.' / ./
~JU~S!PI.doJ1u ~p~ !reWJoN
250
(c)
1
4
X
16 Number
3(M ) i i r I '
i
!
!
25()
I
I I ill i;
i
,I
;
I
I
I
j
i
I
i '
I
,
I
i Ii
I
/
/.
/I
2(M ) I I ! .1 r
/
I
1
:
I'
,
, .'
" .
"
!
7
,'
. . ... ,..
: .w " /
...'
... . . . 8 .. .. .. .. ..
"
.
.......
..,
..........
150 1\ /
I1 ._~ 'I.-"~1._.._---..& i' i, / .! I
, '
~~
/ ,.r
~
-
i
~.1umS !P'{doJ1u ~ ~ !~WJoN
I ( M)
I
. - - - -
I I I
4
..... ....
/ /
/
. .
.,'
Default
,/
/
'~',.at f ",' .....,......", "1 ;:/ I / ,
/ T;~~ = =
/
il
32
X 1 ,( XX )
/ "
I'
'
24
of instances
R
""
.
T
.....
16 Number
of instances
'....:.:.-........ ..
......-
.........-
......-
..
-
. . . - . -
- -.-. ..-.. ....-..- .......... ,
....'
..... .........-' "
24
.
-
...-
-
..-._-4 32
X I ,(MM)
Figure 5. Normalized error curves showing the entropy distance achieved by procedures using the MDL score in the (a) Alarm domain, (b) Hailfinder domain, and (c) Insurance domain.
LEARNING
BAYESIAN
NETWORKS
WITH
LOCAL
STRUCTURE
451
to different representations . Thus , for a fixed N , the error of the procedure representing local structure is a constant fraction of the error of the cor-
responding procedure that does not represent local structure (i .e., learns tabular CPDs). For example, in Figure 4 (a) we seethat in the large-sample region (e.g., N ~ 8000) , the errors of proceduresthat use trees and default tables are approximately 70% and 85% (respectively) of the error of the table-based procedures. In Figure 5(c) the corresponding ratios are 50% and 70 % .
Another way of interpreting these results is obtained by looking at the number of instances needed to reach a particular error rate . For example ,
In Figure4(a), the tree-basedprocedure reaches the errorlevelof 3\ with approximately 23,000 instances . On the other hand , the table -based procedure barely reaches that error level with 32,000 instances . Thus , if we want to ensure this level of performance , we would need to supply the table -baBed procedure with 9,000 additional instances . This number of instances might be unavailable in practice . We continued our investigation by examining the network structures learned by the different procedures . We evaluated the inherent error , Dstruct, of the structures learned by the different procedures . In all of our experi ments , the
inherent
error
of the
network
structures
learned
via
tree - based
and default -based procedures is smaller than the inherent error of the networks learned by the corresponding table -based procedure . For example , examine
the Dstruct column
in Tables 3 and 4 . From these results , we con -
clude that the network structures learned by procedures using local representations make fewer mistaken assumptions of global independence , as predicted by our main hypothesis . Our hypothesis also predicts that procedures that learn local representation are able to assessfewer parameters by making local assumptions of independence in the CPDs . To illustrate this , we measured the inherent local error , Dlocal, and the number of parameters needed to quantify these networks . As we can see in Tables 3 and 4, the networks learned by these procedures exhibit smaller inherent error , Dstruct; but they require fewer parameters , and their inherent local error , Dlocal, is roughly the same as that of networks learned by the table -based procedures . Hence, instead of making global assumptions of independence , the local representation procedures make the local assumptions of independence that better capture the regularities in the target distribution and require fewer parameters . As a consequence, the parameter estimation for these procedures is more accurate
.
Finally , we investigated how our conclusions depend on the particular choices we made in the experiments
. As we will see, the use of local structure
leads to improvements regardless of these choices. We examined two aspects
452
NIR FRIEDMANAND MOISESGOLDSZMIDT
of the learning process: the choice of the parameters for the priors and in the search procedure .
We start by looking at the effect of changing the equivalent sample size N '. Heckerman et ale (1995a) show that the choice of N ' can have drastic effects on the quality of the learned networks . On the basis of on their experiments
in the AI2fm domain , Heckerman
et ale report
that
N' = 5
achieves the best results . Table 5 shows the effect of changing N ' from 1 to 5 in our experiments
. We see that the choice of N ' influences the ma ~nitl ]clp '-'
of the errors
in the learned
networks
, and the sizes of the error
gaps between
the different methods . Yet these influences do not suggest any changes on the benefits
of local
structures
.
Unlike the BDe score, the MDL score does not involve an explicit choice of priors . Nonetheless , we can use Bayesian averaging to select the parame ters for the structures that have been learned by the MDL score, ~ opposed to using maximum likelihood estimates . In Table 6 we compare the error between the maximum likelihood estimates and Bayesian averaging with N ' := 1. As expected , averaging leads to smaller errors in the parameter estimation
, especially
for small sample sizes . However , with the exception
of
the Alcrm domain , Bayesian averaging does not improve the score for large
samples (e.g., N == 32, 000) . We conclude that even though changing the parameter estimation technique may improve the score in some instances , it does not change our basic conclusions . Finally , another aspect of the learning process that needs further investi gation is the heuristic search procedure . A better search technique can lead to better
induced
models as illustrated
in the experiments
of Heckerman
et ale (1995a) . In our experiments we modified the search by initializing the greedy search procedure with a more informed starting point . Follow -
ing Heckerman et ale (1995a) we used the maximal branching as a starting state for the search. A maximal branching network is one of the highest scoring network among these where IPail s 1 for all i . A maximal branch ---.
-
,
-
ing can be found in an efficient manner (e.g., in low-order polynomial time) (Heckerman et al., 1995a) . Table 7 reports the results of this experiment. In the Alcrm domain , the use of maximal branching as an initial -point led to improvements in all the learning procedures . On the other hand , in the Insur2-nce domain , this choice of for a starting point led to a worse error . Still , we observe that the conclusions described above regarding the use of local
6. The
structure
held
for these
runs
as well .
Conclusion main
contribution
of this
article
is the introduction
of structured
rep -
resentations of the CPDs in the learning process, the identification of the
LEARNING
BAYESIAN
NETWORKS
WITH
LOCAL
STRUCTURE
453
benefits of using these representations , and the empirical validation of our
hypothesis. As we mentioned in the introduction (Section 1), we are not the first to consider efficient representations for the CPDs in the context of learning . However , to the best of our knowledge , we are the first to consider and demonstrate the effects that these representations may have on the learning of the global structure of the network . In this paper we have focused on the investigation of two fairly simple , Rtructured -
-
-
-
-
-
-
representations
-
..
of CPDs : trees
and
default
tables . There
are
certainly many other possible representation of CPDs , based, for example ,
on decision graphs, rules, and CNF formulas: seeBoutilier et ale(1996) . OUf choice was mainly- due to the availability of efficient computational tools for learning the representations we use. The refinement of the methods studied in this paper to incorporate these representations deserves further attention . In the machine learning literature , there are various approaches to learning trees , all of which can easily be incorporated in the learning procedures for Bayesian networks . In addition , certain interactions among the search procedures for global and local structures can be exploited , to reduce the computational cost of the learning process. We leave these issues for future
research
.
It is important to distinguish between the local representations we examine in this paper and the noisy-or and logistic regression models that have
been examined in the literature . Both noisy-or and logistic regression (as applied in the Bayesian network literature ) attempt to estimate the CPD with a fixed number of parameters . This number is usually linear in the number of parents in the CPD . In cases where the target distribution
does
not satisfy the assumptions embodied by these models , the estimates of CPDs produced by these methods can arbitrarily diverge from the target distribution . On the other hand , our local representations involve learning the structure
of the CPD , which
can range
from
a lean
structure
with
few
parameters to a complex structure with an exponential number of parameters . Thus , our representations can scale up to accommodate the complexity of the training data . This ensures that , in theory , they are aBymptotically correct : given enough samples, they will construct a close approximation of the target
distri bu tion .
In conclusion
, we have shown
that
the induction
of local
structured
rep -
resentation for CPDs significantly improves the performance of procedures for learning Bayesian networks . In essence, this improvement is due to the fact that we have changed the bias of the learning procedure to reflect the nature of the distribution in the data more accurately .
454
NIR FRIEDMANAND MOISESGOLDSZMIDT
TABLE 2. Summaryof entropydistancefor networkslearnedby the procedure using the MDL scoreand BDe scorewith N ' = 1. MDL Score BDeScore Domain Size Table Tree Defualt Table Tree Default (X 1,000) Alarm
Hailfinder
Insurance
0.25 0.50 1.00 2.00 4.00 8.00 16.00 24.00 32.00
. 0.25 I 0.50 1.00 2.00 4.00 8.00 16.00 24.00 32.00
5.7347 3.5690 1.9787 1.0466 0.6044 0.3328 0.1787 0.1160 0.0762
5.5148 3.2925 1.6333 0.8621 0.4777 0.2054 0.1199 0.0599 0.0430
5.1832 2.8215 1.2542 0.6782 0.3921 0.2034 0.1117 0.0720 0.0630
1.6215 0.9701 0.4941 0.2957 0.1710 0.0960 0.0601 0.0411 0.0323
1.6692 1.0077 0.4922 0.2679 0.1697 0.0947 0.0425 0.0288 0.0206
1.7898 1.0244 0.5320 0.3040 0.1766 0.1118 0.0512 0.0349 0.0268
9.5852 4.9078 2.3200 1.3032 0.6784 0.3312 0.1666 0.1441 0.1111
9.5513 4.8749 2.3599 1.2702 0.6306 0.2912 0.1662 0.1362 0.1042
8.7451 I 6.6357 4.7475 3.6197 2.3754 1.8462 1.2617 1.1631 0.6671 0.5483 0.3614 0.3329 0.2009 0.1684 0.1419 0.1470 0.1152 0.1081
6.8950 3.7072 1.8222 1.1198 0.5841 0.3117 0.1615 0.1279 0.0989
6.1947 3.4746 1.9538 1.1230 0.6181 0.3855 0.1904 0.1517 0.1223
0.25 0.50 1.00 2.00 4.00 8.00 16.00 24.00 32.00
4.3750 2.7909 1.6841 1.0343 0.5058 0.3156 0.1341 0.1087 0.0644
4.1940 2.5933 1.1725 0.5344 0.2706 0.1463 0.0704 0.0506 0.0431
4.0745 2.3581 1.1196 0.6635 0.3339 0.2037 0.1025 0.0780 0.0570
1.9117 1.0784 0.5799 0.3316 0.1652 0.1113 0.0480 0.0323 0.0311
2.1436 1.1734 0.6335 0.3942 0.2153 0.1598 0.0774 0.0458 0.0430
2.0324 1.1798 0.6453 0.4300 0.2432 0.1720 0.0671 0.0567 0.0479
LEARNINGBAYESIANNETWORKSWITH LOCALSTRUCTURE455
'fABLE 3. Summaryof inherentelTor, inherentlocal error, and numberof parametersfor the networkslearnedby the table-basedand the t.ree-basedproceduresusingthe BDe scorewith N' = 1. Table Tree Oomain Size D Dstruct/Dlocal Param D Dlocal Dstruct Param (X 1,000) Alarm
1 4 16 32 .
0.4941 0.1710 0.0601 0.0323
0.1319 0.0404 0.0237 0.0095
570 653 702 1026
0.4922 0.1697 0.0425 0.0206
0.1736 0.0570 0.0154 0.0070
0.0862 0.0282 0.0049 0.0024
383 453 496 497
Hailfinder
1 4 16 32
1.8462 0.5483 0.1684 0.1081
1.2166 0.3434 0.1121 0.0770
2066 2350 2785 2904
1.8222 0.5841 0.1615 0.0989
1.1851 0.3937 0.1081 0.0701
1.0429 0.2632 0.0758 0.0404
1032 1309 1599 1715
Insurance
1 4 16 32
0.6453 0.2432 0.0671 0.0479
0.3977 0.1498 0.0377 0.0323
487 724 938 968
0.5799 0.1652 0.0480 0.0311
0.3501 0.0961 0.0287 0.0200
0.2752 0.0654 0.0146 0.0085
375 461 525 576
rrAT3I .lE4. Summary ofillherent error , inherent localerror , andnumber ofparameters forthe networks learned bythetable -based andtree -based procedures using theMDLscore . Tabie Tree Domain Size D Dstruct /DiocalPar an. D DlocalDstructParam (x 1,000 ). Alarm 1 I 1.9787 0.5923 361I 1.63330.4766 0.3260 289 4 0.6044 0.2188 457 0.47770.14360.0574 382 16 0.1787 0.0767 639 0.11990.0471 0.0189 457 722- 0.04300.0135 0.0053 461 I32- 0.0762 0.0248 Hailfinder 1 2.3200 1.0647 10922.35991.13430.9356 1045 4 0.6784 0.4026 13630.63060.3663 0.2165 1322 16 0.1666 0.1043 17180.16620.11070.0621 1583 0.0743 18640.10420.0722 0.0446 1739 --2 - 0.1111 II ~ -----Insurance 1 1.68Ll1 1.0798 335 1.17250.5642 0.4219 329 4 0.5058 0.3360 518 0.27060.11690.0740 425 16 0.1341 0.0794 723 0.07040.0353 0.0187 497 32 0.0644 0.0355 833 0.04310.0266 0.0140 544
456
NIR FRIED!\1AN AND MOISESGOLDSZMIDrr
TABLE with
5. Summary of entropy. distance for procedures that use the BDe score
N ' =
1 and
Domain
N ' =
fi -
N' = 1 N' = 5 Size Table Tree Default Table Tree Default (x 1,000)
Alarm
1 4 16 32
0.4941 0.1710 0.0601 0.0323
0.4922 0.1697 0.0425 0.0206
0.5320 0.1766 0.0512 0.0268
0.3721 0.1433 0.0414 0.0254
0.3501 0.1187 0.0352 0.0175
0.3463 0.1308 0.0435 0.0238
Hailfinder
1 4 16 32
1.8462 0.5483 0.1684 0.1081
1.8222 0.5841 0.1615 0.0989
1.9538 0.6181 0.1904 0.1223
1.4981 0.4574 0.1536 0.0996
1.5518 0.4859 0.1530 0.0891
1.6004 0.5255 0.1601 0.0999
Insurance
1 4 16 32
0.6453 0.2432 0.0671 0.0479
0.5799 0.1652 0.0480 0.0311
0.6335 0.2153 0.0774 0.0430
0.5568 0.1793 0.0734 0.0365
0.5187 0.1323 0.0515 0.0284
0.5447 0.1921 0.0629 0.0398
TABLE6. Summary of entropydistance for procedures thatusetheMDLscore for learningthe structureandlocalstructurecombined with two methodsfor parameterestimation . Maximum Likelihood Bayesian , N' = 1 Domain Size Table Tree Default Table Tree Default (x 1,000) Alarm
1 4 16 32
1.9787 0.6044 0.1787 0.0762
1.6333 0.4777 0.1199 0.0430
1.2542 0.3921 0.1117 0.0630
0.8848 0.3251 0.1027 0.0458
0.7495 0.2319 0.0730 0.0267
0.6015 0.2229 0.0779 0.0475
Hailfinder
1 4 16 32
2.3200 0.6784 0.1666 0.1111
2.3599 0.6306 0.1662 0.1042
2.3754 0.6671 0.2009 0.1152
1.7261 0.5982 0.1668 0.1133
1.7683 0.5528 0.1586 0.0964
1.8047 0.6091 0.1861 0.1120
Insurance
1 4 16 32
1.6841 0.5058 0.1341 0.0644
1.1725 0.2706 0.0704 0.0431
1.1196 0.3339 0.1025 0.0570
1.1862 0.3757 0.1116 0.0548
0.7539 0.1910 0.0539 0.0368
0.8082 0.2560 0.0814 0.0572
LEARNINGBAYESIANNETWORKS WITH LOCALSTRUCTURE457
TABLE 7. Summaryof entropydistancefor two methodsfor initializingthe search , usingthe the BDescorewith N' = 1. EmptyNetwork Maximal Branching Network Domain Size Table Tree Default Table Tree Default (X 1,000) Alarm 1 0.4941 0.4922 0.5320 0.4804 0.5170 0.4674 4 0.1710 0.1697 0.1766 0.1453 0.1546 0.1454 16 0.0601 0.0425 0.0512 0.0341 0.0350 0.0307 32 0.0323 0.0206 0.0268 0.0235 0.0191 0.0183 Hailfinder
1 4 16 32
1.8462 0.5483 0.1684 0.1081
1.8222 0.5841 0.1615 0.0989
1.9538 0.6181 0.1904 0.1223
1.7995 0.6220 0.1782 0.1102
1.7914 0.6173 0.1883 0.1047
1.9972 0.6633 0.1953 0.1162
Insurance
1 4 16 32
0.6453 0.2432 0.0671 0.0479
0.5799 0.1652 0.0480 0.0311
0.6335 0.2153 0.0774 0.0430
0.6428 0.2586 0.1305 0.0979
0.6350 0.2379 0.0914 0.0538
0.6502 0.2242 0.1112 0.0856
NIR FRIEDMANAND MOISESGOLDSZMIDT
458
Acknowledgments The authors are grateful to an anonymous reviewer and to Wray Buntine and David Heckerman
for their comments
on previous versions of this paper
and for useful discussions relating to this work . Part
of this
research
was done
while
both
authors
were
at the
Fockwell
Science Center , 4 Palo Alto Laboratory . Nir Friedman was also at Stanford University at the time . The support provided by Fockwell and Stanford University is gratefully acknowledged . In addition , Nir Friedman was supported in part by an IBM graduate fellowship and NSF Grant IFI -95-03109. A preliminary version of this article appeared in the Proceedings, 12th Conference on Uncertainty in Artificial Intelligence , 1996. References I . Beinlich , G . Suermondt , R . Chavez , and G . Cooper . The ALARM monitoring system : A case study with two probabilistic inference techniques for belief networks . In Proc . 2 'nd European Conf . on AI and Medicine . Springer - Verlag , Berlin , 1989. R . R . Bouckaert . Properties of Bayesian network learning algorithms . In R . Lopez de Mantaras and D . Poole , editors , Proc . Tenth Conference on Uncertainty in Artificial
Intelligence ( UAI '94) , pages 102- 109. Morgan Kaufmann , San Francisco, CA , 1994. C . Boutilier , N . Friedman , M . Goldszmidt , and D . Koller . Context -specific independence in Bayesian networks . In E . Horvitz and F . Jensen , editors , Proc . Twelfth Con -
ference on Uncertainty in Artificial Intelligence ( UAI '96) , pages 115- 123. Morgan Kaufmann
, San
Francisco
, CA , 1996 .
W . Buntine . A theory of learning classification ogy , Sydney , Australia , 1991.
rules . PhD thesis , University
of Technol -
W . Buntine . Theory refinement on Bayesian networks . In B . D . D 'Ambrosio , P. Smets , and P. P. Bonissone , editors ,' Proc . Seventh Annual Conference on Uncertainty Ar -
tificial Intelligence ( UAI '92) , pages 52- 60. Morgan Kaufmann , San Francisco, CA , 1991 .
W . Buntine . Learning classification trees . In D . J . Hand , editor , A rtificial Intelligence Frontiers in Statistics , number I I I in AI and Statistics . Chapman & Hall , London , 1993 .
D . M . Chickering . Learning Bayesian networks is NP -complete . In D . Fisher and H .- J . Lenz , editors , Learning from Data : Artificial Intelligence and Statistics V. Springer Verlag , 1996. G . F . Cooper and E . Herskovits . A Bayesian method for the induction of probabilistic networks from data . Machine Learning , 9:309- 347 , 1992. T . M . Cover and J . A . Thomas . Elements of Information Theory . John Wiley & Sons, New
York
, 1991 .
M . H . DeGroot . Optimal Statistical Decisions . McGraw -Hill , New York , 1970. F . J . Diez . Parameter adjustment in Bayes networks : The generalized noisy or -gate . In D . Heckerman and A . Mamdani , editors , Proc . Ninth Conference on Uncertainty in
Artificial
Intelligence ( UAI '99) , pages 99- 105. Morgan Kaufmann , San Francisco,
CA , 1993 .
N . Friedman and Z . Yakhini . On the sample complexity of learning Bayesian networks . In E . Horvitz and F . Jensen , editors , Proc . Twelfth Conference on Uncertainty in 4All products their
respective
and company names mentioned in this article are the trademarks holders .
of
LEARNING
BAYESIAN
NETWORKS
WITH
LOCAL
STRUCTURE
459
Artificial Intelligence ( UAI '96) . Morgan Kaufmann , San Francisco, CA , 1996. D . Heckerman and J . S. Breese . A new look at causal independence . In R . Lopez de Mantaras and D . Poole , editors , Proc . Tenth Conference on Uncertainty in Artificial
Intelligence ( UAI '94) , pages 286- 292. Morgan Kaufmann , San Francisco, CA , 1994. D . Heckerman , D . Geiger , and D . M . Chickering . Learning Bayesian networks : The combination of knowledge and statistical data . Machine Learning , 20:197- 243, 1995. D . Heckerman . A tutorial on learning Bayesian networks . Technical Report MSR - TR 95 - 06 , Microsoft
Research
, 1995 .
W . Lam and F . Bacchus . Learning Bayesian belief networks : An approach based on the MDL principle . Computational Intelligence , 10:269- 293, 1994. R . Musick . Belief Network Induction . PhD thesis , University of California , Berkeley , CA , 1994 .
R . M . Neal . Connectionist
learning of belief networks . Artificial
Intelligence , 56:71- 113,
1992 .
J . Pearl . Probabilistic
Reasoning in Intelligent
Systems . Morgan Kaufmann , San Fran -
cisco , CA , 1988 .
J . R . Quinlan and R . Rivest . Inferring decision trees using the minimum description length principle . Information and Computation , 80 :227- 248, 1989. J . R . Quinlan . C4 .5: Programs for Machine Learning . Morgan Kaufmann , San Francisco , CA , 1993 .
J . Rissanen . Stochastic Complexity
in Statistical
Inquiry . World Scientific , River Edge ,
NJ , 1989 .
S. Russell , J . Binder , D . Koller , and K . Kanazawa . Local learning in probabilistic works
with
hidden
variables
. In Proc . Fourteenth
International
Joint
net -
Conference
on
Artificial Intelligence (IJCAI '95) , pages 1146- 1152. Morgan Kaufmann , San Francisco , CA , 1995 .
G . Schwarz . Estimating the dimension of a model . Annals of Statistics , 6:461- 464, 1978. J . E . Shore and R . W . Johnson . Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy . IEEE Transactions on Information
Theory, IT -26(1) :26- 37, 1980. D . J . Spiegelhalter and S. L . Lauritzen . Sequential updating of conditional probabilities on directed graphical structures . Networks , 20:579- 605 , 1990. S. Srinivas . A generalization of the noisy -or model . In D . Heckerman and A . Mamdani ,
editors, Proc. Ninth Conference on Uncertainty in Artificial Intelligence ( UAI '93) , pages 208- 215 . Morgan Kaufmann , San Francisco , CA , 1993. C . Wallace and J . Patrick . Coding decision trees . Machine Learning , 11:7- 22, 1993.
ASYMPTOTIC MODEL SELECTION FOR DIRECTED NETWORKS WITH HIDDEN VARIABLES
DAN
GEIGER
Computer Science Department Technion , Haifa 32000, Israel dang@cs. technion . aCtil DAVID
HECKERMAN
Microsoft Research, Bldg 98 Redmond
W A , 98052 - 6399
heckerma @microsoft .com AND CHRISTO
PHER
MEEK
Carnegie -Mellon University Department of Philosophy meek @cmu . edu
Abstract . We extend the Bayesian Information Criterion (BIC ) , an asymptotic approximation for the marginal likelihood , to Bayesian networks with hidden variables . This approximation can be used to select models given large samples of data . The standard BIC as well as our extension punishes the complexity of a model according to the dimension of its parameters . We argue that the dimension of a Bayesian network with hidden variables is the rank of the Jacobian matrix of the transformation between the parameters of the network and the parameters of the observable variables . We compute the dimensions of several networks including the naive Bayes model with a hidden root node . This manuscript was previously published in The Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence , 1996, Morgan Kauf mann . 461
462 1.
DAN GEIGERET AL.
Introduction
Learning Bayesian networks from data extends their applicability to sit uations where data is easily obtained and expert knowledge is expensive .
Consequently, it has beenthe subject of much researchin recent years (e.g., Heckerman, 1995a; Buntine, 1996). Researchershave pursued two types of approaches for learning Bayesian networks : one that uses independence tests
to direct
a search
among
valid
models
and
another
that
uses a score
to search for the best scored network - a procedure known as model selection . Scores based on exact Bayesian computations have been developed
by (e.g.) Cooper and Herskovits (1992) , Spiegelhalter et ale (1993) , Buntine (1994) , and Heckerman et al. (1995), and scores based on minimum description length (MDL ) have been developedin Lam and Bacchus (1993) and Suzuki (1993) . We consider a Bayesian approach to model selection . Suppose we have
a set { Xl , . . . , Xn } == X of discrete variables, and a set { Xl , . . ., XN} = D of cases , where
each
case is an instance
of some
or of all the
variables
in
X . Let (8 , 8s) be a Bayesian network, where.S is the network structure of the Bayesian network , a directed acyclic graph such that each node Xi of 8 is associated
with
a random
variable
Xi , and 8s is a set of parameters
associated with the network structure . Let Sh be the hypothesis that precisely the independence assertions implied by S hold in the true or objective joint distribution of X . Then , a Bayesian measure of the goodness-of-fit of
networkstructure S to D is p(ShID) cx : p(Sh)p(DISh), wherep(DISh) is known aB the marginal likelihood of D given Sh . The problem of model selection among Bayesian networks with hidden variables more
, that
difficult
is , networks than
model
with
variables
selection
among
whose
values
networks
are not observed
without
hidden
is
vari -
ables. First , the space of possible networks becomes infinite , and second, scoring each network is computationally harder because one must account
for all possible values of the missing variables (Cooper and Herskovits, 1992) . Our goal is to develop a Bayesian scoring approach for networks that include hidden variables . Obtaining such a score that is computation ally effective and conceptually simple will allow us to select a model from among a set of competing models . Our approach is to use an aBymptotic approximation of the marginal likelihood . This asymptotic approximation is known as the Bayesian Infor -
mation Criteria (BIC ) (Schwarz, 1978; Haughton, 1988), and is equivalent to Rissanen's (1987) minimum description length (MDL ). Such an asymptotic approximation
haB been carried out for Bayesian networks by Her-
skovits (1991) and Bouckaert (1995) when no hidden variables are present. Bouckaert (1995) shows that the marginal likelihood of data D given a
ASYMPTOTIC
MODEL
SELECTION
463
network structure S is given by
p(DISh) == H(S, D)N - 1/ 2dim(S) log(N) + 0 (1)
(1)
where N is the sample size of the data, H (S, D ) is the entropy of the probability distribution obtained by projecting the frequencies of observed cases into the conditional probability tables of the Bayesian network 5 ,
and dim (5 ) is the number of parameters in S. Eq. 1 revealsthe qualitative preferences made by the Bayesian approach . First , with sufficient data , a network than
structure
a network
ond , among
that
structure
all network
is an I- map of the true distribution that
is not an I - map of the true
structures
that
are I - maps
is more likely
distribution
of the true
. Sec -
distribution
,
the one with the minimum number of parameters is more likely . Eq . 1 was derived from an explicit formula for the probability of a network given data by letting the sample size N run to infinity and using a Dirichlet prior for its parameters . Nonetheless, Eq . 1 does not depend on the selected pribr . In Section 3, we use Laplace 's method to rederive Eq . 1 without assuming a Dirichlet prior . Our derivation is a standard application of asymptotic Bayesian analysis . This derivation is useful for gaining intuition for the hidden -variable case. In section 4, we provide an approximation to the marginal likelihood for Bayesian networks
for this approximation
with hidden variables , and give a heuristic
argument
using Laplace 's method . We obtain the following
.
equatIon :
logp(SID) ~
" " logp(SID, (Js) - 1/ 2dim(S, (Js) log(N)
(2)
.. where(Jsis the maximum likelihood(ML) valuefor the parametersof the .. networkand dim(S, (Js) is the dimensionof S at the ML valuefor 8s. The dimensionof a modelcan be interpretedin two equivalentways. First, it is the numberof free parametersneededto representthe parameterspace nearthe maximumlikelihoodvalue. Second , it is the rank of the Jacobian matrix of the transformationbetweenthe parametersof the networkand the parametersof the observable(non~-hidden) variables . In any case, the dimensiondependson the value of (Js, in contrast to Eq. 1, where the dimensionis fixed throughoutthe parameterspace. In Section5, we computethe dimensionsof severalnetworkstructures, including the naive Bayesmodelwith a hiddenclassnode. In Section6, we demonstratethat the scoringfunction usedin AutoClasssometimes divergesfrom p(SID) asymptotically . In Sections7 and 8, we describehow our heuristicapproachcan be extendedto Gaussianand sigmoidnetworks.
464
DAN GEIGER ET AL.
2.
Background
We
introduce
the
number
of
to
the
of
states
.
PSi
=
use
, e
use
pai
to
=
=
addition
,
parameters
fJi
i
.
~
e
.
we
use
=
Thus
fJs
=
,
{
fJij
Os
=
=
fJij
{
j
=
~
Oill
{
i
}
:
( }
=
~
~
ijk
~
n
1
( Jijk
12
}
.
=
k
for
Psi
.
ri
,
}
to
is
,
jth
be
the
=
( }
non
instance
of
-
the
,
.
given
ijk
>
parents
use
O
.
Pai
with
we
We
redundant
associated
unambiguous
write
xf
assume
denote
states
we
state
Xi
we
parameters
S
is
its
Also
i
of
That
that
.
r
number
assigned
given
the
When
1
~
a
denote
the
parameter
=
~
i
to
are
Let
corresponding
be
of
Xi
.
variables
rl
states
of
node
qi
~
the
or
)
of
~
probability
that
set
I1XlEP
parents
true
network
the
=
index
the
Note
~
be
Bayesian
8
,
node
instead
of
'
To
compute
eral
p
(
DISh
assumptions
a
(
sample
sets
(
,
(
local
,
.
.
.
,
(
the
parameters
rameter
Fifth
,
is
two
,
the
prior
Dirichlet
the
-
that
,
p
(
number
these
following
et
Oij
ISh
of
exact
formula
p
(
)
=
=
where
N
We
call
ilk
is
The
last
:
is
assumption
is
ditional
the
implied
.
,
they
represent
cases
i , il +
G '
Xi
-
D
:
-
:
?; J
in
ft k
r =
sets
which
of
independence
if
a
node
of
the
(
is
be
pa
.
node
interpreted
as
P8i
1992
=
)
-
complete
each
+ (
Nijk
aijk
=
pai
.
obtained
the
=
=
) )
xf
and
Pai
function
.
data
are
(
shows
assumptions
1995
.
the
in
)
the
provide
the
from
fifth
one
Bayesian
aEsumptions
pa1
,
are
that
and
equivalent
=
Namely
seen
Heckerman
which
=
.
convenience
,
are
-
networks
and
aijk
Xi
sake
three
param
,
case
xf
r
and
52
of
(
complete
first
Third
can
scoring
Geiger
each
the
with
=
l
erskovits
the
and
and
.
each
(
distribution
51
,
both
ijk
=
for
:
)
,
distribution
Herskovits
N
H
)
,
where
after
.
the
same
~
for
and
if
itk
which
in
-
Dirichlet
that
the
i ' tJ
made
from
assumption
. e
-
family
of
Fourth
-
be
independent
the
likelihood
Cooper
Dirichlet
characterization
( } ~
(
Second
)
1990
then
sev
to
independent
in
.
.
associated
in
-
l
before
the
fIk
)
,
assumed
mutually
identical
)
Os
1990
,
1995
is
mutually
,
and
r =
of
assumption
family
j
the
distributions
same
i
number
expression
parameter
a
the
this
l
,
,
be
parameters
X
8
are
,
marginal
fr =
.
Cooper
the
ft i
On
Lauritzen
al
seen
for
DISh
(
cases
,
,
to
are
the
)
assumptions
.
Lauritzen
node
of
is
equivalent
Using
this
distribution
.
networks
Heckerman
D
(
.
structures
data
assumed
distinct
with
modularity
are
and
in
associated
,
and
i
network
the
network
01
Spiegelhalter
parents
,
Bayesian
each
many
First
sets
for
,
same
for
.
Spiegelhalter
Jiqi
independence
has
form
made
parameter
,
Jil
closed
some
the
independence
eter
in
usually
from
structure
global
)
are
random
network
(
to
with
11
a
Pai
qi
j
the
pat
,
and
integer
associated
and
,
that
deno
Pai
Xi
denote
for
Xi
node
the
to
{ Jijk
that
variable
of
W
notation
of
parents
Pai
In
following
ad
networks
)
,
then
the
-
ASYMPTOTICMODELSELECTION events et for to
Sf
and
al . , 1995
S ~ are
) . This
causal
equivalent
networks
distinct
et
was
where
hypotheses
Heckerman
as well
assumption
ale
two
arcs
( Heckerman ( 1995
( hypothesis
made
) show
one
) . To
must
in
the
Cooper
- Herskovits
probability network
specified in
The
of by
the
Cooper
the
prior
satisfy
. Nonetheless
and
Nijk
finite
/ N
independence
still N
holds , the
washes
Assymptotics
We
shall
result
rederive
likelihood
of
Our
section tion is
the derivation size
only
, we
assumes . Finally
extended
to
We
begin
use
is
the argue
defining
joint
size
or
in
to
a qualita
, as
we
the
show
prior
/ N
Sterling
's
and
lo -
global , the
a large
-
Nij
result
sample
size
.
Variables
) and 's
the
maximum
need
Bouckaert
method
compute
- + oo P ( DNISh
for
9
next
with
f ( 9 ) == logp
the
section
( DNI9
) asymptotic
to
expand
value
, and
distribution P ( DNISh
) for
discussed
in
variables , Sh ) . Thus
data the , our
maximum that
likelihood
our
the then
.
) . Furthermore
around
hidden
is
likelihood
assumptions
the
' s ( 1995
, which
- normal to
the
prior
networks
the
keeping
of
, with
of
' ( 1991
limN
, we
,
Bayesian
r ( . ) using
assumptions
, although
a multivariate the
itself yet
expanding
Laplace
requires
that
) is
prior sample
lend
infinity
. Intuitively
around
compute
Bayesian by
prior
not
to
the
contribution
using
, which
or
' s effective
does
by
Hidden
bypasses N
. Instead
positive
any
data
peak
assumptions
)
initial
user
grow
on
a Dirichlet
we
the
derived hinges
Herskovits
the
N
assumptions
away
technique
approximate
a sample
be
Without
now
. The
on
these
these
q ( X 1 , . . . , Xn
an
function
letting
. 1 can
and
data
scoring
derivation
without
3 .
log
, Eq
is
hold
.
, by
. This
, where
G'
not
correspond
. == pai
from
, and
does
use
, p ~
obtained
user
- Herskovits
analysis
== Xi
function
X
network
tive
approximation
q ( Xi
scoring
distribution
confidence
cal
== a
it
directions
k aijk
, Heckerman
, because
opposing
, 1995b that
equivalence
explicit
with
465
derivation
DN
of
previous deriva
-
value can
be
. ,
P(DNISh) == J P(DNIO,Sh) p(OISh ) dO==
J exp{f (B)} p(BISh ) dB ,. ,. Assuming f (8) has a maximum- the ML value 8- we have f ' (8) Using a Taylor-seriesexpansion of / (8) around the ML value, we get
" " " f (8) ~ f (8) + 1/ 2(8 - 8)f " (8) (8 - 8)
(3) o. (4)
466
DANGEIGERET AL.
where f " (9) is the Hessian of I - the square matrix of second derivatives with respect to every pair of variables { 8ijk, 8i'jlk /} . Consequently, from Eqs. 3 and 4,
logp(DISh) ~ 1(8)+
(5)
logfexP { 1/ 2(8 - 9)/ " (8)(8 - 8)}p(8ISh)d8 We assume that - f " ( 9) is positive - definite , and that , as N grows to infinity , the peak in a neighborhood around the maximum becomes sharper . Consequently , if we ignore the prior , we get a normal distribution around the peak,.. Furthermore , if we assume that the prior p (9ISh ) is not zero around 8 , then as N grows it can be assumed constant and so removed from the integral in Eq . 5. The remaining integral is approximated by the formula for multivariate -normal distributions :
J exp{ 1/ 2(8 - 8)f " (8) (8 - iJ)}d8 ~
.;"'iidet[-/"(8)]d/2
(6)
where d is the number of parameters in 8 , d = Oi : l (ri - l ) qi. As N grows to infinity , the above approximation becomes more precise because the entire mass becomes concentrated around the peak. Plugging Eq . 6 into Eq . 5 and noting that det [- 1" (8 )] is proportional to N yields the BIC :
P(DNISh ) ~ p(DNriJ ,Sh) - ~ log(N)
(7)
A carefulderivationin this spirit showsthat, undercertainconditions, the relative error in this approximationis Op(l ) (Schwarz , 1978; Haughton, 1988). For Bayesiannetworks, the function f (9) is known. Thus, all the assumptionsabout this function can be verified. First, we note that j " (8) is a block diagonalmatrix whereeachblock Aij correspondsto variableXi and a particularinstancej ofPai , andis of size(ri - l )2. Let usexamineone suchAij . To simplify notation, assumethat Xi hasthreestates. Let WI, W2 and W3denote(Jijk for k = 1,2,3, wherei and j are fixed. We consider only thosecagesin DN wherePai = j , and examineonly the observations of Xi - Let D~ denotethe set of N valuesof Xi obtainedin this process . With eachobservation , weassociatetwo indicatorfunctionsXi and Yi. The function Xi is one if Xi getsits first valuein casei and is zerootherwise. Similarly, Yi is oneif Xi getsits secondvaluein cagei andis zerootherwise.
ASYMPTOTICMODELSELECTION
467
The log likelihoodfunction of DN is givenby N
t\(WI, W2 ) ==logII W~iW~i (1- WI- W2 )1-Xi- Yi
(8)
i = l
To find the maximum, we set the first derivativeof this function to zero. The resultingequationsare calledthe maximumlikelihoodequations: AWl(Wl, W2) == '~ Xi - W2 Yi ] - 0 LN =l [ ~X' - 11-- Wl
AW2 (WI, W2) = ~ Yi - 11-- WI Xi - W2 Yi ] = 0 ~N =I [ ;-; Th~ only solution to theseequationsis given by WI == X == L::i xii N , W2= Y = L::i Yi/ N , which is the maximumlikelihoodvalue. The Hessianof A(WI, W2) at the ML valueis givenby \ " (w1, W) WlW2 A 2 -- ( A" A"WlWl A"W2W2 ) -WtWt A" -N
~-~ y ! 1=~ y y + l1- x- y ) ( 1 l - x- 1
(9)
This Hessianmatrix decomposes into the sumof two matrices. One matrix is a diagonalmatrix with positivenumbersl / x and l / yon the diagonal. The secondmatrix is a constantmatrix in which all elementsequal the positivenumber1/ (1 - x - y). Becausethesetwo matricesare positiveand non-negativedefinite, respectively , the Hessianis positive definite. This argumentalso holdswhenXi hasmorethan three values. Becausethe maximumlikelihoodequationhasa singlesolution, andthe Hessianis positivedefinite, and becauseas N increasesthe peak becomes sharper(Eq.9), all the conditionsfor the generalderivationof the BIC are met. Pluggingthe maximumlikelihood valueinto Eq. 7, which is correct to 0 (1), yields Eq. 1. 4. Assymptotics With Hidden Variables Let us now considerthe situation whereS containshidden variables. In this case, wecan not usethe derivationin the previoussection, becausethe log-likelihood function logP(DNISh, 8) does not necessarilytend toward a peakas the samplesize increases . Instead, the log-likelihood function can tend toward a ridge. Consider, for example, a network with one arc
468
DANGEIGERET AL. -
H
- t
X
where
Assume
H
that
the
only
likelihood
( 1 x
-
( Jh
in =
its
,
i
x
two
and
of
is
Xi
and
values
values
function
) ( Jxlii
case
X
has
is
zero
w
.
fli
The
X
-
that
WXi
( 1
-
N
.
is
two
is
w
) l -
,
is
Xi
,
and
.
w if
true
== X
x
of
,
8h8xlh
+
gets
value that
w
solution
.
Then
probability
terms
any
x
hidden
one
in
Nonetheless
is
where
the
unique
values
H
equals w
value
xii
has
that
parameter
ML
2: : : i
and
function
The
==
h
observed
by
otherwise
when
are
indicator
.
maximum
X
given
the
unconditionally
hand
:
it
attains
for
8
to
the
equation
Lxii
N
==
( } h ( } xlh
+
( 1
the
data
. In
-
( } h ) ( } xlh
z
will
maximize
H
- t
an
the
X
has
informal for
Given
,
let
W
S
a
Bayesian
{ wolD of
defines a
by
a
region
o
Now
,
)
g
with
d
sample
.
as
log
of
matrix
.
locally the
J
( x
)
linear
( Spivak
,
1979
not point
a
,
and
)
, Sh
)
==
==
{ < PI
,
log
p
can
log
P
transformation m :
. The Rn
- t
dimension
d is
a
,
) . change
of This
k
in
a
small
of
Rn the
the
around
M
.
defined
In
a
will
the
~
small
with
look
log
like
- likelihood
become
log
,
exception
peaked
approximation
how
it
Rm
,
as :
N
( 10
can
the
of
}
,
it
be
equals can
)
,
rank
J of
around
x
k
in
When
rank which
( x
of
a
the
approximated
matrix
the
.
is
the be
neighborhood
when
found
transformation
Jacobian
region holds
8
is
space
) , ,
. That
space
of
will
-
image
the
small
ball
)
mapping
for a
BIC
- t
where
observation in
:
W
the
the
( 9
C
) .
and
smooth
matrix
image
, Sh
of
( with
Thus
0
probability
value
in
9 .
I
, Sh
is
}
Rm
N
joint
values
around
provide
- redundant
Euclidean A
the
( DNliJ d
ML
. . . , < Pd } ( D
non .
g
resemble
( DNI
what
a
of
we
variables
M
all
apply
logp
( J is
range
region
of
~
understand
will
of
true
manifold
M
we
the
of
. The
of
small
function
W curved
set
observable
value
image
transformation
the
to
a
the
in
transformation
does
regular
x k
of
matrix a
n
linear
dimension
,
( DNISh
linear
size
a
is
a
( DNI to
When as
as
P
a
matrix
is
manifold
That
increases
remains
considering
) ,
with
structure ,
variables
of
every
network section
a
parameters
(J )
coordinates
size
that It
from
A
( 8
the
logp
Note
points
g
( O ) ,
written
the
g
of
X
the
this
hidden
domain
the to
map
orthogonal
function
for
denote
,
In
identify
with
Corresponding
set
dimension
Rd
}
consider
around
some
.
O
sense .
to
network
network
E
this
parameter how
Bayesian
- zero
. 1
- redundant describing
( smooth
measure
W
non
any
==
distribution
of
of
one
argument
parameters
X
likelihood
only
)
x is
of case
( x
)
serves
ERn
.
the the
The
rank Jacobian
x
is
.
1For terminology and basic facts in differential geometry, see Spivak (1979) .
called
of
ASYMPTOTICMODELSELECTION
469
Returning to our problem , the mapping from (J to W is a polynomial function of 8 . Thus , as the next theorem shows, the rank of the Jacobian matrix [ltl , ] is almost everywhere some fixed constant d, which we call the regular rank of the Jacobian matrix . This rank is the number of nonredundant parameters of S- that is, the dimension of S . Theorem 1 Let lJ be the parameters of a network S for variables X with observable variables 0 C X . Let W be the parameters of the true joint distribution of the observable variables . If each parameter in W is a poly nomial function is a constant .
of 8 , then rank [ ~
(lJ)] = d almost everywhere, where d
Proof : Because the mapping from (J to W is polynomial , each entry in the matrix J (lJ) = [ fla9,(8 )] is a polynomial in lJ. When diagonalizing J , the leading elements of the first d lines remain polynomials in 8 , whereas all other lines , which are dependent given every value of 8 , become identically zero. The rank of J (lJ) falls below d only for values of () that are roots of some of the polynomials in the diagonalized matrix . The set of all such roots ha5 mea5ure zero. 0 Our heuristic argument for Eq . 10 does not provide us with the error term . Researchers have shown that Op(l ) relative errors are attainable for a variety of statistical models (e.g., Schwarz, 1978, and Haughton , 1988) . Although the arguments of these researchers do not directly apply to our ca5e, it may be possible to extend their methods to prove our conjecture . 5 . Computations
of the Rank
We have argued that the second term of the BIC for Bayesian networks with hidden variables is the rank of the Jacobian matrix of the transfor mation between the . parameters of the network and the parameters of the observable variables . In this section , we explain how to compute this rank , and demonstrate the approach with several examples . Theorem 1 suggests a random algorithm for calculating the rank . Compute the Jacobian matrix J (O) symbolically from the equation W == g (O) . This computation is possible since 9 is a vector of polynomials in 8 . Then , aBsign a random value to 8 and diagonalize the numeric matrix J (lJ) . The orem 1 guarantees that , with probability 1, the resulting rank is the regular rank of J . For every network , select- say- ten values for 8 , and determine r to be the maximum of the resulting ranks . In all our experiments , none of the randomly chosen values for lJ accidentally reduced the rank . We now demonstrate the computation of the needed rank for a naive Bayes model with one hidden variable H and two feature variables Xl and X 2. Assume all three variables are binary . The set of parameters W = 9 (lJ)
470
DANGEIGERET AL.
(Jh8x2lh (}h8xllh (1- (}h)8x21h (1- ()h)()Xllh ()xllh ()X21h - (}xllh (}X2Ih - (}h(}X2Ih (}h(}Xllh - (1- (}h)(}X2Ih (1- ()h)(}Xlih (Jxllh8x2lh - 8Xllh )8x21h (1- (}h(}X2Ih ) - (}h(}Xllh (1- (}h)(}X2Ih - (1- (}h)8xllh(}xllh (}X2Ih - 8Xllh8x21h ) Figure 1. nodes
The Jacobian matrix
for a naive Bayesian network with two binary feature
is giyen by WXIX2== ()h()Xllh()X2Ih+ (1 - ()h)()XlIJi()X2IJi WXIX2= (}h(l - (}xllh)(}X2Ih+ (1 - (}h) (l - ()xlIJi) (}X2IJi WXIX2== lJhlJxllh ( l - (}x2Ih) + (1 - ()h)(}xlIJi( l - (}x2IJi) The 3 x 5 Jacobian matrix for this transformation is given in Figure 5 where (}xilh = 1 - (}xilh (i = 1, 2). The columns correspond to differentiation with respect to (}xllh, (}x2Ih, (}xllh, (}x2lh and (}h, respectively. A symbolic computation of the rank of this matrix can be carried out ; and it shows that the regular rank is equal to the dimension of the matrix - namely, 3. Nonetheless, as we have argued, in order to compute the regular rank, one can simply chooserandom valuesfor (J and diagonalize the resulting numerical matrix . We have done so for naive Bayes models with one binary hidden root node and n ~ 7 binary observable non-root nodes. The size of the associatedmatrices is (1 + 2n) x (2n - 1). The regular rank for n == 3, . . . , 7 was found to be 1+ 2n. We conjecture that 1+ 2n is the regular rank for all n > 2. For n == 1, 2, the rank is 1 and 3, respectively, which is the size of the full parameter spaceover one and two binary variables. The rank can not be greater than 1 + 2n becausethis is the maximum possibledimension of the Jacobian matrix . In fact, we have proven a lower bound of 2n as well. Theorem 2 Let S be a naive Bayes model with one binary hidden root node and n > 2 binary observablenon-root nodes. Then 2n ~ r ~ 2n + 1 where r is the regular rank of the Jacobian matrix betweenthe parameters of the network and the parameters of the feature variables.
...-~
~
The proof is obtained by diagonalizing the Jacobian matrix symbolically, and showing that there are at least 2n independent lines. The computation for 3 ~ n ~ 7 showsthat , for naive Bayesmodels with a binary hidden root node, there are no redundant parameters. Therefore, the best way to represent a probability distribution that is representable by such a model is to use the network representation explicitly .
ASYMPTO'TIC MODELSELECTION
471
Nonetheless , this result does not hold for all models. For example , consider the following W structure : A - + C +- H - + D +- B
where H is hidden . Assuming all five variables are binary , the space over the observables is representable by 15 parameters , and the number of parameters of the network is 11. In this example , we could not compute the rank symbolically . Instead , we used the following Mathematica code.
There are 16 functions (only 15 are independent) defined by W == g((}). In the Mathematica code, we use fijkl for the true joint probability Wa=i ,b=j ,c=k,d=l , cij for the true conditional probability (}c=Ola=i,h=j , dij for (}d= Olb= i ,h=j ' a for (Ja=O, b for (}b= O, and hO for (}h= O.
The first function is given by
10000 [a_, b_, hO_, cOO _, . . . , cll _, dOO -, . . . , dll _] :== u * b * (h0 * cO0 * dO0 + (1 - h0) * cO1 * dOl) and the other functions
are similarly
written . The Jacobian
matrix
is com -
puted by the command Outer , which has three arguments . The first is D which tions
stands for the differentiation , and
the
third
is a set
of variables
operator , the second is a set of func .
J [a_, b_, kG_, cOO -, . . . , cll _, dOO _, . . . , dll _J :== Outer [D , { fOOOO[a, b, hO, cOO,cOl, . . . , dl1 ] ,
f 0001 [a, b, h0, cOO , . . . , c11, dO0, . . . , d 11] , .
.
.
,
fIlII [a, b, h0, cOO , . . . , c11, dO0, . . . , d11]} , { a, b, hO, cOO , cO1, clO, cll , dOO , dOl, dlO, dll }] The next command produces a diagonalized matrix at a random point with a precision of 30 decimal digits . This precision was selected so that matrix elements equal to zero would be correctly identified as such.
N [RowReduce[J [a, b, hO, cOO , . . ., c11, dOO , . . . , d11]/ .{ a - t Random[Integer, { I , 999} ]/ 1000, b - t Random[Integer, { I , 999} ]/ 1000, .
.
.
,
dll - + Random[Integer, { I , 999}J/ IOOO }J, 30J The result of this Mathematica program was a diagonalized matrix with 9 non - zero rows and 7 rows containing
all zeros . The same counts were ob -
tained in ten runs of the program . Hence, the regular rank of this Jacobian matrix is 9 with probability 1.
472
DAN GEIGER ET AL.
The interpretation of this result is that , around almost every value of 9 , one can locally represent the hidden W structure with only 9 parameters . In contrast , if we encode the distribution using the network parameters (8 ) of the W structure , then we must use 11 parameters . Thus , two of the network parameters are locally redundant . The BIC approximation punishes this W structure according to its most efficient representation , which uses 9 parameters , and not according to the representation given by the W structure , which requires 11 parameters . It is interesting to note that the dimension of the W structure is 10 if H has three or four states , and 11 if H hag 5 states . We do not know how to predict when the dimension changes as a result of increasing the number of hidden states without computing the dimension explicitly . Nonetheless , the dimension can not increase beyond 12, because we can average out the hidden variable in the W structure (e.g., using arc reversals) to obtain another network structure that has only 12 parameters .
6. AutoClass The AutoClass clustering algorithm developed by Cheeseman and Stutz ( 1995) uses a naive Bayes model .2 Each state of the hidden root node H represents a cluster or class; and each observable node represents a measurable feature . The number of classes k is unknown a priori . AutoClass computes an approximation of the marginal likelihood of a naive Bayes model given the data using increasing values of k . When this probability reaches a peak for a specific k , that k is selected as the number of classes. Cheeseman and Stutz (1995) use the following formula to approximate the marginal likelihood :
logp(DIS) ~
A A logp(DcIS) + logp(DIS, Os) - logp(DcIS,Os)
where Dc is a database consistent with the expected sufficient statistics ~ computed by the EM algorithm . Although Cheeseman and Stutz suggested this approximation in the context of simple AutoCI ~ s models , it can be used to score any Bayesian network with discrete variables as well as other models (Chickering and Heckerman , 1996) . We call this approximation the CS scoring function . Using the BIC approximation for p (DcIS ) , we obtain
,. logp(DIS) ~ logp(DIS, Os) - d'/ 21ogN 2The algorithm
can handle conditional
dependencies among continuous variables .
ASYMPTOTIC MODELSELECTION
473
where d' is the number of parameters of the network . (Given a naive Bayes model with k classes and n observable variables each with b states , d' = nk (b - 1) + k - 1.) Therefore , the CS scoring function will converge asymptotically to the BIC and hence to p (DIS ) whenever d' is equal to the regular rank of S (d) . Given our conjecture in the previous section , we believe that the CS scoring function will converge to p (DIS ) when the number of classes is two . Nonetheless, d' is not always equal to d. For example , when b = 2, k == 3 and n == 4, the number of parameters is 14, but the regular rank of the Jacobian matrix is 13. We computed this rank using Mathematica as described in the previous section . Consequently , the CS scoring function will not always converge to p (DIS ) . This example is the only one that we have found so far ; and we believe that incorrect results are obtained only for rare combinations of b, k and n . Nonetheless , a simple modification to the CS scoring function yields an approximation that will asymptotically converge to p (DIS ) :
logp(DIS) ~ logp(DcIS) + logp(DIS, Os)logp(DcIS, lis) - d/ 2log N + d' / 2 logN Chickeringand Heckerman(1996) showthat this scoringfunction is often a better 7.
approximation
Gaussian
In
this
section
X
are
continuous
the
the a
Networks , we
network
associated joint
considel . As
structure with
the
of
is
local
' the
case
whel
' e each
before
, let
of
Bayesian
network
structure
. A
the
network
likelihood
product
for p (DIS ) than is the BIC .
that
likelihoods
of
a .
( 5 , ( Js )
be
of a
varia
, and Gaussian
multivariate
Each
tIle
{ Xl
network ( J s is
a
, . . . , -~ n } , where
set
network
Gaussian
local
,bIes
Bayesian
of is
is
the
== is
parameters one
in
distribution
likelihood
5
which that
linear
is
regression
model
p ( xilpai
where ance
N v
>
( Jl , v ) 0 ,
mi
is
a is
' ( Ji , S )
normal a
conditional
==
N
( Gaussian mean
( mi
)
+
EXjEPaibjiXj
, Vi )
distribution of
Xi
,
with bji
is
a
mean
coefficient
J. l that
and
vari repre
-
gents the strength of the relationship between variable X j and Xi , Vi is a variance ,3 and ()i is the set of parameters consisting of mi , Vi, and the bji . The parameters () s of a Gaussian network with structure S is the set of all 8i . 3mi is the mean of Xi conditional on all parents being zero, bji corresponds to the partial regression coefficient of Xi on X j given the other parents of Xi , and Vi corresponds to the residual variance of Xi given the parents of Xi .
474
DANGEIGER ET AL.
To apply the techniques developed in this paper , we also need to specify the parameters of the observable variables . Given that the joint distribu tion
is multivariate
- normal
and that
multivariate
- normal
distributions
are
closed under marginalization , we only need to specify a vector of means for the observed
variables
and a covariance
matrix
over the observed
variables
.
In addition , we need to specify how to transform the parameters of the network
to the observable
and
transformation
the
parameters . The transformation
to obtain
the observable
covariance
of the means matrix
can be
accomplished via the trek-sum rule (for a discussion, see Glymour et ale 1987) . Using the trek -sum rule , it is easy to show that the observable param eters are all sums of products of the network parameters . Given that the mapping from 8 s to the observable parameters is W is a polynomial func -
tionof8, it follows from Thm . 1thattherank oftheJacobian matrix [~ ] is almost everywhere some fixed constant d, which we again call the regular rank
of the
parameters Let
Jacobian
matrix
. This
rank
of S- that is , the dimension
us consider
two Gaussian
to the code in Section
is the
number
of non - redundant
of S .
models . We use Mathematica
5 to compute
their
code similar
dimensions , because we can
not perform the computation symbolically . As in the previous experiments , none of the randomly chosen values of (Js accidentally reduces the rank . Our first example is the naive-Bayes model H
Xl
~
\ ' -' X3
X2
X4
in which H is the hidden variable and the Xi are observed. There are 14 network parameters: 5 conditional variances, 5 conditional means, and 4 linear parameters. The marginal distribution for the observedvariables also has 14 parameters: 4 means, 4 variances, and 6 covariances. Nonetheless, the analysis of the rank of the Jacobian matrix tells us that the dimension of this model is 12. This follows from the fact that this model imposes tetrad constraints (seeGlymour et ale 1987). In this model the three tetrad constraints that hold in the distribution over the observedvariables are cov(X 1, X2)COV (X3, X4) - COV (X 1, X3) COV (X2, X4) = 0 COV (XI , X4)COV (X2, X3) - COV (XI , X3)COV (X2, X4) = 0 COV (XI , X4)COV (X2, X3) - COV (XI , X2)COV (X3, X4) = 0 two of which are independent. These two independent tetrad constraints lead to the reduction of dimensionality.
ASYMPTOTIC MODELSELECTION
475
Our second example is the W structure described in Section 5 where each of the variables is continuous . There are 14 network parameters : 5 conditional means, 5 conditional variances , and 4 linear parameters . The marginal distribution for the observed variables has 14 parameters , whereas the analysis of the rank of the Jacobian matrix tells us that the dimension of this model is 12. This coincides with the intuition that many values for the variance of H and the linear parameters for C f - Hand H - + D produce the same model for the observable variables , but once any two of these parameters are appropriately set, then the third parameter is uniquely determined by the marginal distribution for the observable variables .
8. Sigmoid Networks Finally , let us consider the casewhere eachof the variables { Xl , . . . , Xn } == X is binary (discrete), and each local likelihood is the generalized linear model
p(xilpai , 8i, S) == Sig(ai + EXjEPaibjiXj)
whereSig(x) is thesigmoid functionSig(x) == ,1;=. . These models , which we call sigmoid networks , are useful for learning relationships among discrete variables , because these models capture non-linear relationships among variables yet employ only a small number of parameters (Neal , 1992; Saul
et al., 1996) . Using techniques similar to those in Section 5, we can compute the
rank oftheJacobian matrix [~ ]. Wecannotapply Thm . 1toconclude
that this rank is almost everywhere some fixed constant , because the local likelihoods are non-polynomial sigmoid functions . Nonetheless, the claim of Thm . 1 holds also for analytic transformations , hence a regular rank exists
for sigmoid networks as well (as confirmed by our experiments) . Our experiments show expected reductions in rank for several sigmoid networks . For example , consider the two -level network
HI Xl
H2
[S;~ ~~~; 2J X3 X2 X4
This network has 14 parameters . In each of 10 trials , we found the rank of the Jacobian matrix to be 14, indicating that this model has dimension 14. In contrast , consider the th ree-level network .
476
DANGEIGERET AL. H3 /
~
HI
H2
[ S ; ~~ ~~ ~ ; 2J Xl
X3
X2
X4
This network has 17 parameters , whereas the dimension we compute is 15 . This reduction is expected , because we could encode the dependency between the two variables in the middle level by removing the variable in the top layer and adding an arc between these two variables , producing network with 15 parameters .
a
References Bouckaert, R. (1995). Bayesian belief networks: From construction to inference. PhD thesis, University Utrecht . Buntine , W . (1994). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2: 159- 225. Buntine , W . (1996). A guide to the literature on learning graphical models. IEEE Transactions on I( nowledge and Data Engineering, 8:195- 210. Cheeseman, P. and Stutz , J. (1995). Bayesian classification (AutoClass): Theory and results. In Fayyad, U ., Piatesky-Shapiro, G., Smyth, P., and Uthurusamy , R., editors, Advances in I( nowledge Discovery and Data Mining , pages 153- 180. AAAI Press, Menlo Park , CA . Chickering, D. and Heckerman, D. ( 1996). Efficient approximations for the marginal likelihood of incomplete data given a Bayesian network . In Proceedingsof Twelfth Conference on Uncertainty in Artificial Intelligence, Portland , OR, pages 158- 168. Morgan Kaufmann . Cooper, G. and Herskovits, E. (1992) . A Bayesian method for the induction of probabilistic networks from data . Machine Learning, 9:309- 347. Geiger, D . and Heckerman, D. (1995) . A characterization of the Dirichlet distribution with application to learning Bayesian networks. In Proceedingsof Eleventh Conference on Uncertainty in Artificial Intelligence , Montreal , QU , pages 196- 207. Morgan Kaufmann . Seealso Technical Report TR -95- 16, Microsoft Research, Redmond, WA , February 1995. Glymour , C., Scheines, R., Spirtes, P., and Kelly , K . (1987). Discovering Causal Structure . Acedemic Press. Haughton , D . (1988). On the choice of a model to fit data from an exponential family . Annals of Statistics , 16:342- 355. Heckerman, D . ( 1995a) . A tutorial on learning Bayesian networks. Technical Report MSR- TR -95-06, Microsoft Research, Redmond, WA . Revised November, 1996. Heckerman, D . (1995b). A Bayesian approach for learning causal networks. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence, Montreal , QU , pages 285- 295. Morgan Kaufmann . Heckerman, D ., Geiger, D ., and Chickering, D . (1995) . Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197- 243. Herskovits, E. (1991). Computer-basedprobabilistic network construction . PhD thesis, Medical Information Sciences, Stanford University , Stanford, CA . Lam , W . and Bacchus, F . ( 1993). Using causal information and local measuresto learn
ASYMPTOTICMODELSELECTION Bayesian networks . In Proceedings of Ninth Conference on Uncertainty Intelligence , Washington , DC , pages 243- 250. Morgan Kaufmann .
477 in Artificial
Neal, R. (1992). Connectionist learning of belief networks. Artificial Intelligence , 56:71113 .
Rissanen, J. (1987). Stochastic complexity (with discussion). Journal of the Royal Statistical Society , Series B , 49 :223- 239 and 253- 265.
Saul, L ., Jaakkola, T ., and Jordan, M . (1996). Mean field theory for sigmoid belief networks . Journal of Artificial
Intelligence
Research , 4:61- 76.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics , 6:461- 464. Spiegelhalter, D ., Dawid , A ., Lauritzen , S., and Cowell, R. (1993). Bayesian analysis in expert
systems . Statistical
Science , 8 :219 - 282 .
Spiegelhalter, D . and Lauritzen , S. (1990). Sequential updating of conditional probabili ties on directed graphical structures . Networks , 20:579- 605 .
Spivak, M . (1979) . A ComprehensiveIntroduction to Differential Geometry 1, 2nd edition . Publish or Perish , Berkeley , CA .
Suzuki, J. ( 1993). A construction of Bayesian networks from databasesbased on an MDL scheme . In Proceedings of Ninth Conference on Uncertainty Washington , DC , pages 266- 273. Morgan Kaufmann .
in Artificial
Intelligence ,
A
HIERARCHICAL
COMMUNITY
OF
EXPERTS
GEOFFREYE. HINTON BRIANSALLANS AND ZO
UBIN
G
HAHRAMANI
Department
of
University
Toronto
{
hinton
of
,
,
Computer
Ontario
sallans
Science
Toronto
,
,
Canada
zoubin
M5S
}
Ccs
3H5
. toronto
. edu
Abstract . We describe a directed acyclic graphical model that contains a hierarchy of linear units and a mechanism for dynamically selecting an appropriate subset of these units to model each observation . The non-linear selection mechanism is a hierarchy of binary units each of which gates the output of one of the linear units . There are no connections from linear units to binary units , so the generative model can be viewed as a logistic belief net (Neal 1992) which selects a skeleton linear model from among the available linear units . We show that Gibbs sampling can be used to learn the parameters of the linear and binary units even when the sampling is so brief that the Markov chain is far from equilibrium .
1. Multilayer
networks
of linear - Gaussian
units
We consider hierarchical generative models that consist of multiple layers of simple , stochastic processing units connected to form a directed acyclic graph . Each unit receives incoming , weighted connections from units in the layer above and it also has a bias (see figure 1) . The weights on the connections and the biases are adjusted to maximize the likelihood that the layers of " hidden " units would produce some observed data vectors in the bottom layer of " visible " units . The simplest kind of unit we consider is a linear -Gaussian unit . Following the usual Bayesian network formalism , the joint probability of the 479
480
GEOFFREYE. HINTONET AL.
Figure 1.
states
of
of
all
each
the
unit
the
.
units
in
given
units
in
Gaussian
the
the
the
of
units
unit
,
j
,
in
.
with
in
the
the
top
,
layer
its
the
,
state
of
learned
we
of
which
the
for
each
mean
can
down
product
parents
The
a
layer
next
is
of
above
distribution
Yk
network
states
layer
Units in a belief network .
compute
in
networks
the
variance
the
-
are
top
.
top
probability
layered
unit
and
local
layer
Given
down
has
the
input
,
a
states
. vj
,
to
each
:
Yj
:
=
bj
+
L
WkjYk
(
1
)
k
where
bj
and
is
Wkj
of
unit
is
learned
of
one
is
j
is
the
bias
the
weight
then
(
even
any
subset
of
is
tistical
in
sensible
Ghahramani
all
units
connection
in
from
mean
factor
Yj
the
k
and
to
a
of
a
layer
to
j
.
above
The
state
variance
is
the
aJ
data
so
structure
to
and
to
that
extend
Hinton
are
linear
,
.
are
Given
the
the
to
the
the
dis
-
known
,
parameters
higher
for
of
posterior
all
fit
states
is
update
all
:
easy
order
tagks
sta
like
-
vision
.
models
1996
units
they
inappropriate
crucial
Gaussian
distribution
ignore
consists
weighted
advantages
compute
to
models
they
-
and
this
)
send
linear
.
once
1984
that
important
unobserved
algorithm
is
of
two
data
and
linear
,
)
layer
tractable
EM
Everitt
factors
continuous
units
use
(
the
have
are
it
,
way
visible
variables
units
the
(
noise
linear
order
analysis
units
)
Unfortunately
-
down
Gaussian
unobserved
in
higher
(
the
linear
the
structure
which
One
them
the
.
over
with
models
to
model
index
Gaussian
good
straightforward
the
-
loadings
of
of
top
-
with
across
an
underlying
linear
factor
many
tribution
it
the
provide
when
is
distributed
of
models
often
k
the
model
layer
connections
,
.
generative
They
j
on
data
hidden
Linear
unit
Gaussian
from
The
of
;
is
Hinton
to
et
use
al
a
.
,
1997
mixture
of
)
.
This
M
retains
of
A HIERARCHICAL COMMUNITY OFEXPERTS
481
tractability because the full posterior distribution can be found by computing the posterior across each of the M models and then normalizing . However , a mixture of linear models is not flexible enough to represent the kind of data that is typically found in images. If an image can have several different objects in it , the pixel intensities cannot be accurately modelled by a mixture unless there is a separate linear model for each possible combination of objects . Clearly , the efficient way to represent an image that contains n objects is to use a "distributed " representation that contains n separate parts , but this cannot be achieved using a mixture because the non-linear selection process in a mixture consists of picking one of the lin ear models . What we need is a non-linear selection process that can pick arbitrary subsets of the available linear -Gaussian units so that some units can be used for modelling one part of an image , other units can be used for modelling other parts , and higher level units can be used for modelling the redundancies between the different parts . 2 . Multilayer
networks
of binary -logistic
units
Multilayer networks of binary -logistic units in which the connections form a directed acyclic graph were investigated by Neal (1992) . We call them logistic belief nets or LBN 's. In the generative model , each unit computes its top -down input , Sj , in the same way as a linear -Gaussian unit , but instead of using this top -down input as the mean of a Gaussian distribution it uses it to determine the probability of adopting each if the two states 1 and 0:
(2)
Sj = bj +
1
p
( Sj
=
11 {
Sk
:
k
E
paj
} )
=
a
( sj
)
=
1
+
e -
Sj
(3)
where paj is the set of units that send generative connections to unit j (the " parents " of j ) , and 0' (.) is the logistic function . A binary -logistic unit does not need a separate variance parameter because the single statistic Sj is sufficient to define a Bernouilli distribution . Unfortunately , it is exponentially expensive to compute the exact posterior distribution over the hidden units of an LBN when given a data point , so Neal used Gibbs sampling : With a particular data point clamped on the visible units , the hidden units are visited one at a time . Each time hidden unit u is visited , its state is stochastically selected to be 1 or 0 in propor tion to two probabilities . The first , pa \ su=l == p (Su == 1, { Sk : k # u } ) is the joint probability of generating the states of all the units in the network (including u ) if u has state 1 and all the others have the state defined by the current configuration of states , G'. The second, pa \ su=0, is the same
482
GEOF ' FREYE. HINTONET AL.
quantity of
all
if the
u
has
other
plication
of
this
configurations
of
O . When are
the
constant
is
a configuration
these
. It
decision
selected
LBN
calculating
held
stochastic being
Because pa
state units
rule
it
, a , of
pa
states
== II
to
is
their
easy
of
p ( si
probabilities
be
shown
eventually
according
acyclic
can
to
all
I { Sk
leads
units
: k
states
repeated
to
posterior
ap -
hidden
state
probabilities
compute
the
, the
that
the
joint
.
probability
.
E pai
}
(4 )
't
where
sf It
are -
is
the
binary
is convenient
called 1n pa
state
to
energies
work
by
of in
unit
the
analogy
i in
configuration
domain
with
of
Q .
negative
statistical
log
physics
probabilities
. We
which
define
EQ
to
be
. Ea
== -
L
( s ~ In
s~ +
(1 -
s ~ ) In ( l
-
s~ ) )
(5 )
u
where
s ~ is
expectation units
in The
of
two
the
binary
state
generated the
net
rule
for
probabilities
by
of
unit
the
u in
layer
configuration
above
0: , s ~ is the
, and
u
is
an
index
top
over
- down all
the
. stochastically and
picking hence
~ E ~
the
a
new
state
difference
== Ea
\ su = o -
of
Ea
for
two
u
requires
the
ratio
energies
\ su = l
(6 )
p(su== 11{Sk: k # u}) ==0'(~ E~)
(7)
All the contributions to the energy of configuration 0: that do not depend on Sj can be ignored when computing L\ Ej . This leavesa contributionthat depends on the top -down expectation Sj generated by the units in the layer above (see Eq . 3) and a contribution that depends on both the states , Si, and the top -down expectations , Si, of units in the layer below (seefigure 1)
L\ EC J!
InBj- In(l - Bj) +2;=[sfIns~\Sj=l + (1- sf)In(1- s~\Sj=l) ~
-
aI "a\sj=O (1 a)I (1 "a\sj=o)] (8)
s . ~
n s . ~
-
-
s . ~
n
-
s . ~
Given samplesfrom the posteriordistribution, the generativeweights of a LBN can be learnedby using the online delta rule which performs gradientascentin the log likelihoodof the data:
~ Wji= ESj(Si- Si)
(9)
""rI') . I.
A HIERARCHICAL COMMUNITYOF EXPERTS
483
"
I
)
." ~)
' r
" r
I
I
I
I
I
I . .
.I
.I ~ ' I
I
.
.
.
.
, ,
I
,
, ,
.I
I
.I .I
.I
,
,
I
'
I
.I
,t
)
I
,
,
I
, 1'
I
'
I
.I
.I .I
'
I
.I .I
,
I
'
,
"--
)
I
' r
' r
' r
I I
I I
I I
I
I
.
)
I
I
.
.
.
.
.
.
.
.
.
'
.
.
.
.
.
.
.
.
.
Figure 2. Units in a community of experts , a network of paired binary and linear units . Binary units (solid squares ) gate the outputs of corresponding linear units (dashed circles ) and also send generative connections to the binary units in the layer below . Linear units send generative connections to linear units in the layer below (dashed arrows ) .
3 . Using
binary
units to gate linear
units
It is very wasteful to use highly non-linear binary units to model data that is generated from continuous physical processes that behave linearly over small ranges. So rather than using a multilayer binary network to generate data directly , we use it to synthesize an appropriate linear model by selecting from a large set of available linear units . We pair a binary unit with each hidden linear unit (figure 2) and we use the same subscript for both units within a pair . We use y for the real-valued state of the linear unit and s for the state of the binary unit . The binary unit gates the output of the linear unit so Eq . 1 becomes:
Yj == bj +
L WkjYkSk
(10)
k
It is straightforward to include weighted connections from binary units to linear units in the layer below , but this was not implemented in the examples we describe later . To make Gibbs sampling feasible (see below ) we prohibit connections from linear units to binary units , so in the generative model the states of the binary units are unaffected by the linear units and are chosen using Eq . 2 and Eq . 3. Of course, during the inference process the states of the linear units do affect the states of the binary units . Given a data vector on the visible units , it is intractable to compute the posterior distribution over the hidden linear and binary units , so an
484
GEOFFREY E. HINTONET AL.
approximate inference method must be used. This raises the question of whether the learning will be adversely affected by the approximation errors that occur during inference . For example , if we use Gibbs sampling for inference and the sampling is too brief for the samples to come from the equilibrium distribution , will the learning fail to converge? We show in section 6 that it is not necessary for the brief Gibbs sampling to approach equilibrium . The only property we really require of the sampling is that it get us closer to equilibrium . Given this property we can expect the learning to improve a bound on the log probability of the data .
3.1. PERFORMING GIBBSSAMPLING The obvious way to perform Gibbs sampling is to visit units one at a time and to stochastically pick a new state for each unit from its posterior distribution given the current states of all the other units . For a binary unit we need to compute the energy of the networ.k with the unit on or off . For a linear unit we need to compute the quadratic function that determines how the energy of the net depends on the state of the unit . This obvious method has a significant disadvantage . If a linear unit , j , is gated out by its binary unit (i .e., Sj == 0) it cannot influence the units below it in the net , but it still affects the Gibbs sampling of linear units like k that send inputs to it because these units attempt to minimize (Yj - Yj )2/ 20'J . So long as Sj == 0 there should be no net effect of Yj on the units in the layer above. These units completely determine the distribution of Yj , so sampling from Yj would provide no information about their distributions . The effect of Yj on the units in the layer above during inference is unfortunate because we hope that most of the linear units will be gated out most of the time and we do not want the teeming masses of unemployed linear units to disturb the delicate deliberations in the layer above. We can avoid this noise by integrating out the states of linear units that are gated out . Fortunately , the correct way to integrate out Yi is to simply ignore the energy contribution ( YJ. - YJ ,. .)2/ 20' j8 2 A second disadvantage of the obvious sampling method is that the decision about whether or not to turn on a binary unit depends on the particular value of its linear unit . Sampling converges to equilibrium faBter if we integrate over all possible values of Yj when deciding how to set Sj. This integration is feasible because, given all other units , Yj has one Gaussian posterior distri bu tion when Sj = 1 and another Gaussian distri bu tion when Sj = o. During Gibbs sampling , we therefore visit the binary unit in a pair first and integrate out the linear unit in deciding the state of the binary unit . If the binary unit gets turned on , we then pick a state for the linear unit from the relevant Gaussian posterior . If the binary unit is turned off
A HIERARCHICAL COMMUNITYOF EXPERTS
485
it is unnecessary to pick a value for the linear unit . For any given configuration of the binary units , it is tractable to compu te the full posterior distribution over all the selected linear units . So one interesting possibility is to use Gibbs sampling to stochastically pick states for the binary units , but to integrate out all of the linear units when making these discrete decisions . To integrate out the states of the selected linear units we need to compute the exact log probability of the observed data using the selected linear units . The change in this log probability when one of the linear units is included or excluded is then used in computing the energy gap for deciding whether or not to select that linear unit . We have not implemented this method because it is not clear that it is worth the computational effort of integrating out all of the selected linear units at the beginning of the inference process when the states of some of the binary units are obviously inappropriate and can be improved easily by only integrating out one of the linear units . Given samples from the posterior distribution , the incoming connection weights of both the binary and the linear units can be learned by using the online delta rule which performs gradient ascent in the log likelihood of the data . For the binary units the learning rule is Eq . 9. For linear units the rule
is :
~ Wji == { YjSj(Yi - Yi)si/ af
(11)
The learning rule for the biasesis obtained by treating a bias as a weight coming from a unit with a state of 1.1 The variance of the local noise in each linear unit , aJ, can be learned by the online rule: ~ a; == ESj [(Yj - Yj)2 - o-} ]
(12)
Alternatively , aJ can be fixed at 1 for all hidden units and the effective local noise level can be controlled by scaling the incoming and outgoing weights . 4 . Results
on the bars task
The noisy bars task is a toy problem that demonstrates the need for sparse distributed representations (Hinton et al ., 1995; Hinton and Ghahramani , 1997) . There are four stages in generating each K X K image . First a global orientation is chosen, either horizontal or vertical , with both cases being equally probable . Given this choice, each of the K bars of the appropriate orientation is turned on independently with probability 0.4. Next , each active bar is given an intensity , chosen from a uniform distribution . Finally , lWe have used Wji to denote both the weightsfrom binary units to binary units and from linear units to linear units; the intendedmeaningshouldbe inferred from the context.
488
GEOFFREY
5 .
Results
We
trained
from
a
the
to
an
similar
8
X
8
grid
three
,
a
test
both
set sets
. A
a
Figure
5 .
For
clarity
24
pairs
and
the
Gibbs
.
Gibbs
the
linear
layer
.
.
.
.
.
.
.
In
this
represent
the
handwritten ,
1994
pixel
values
into and
a
threes
training
data
digits were
training
set
being
equally
is
shown
in
of
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
'.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
of
the
training
consisted
units
in
a
training
learning
rate
case digit
, the
training network
features
the is uses
. Some
and in
a
the
by
constraints in
of ,
all of
in 24 the
43
network
.
layer
,
of the
,
units
in
through
parameter
sign to
hidden
paBses
used
the
layer
linear
trained
top
problem
on
features
5 (a ) .
- Gaussian
decay
iterations
figure
the
the
linear
made
hidden
shown
in 64
weight
placed first
units
and
bars
12
by
,
.
network
0 . 01 as
generated
figure
pair
, the of
this
layer
, followed no
units of
single
performed
iterations
- Gaussian
a
in
hidden
. During
were
result
first
. b ) Images
values
of
the
was
, there
data
positive
lie
digits
figure
.
represents
to
1400
.
.
threes scaled
represented
.
.
and were
rescaled
.
subset
b
twos
) . The
.
, with
The
.
.
sampling task
AL
.
sampling
this
twos
of
.
layer
set
- scale divided
, with
subset
on ( Hull
- gray were
digits
.
network
visible
data
digits
600
network
databaBe
256
.
a ) A
of
the
small
, black
The
- layer 1
2000
of
ET
digits
CDROM
[ 0 , 1 ] . The
and
to
handwritten
CEDAR
within
in
on
E . HINTON
the of
with
4
0 . 02
discarded
for
learning
.
the
weights
from
units
in
the
For
visible
6 . units are
in
the
global
first , while
hidden others
layer are
.
490
GEOFFREYE. HINTONET AL.
highly localized . The top binary unit is selecting the linear units in the first hidden layer that correspond to features found predominantly in threes , by exciting the corresponding binary units . Features that are exclusively used in twos are being gated out by the top binary unit , while features that can be shared between digits are being only slightly excited or inhibited . When the top binary unit is off , the features found in threes are are inhibited by strong negative biases, while features used in twos are gated in by positive biases on the corresponding binary units . Examples of data generated by the trained network are shown in figure 5(b) . The trained network was shown 600 test images, and 10 Gibbs sampling iterations were performed for each image . The top level binary unit was found to be off for 94% of twos , and on for 84% of threes . We then tried to improve classification by using prolonged Gibbs sampling . In this case, the first 300 Gibbs sampling iterations were discarded , and the activity of the top binary unit was averaged over the next 300 iterations . If the average activity of the top binary unit was above a threshold of 0.32, the digit was classified as a three ; otherwise , it was classified as a two . The threshold was found by calculating the optimal threshold needed to classify 10 of the training samples under the same prolonged Gibbs sampling scheme. With prolonged Gibbs sampling , the average activity of the top binary unit was found to be below threshold for 96.7% of twos , and above threshold for 95.3% of threes , yielding an overall successful classification rate of 96% (with no rejections allowed ) . Histograms of the average activity of the top level binary unit are shown in figure 7. a
200
150
100
50
0
b
.
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
0 .8
0 .0
1
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
0 .8
0 .8
1
200
1
1
0 0
Figure 7. Histograms of the average activity of the top level binary unit , after prolonged Gibbs sampling , when shown novel handwritten twos and threes . a) Average activity for twos in the test set . b ) Average activity for threes in the test set .
F=~QQEQ - (-~QQlnQQ )
(13)
492
GEOF~""REYE. HINTONET AL.
If
Q
then
'
is
F
the
is
visible
posterior
equal
to
units
ative
the
under
log
gence
distribution
negative
the
of
Q
log
model
probability
between
over
and
P
F
configurations
probability
defined
visible
hidden
of
by
E
.
given
the
configuration
Otherwise
configuration
,
by
the
E
of
F
exceeds
Kullback
the
-
,
the
neg
Leibler
-
diver
-
:
=
=
-
In
p
(
visibIe
)
+
E
Qa
.
In
~
( Pa
14
)
de
-
.
a .
The
EM
1993
)
algorithm
:
a
consists
full
termine
M
E
achieved
,
step
and
by
a
Q
the
-
t
-
-
Et
by
a
in
previous
partial
M
few
sweeps
E
-
step
ensures
-
l
Neal
and
Hinton
,
that
respect
to
Q
,
over
which
the
is
hidden
that
to
descent
without
Qt
occurs
,
be
in
fully
to
be
the
F
is
that
minimizing
the
it
distribution
energy
after
imagine
we
of
we
coordinate
E
-
with
reached
function
partial
it
used
step
t
during
and
updates
.
guaranteed
we
compute
sampling
the
are
that
can
Gibbs
from
step
Et
t
that
network
as
define
and
noise
so
each
(
parameters
distribution
F
We
step
+
sampling
produced
.
t
M
networks
F
the
with
posterior
improve
step
Partial
to
eliminate
identical
the
EM
Q
E
.
to
F
viewing
which
function
sampling
gradient
steps
partial
step
energy
-
in
respect
minimizes
to
distribution
of
E
To
of
the
end
partial
the
E
to
descent
with
.
of
partial
respect
step
equal
E
advantage
justifies
at
E
given
major
coordinate
F
full
setting
configurations
A
of
minimizes
hidden
have
an
the
.
infinite
Provided
Q
we
configuration
that
ensemble
exact
start
at
pt
+
l
~
distribution
the
the
Ft
end
because
Gibbs
of
the
the
:
~ +l < Et L-.III Q 'Et Q -~ L.." QQ QQt QQt
(15)
while Gibbs sampling, however brief, ensuresthat :
Ea Q~+lE;+l + Q~+lInQ~+l ~ EQ~ a E;+l + Q~InQ~.
(16)
In practice , we try to approximate an infinite ensemble by using a very small learning rate in a single network so that many successive partial Esteps are performed using very similar energy functions . But it is still nice to know that with a sufficiently large ensemble it is possible for a simple learning algorithm to improve a bound on the log probability of the visible configurations even when the Gibbs sampling is far from equilibrium . Changing the parameters can move the equilibrium distribution further from the current distribution of the Gibbs sampler . The E s.tep ensures that the Gibbs sampler will chase this shifting equilibrium distribution . One wor risome consequence of this is that the equilibrium distribution may end up
A HIERARCHICAL COMMUNITY OFEXPERTS
493
very far from the initial distribution of the Gibbs sampler . Therefore , when presented a new data point for which we don 't have a previous remembered Gibbs sample , inference can take a very long time since the Gibbs sampler will have to reach equilibrium from its initial distribution . There are at least three ways in which this problem can be finessed: 1. Explicitly learn a bottom -up initialization model . At each iteration t , the initialization model is used for a fast bottom -up recognition pass. The Gibbs sampler is initialized with the activities produced by this pass and proceeds from there . The bottom -up model is trained using the difference between the next sample produced by the Gibbs sampler and the activities it produced bottom -up . 2. Force inference to recapitulate learning . Assume that we store the sequence of weights during learning , from which we can obtain the sequence of corresponding energy functions . During inference , the Gibbs sampler is run using this sequence of energy functions . Since energy functions tend to get peakier during learning , this procedure should have an effect similar to annealing the temperature during sampling . Storing the entire sequence of weights may be impractical , but this procedure suggests a potentially interesting relationship between inference and learning . 3. Always start from the same distribution and sample briefly . The Gibbs sampler is initialized with the same distribution of hidden activities at each time step of learning and run for only a few iterations . This has the effect of penalizing models with an equilibrium distribution that is far from the distributions that the Gibbs sampler can reach in a few samples starting from its initial distribution .2 We used this procedure in our simulations .
7.
Conclusion
We have described a probabilistic generative model consisting of a hierar chical network of binary units that select a corresponding network of linear units . Like the mixture of experts (Jacobs et al ., 1991; Jordan and Jacobs, 1994) , the binary units gate the linear units , thereby choosing an appropri ate set of linear units to model nonlinear data . However , unlike the mixture of experts , each linear unit is its own expert , and any subset of experts can 2The free energy, F , can be interpreted as a penalized negative log likelihood , where the penalty term is the K ullback- Leibler divergence between the approximating distribu tion Qa and the equilibrium distribution (Eq. 14). During learning, the free energy can be decreasedeither by increasing the log likehood of the model, or by decreasing this KL divergence. The latter regularizes the model towards the approximation .
494
GEOFFREYE. HINTON ET AL.
be selected
at once ,
so we call this network a hierarchical community of
experts. Acknowledgements We thank
Peter Dayan , Michael
Jordan , Radford
Neal and Michael
Revow
for many helpful discussions. This research was funded by NSERC and the Ontario Information Technology Research Centre . GEH is the Nesbitt Burns
Fellow
Refere
n ces
of the Canadian
Institute
for Advanced
Research
.
Everitt , B. S. (1984) . An Introduction to Latent Variable Models. Chapman and
Hall , London
.
Ghahramani, Z. and Hinton , G. E. (1996). for mixtures
of factor analyzers .
The EM algorithm
Technical Report
CRG - TR -96- 1
[ ftp : / / ftp . cs . toronto . edu / pub / zoubin / tr - 96 - 1 .ps . gz ] , Depart ment of Computer Science , University of Toronto .
Hinton , G. E., Dayan, P., Frey, B. J., and Neal, R. M . (1995). The wakesleep algorithm for unsupervised neural networks . Science, 268:11581161 .
Hinton , G. E., Dayan, P., and Revow, M . (1997). Modeling the manifolds of Imagesof handwritten digits . IEEE Trans. Neural Networks, 8(1) :65- 74. Hinton , G. E. and Ghahramani, Z. (1997) . Generative models for discovering sparse distributed
representations
. Phil . Trans . Roy . Soc . London
B , 352 : 1177 - 1190 .
Hull , J. J. (1994) .
A database for handwritten text recognition re-
search. IEEE Transactions on Pattern Analysis and Machine Intelli -
gence, 16(5) :550- 554. Jacobs, R. A ., Jordan, M . I ., Nowlan, S. J., and Hinton , G. E. (1991) . Adaptive mixture of local experts . Neural Computation , 3:79- 87.
Jordan, M . I . and Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm . Neural Computation , 6:181- 214.
Neal, R. M . (1992) . Connectionist learning of belief networks. Artificial Intelligence , 56:71- 113.
Neal, R. M . and Hinton , G. E. (1993). A new view of the EM algorithm that justifies incremental and other variants . Unpublished manuscript [ ftp
: / / ftp
. cs . utoronto
. ca / pub / radford
Computer Science, University of Toronto .
/ em. ps . z] , Departmentof
AN INFORMATI ON-THEORETIC ANALYSIS OF HARD AND SOFT ASSIGNMENT METHODS FOR CLUSTERING
MICHAEL
KEARNS
AT & T Labs
Florham YISHA
- Research
Park , New Jersey
Y MANSOUR
Tel Aviv
University
Tel Aviv , Israel AND ANDREW
Y . NG
Massachusetts Institute of Technology Cambridge , Massachusetts Abstract
. Assignment methods are at the heart of many algorithms for unsu-
pervised learning and clustering -
in particular , the well -known K -means and
Ezpectation-Mazimization (EM) algorithms . In this work, we study several different methods of assignment , including the "hard " assignments used by K -means and the " soft " assignments used by EM . While it is known that K -means mini mizes
the
distortion
on the
data
and
EM
maximizes
the
likelihood
, little
is known
about the systematic differences of behavior between the two algorithms . Here we shed light on these differences via an information -theoretic analysis . The corner stone of our results is a simple decomposition of the expected distortion , showing
that K -means (and its extension for inferring general parametric densities from unlabeled sample data) must implicitly manage a trade-off between how similar the data assigned to each cluster are, and how the data are balanced among the clusters . How well the data are balanced is measured by the entropy of the parti tion defined by the hard assignments . In addition to letting us predict and verify systematic differences between K -means and EM on specific examples , the decomposition allows us to give a rather general argument showing that K -means will consistently find densities with less " overlap " than EM . We also study a third nat ural assignment method that we call po6terior assignment , that is close in spirit to the soft assignments of EM , but leads to a surprisingly different algorithm . 495
496
MICHAEL KEARNS ET AL.
1. Introduction Algorithms for density estimation , clustering and unsupervisedlearning are an important
tool in machine learning . Two classical algorithms are the K -
means algorithm (MacQueen, 1967; Cover and Thomas, 1991; Duda and Hart , 1973) and the Ezpectation-Mazimization (EM ) algorithm (Dempster et al., 1977). These algorithms have been applied in a wide variety of settings , including parameter estimation in hidden Markov models for speech
recognition (Rabiner and Juang, 1993), estimation of conditional probability tables in belief networks for probabilistic inference (Lauritzen , 1995) , and various clustering problems (Duda and Hart , 1973) . At a high level , K -means and EM appear rather similar : both perform a two -step iterative optimization , performed repeatedly until convergence. The first step is an assignment of data points to "clusters " or density mod els, and the second step is a reestimation of the clusters or density models based on the current assignments . The K -means and EM algorithms differ
only in the manner in which they assigndata points (the first step). Loosely speaking , in the case of two clusters l , if Po and Pl are density models for the two clusters , then K -means assigns z to Po if and only if Po( z ) ~ Pl (z ) ;
otherwise z is assignedto Pl . We call this hard or Winner- Take-All (WTA ) assignment . In contrast , EM assigns z fractionally , assigning z to Po with
weight Po(z)/ (Po(z) + Pl (z)) , and assigningthe "rest" of z to Pl . We call this soft or fractional
assignment . A third natural alternative would be to
again assign z to only one of Po and Pl (as in K -means) , but to randomly assign it , assigning to Po with probability Po(z)/ (Po(z) + Pl (z)) . We call this posterior assignment . Each of these three assignment methods can be interpreted as classifying
points as belonging to one (or more) of two distinct populations, solely on the basis of probabilistic models (densities) for these two populations. An alternative interpretation
is that we have three different ways of inferring
the value of a "hidden" (unobserved) variable, whose value would indicate which
of two sources had generated
an observed
data point . How these
assignment methods differ in the context of unsupervised learning is the subject of this paper . In the context of unsupervised learning , EM is typically viewed as an algorithm for mixture density estimation . In classical density" estimation . a I
--
finite training set of unlabeled data is used to derive a hypothesis density . The goal is for the hypothesis density to model the "true " sampling density as accurately as possible , typically as measured by the Kullback -Leibler 1Throughout the paper , we concentrate on the case of just two clusters or densities for simplicity of development . All of our results hold for the general case of K clusters or
densities
.
HARD
AND
SOFT
ASSIGNMENTS
FOR CLUSTERING
497
(KL ) divergence. The EM algorithm can be used to find a mixture density model of the form c:toPo+ (1 - c:tO)Pl . It is known that the mixture model found by EM will be a local minimum of the log-loss (Dempster et al., 1977) (which is equivalent to a local maximum of the likelihood), the empirical analogue of the KL divergence . The K -means algorithm is often viewed as a vector quantization algo-
rithm (and is sometimes referred to as the Lloyd-Max algorithm in the vector quantization literature ) . It is known that K -means will find a local minimum of the distortion or quantization error on the data (MacQueen, 1967) , which we will discussat some length. Thus , for both the fractional and WTA assignment methods , there is a
natural and widely used iterative optimization heuristic (EM and K -means, respectively) , and it is known what loss function is (locally) minimized by each algorithm (log-loss and distortion , respectively) . However, relatively little seems to be known about the precise relationship between the two loss functions and their attendant heuristics . The structural similarity of EM and K - means often leads to their
being considered
closely related
or
even roughly equivalent. Indeed, Duda and Hart (Duda and Hart , 1973) go as far as saying that K -means can be viewed as " an approximate way to obtain maximum likelihood estimates for the means" , which is the goal of density estimation in general and EM in particular . Furthermore , K -means is formally equivalent to EM using a mixture of Gaussians with covariance
matrices fI (where I is the identity matrix ) in the limit e - + o. In practice, there is often some conflation of the two algorithms : K -means is sometimes used in density estimation applications due to its more rapid convergence, or at least used to obtain "good" initial parameter values for a subsequent execu
tion
of EM
.
But there are also simple examples in which K -means and EM converge to rather different solutions , so the preceding remarks cannot tell the entire story . What quantitative statements can be made about the systematic differences between these algorithms and loss functions ? In this work , we answer this question by giving a new interpretation of the classical 'distortion that is locally minimized by the K -means algorithm . We give a simple information -theoretic decomposition of the expected dis-
tortion that showsthat K -means (and any other algorithm seekingto minimize the distortion ) must manage a trade-off between how well the data are balanced or distributed and the accuracy
among the clusters
of the density
by the hard assignments ,
models found for the two sides of this as-
signment . The degree to which the data are balanced among the clusters is measured by the entropy of the partition defined by the assignments . We refer to this trade -off as the information -modeling trade - off . The information -modeling trade - off identifies two significant ways in
498
MICHAELKEARNSET AL.
which
K - means
sampling
and
EM
density
Q
with
with
explicitly
concerned the
sampling
rately strongly
is
result
of
second
the
even
in In
simple
specific
tend
find
ant
the
loss
with tioned
above
Our and
should
that
as a partition Po
and
we
must
In be
each
of
sepa -
may
be
; in
EM
the
intuitive
it
this
here ; the
of K - means
behavior us
to
less
form
of
K - means
derive
a general
K - means
" overlap
will
" with
each
examples
, this
bias
general
bias
that
is a rather
the
of
, that
simple
it
or
unequal
study
also
allows
weightings
the
" the the
has
an
incentive
posterior the
us to of the
weighting
despite
density
models
analyze
resulting
loss by
vari -
density
models
Po
effect
on
for
performed
the
interesting finding
a partition
assignment
this
and
for
that
we
to
output
method
function EM
men -
' s algebraic
, it
differs
rather
explore
of the
determined
by
may
will
, we its
settings
think of
and
z
E X
of F ~
flip
in
we consider PI
Plover to
as
X , and
either
0 or
" assigning
( and
to
, K - means
a ( possibly 1 ; we
" points
determine
{ O, I } ; in
such
propose
interpretation
Po and
EM
.
to
( b E { O, I } ) as a density coins
call
applying
learning
Assignments
Po
a value
. We
to anyone
Hard
maps
think
it . F
section
interest
of unsupervised
densities
F
assignments
all
is
is
Ql
.
problems
of X . We
always
. In
densities
define
: namely have
that
optimization
have
PI , and
assigned
"
for
behavior
allows
that
argue
be of some
to
) mapping
F
" hard
that
and
K - means
formalize
the
. In certain
of this
Decomposition
of
density
show iterative
variants
Suppose
points
also
entire
.
Loss
domized
but
the
P1
EM
" erasing
. We
PI
by
the
differ
we use
use
, essentially
results
A
the
and
we
explain
density
maintains
that
, and
to
their
2.
that
entropy
similarity dramatically
and
framework
show
function
high
EM
algorithms
mathematical
P1 ; we
and
we
sampling
the
, and
and
by
; here
the
by
of K - means
and
ally
on
Po
Qo
differences
determine
the
Qo ) P1 , K - means
they
these
decomposition
found
is apparent
P1 used
The
models
model
see .
predict
new
to those
little
Po and
can
K - means
density
of K - means depends
, the
how
compared
us
of
shall
Po
partition
methods
, as we
(1 -
identified
first
actually
+
to
subpopulations
models
of the
assignment
letting
about
to
. The
seeks
QoPo
distinct good
absent
, but
examples
prediction
model
finding
entropy
examples to
EM
of subpopulations
the
obvious
addition
other
by
differing
is less
, where
identifying
choice
entirely
. First
a mixture
, and
, the
influenced
influence
on
density
. Second
differ
other
a triple
in this perhaps
ran refer
exactly
model
words
, F
one
for
assignment must
to
the of
z ,
make
( F , { Po , PI } ) a partitioned
a measure and
the
will
of goodness
consequences paper
, the
some
for
partitioned
.
partition
additional
F will parameters
actu
),
HARD
AND
SOFT
ASSIGNMENTS
FOR CLUSTERING
499
but we will suppress the dependency of F on these quantities for notational brevity . As simple examples of such hard assignment methods , we have the
two methods discussedin the introduction : WTA assignment (used by K means) , in which z is assignedto Po if and only if Po(z) ~ PI (z) , and what we call posterior assignment , in which ~ is assigned to Pb with probability
Pb( z)j (Po(Z) + PI (z)) . The soft or fractional assignmentmethod used by EM does not fall into this framework , since z is fractionally
assigned to
both Po and Pl .
Throughout the development , we will assume that unclassified data is drawn according to some fixed , unknown density or distribution Q over X that we will call the sampling density . Now given a partitioned density
(F, { Po, PI } ) , what is a reasonableway to measurehow well the partitioned density "models " the sampling density Q ? As far as the Pb are concerned , as we have mentioned , we mi ~ht ask that the density Pb be a good model
of the sampling density Q conditioned on the event F (z) = b. In other words , we imagine that F partitions Q into two distinct subpopulations , and demand that Po and PI separately model these subpopulations . It is
not immediately clear what criteria (if any) we should ask F to meet; let us defer this question for a moment .
Fix any partitioned density (F, { Po, PI } ) , and define for any z E X the partition
loss
X(z) = E[- log (PF (z)(z))]
(1)
where the expectation is only over the (possible) randomization in F . We have suppressed the dependence of X on the partitioned density under consideration for notational brevity , and the logarithm is base 2. If we ask that the partition loss be minimized , we capture the informal measure of goodness proposed above: we first use the assignment method F to assign z to either Po or Pl ; and we then "penalize " only the assigned density Pb
by the log loss - 10g(Pb(z)) . We can define the training partition loss on a finite set of points S , and the expected partition loss with respect to Q , in the natural
ways .
Let us digress briefly here to show that in the special case that Po and
Pl are multivariate Gaussian (normal) densitieswith meansJ.Loand Ill , and identity covariance matrices , and the partition F is the WTA assignment method , then the partition loss on a set of points is equivalent to the well known distortion or quantization error of J.Loand III on that set of points
(modulo some additive and multiplicative constants) . The distortion of z with respect to J.Laand ILl is simply (1/ 2) min(llz - ILo112 , Ilz - ILl112 ) = (1/ 2) llz - ILF(z)112 , where F (z) assignsz to the nearer oflLo and ILl according to Euclidean distance (WTA assignment) . Now for any z , if ~ is the ddimensional Gaussian(1/ (27r) (d/ 2 )e- (1/ 2)llz- l.I.bI12 and F is WTA assignment
500
MICHAEL
KEARNS
wi th respect to the Pb, then the partition
ET AL .
loss on z is
(2) (1/ 2)llz - JlF(z)112Iog (e) + (d/ 2) log27r . (3)
- log{PF(z) (z)) == log((27r )d/2e(1/2)llz- #.I.p'(z)112 )
The first term in Equation (3) is the distortion times a constant, and the second term is an additive constant that does not depend on z , Po or Pl . Thus , minimization of the partition loss is equivalent to minimization of the distortion . More generally , if z and Jl are equal dimensioned real
vectors, and if we measure distortion using any distance metric d(z , p,) that can be expressedas a function of z - IL, (that is, the distortion on z is the smaller of the two distances d(z , ILo) and d(z , /ll ) ,) then again this distortion
is the special case of the partition
loss in which the density
Pb is Pb(Z) = (l / Z )e- d(z,lJ.b) , and F is WTA assignment. The property that d(z , p,) is a function of z - /l is a sufficient condition to ensure that the normalization factor Z is independent of Jl; if Z depends on /l , then the partition loss will include an additional IL-dependent term besides the distortion , and we cannot guarantee in general that the two minimizations are equivalent .
Returning to the development , it turns out that the expectation of the partition loss with respect to the sampling density Q has an interesting decomposition and interpretation . For this step we shall require some basic but important definitions . For any fixed mapping F and any value b E
{ O, I } , let us define Wb== PrzEQ[F (z) == b], so Wo+ WI = 1. Then we define Qb by
Qb(Z) == Q(z) . Pr [F (z) == b]/ Wb
(4)
where here the probability is taken only over any randomization of the mapping F . Thus , Qb is simply the distribution Q conditioned on the event
F (z) == b, so F "splits" Q into Qo and QI : that is, Q(z) == woQo(z) + WIQI (z) for all z . Note that the definitions of Wb and Qb depend on the partition F (and therefore on the Pb, when F is determined by the Pb). Q:
Now we can write
the expectation
of the partition
loss with
respect to
EzEQ[X(Z)] WOEzoEQo [- log(Po(zo))] + wIEztEQt [- log(PI (ZI))]
(5)
wOEzoEQo [log ~ -IOg (Qo (zo )] +WlEZ1 EQI[log-Ql(Zl) 1~(-;;-~)- - log(Q1(z1))] -
woKL (QoIIPo) + wIKL (QIIIP1) + wo1l(Qo) + wl1l (QI ) woKL(QoIIPo) + wIKL (Ql //P1) + 1l (Q/F ).
(6) (7) (8)
HARD AND SOFT ASSIGNMENTSFOR CLUSTERING
501
Here KL (Qbll~ ) denotes the Kullback-Leibler divergencefrom Qb to Pb, and 1l (QIF ) denotes 1l (zIF (z)) , the entropy of the random variable z , distributed according to Q, when we are given its (possibly randomized) assignment F (z) . This decomposition will form the cornerstone of all of our subsequent arguments, so let us take a moment to examine and interpret it in some detail . First , let us remember that every term in Equation (8) depends on all of F , Po and Pl , since F and the Pb are themselvescoupled in a way that depends on the assignmentmethod. With that caveat, note that the quantity KL (QbIIPb) is the natural measure of how well Pb models its respective side of the partition defined by F , as discussedinformally above. Furthermore, the weighting of these terms in Equation (8) is the natural one. For instance, as Woapproaches0 (and thus, Wl approaches1) , it becomeslessimportant to make KL (QoIIPo) small: if the partition F assigns only a negligible fraction of the population to category 0, it is not important to model that category especially well, but very important to accurately model the dominant category 1. In isolation, the terms woKL(QoIIPo) + wlKL (QlIIPI ) encourage us to choose Pb such that the two sides of the split of Q defined by Po and PI (that is, by F ) are in fact modeled well by Po and Pl . But these terms are not in isolation. The term 1l (QIF ) in Equation (8) measuresthe informativeness of the partition F defined by Po and PI , that is, how much it reducesthe entropy of Q. More precisely, by appealing to the symmetry of the mutual information I (z , F (z)) , we may write (where z is distributed according to Q) : 1l (QIF ) = = = =
1l (zIF (z)) 1 (z) - I (z , F (z)) 1 (z) - (1 (F (z)) - 1 (F (z) lz)) 1 (z) - (1l2(wo) - 1 (F (z) lz))
(9) (10) (11) (12)
where 1l2(p) = - plog (p) - (l - p) log(l - p) is the binary entropy function . The term 1 (z) = 1 (Q) is independent of the partition F . Thus, we see from Equation (12) that F reducesthe uncertainty about z by the amount 1l2(WO ) - 1 (F (z) lz) . Note that if F is a deterministic mapping (as in WTA assignment) , then 1l (F (z) lz) = 0, and a good F is simply one that maximizes 1l (WO ). In particular , any deterministic F such that wo = 1/ 2 is optimal in this respect, regardlessof the resulting Qo and Ql . In the general case, 1l (F (z) Iz) is a measureof the randomnessin F , and a good F must trade off between the competing quantities 1l2(WO ) (which, for example, is maximized by the F that flips a coin on every z) and - 1l (F (z) lz) (which is always minimized by this same F ) . Perhaps most important , we expect that there may be competition between the modeling terms woKL (QoIIPo) + wlKL (Q11IP1 ) and the partition
502
MICHAELKEARNSET AL.
information term 1 (Q IF ). If Po and PI are chosenfrom some parametric class P of densities of limited complexity (for instance, multivariate Gaussian distributions ), then the demand that the KL (QbIIPb) be small can be interpreted as a demand that the partition F yield Qb that are "simple" (by virtue of their being well -approximated , in the KL divergence sense, by den-
sities lying in P ). This demand may be in tension with the demand that F be informative , and Equation (8) is a prescription for how to manage this competition , which we refer to in the sequel as the information -modeling trade - off .
Thus, if we view Po and PI as implicitly defining a hard partition (as in the case of WTA assignment) , then the partition loss provides us with one particular way of evaluating the goodness of Po and P1 as models of the sampling density Q . Of course, there are other ways of evaluating the
Pb, one of them being to evaluate the mixture (1/ 2)Po + (1/ 2)Pl via the KL divergence KL (QII(1/ 2)Po+ (1/ 2)Pl ) (we will discussthe more general case of nonequal mixture coefficients shortly ) . This is the expressionthat is (locally ) minimized by standard density estimation approachessuch as EM , and we would particularly
like to call attention to the ways in which
Equation (8) differs from this expression. Not only does Equation (8) differ by incorporating the penalty 1 (Q IF ) for the partition F , but instead of asking that the mixture (1/ 2)Po + (1/ 2)Pl model the entire population Q , each Pb is only asked to - and only given credit for - modeling its respective Qb. We will return to these differences in considerably more detail in Section
4.
We close this section by observing that if Po and Pl are chosen from a class P of densities , and we constrain F to be the WTA assignment method for the Pb, there is a simple and perhaps familiar iterative optimization algorithm for locally minimizing the partition loss on a set of points S over all choices 'of the Pb from P - we simply repeat the following two steps until
convergence
:
- (WTA Assignment) Set 80 to be the set of points z E 8 such that Po(z) ~ P1(z) , and set 81 to be S - So.
- (Reestimation ) ReplaceeachPbwith argminpEP { - EzESblog(P(z))} . As we have already noted , in the case that the ~
are restricted to be
Gaussian densities with identity covariance matrices (and thus, only the means are parameters) , this algorithm reduces to the classical K -means algorithm . Here we have given a natural extension for estimating Po and PI from a general parametric class, so we may have more parameters than just the means . With some abuse of terminology , we will simply refer to our generalized version as K -means. The reader familiar with the EM algorithm for choosing Po and PI from P will also recognize this algorithm as simply
HARD
a
" hard
"
or
mixture
AND
WTA
SOFT
assignment
coefficients It
is
easy
partition
to
us
fact
K
of
unweighted
will
result
EM
503
( that
is , where
the
) .
- means
chosen
from
special
P
case
that
K
trade
means
loss
- off
at
Equation
of
using
the
case
in
the
a
local
WTA
partition
minimum
of
assignment
loss
in
does
not
be
the
, for
, for
Finally
K
the
method
- means
each
, that
, Qb
that
we
can
can
,
.
loss
for
the
terms
also
change
( QblIPb
with
)
each
generalize
K
-
-
- means but
this
nonincreasing
iteration
Equation
is
litera
them are
in
this
by
to
K
terms
where
estimated
assigned
-
the
the
quantization
means
KL
of
examples
vector
, combined information
increase
each
see
the
, the points
loss the
not
that
will in
the
easily
will
mean
we
iteration of
- means manage
- means
not
observed
means
instance
example
, note
K
does indeed
at
true
K
implicitly
although
often
) that the
imply
( because
been
the
must
, this
increMej
h ~
, 1982
fact
that
not
It
minimizes
- means
iteration
will
.
( Gersho
must
locally K
. Note
any
( 8 ) the
- means
( 8 ) , implies
modeling
case
Pb
this
Equation
ture
equal
CLUSTERING
.
The
not
be that
over
rename .
convenience
with
verify
FOR
variant
must
loss
Let
ASSIGNMENTS
) .
( 8 )
to
the
K
- cluster
: K EQ
[X ( Z ) ]
== L
wiKL
( QiIIPi
)
+
1l
( QIF
) .
( 13
( 1l
( z ) ) -
)
i = l
Note
that
where
, as
z
now
is
an
3 .
O
in
Equation
( log
we
EM or
have
IlK
of
the
' P Po
ization
to
algorithm
aoPo
case
E
P
case
, of
( z )
( Reestimation
-
( Reweighting
:2::
( z ) that
for
( F general
densities
1l
K
)
S
a
) P1
Replace Replace
of
of
( F
, 1l
Set
So
to
( z ) , and each ao
set Pb
with
0
E
the
steps
be
the S1
with ISol
a
( z ) lz
( F
) ) ,
( z ) )
is
a
to
be
natural
space
X
and
outputs
gener
weighted
K
a ,
the
pair
' P
.)
and
ao
,
of
general
straightforward E
-
variant
,
( Again
1 / 2 ,
The
and
then
such
that
:
set to
be
argminpEP / ISI
Pb
unweighted
- assignment
[0 , 1 ] . is
the
forced
also
hard
points
weights for
three
a
over
K
choices
is
as
data
weight and
following
of are
) . There
densities
set
as
random
ao )
of
variant
coefficients
thought
' P
well
) -
densities
a
K
- assignment
K be
as
the
hard
mixture
class
with
( 1
a
of
input
Assignment
-
= = tl
, and
the
can
any as
executes
( WTA
is
that
. For
begins
repeatedly
- means
takes
, Pl
the
)
Q
.
is , where
- means
EM over
densities
-
, K
general
K
weighted
means
quantity
( that
of
( QIF to
- Means
noted
in
alization
) )
K
algorithm
) , 1l
according
( K
Weighted
As
( 11
distributed
.
of S
points -
So { -
z
E
S
. EZESb
log
( P
( z ) ) } .
504
MICHAELKEARNSET AL.
Now we can again ask the question: what loss function is this algorithm (locally ) minimizing ? Let us fix F to be the weighted WTA partition , given by F (z) = 0 if and only if aoPo(z) ~ (1 - aO)Pl (z) . Note that F is deterministic , and also that in general, ao (which is an adjustable parameter of the weighted K -means algorithm ) is not necessarily the same as Wo (which is defined by the current weighted WTA partition , and dependson Q) . It turns out that weighted K -meanswill not find Po and PI that give a local minimum of the unweighted K -means loss, but of a slightly different loss function whose expectation differs from that of the unweighted K means loss in an interesting way. Let us define the weightedK -means loss of Po and PIon z by
- log(a~-F(z)(l - ao)F(Z)PF(z)(Z))
(14)
where again , F is the weighted WTA partition determined by Po, PI and Qo. For any data set S , define Sb = { z E S : F (z ) = b} . We now show that weighted K -means will in fact not increase the weighted K -means loss on S with each iteration . Thus2
-zES L log (a.~-F(z)(1- ao )F(z)PF (z)(z)) - zESo L log (aoPo (z))- zESt L log ((1- aO )Pl(z)) - zESo L log (Po (Z))- zESt L log (P1 (Z)) -ISollog (ao )- ISlllog (l - ao ).
(15)
-
(16)
Now
- ISollog (ao) - IS111og (1- ao) -
ISol (ao) + TSl1og IS11 (1- ao)) - ISI ( TSl1og
(17)
which is an entropic expression minimized by the choice 0 = 1801/ 181. But this is exactly the new value of 00 computed by weighted K -means from the current assignments 80, 81. Furthermore , the two summations in Equation (16) are clearly reduced by reestimating Po from So and P1 from Sl to obtain the densities P~ and P{ that minimize the log-loss over So and 81 respectively , and these are again exactly the new densities computed 2Weare grateful to Nir Friedmanfor pointing out this derivationto us.
HARD
by
AND
weighted
K
means
loss
(
justifying
-
a
where
the
[
Wb
side
is
give
the
wo
)
=
-
[
PrzEx
(
ao
,
1
ao
estimate
ao
-
F
(
)
(
z
.
-
)
z
)
is
the
=
(
=
b
-
z
)
}
ao
(
z
)
)
as
)
=
)
in
( z
the
)
-
,
PI
}
the
)
on
S
weighted
at
K
each
-
iteration
,
PF
(
,
z
)
log
.
(
(
z
K
)
,
{
)
-
first
(
Po
F
,
{
wllog
(
term
Po
,
Pi
,
Pi
}
)
,
)
.
and
ISI
(
ISol
/
means
loss
for
this
ISI
large
=
=
Wo
,
)
-
two
Wi
)
=
is
(
we
know
that
K
simply
the
expect
)
hand
wo
much
we
18
terms
=
weighted
we
(
right
not
,
how
samples
the
wo
means
is
aD
last
is
-
-
The
there
K
1
on
}
(
weighted
/
-
]
distributions
F
of
)
ao
The
of
ISol
limit
weighted
have
Wo
for
=
iteration
,
F
]
(
but
ao
Po
We
binary
fixed
;
505
decreases
{
expected
before
the
a
means
,
?
loss
For
-
F
Q
1
PF
have
Thus
(
what
(
entropy
each
WOe
(
log
must
at
of
,
partition
-
of
CLUSTERING
.
between
we
)
loss
PI
~
F
cross
convergence
reassigns
a
entropy
this
K
14
density
-
[
-
weighted
this
expected
cross
about
(
FOR
(
and
log
the
,
sampling
EzEQ
just
and
say
=
=
Thus
Equation
Po
to
EzEQ
ASSIGNMENTS
of
fixed
respect
=
.
by
naming
for
with
means
given
our
Now
SOFT
-
,
1
-
can
at
means
empirical
Wo
-
+
wo
,
and
thus
-
Combining
Equation
sition
for
found
by
weighted
K
[
=
-
=
=
~
since
1 .
(
the
of
the
tion
mative
weighted
log
means
(
has
does
weight
K
-
means
wo
(
)
Z
)
wo
(
)
1
+
-
the
)
-
+
18
1l2
)
(
wo
and
gives
(
(
)
.
(
our
that
+
(
general
for
19
)
decompo
the
Po
,
Pi
-
and
ao
WI
QIIIPl
)
Pb
Q
(
,
1
will
)
1l
]
(
1l
no
+
QIF
(
2
Q
.
)
)
-
1l2
Q
ao
But
(
wo
,
we
may
finding
(
.
,
This
all
{
,
the
,
,
this
,
}
(
21
)
(
22
)
the
)
that
the
of
and
P1
thus
(
,
has
unweighted
the
introduc
finding
-
an
modeling
trade
the
)
introduction
Po
-
PI
from
F
fixed
20
of
Po
differs
towards
minimize
think
F
(
)
.
beyond
bias
to
)
partition
information
try
(
of
the
for
)
1l
)
First
the
is
)
QIIIPI
even
/
removed
there
Z
+
as
.
1
=
+
(
or
means
of
=
)
)
ways
ao
Z
)
wIKL
two
(
the
-
+
and
also
PF
)
definition
Qo
,
)
QIIIPl
(
)
Z
)
on
QoIIPo
fixing
algorithm
Wl
wIKL
in
Thus
(
QIIIP1
(
K
means
F
(
depend
has
;
)
)
wllog
)
of
.
ao
wIKL
our
Qo
8
wIKL
+
-
changed
F
-
wllog
)
woKL
to
partition
F
weighted
definition
the
"
-
not
corresponds
of
Wi
(
(
)
of
-
(
Equation
QoIIPo
goal
wllog
Equation
QoIIPo
(
-
with
QoIIPo
(
)
~
(
K
00
a
wolog
sum
the
-
(
)
,
(
unweighted
weight
changed
K
Q
)
goal
log
Wo
)
wo
means
woKL
generalization
minimizes
-
woKL
-
(
in
woKL
-
,
19
loss
=
Thus
log
(
partition
EZEQ
(
Wo
modeling
"
-
infor
off
-
for
terms
506
MICHAELKEARNSET AL.
woKL(QoIIPo) + wIKL (QIIIP1) only. Note, however, that this is still quite different from the mizture KL divergenceminimized by EM . 4. K -Means vs . EM : Examples In this section, we consider severaldifferent sampling densities Q, and compare the solutions found by K -means (both unweighted and weighted) and EM . In eachexample, there will be significant differencesbetweenthe error surfaces defined over the parameter space by the K -means lossesand the KL divergence. Our main tool for understanding these differenceswill be the loss decompositionsgiven for the unweighted K -meansloss by Equation (8) and for the weighted K -means loss by Equation (22) . It is important to remember that the solutions found by one of the algorithms should not be considered "better" than those found by the other algorithms: we simply have different loss functions, eachjustifiable on its own terms, and the choice of which loss function to minimize (that is, which algorithm to use) determines which solution we will find . Throughout the following examples, the instance spaceX is simply ~ . We compare the solutions found by (unweighted and weighted) EM and (unweighted and weighted) K -means when the output is a pair { Po, PI } of Gaussians over ~ - thus Po == N (ILo, O'o) and PI == N (JLI, 0' 1) , where JLo , 0'0, JLI, 0' 1 E ~ are the parameters to be adjusted by the algorithms . (The weighted versionsof both algorithms also output the weight parameter ao E [0, 1] .) In the caseof EM , the output is interpreted as representing a mixture distribution , which is evaluated by its KL divergencefrom the sampling density. In the case of (unweighted or weighted) K -means, the output is interpreted as a partitioned density, which is evaluated by the expected (unweighted or weighted) K -meanslosswith respect to the sampling density . Note that the generalization here over the classical vector quantization case is simply in allowing the Gaussiansto have non-unit variance. In each exampIe, the various algorithms were run on IO thousand examples from the sampling density; for these I -dimensional problems, this sample size is sufficient to ensure that the observed behavior is close to what it would be running directly on the sampling density. Example (A ) . Let the sampling density Q be the symmetric Gaussian mixt ure Q = o.5N (- 2, 1.5) + O.5N (2, 1.5) .
(23)
See Figure 1. Supposewe initialized the parameters for the algorithms as JLo== - 2, III == 2, and 0'0 == 0'1 == 1.5. Thus, eachalgorithm begins its search from the "true" parameter values of the sampling density. The behavior of unweighted EM is clear: we are starting EM at the global minimum of its expected loss function , the KL divergence; by staying where it begins, EM
HARD
can
enjoy
KL
divergence
a
of
The
term
= =
1
/
2
the
event
.
I F
)
(
z
)
Qo
the
O
.
)
of
best
choice
of
F
will
improve
the
same
us
the
of
provided
by
move
on
more
subtle
. 130
. 500
)
some
. 5
)
0 '
are
these
)
term
the
the
K
- means
,
2
added
of
. 5
as
,
the
We
)
,
.
)
the
the
Symmet
-
as
WTA
is
,
and
long
it
the
par
-
possible
to
degrad
make
optimal
. 5
mass
without
.
1
results
2
the
thus
)
,
towards
1
then
,
back
that
-
,
( QIF
as
-
0
( QoIIPo
conditions
1i
)
of
,
(
= =
moved
Furthermore
initial
N
moved
value
-
term
. 5
woKL
symmetric
)
on
reflected
1
.
:
( QIIIP1
is
0
,
)
yields
z
value
movements
from
for
.
it
has
initial
)
1 ,
2
terms
which
reflection
initial
the
-
tail
the
,
below
= =
(
tail
the
)
z
N
the
( QIIIP1
and
. 5
.
conditioned
,
1
since
( 8
wIKL
Q
,
parameters
0
and
Rather
,
on
only
~
is
above
( since
than
.
( 2
the
. 5
to
wIKL
the
= =
As
1
. 338
-
essentially
performance
is
K
the
means
- means
. 131
.
-
we
find
The
that
after
8
coarse
. 301
the
is
KL
that
the
the
various
of
the
easy
to
decomposi
that
to
of
that
been
behavior
approximation
behavior
out
.
have
is
)
divergence
to
superior
would
this
pushed
inferior
loss
it
( 24
been
,
point
of
1
have
means
justification
= =
is
,
.
1
Naturally
example
a
0 '
model
K
the
,
means
mixture
as
where
2
the
directly
a
K
= =
,
simple
provides
examples
III
expected
this
-
,
,
solution
reduced
to
its
in
sample
the
predicted
variances
while
finite
to
0 ' 0
Q
regarding
to
1
tail
than
0 ' 0
,
.
of
( 8
,
( since
1
on
the
behavior
2
converged
that
Equation
-
)
which
Q
the
of
z
,
Equation
choice
ofN
its
smaller
density
remark
2
in
if
is
or
of
of
given
( QoIIPo
-
-
choice
each
only
operation
weighted
and
,
(
with
this
initial
( that
presence
.
2
0
parameters
predict
Example
,
sampling
Let
directly
= =
)
by
has
-
for
and
tail
respect
experiment
= =
origin
the
starting
tion
. 5
the
N
the
smaller
( QbIIPb
means
Wo
the
from
0
not
of
value
the
Ilo
from
. 5
left
and
. 5
loss
if
the
here
examine
Qo
reflection
term
,
0
us
0
Q
:
irrelevant
woKL
below
the
for
yields
1
= =
that
is
,
with
wbKL
Performing
which
2
Qo
Ill
= =
with
unchanged
= =
0
terms
-
be
to
00
-
-
)
Notice
but
tail
optimal
K
0
,
be
prediction
,
(
should
terms
for
iterations
N
,
and
initially
achieved
~
0
( z
The
variance
remain
the
the
z
should
ILo
.
.
= =
the
Thus
0 ' 0
of
ing
.
Ilo
Let
EM
essentially
by
F
1
of
apply
movements
tition
,
mean
mean
choice
remarks
is
the
best
ric
or
it
reduces
final
= =
z
the
and
,
?
density
weighted
partition
simply
)
Clearly
moves
,
0
is
minimized
is
above
0
507
CLUSTERING
sampling
of
is
- means
expected
story
,
= =
the
already
( wo
= =
"
the
true
parameter
different
Equivalently
)
js
1l2
a
also
K
F
off
z
left
( Q
F
below
in
this
of
are
chopped
on
for
and
,
is
parameter
partition
however
models
same
unweighted
1
WTA
FOR
perfectly
The
decomposition
the
"
.
value
about
the
Wo
)
weighting
optimal
in
ASSIGNMENTS
that
the
What
SOFT
solution
0
absence
the
AND
-
cannot
EM
.
be
We
now
algorithms
is
.
(
competes
B
)
.
We
now
with
examine
the
KL
an
divergences
example
in
.
Let
which
the
the
sampling
term
density
1l
( QIF
Q
)
be
508
MICHAELKEARNSET AL.
the single unit -variance Gaussian Q(z) = N (O, 1); see Figure 2. Consider the initial
choice of parameters
/.Lo == 0 , 0' 0 == 1, and Pl at some very distant
location , say /.Lo= 100, 0'0 = 1. We first examine the behavior of un weighted
K -means. The WTA partition F defined by these settings is F ( z) = 0 if and only if z < 50. Since Q has so little mass above z = 50, we have Wo ~
1, and thus 1l (QIF ) ~ 1l (Q) : the partition is not informative . The term wlKL (QlIIPl ) in Equation (8) is negligible, since Wl ~ o. Furthermore, Qo ~ N (O, 1) becauseeven though the tail reflection described in Example (A) occurs again here, the tail ofN (O, 1) abovez == 50 is a negligible part of the density. Thus woKL(QoIIPo) ~ 0, so woKL(QoIIPo)+ WlKL (QlIIPl ) ~ o. In other words , if all we cared about were the KL divergence terms , these settings would be near-optimal . But the information -modeling trade -off is at work here: by moving Pl closer to the origin , our KL divergences may degrade , but we obtain a more informative partition . Indeed , after 32 iterations unweighted K -means converges
to
ILo== - 0.768, 0'0 == 0.602, III == 0.821, 0' 1 == 0.601
(25)
which yields Wo == 0 .509 .
The information -modeling tradeoff is illustrated nicely by Figure 3, where we simultaneously plot the unweighted K -means loss and the terms
woKL (QoIIPo) + wIKL (QIIIP1) and 1 2(Wo) as a function of the number of iterations during the run . The plot clearly shows the increase in 1 2(wo) (meaning a decreasein 1 (QIF )) , with the number of iterations , and an increase in woKL (QoIIPo) + wIKL (QIIIP1) . The fact that the gain in partition information is worth the increase in KL divergences is shown by the resulting decrease in the unweighted K -means loss. Note that it would be especially difficult to justify the solution found by unweighted K -means from the viewpoint of density estimation .
As might be predicted from Equation (22) , the behavior of weighted K -means is dramatically different for this Q , since this algorithm has no incentive to find an informative partition , and is only concerned with the KL divergence terms . We find that after 8 iterations it has converged to
ILo== 0.011, 0'0 == 0.994, ILl == 3.273, 0' 1 = 0.033 with
(26)
0 = Wo = 1.000. Thus , as expected , weighted K -means has chosen a
completely uninformative partition , in exchangefor making WbKL(QbIIPb) ~ o. The values of III and 0"1 simply reflect the fact that at convergence, P1 is assigned only the few rightmost points of the 10 thousand examples . Note that the behavior of both K -means algorithms is rather different
from that of EM , which will prefer Po = P1 = N (0, 1) resulting in the mixture (1/ 2)Po+ (1/ 2)PI = N (O, 1) . However, the solution found by weighted
HARDAND SOFTASSIGNMENTS FORCLUSTERING
509
K -means is "closer" to that of EM , in the sense that weighted K -means effectively eliminates one of its densities and fits the sampling density with a single Gaussian . Example ( C ) . A slight modification to the sampling distribution of Ex ample (B ) results in some interesting and subtle difference of behavior for our algorithms . Let Q be given by
Q == 0.9SN(0, 1) + 0.OSN(5, 0.1).
(27)
Thus, Q is essentially as in Example (B) , but with addition of a small distant "spike" of density; seeFigure 4. Starting unweighted K -meansfrom the initial conditions J1 ,0 = 0, 0"0 = 1, ILl == 0, 0"1 == 5 (which has Wo== 0.886, 1l (wo) == 0.513and woKL(QoIIPo)+ w1KL(Q11IP1 ) == 2.601) , we obtain convergenceto the solution ILo== - 0.219, 0'0 == 0.470, ILl == 0.906, 0' 1 == 1.979
(28)
which is shown in Figure 5 (and has Wo == 0.564, 1l (wo) == 0.988, and woKL (QolfPo) + wIKL (QIIIP1) == 2.850) . Thus, as in Example (B) , unweighted K -means starts with a solution that is better for the KL divergences, and worse for the partition information , and elects to degrade the former in exchangefor improvement in the latter . However, it is interesting to note that 1 (wo) == 1 (0.564) == 0.988 is still bounded significantly away from 1; presumably this is becauseany further improvement to the partition information would not be worth the degradation of the KL divergences. In other words, this solution found is a minimum of the K -means loss where there is truly a balanceof the two terms: movement of the parameters in one direction causesthe loss to increasedue to a decreasein the partition information , while movementof the parameters in another direction causes the loss to increasedue to an increasein the modeling error. Unlike Example (B) , there is also another (local) minimum of the unweighted K -means loss for this sampling density, at Po = 0.018, 0'0 = 0.997, III = 4.992, 0'1 = 0.097
(29)
with the suboptimal unweighted K -means loss of 1.872. This is clearly a local minimum where the KL divergenceterms are being minimized, at the expense of an uninformative partition (wo == 0.949) . It is also essentially the same as the solution chosenby weighted K -means (regardlessof the initial conditions) , which is easily predicted from Equation (22) . Not surprisingly, in this example weighted K -means convergesto a solution close to that of Equation (29) . Example (D ) . Let us examine a case in which the sampling density is a mixture of three Gaussians: Q = O.25N (- 10, 1) + o.5N (O, 1) + O.25N (10, 1).
(30)
510
MICHAELKEARNSET AL.
See Figure 6. Thus , there are three rather distinct subpopulations of the sampling density . If we run unweighted K -means on 10 thousand examples
from Q from the initial conditions J.Lo= - 5, J.Ll = 5, 0'0 = 0"1 = 1, (which has Wo= 0.5) we obtain convergenceto ILo= - 3.262, 0'0 = 4.789, ILl = 10.006, 0' 1 == 0.977
(31)
which has Wo == 0.751. Thus , unweighted K -means sacrifices the initial optimally informative partition in exchange for better KL divergences .
(Weighted K -means convergesto approximately the same solution, as we might have predicted from the fact that even the unweighted algorithm did
not chooseto maximize the partition information .) Furthermore, note that it has modeled two of the subpopulations of Q (N (- 10, 1) and N (O, 1)) using Po and modeled the other (N (10, 1)) using Pl . This is natural "clustering " behavior -
the algorithm prefers to group the middle subpopulation
N (O, 1) with either the left or right subpopulation, rather than "splitting " it . In contrast , unweighted EM from the same initial conditions converges to the approximately symmetric solution
ILo== - 4.599, iTo== 5.361, III == 4.689, iT1== 5.376.
(32)
Thus , unweighted EM chooses to split the middle population between Po and Pl . The difference between K -means and unweighted EM in this example is a simple illustration
of the difference
between the two quantities
woKL (QoIIPo) + wIKL (QlIIP1) and KL (Q//ooPo+ (1 - oo)P1) , and shows a natural case in which the behavior of K -means is perhaps preferable from the clustering point of view . Interestingly , in this example the solution found by weighted EM is again quite close to that of K -means.
5. K -Means ForcesDifferent Populations The partition lossdecomposition givenby Equation(8) hasgivenus a betterunderstanding of thelossfunctionbeingminimized by K -means , and allowedusto explainsomeof the differences between K -meansandEM on specific , simpleexamples . Arethereanygeneraldifferences wecanidentify? In this sectionwegivea derivationthat stronglysuggests a biasinherentin theK -meansalgorithm:namely , a biastowardsfindingcomponent densities that areas "different " aspossible , in a senseto bemadeprecise . Let V(Po, PI) denotethe variationdistance 3 between the densities Po andPI: IPo(z) - PI(z) Idz. (33)
V(Po ,PI)=1
3The ensuing argument actually holds for any distance metric on densities .
HARDANDSOFTASSIGNMENTS FORCLUSTERING
511
Note that V(Po, PI) ~ 2 always. Noticethat due to the triangleinequality, for any partitioned density (F, { Po, PI}), V(Qo, Qi ) .s V(Qo, Po) + V(PO,Pi) + V(Ql , PI).
(34)
Let us assumewithout lossof generalitythat Wo== Pr .-z:EQ[F (z) == 0] ~ 1/ 2. Now in the caseof unweightedor weightedK -means(or indeed, any other casewherea deterministicpartition F is chosen ) , V(Qo, Ql ) == 2, so from Equation (34) we may write V(Po, PI) ~ 2 - V(Qo, Po) - V (QI, PI) (35) == 2 - 2(woV(Qo, Po) + WIV(QI , PI) + ((1/ 2) - Wo) V(Qo, Po) + ((1/ 2) - WI) V(QI , PI)) (36) ~ 2 - 2(woV(Qo, Po) + WIV(QI , PI)) - 2((1/ 2) - WO ) V(Qo, Po137 ) ~ 2 - 2(woV(Qo, Po) + WIV(QI , PI)) - 2(1 - 2wo). (38) Let us examine Equation (38) in somedetail . First , let us assumeWo== 1/ 2, in which case2(1 - 2wo) == o. Then Equation (38) lower bounds V (Po, PI ) by a quantity that approachesthe maximum value of 2 as V (Qo, Po) + V (Ql ' PI ) approachesO. Thus, to the extent that Po and Pl succeedin approximating Qo and Ql , Po and Pl must differ from each other . But the
partition lossdecomposition of Equation (8) includes the terms KL (QbIIPb) , which are directly encouraging Po and Pl to approximate Qo and Ql . It is true that we are conflating two different technical senses of approximation
(variation distance KL divergence) . But more rigorously, since V (P, Q) ~
2 In2vKL (PIIQj holdsfor anyP andQ, andfor all z wehaveJZ ~ z+ 1/ 4, we may
write
V (PO, Pi ) ~ 2 - 4ln2 (woKL(QoIIPo) + wiKL (QiIIPl ) + 1/ 4) - 2(1 - 2woX39) == 2 - ln2 - 4ln2 (woKL(QoIIPo) + wiKL (QiIIPl )) - 2(1 - 2woX.40) Since the expression woKL (QoIIPo) + wIKL (QIIIP1) directly appears in Equation (8) , we see that K -means is attempting to minimize a loss function that encourages V (Po, Pi ) to be large, at least in the case that the algorithm finds roughly equal weight clusters (wo ~ 1/ 2) - which one might expect to be the case, at least for unweighted K -means, since there
is the entropic term - 1i2(wo) in Equation (12) . For weighted K -means, this entropic
term is eliminated
.
In Figure 7, we show the results of a simple experiment supporting the suggestion that K -means tends to find densities with less overlap than EM
512
MICHAEL KEARNS ET AL.
does
.
In
the
experiment
dimensional
,
means
(
the
between
tion
distance
solid
lines
)
(
middle
6
.
the
top
perhaps
Pb
with
true
)
nice
ment
is
PI
the
call
same
. "
)
=
F
,
a
Qb
,
if
Po
and
derivation
pected
think
density
Q
=
=
(
partition
1
/
2
line
grey
next
section
top
grey
line
)
)
)
.
.
Po
+
(
,
1
/
2
(
under
PI
(
,
the
Pb
prior
of
course
,
assign
from
that
Pb
Po
,
occurred
were
,
Q
one
=
(
1
we
can
/
2
)
"
Qo
+
-
the
but
tail
-
and
when
Gaussian
with
and
as
when
components
this
-
WTA
even
Gaussian
,
,
WTA
resulting
that
each
to
the
which
to
,
Recall
fixed
z
.
"
namely
any
assign
Pb
)
mixture
.
we
compared
-
on
method
assign
Thus
by
)
sampling
zero
assignments
partition
partition
Qb
=
Pb
.
(
(
(
=
given
z
Z
the
use
)
We
[
+
F
(
Ql
z
)
(
of
WTA
reflected
(
1
/
2
Z
)
b
)
]
jwb
Qo -
(
(
z
Qb )
+
(
Z Ql ~
(
Z
)
)
o
.
.
Q
=
(
Thus
,
1
/
2
the
)
Po
the
+
KL
Equation
(
of
F
posterior
(
see
/
2
)
PI
8
)
.
encourage
example
this
by
the
)
QI
'
to
However
lead
,
competing
a
moment
it
the
)
(
42
)
(
43
)
(
44
)
ex
-
tempt
closer
-
to
situation
is
for
.
41
the
is
us
constraint
the
model
,
(
above
in
reason
will
in
then
us
For
.
the
'
terms
partition
of
an
1
divergence
assignments
because
will
)
=
)
WTA
again
=
Z
definition
will
,
Pr
by
this
.
)
that
)
this
Z
Qo
=
than
than
grey
assignment
randomly
z
truncation
)
were
(
such
estimation
subtle
=
under
that
hard
generated
(
mixture
=
QbIIPb
loss
density
A
true
that
(
-
then
)
are
(
posterior
"
posterior
QI
Z
(
partition
to
(
PI
wbKL
sampling
=
PI
was
the
Gaussian
varia
three
the
(
hard
we
+
is
)
EM
means
making
partition
QbIIPb
Qb
the
is
PI
)
z
(
the
(
in
if
Qo
z
posterior
as
was
-
But
Po
(
potential
KL
of
natural
that
Po
Example
form
terms
(
F
the
the
resulted
and
this
in
density
back
informative
We
avoids
the
assignment
/
that
of
it
make
sampling
more
.
way
Suppose
)
the
top
in
-
the
and
the
discussed
-
distance
by
of
K
another
density
property
that
have
Z
sampling
mentioned
not
.
(
probability
be
signment
is
natural
the
found
is
one
between
,
lowest
two
variation
reference
weighted
one
there
a
(
which
and
of
Partition
is
But
,
as
means
Posterior
Pb
that
One
The
probability
posterior
not
.
more
with
the
may
ing
Pl
even
to
)
)
mixture
the
solutions
-
,
lines
:
and
the
K
grey
shows
line
for
method
Po
assumption
So
three
assignment
of
dark
descent
Algorithm
WTA
Pl
a
distance
axis
unweighted
was
varying
vertical
(
gradient
Q
with
The
and
for
loss
basis
-
and
posterior
.
Gaussians
Po
,
density
Gaussians
)
target
)
New
sampling
variance
between
,
the
axis
line
A
The
-
two
of
the
z
unit
horizontal
the
near
,
an
FORCLUSTERING HARDANDSOFTASSIGNMENTS Now
on
,
a
under
fixed
E
the
point
[ X
(
z
)
posterior
z
]
partition
here
we
will
= =
E
= =
-
the
call
[ -
log
Po
- side
on
of
(
)
( z
( z
)
A
a
)
(
z
partition
loss
( z
)
of
S
)
is
.
log
Po
( z
)
-
over
the
Po
z
Recall
E
that
P1
( z
and
that
K
from
if
we
= =
the
o
. 5N
(
-
This
is
F
2
,
1
(
2:
)
I2 : )
now
)
possible
at
the
F
the
. 5
)
(
z
)
+
1l
)
the
log
P1
( z
)
}
)
( 45
of
loss
summation
of
)
F
.
the
density
. 5N
( 2
2
. 55
. 03
,
divergence
initial
= =
as
,
,
1
(
. 5
;
The
right
-
in
Example
)
( 46
be
2
conditions
)
,
of
Po
. 140
,
0 " 0
/
is
)
2
z
.
)
is
)
the
1
2
ILl
= =
to
( 2
,
1
( z
-
)
(
1
( 1
2
)
= =
,
,
2
( WO
)
1
= =
the
holds
is
at
,
least
reducing
the
away
1l2
(
1
/
2
the
)
= =
1
while
stated
training
initial
posterior
arising
-
of
still
it
by
from
)
posterior
symmetrically
( wo
!
choice
the
/
the
can
means
Thus
,
improved
-
Under
)
F
Under
partition
K
2
starting
. 5
away
be
the
instance
on
the
.
1
means
:
in
in
loss
finding
a
local
solution
2
. 129
all
,
four
has
initial
the
)
In
issues
the
)
= =
N
means
partition
.
1
solution
to
)
for
case
respect
Pl
(
( wo
preserve
,
= =
improved
probabilistic
-
results
QI
the
.
descent
for
2
zIF
their
. 256
= =
move
cannot
Q
algorithmic
This
. 64
1
(
to
the
1
PI
deterministic
1
to
= =
,
of
1
moving
)
)
will
divergences
was
gradient
.
2
(
informativeness
F
with
+
KL
the
F
the
. 5
informative
the
indeed
loss
2
as
able
is
1
are
for
by
value
/
,
,
maximally
solution
or
This
to
1
F
better
,
of
-
2
divergences
because
steps
opposed
on
0
gradients
an
conditions
sampling
0 '
1
= =
. 233
( 47
parameters
are
expected
Of
course
has
)
smaller
posterior
.
density
1
loss
,
the
increased
KL
from
.
algorithm
loss
53
.
absolute
of
What
)
a
KL
conditions
0
a
-
according
may
)
the
in
#
(
)
expression
was
posterior
point
0
)
discussion
ILo
which
)
)
PI
I z
,
a
of
than
)
we
( z
for
minimum
lz
and
,
( F
below
: 1
is
Po
values
( see
terior
)
posterior
sampling
o
N
since
but
initial
lz
= =
distributed
)
there
parameter
the
PI
.
unweighted
the
of
stated
F
origin
reducing
Qo
-
is
( z
the
(
of
from
2:
(
that
variances
and
gener8
( here
1l
,
conditions
1
,
but
of
so
our
term
partition
at
,
-
and
definition
initial
because
the
= =
,
doing
partition
these
Po
weighted
symmetrically
by
posterior
at
( both
origin
from
(
start
- means
preserved
,
Po
is
then
F
{
randomization
the
-
1l
,
)
+
the
the
S
( z
)
loss
simply
all
( z
only
partition
then
over
Revisited
Q
is
( F
]
taken
of
sample
)
Pl
P1
is
( 45
)
the
)
+
case
Equation
Example
( A
( z
special
loss
hand
PF
expectation
this
posterior
,
is
Po
where
F
513
should
a
sample
one
?
Here
use
it
seems
in
order
worth
to
minimize
commenting
the
expected
on
the
pos
algebraic
-
514
MICHAELKEARNSET AL.
similarity by
between
EM
. In
and
sample
and
Pi
Equation
( unweighted data
( 45 ) and
) EM
S , then
, if we
our
the
have
next
solution
L zES
(
Po ( z ) PO ( Z ) + PI
While
the
summand
appear
quite
the
log
prefactors -
log
( Pt
we
must
use
way
of
log
their must
no
obvious
to
posterior An
us Po
109 =
to
to
gradient
+
{ 1 / 2 ) P { , where
,
P6
-
more
two PI log
and
get
is
minimize
let
' P be
in
the
a smoothly on
the
, we
loss )
the
Pt
( 45 ) ,
. An
use
our
current
the
~ together ( P6 , Pi
. Thus
, there
posterior class
and
of
PI
loss
densities
to
,
posteriors
evaluate
well
Pt
informal
the of
, to
as
parameterized Po
posterior log - losses
solution
( using
expected
of
( 48 )
Equation
the
log - losses
( PJ , Pi
parameters
fiz
can
~
to
the
Equation
. In
by
posterior
minimize
Equation
weighted
guess
each the
of
a potential
EM
for
according to
next
evaluate
that
then
algorithm
Pb
determined
labels
. For
. In
resulting
our to
- side
Pb ( Z ) / ( PO ( z ) + P1 ( z ) )
guesses the
, giving
random
P ~ , Pi
difference
current
order
( 48 )
- hand
prefactors
posteriors
labels
descent
mixture
and
Pt
: in
.
) is
. An , and
minimize
the
.
even
fix
the
right
crucial
minimize
difference
the
a
the
posterior
we
( z ) ) ) , and
iterative
loss
standard Let
Pl
labels
is to
to
generate
generate
alternative resort
{ 1 / 2 ) P6
(P { (z )))
( :z:-) log
( z ) ) : our
then
present
the
)
is
the
( Pt
- losses
explaining
we
log
is
Pb ( ~ ) / ( Po ( Z ) + with
-
respect
the
( Po , Pl
, there
z , and
( z ) ) with decoupling
guess
performed
{ 1 / 2 ) Po + { 1 / 2 ) Pl
I ( Po ( z ) )
( 48 ) and
between
each
such
Equation
similar
- losses
for
no
of
in
is a decoupling
and
is
( Z ) log
+ PO ( Z PI ) ) +( ZPI
( 45 )
minimization solution
minimize
-
there
iterative
a current
intriguing
difference
log - loss densities
as
Po
representing
can
be
and
revealed
Plover
the
between
mixture
by X
, and
the
posterior
examining a point
( 1 / 2 ) Po
+
( 1 / 2 ) PI
( Z ) ) to
be
the
their z
EX
( 1 / 2 ) PI
8L ' o g 1 -1 8Po (z) In (2)Po (z)+Pl (z).
( ( 1 / 2 ) Po ( z ) +
loss
mixture
and
the
derivatives . If
' and
log - loss
we we on
. think
define z , then
(49)
This derivative has the expected behavior . First , it is always negative , meaning that the mixture log-loss on z is always decreased by increasing Po(z ) , as this will give more weight to z under the mixture as well . Second, as Po(z ) + P1(z ) --+ 0, the derivative goes to - 00. In contrast , if we define the posterior loss on z LPO6t
-PO )(Z)10 )(Z)log (ZPo )+(zP1 gPO (Z)- PO (ZPl )+(ZPl Pl(Z)(50)
HARDANDSOFTASSIGNMENTS FORCLUSTERING
515
thenwe0btain 8Lpolt 8Po(z)
~;)~~(;)[-log Po (z)+P ;;(~~~(;ylog Po (z) (51) +Po )(z)1og P1() (zPI )+(ZP1 Z-~1].
-
This
derivative
shows
further
loss and the posterior of the derivative
curious
loss . Notice
is determined
differences
that
since
between
by the bracketed
expression
If we define Ro ( z ) == Po ( z ) / ( Po ( z ) + P1 ( z ) ) , then can be rewritten as
which
is a function
Equation
of Ro ( z ) only . Figure
( 52 ) , with
the value
8LpO6t / 8Po ( z ) can actually
can
a repulsive
force
0 .218 ) . The
explanation
have Equation it small
probability
It
is interesting
have
explicit
the literature From for
the likely
to
note
repulsive
that effects
on K - means preceding to lead
than , say , classical the
fact
that
K - means , this
phenomenon
the
Po and
PI
can be shown
value
ratio
centroids
, it might
posterior
once
as poorly
be natural
' P. This
data
manner
.
points
proposed
in
et al . , 1991 ) . class
" from
that , as ' P would
one another
intuition
in the sense given
general
as possibly
to expect
a density
we
is , gives
as possible
been
( Hertz
are " different
over
each other in a fairly
have
z
Ro ( z ) ==
( that
in which
maps
loss over
PI that
the plot
( approximately
poorly
z be modeled
on distant
estimation repel
the
of z to PI as deterministic algorithms
to Po and
when
in
, the point
is straightforward
clustering
discussion
axis . From namely
critical
and self - organizing
density
of the expression
-
z somewhat
that
the assignment
K - means , minimizing
be more
this
as Po models
) , it is preferable
by Po , so as to make
occurs
a certain
expression
( 52 )
a plot
be positive
below
for
( 8 ) : as long
bracketed
( 51 ) .
~ In ( 2 )
8 shows
on Po . This
Po ( z ) / ( Po ( z ) + Pi ( z ) ) falls
log -
in Equation
of Ro ( z ) as the horizontal
we see that exhibit
this
1 - Ro ( z ) Ro ( z ) -
( 1 - Ro ( z ) ) log
the mixture
l / ( Po ( z ) + Pl ( Z ) ) ~ 0 , the sign
derives
from
above . As for
( details
omitted
).
516
MICHAELKEARNSET AL. ~ ~ .
0
C\ I ~
0
0 ~
0 00 0 0
<0 0 0
~ a a
C\ I 0 0
0 0
-6
-4
-2
0
2
4
6
Figure 1: The sampling density for Example (A).
~ 0
M .
0
C' ! 0
or
-
0
0 0
-4
-2
0
2
Figure 2: The sampling density for Example (B) .
4
HARDANDSOFTASSIGNMENTS FORCLUSTERING
517
Figure 3: Evolution of the K -means loss (top plot ) and its decomposition for Example (B) : KL divergenceswoKL(QoIIPo) + wIKL (Ql \\P1) (bottom plot ) and partition information gain 1 2( wo) (middle plot ) , as a function of the iteration of unweighted K -means running on 10 thousand examplesfrom Q == N (0, 1) .
~ 0
C\ I 0
.
.
-
ci
0 0
-6
-4
-2
0
2
4
6
Figure 4: Plot of the sampling mixture density Q = O.95N (O, 1) + 0.OSN(5, 0.1) for Example (C) .
518
MICHAEL KEARNS ET AL.
aJ a
<0 .
a
-. : t
0
~ a
0 0
-6
-4
-2
0
2
4
6
Figure 5: Po and PI found by unweighted K -means for the pIing density of Example (C).
0
C\ ! 0
&l ) ~
0
0 ~
0
&! ) 0
a
0
ci
- 10
-5
0
5
Figure 6: The sampling density for Example (D).
10
sam -
HARDAND SOFTASSIGNMENTS FORCLUSTERING
0
N
.................................................................................
' . . . ' .
"
,
..
it ' )
.
.
......
-
. .
Q) CJ C
,.-
"
, "
..,
" ! h "
.."....
. .
. . ... .'
U)
C 0
"
.'
. .. '
.....
m ~
" C
. . , . . . . . . .
"
..
..'
.... .'
0
.." ,
. '
.
..-
.. .'
.. .'
-
tU ": : tU >
.....................,.. it ) .
0
0 ,
a
0
1
2 distance
between
3
4
means
Figure 7: Variation distance V (Po, Pi ) as a funlction of the distance betweenthe sampling meansfor EM (bottom grey line), unweighted K -means (lowest of top three grey lines) , posterior loss gradient descent (middle to top three grey lines), and weighted K -means (top grey line) . The dark line plots V (Qo, Ql ) . M
N
y -
o
.. .. . . . ... .. .. . . . .. . . . . .
. .. .. . .. .. .. . .. . .. .. . ... . .. ... .. . . . .. .. .. . ... .. .. .. . .. .. . . . . .. . . .. .. .. . . . . .. .. .. . .. ... . . . ... . .. .. . . .. . . . . .. . . . .. ... . . .. . . . . . .. . . . . . ... . ..
y I
0 .0
0 .2
0 .4
r
0 .6
0 .8
1 .0
Figure 8: Plot of Equation (52) (vertical axis) as a function of Ro = Ro(z ) (horizontal axis) . The line y = 0 is also plotted as a reference.
519
520
MICHAELKEARNSET AL.
References T .M . Cover and J .A . Thomas .
Element . 0/ In / ormation
Theory . Wiley - Interscience ,
1991 .
A .P. Dempster , N .M . Laird , and D .B . Rubin . Maximum -likelihood from incomplete data via the em algorithm . Journal 0/ the Royal Stati , tical Society B , 39:1- 39, 1977. R .O . Duda and P .E . Hart . Pattern Cla , ..ification and Scene Analy . i . . John Wiley and Sons , 1973 .
A . Gersho . On the structure
of vector quantizers . IEEE
Tran , action . on In / ormation
Theory, 28(2):157- 166, 1982. J . Hertz , A . Krogh , and R .G . Palmer . Introduction to the Theor 'JIof Neural Computation . Addison - Wesley , 1991. S. L . Lauritzen . The EM algorithm for graphical association models with missing data . Computational Stati ..tic . and Data Analy . i . , 19:191- 201, 1995. J . MacQueen . Some methods for classification and analysis of multivariate observations . In Proceeding . of the Fifth Berkeley Sympo . ium on Mathematic . , Stati . tic . and Prob ability , volume 1, pages 281- 296, 1967. L . Rabiner and B . Juang . Fundamentall of Speech Recognition . Prentice Hall , 1993.
LEARNING HYBRID BAYESIAN NETWORKS FROM DATA
STEFANO
MONTI
Intelligent Systems Program University of Pittsburgh 901M CL, Pittsburgh , PA - 15260 AND GREGORY
F . COOPER
Center for Biomedical Informatics University of Pittsburgh 8084 Forbes Tower ,
Pittsburgh, PA - 15261
Abstract . We illustrate two different methodologies for learning Hybrid Bayesian networks , that is, Bayesian networks containing both continuous and discrete variables , from data . The two methodologies differ in the way of handling continuous data when learning the Bayesian network structure . The first methodology uses discretized data to learn the Bayesian network structure , and the original non-discretized data for the parameterization of the learned structure . The second methodology uses non-discretized data both to learn the Bayesian network structure and its parameterization . For the direct handling of continuous data , we propose the use of artificial neural networks as probability estimators , to be used as an integral part of the scoring metric defined to search the space of Bayesian network structures . With both methodologies , we assume the availability of a complete dataset , with no missing values or hidden variables . We report experimental results aimed at comparing the two method ologies. These results provide evidence that learning with discretized data presents advantages both in terms of efficiency and in terms of accuracy of the learned models over the alternative approach of using non-discretized data .
521
522 1.
STEFANO MONTIANDGREGORY F. COOPER
Introduction
Bayesian belief networks (BN s) , sometimes referred to as probabilistic net works , provide a powerful formalism for representing and reasoning under uncertainty
. The construction
of BNs with domain
experts often is a diffi -
cult and time consuming task [16]. Knowledge acquisition from experts is difficult because the experts have problems in making their knowledge explicit . Furthermore , it is time consuming because the information needs to be collected manually . On the other hand , databases are becoming increasingly abundant in many areas. By exploiting databases, the construction time of BN s may be considerably decreased. In most approaches to learning BN structures from data , simplifying assumptions are made to circumvent practical problems in the implementa tion of the theory . One common assumption is that all variables are discrete
[7, 12, 13, 23], or that all variables are continuous and normally distributed [20]. We are interested in the task of learning BNs containing both continuous and discrete variables , drawn from a wide variety of probability distri butions . We refer to these BNs as Hybrid Bayesian networks . The learning task consists of learning
the BN structure , as well as its parameterization
.
A straightforward solution to this task is to discretize the continuous variables , so as to be able to apply one of the well established techniques available for learning BNs containing discrete variables only . This approach has the appeal of being simple . However , discretization can in general generate spurious dependencies among the variables , especially if "local " dis-
cretization strategies (i .e., discretization strategies that do not consider the interaction between variables) are used1. The alternative to discretization is the direct modeling of the continuous data as such. The experiments described in this paper use several real and synthetic databases to investi gate whether the discretization of the data degrades structure learning and parameter
estimation
when using a Bayesian network
representation
.
The useof artificial neural networks (ANN s) as estimators of probability distributions presents a solution to the problem of modeling probabilistic relationships involving mixtures of continuous and discrete data . It is par ticularly attractive because it allows us to avoid making strong parametric assumptions about the nature of the probability distribution governing the relationships among the participating variables . They offer a very general semi-parametric technique for modeling both the probability mass of dis1Most discretization techniques have been devised with the classification task in mind , and at best they take into consideration the interaction between the class variable and the feature variables individually . "Global " discretization for Bayesian networks learning , that is , discretization taking into consideration the interaction between all dependent variables , is a promising and largely unexplored topic of research , recently addressed in
the work described in [19].
523
LEARNING HYBRID BAYESIAN NETWORKS FROM DATA
crete variables and the probability density of continuous variables . On the other hand , as it was shown in the experimental evaluation in [28) (where only discrete data was used) , and ~ it is confirmed by the evaluation reported in this paper , the main drawback of the use of ANN estimators is the computational cost associated with their training when used to learn the BN structure . In this paper we continue the work initiated in [28), and further explore the use of ANNs as probability distribution estimators , to be used as an integral part of the scoring metric defined to search the space of BN struc tures . We perform an experimental evaluation aimed at comparing the new learning method with the simpler alternative of learning the BN structure based on discretized data . The results show that discretization is an efficient and accurate method of model selection when dealing with mixtures of continuous and discrete data . The
rest
of
introduce
to
learn
and
s
In
probability
.
- based
Section
4 ,
results
procedure
,
and
of
as
network
In
Section
we
3 ,
the
describe
.
use
,
the with
a
and
with
some
our
2
artificial
in
Section
5
space
we
exper
for
-
learning based
on
paper
with
the
suggestions
BN as
present
proposed
conclude
,
of
networks
alternative
We
how
method
the
the
briefly of
learning
neural
of
we
basics
search
of
simple .
some
Section
to
efficacy
variables
In and
describe
Finally
evaluating it
.
used
the
continuous
results
we
metric
comparing the
follows
formalism
scoring
at
at of
discussion
organized
estimators aimed
discretization
further
the a
research
.
Background
A
Bayesian
is
a
belief
directed variables
}
of
probability 7I " x
example
to
of
the
a
links
a
Furthermore
the
simple
set
set
,
,
arcs
E
of
we
see can
tumor
( G X
=
=
{
the
of
x
in
and
by
metastatic cause
X
, Xl
cause
papilledema
O .
, giving
increase
G
)
I
Xi
.
, Xj
E
; a
derived
1 , in
( Xl in
X
,
part
causal )
Xi
and
X
we
#
0
P
)
is
is E
a X
give
,
an
from
[ 11
] .
interpretation is
a
total
( xs
, E
;
node
Figure
a
( X
representing
variables
Given
In
==
}
variables2
cancer an
) , where
, . . . , xn
domain
structure ,
can
( Xi
in
parents
that
, P
{ Xl
domain
instantiations
also
, 0
among of
structure
it
brain
triple
nodes
network
network
that
a
of of
the
Bayesian
the
and
set
dependencies
over
displayed ) ,
by
a
instantiations
denote
at
( X3
) '
with
possible
to
looking
tumor
and
distribution
use
defined
with
probabilistic
space
we
is
graph ,
representing
the
By
network
acyclic
domain X j
( X2
data
ANN
distribution
imental
.
is belief
from
the .
paper
Bayesian
BN
define
structures
2
the
the
cause
of
serum
) ,
and
brain
calcium
both
brain
2An instantiation w of all n variables in X is an n-uple of values { xi , . . . , x~} such that Xi = x~ for i = 1 . . . n .
524
STEFANO MONTIANDGREGORY F. COOPER -
P (XI ) P (x2IXl ) P (x21 Xl ) P (x31xI ) P (x31 Xl ) P (x4Ix2 , X3) P (X4\ X2, X3) P (x41 X2, X3) P (X41X2, X3) P (x51x3) P (x51 X3)
0.2 0.7 0.1 0.6 0.2 0.8 0.3 0.4 0.1 0.4 0.1
Xl : X2: X3: X4: X5:
tumor
and
an increase
a coma The
key feature
is usually
calcium
of each variable
is their
events
( domain
to as the Markov
the
Bayesian
its parents
1T ' i , with
conditional
network
8i is represented corresponding
entry
in the
table
P ( X~ 11T ' ~, 8i ) for a given probability the
probabilities
in
the
by means to the
instantiation belief
example
all the
of the variable
complete
probability
for the distribu
network
needed
, with
refer are dis -
of a lookup
table ,
probability
Xi and its parents of X . In
-
P ( Xi l7ri , 8i )
variables
conditional
,
. This
can then fact , it
7ri .
be com has
been
shown [29 , 35 ] that the joint probability of any particular instantiation all n variables in a belief network can be calculated as follows :
of
(1)
-
n
-
-
-
from
instantiation
of any
its parents
, and it allows
. For
1 , where
of
) . In particular
9i the set of parameters
with
puted
to lapse
representation
distributions
probability
of Figure
given
joint
conditional
crete , each set of parameters
The
a patient
variables
property
of the multivariate
of the univariate
Xi given
explicit
of its non - descendants
referred
characterize
each
papilledema
can cause
networks
among
representation
over X in terms
ence to the
coma
set of nodes x = { Xl , X2, Xa, X4, X5} , and parent { X2, X3} , 7rx5 = { X3} . All the nodes represent domain { True , False} . We use the notation Xi tables give the values of p (Xi l7rxi ) only , since
serum
of Bayesian
is independent
parsimonious
to fully
in total
independence
each variable
tion
brain tumor
(X4 ) .
conditional property
total serum calcium
X5
Figure 1. A simple belief network , with sets 7rX} = 0, 7rx2 = 7rx3 = { Xl } , 7rx4 = binary variables , taking values from the to denote (Xi = False ) . The probability p (Xi l7rxi ) = 1 - p (Xi l7rxi ) .
into
metastatic cancer
-
-
-
-
-
-
P ( x ~ , . . . , x ~ ) = II P (x ~ 17r~i ' 8i ) . i ==l
-
-
guide to the literature -
3For a comprehensive
[6].
on learning probabilistic
networks , see
LEARNINrGHYBRIDBAYESIAN NETWORKS FROMDATA 2 .1. LEARNING
BAYESIAN
BELIEF
525
NETWORKS3
In a Bayesian framework , ideally classification and prediction would be performed by taking a weighted average over the inferences of every possible BN containing the domain variables4 . Since this approach is usually computationally infeasible , due to the large number of possible Bayesian networks , often an attempt has been made to select a high scoring Bayesian network of this
for classification
. We will assume this approach
in the remainder
paper .
The basic idea of the Bayesian approach is to maximize the probability
P (Bs I V ) = P (Bs , V )j P (V ) of a network structure Bs given a database of casesV . Becausefor all network structures the term P (V ) is the same, for the purpose of model selection it suffices to calculate P (Bs , ' D) for all Bs .
So far , the Bayesian metrics studied in detail typically rely on the fol -
lowing assumptions: 1) given a BN structure, all cases in V are drawn independently from the same distribution (random sample assumption); 2) there are no caseswith missing values (complete databaseassumption; some more recent studies have relaxed this assumption [1, 8, 10, 21, 37]); 3) the parameters of the conditional probability distribution of each variable
are independent (global parameter independenceassumption); and 4) for discrete variables
the parameters
associated
with each instantiation
of the
parents of a variable are independent (local parameter independence assumption ) . The last two assumptions can be restated more formally as
follows. Let 8Bs == { 8l , . . . , 8n} be the complete set of parameters for the BN structure Bs , with each of the 8i 's being the set of parameters that
fully characterize the conditional probability P (Xi l7ri). Also, when all the variables in 7ri are discrete, let 8i = { Oil' . . . ' Oiqi} ' where Oij is the set of parameters defining a distribution that corresponds to the j -th of the qi possible instantiations of the parents 7ri. From Assumption 3 it follows that P (8Bs I Bs ) = IIi P (8i I Bs ), and from assumption 4 it follows that
P(8i I Bs) = IIj P (8ij I Bs) [36]. The application of these assumptions allows for the following factoriza -
tion of the probability P (Bs , V ): n
P(Bs,V) = P(Bs)P(V IBs) = P(Bs) II S(Xi,7ri,V) ,
(2)
i= l
where each S(Xi, 7ri, V ) is a term measuring the contribution of Xi and its parents 7ri to the overall score of the network
structure
Es . The exact form
of the terms S(Xi 7ri , V ) slightly differs in the Bayesian scoring metrics de4Seethe work described in [24, 25] for interesting applications of the Bayesian model averaging approach .
526
STEFANO MONTIANDGREGORY F. COOPER
fined so far , and for the details we refer the interested reader to the relevant literature [7, 13, 23] . To date , closed-form expressions for S(Xi 7ri, V ) have been worked out for the cases when both Xi and 7ri are discrete variables , or when both Xi and 7ri are continuous (sets of ) variables normally distributed ; little work has been done in applying BN learning methods to domains not satisfying these constraints . Here , we only describe the metric for the discrete case defined by Cooper and Herskovits in [13], since it is the one we use in the experiments . Given a Bayesian network Bs for a domain X , let ri be the number of states of variable Xi , and let qi = I1XsE7rirs be the number of possible instantiations of 7ri . Let (}ijk denote the multinomial parameter correspond ing to the conditional probability P (Xi = k l7ri = j ), where j is used to index the instantiations of 7ri, with (}ijk > 0, and Ek (}ijk = 1. Also , given the database V , let Nijk be the number of cases in the database where Xi = k and 7ri = j , and let N ij = Ek N ilk be the number of cases in the database where 7ri = j , irrespective of the state of Xi . Given the assumptions described above, and provided all the variables in X are discrete , the probability P (V , Bs ) for a given Bayesian network structure Bs is given by
nqi r(ri) ri P(V,Bs )=P(Bs )gjllr(Nij +Ti )Elr(Nijk ),
(3)
where r is the gamma function5 . Once a scoring metric is defined , a search for a high -scoring network structure can be carried out . This search task (in several forms ) has been shown to be NP -hard [4, 9]. Various heuristics have been proposed to find network structures with a high score. One such heuristic is known as K2 [13] , and it implements a greedy forward stepping search over the space of network structures . The algorithm assumes a given ordering on the vari ables. For simplicity , it also assumes a non-informative prior over parameters and structure . In particular , the prior probability distribution over the network structures is assumed to be uniform , and thus , it can be ignored in comparing network structures . As previously stated , the Bayesian scoring metrics developed so far either assume discrete variables [7, 13, 23], or continuous variables normally distributed [20] . In the next section , we propose one generalization which allows for the inclusion of both discrete and continuous variables with arbitrary probability
distributions .
5Cooper and Herskovits [13] defined Equation (3) using factorials , although the generalization to gamma functions is straightforward .
LEARNING HYBRID BAYESIAN NETWORKS FROM DATA 3 . An ANN - based scoring
527
metric
In this section , we describe in detail the use of artificial neural networks as probability distribution estimators , to be used in the definition of a decomposable scoring metric for which no restrictive assumptions on the functional form of the class, or classes, of the probability distributions of the participating variables need to be made. The first three of the four assumptions described in the previous section are still needed. However , the use of ANN estimators allows for the elimination of the assumption of local parameter independence . In fact , the conditional probabilities corresponding to the different instantiations of the parents of a variable are represented by the same ANN , and they share the same network weights and the same training data . Furthermore , the use of ANN s allows for the seamless representation of probability functions containing both continuous and discrete variables . Let us denote with Vi = { C1, . . . , Cl - 1} the set of the first I - I cases in the datab ~ e, and with x ~i) and 7r~i) the instantiations of Xi and 7ri in the I-th case respectively . The joint probability P (Bs , D ) can be written as
m P{Bs,V) = P{Bs)P{VIBs) = P{Bs) l=l II P(CllVi,Bs) =
mn P{Bs) n n p(X~ ~l),Vl,Bs) . l=l i=l l)17r
-
If
we
assume
uninformative
structures ity
In
,
that
of
7ri
fact
,
the
are
we
priors
form
the
P
(
parents
can
Bs
(
Bs
,
V
)
=
where
tion
be
S
of
(
Xi
,
Xi
,
V
and
if
data
dictive
already
quential
validation
(
the
in
.
In
. e
.
,
fact
,
.
.
II
p
=
in
the
a
uniform
7ri
)
,
where
P
P
in
(
(
Bs
7ri
,
)
V
(
x
~
as
it
Hence
,
.
shown
the
It
)
17r
l
)
,
Vi
,
Bs
)
]
=
brackets
network
II
probabil
-
decomposable
,
5
so
.
as
to
obtain
terms
and
l
)
the
prequential
to
theoretically
,
is
7ri
,
V
)
only
15
]
prior
P
(
7ri
the
as
in
4
,
sound
)
can
.
The
of
,
interpreted
=
-
application
Usually
S
func
)
the
.
Vl
(5)
,
a
structures
Equation
cases
,
is
the
model
analysis
a
14
the
in
given
[
it
of
clearly
(
(
Xi
it
to
in
,
(
l
network
corresponds
Dawid
success
of
the
S
=
and
Bs
over
and
log
the
C
,
structure
prior
case
name
corresponds
~
square
more
each
network
the
is
4
i
in
of
is
l
by
measure
is
)
Equation
l
4
out
a
of
prediction
(
on
n
discussed
distribution
seen
P
priors
products
Equations
carried
as
decomposable
probability
between
assume
is
i
IIi
the
two
term
analysis
,
)
l
illustrated
score
7ri
parents
prequential
V
P
l
is
we
decomposition
tive
)
or
m
[
=
its
neglected
derivation
the
7ri
,
the
II
i
=
Xi
n
P
)
of
interchange
,
(4)
a
predic
-
predicting
,
{
above
the
we
x
form
(
l
which
)
,
.
a
.
.
,
x
pre
(
l
suggests
form
-
-
l
se
of
cross
)
}
-
-
528
STEFANO MONTIANDGREGORY F. COOPER
From a Bayesian perspective , each of the P (Xi l7ri, Vi , Bs ) terms should be computed as follows :
P(Xi l7ri, Vi, Bs) =
P{Xil7ri,9i,Bs)P{9i IVi,Bs)d9i.
In most casesthis integral does not have a closed-form solution; the following MAP approximation can be used instead: P (Xi l7ri, Vi , Bs ) = P (Xi [ 7ri, (Ji, Bs ) , (6) with lJi the posterior mode of (Ji, i .e., (Ji = argmax8i{ P (lJi I Vi , Bs )} . As '"a further approximation , we use ~he maximum likelihood (ML ) estimator (Ji instead of the posterior mode (Ji. The two quantities are actually equivalent if we assumea uniform prior probability for (Ji, and are asymptotically equivalent for any choice of positive prior . The approximation of Equation (6) correspondsto the application of the plug-in prequential approach discussedby Dawid [14]. ~ Artificial neural networks can be designed to estimate 8i in both the discrete and the continuous case. Severalschemesare available for training a neural network to approximate a given probability distribution , or density. In the next section, we describe the sojtmax model for discrete variables [5], and the mixture density network model introduced by Bishop in [2], for modeling conditional probability densities. Notice that evenif we adopt the ML approximation, the number of terms to be evaluated to calculate P (V I Bs ) is still very large (mn terms, where m is the number of cases, or records, in the database, and n is the number of variables in X ), in most casesprohibitively so. The computation cost can ~ be reduced by introducing a further approximation. Let 8i (l ) be the ML estimator of 8i with respect to the dataset Vi . Instead of estimating a distinct ~ 8i (l ) for eachl = 1, . . . , m, we can .. group consecutivecasesin batchesof cardinality t , and estimate a new 9i (l ) for each addition of a new batch to the dataset Vi rather than for each addition of a new case. Therefore, the same ~ 8i (l ), estimated with respect to the dataset Vi , is used to compute each of ~ (l ) , Bs ) , . . . , P (xi(i+t- l ) I 7ri(i+t- l ) , 8i ~ (l ) , Bs ) . the t terms P (Xi(i) l7ri(i) , 8i With this approximation we implicitly ~make the assumption that , given our present belief about the value of each8i , at leMt t new casesare needed to revise this belief. We thus achievea i -fold reduction in the computation ~ needed, since we now need to estimate only m/ t 8i 'S for each Xi, instead of the original m. In fact, application of this approximation to the computation of a given S(Xi, 'lri, V ) yields:
-
S(Xi, 'lri, V )
m '" ,lJi(l),Bs) n p ( X ~ l ) 17r ~ l ) l=l 1 ,
LEARNING
HYBRID
BAYESIAN
mjt =
-
l
With select
of
IVil
regard
to
a
constant
II
. The
estimate but
II
will
grows
A
scheme t
given
a
r All
of
Vi
training
4 .
ANN
set of
section
, we distributions variables
by
Bishop
in
Xi =
set
of
is
a U
,
discrete
common
r7fi
. a
=
practice
where to be
fk the
( 1ri )
neural
l1ri
is
=
+
when of
to
the
function
A
( l ) , this
l
new
is
in
small
cases
training
case
of
t
of
cases
can
,
for
,
as
data
the
be
l
set
data
set
summarized
already
is
t
in
seen
assuming
1 =
the
we
=
rO . 5ll
the
( i .e . ,
set
A
the
=
would
1 , 2 , 3 , 5 , 8 , 12
0 .5 ,
require
, 18
the 1ri
belongs thus
. These density
, 27
, 41
.
ri
values
of
and
softmax
model
with
continuous
with
( rj
model
introduced .
-
1 ) .
l7ri
) ,
output
k
rj
parents 7rf
1
ri
,
and
a
l7ri
discrete is
define
follows
)
input
variables units
, as
the
( Xi
r7fi
of
indicator
output
1 , . . . , ri
is
P
units
-
The =
of
and
representation
of .
set ,
distribution
The
regression
a
parents
ri
means
Vk
the
densities
probability
by
=
are
network
conditional
VARIABLES
network
( Xi
of
probability
set
EXjE7ff
representation
networks
conditional
the
:
( 7ri )
Eni J. = Ie
1ri .
the
DISCRETE
values
linear
for
with
P
)
~
network
(l )
mixture
statistical
input
that
example a
9i
cases
case
scheme
8i
for M
.
'"
of
t
addition
example
models
neural
rj
the
network
new
can
estimating
additional
conditional
The
l7rfl
in
Vk
interpreted
probability a
=
For
neural
is
probabilities
( Xi
a an
updating
the
efk
P
1 .
the
two
7rf
taking
conditional
,
variable
by
variable
( 7 )
t , we
new
the
number
~
FOR
parents
, where
input
to
, 000
the
modeling
where
approximated
units
of
difference
A
with
discrete
7rf
) .
increment
. When
addition
estimators
MODEL
be 7rf
to
updating
[ 5 ] , and
[ 2 ] , for
SOFTMAX
10
is
<
describe
discrete
7ri
cases
529
estimators
for
Let
50 ANN
probability
4 .1 .
0
) , Bs
for
choose
1 , adding =
1
probability
this
= l
where
of the
the
significant
and
( tk
value
can
preferable
incremental
, ) ,
we
insensitive l
a
the
=
data
the
appropriate
to
if
make
( X ~' ) 17r ~ ' ) , iJi
seems
, while
for
cardinality
an t , or
, when it
to
equation
of for
increasingly
example
unlikely
p
sensitive
become
doubling
very
In
very
DATA
l = tk + 1
approach
be
. For
means
choice value
second
will it
the
FROM
t (k + l )
k = O
-
NETWORKS
f J. ( 7r t' )
,
output Notice
of that
probability to configured
( 8 )
the
the
of
cl ~
k - th
of
,
with
k - th
the
output
unit
probability s ri a
P
membership cl ~ sum
ses
corresponding
( Xi of
. It
h ~
- of - squares
= 1ri ,
been or
Vk
l7ri
i .e . , proved
cross
)
can
~
the that
- entropy
530
STEFANOMONTI AND GREGORYF. COOPER
error function ,. leads to network outputs that estimate Bayesiana posteriori -
probabilitiesof classmembership[3, 32]. 4.2. MIXTURE DENSITY NETWORKSFOR CONTINUOUSVARIABLES The approximation of probability distributions by meansof finite mixture models is a well establishedtechnique, widely studied in the statistics liter ature [17, 38]. Bishop [2] describesa classof network models that combine a conventional neural network with a finite mixture model, so as to obtain a general tool for the representation of conditional probability distributions . The probability P (Xi l7ri, Vi , Bs ) can be approximated by a finite mixture of normals as is illustrated in the following equation (where we dropped the conditioning on VI and Bs for brevity ):
K P(Xi l7ri )=kL=lQk (7ri )< !J(Xi kl7ri ), (9) where KQk isthe number ofmixture components ,0functions ~Qk ~1,cl kk(= 1 ,..).is ,Ka, and Ek = 1 , and where each of the kernel > Xi l7ri normal density ofthe form : 2ak )2 (10 ) < /Jk(xii7r i)=Cexp {-(Xi-/ '/(k7ri (7r i))2} with c the normalizing constant , and JLk(7ri) and O'k(7ri)2 the conditional mean and variance respectively . The parameters C }l;k{7ri) , JLk(7ri) , and ak (7ri)2 can be considered as continuous functions of 7ri . They can therefore be estimated by a properly configured neural network . Such a neural network will have three outputs for each of the K kernel functions in the mixture model , for a total of 3K outputs . The set of input units corresponds to the variables in 7ri . It can be shown that a Gaussian mixture model such as the one given in Equation (9) can - with an adequate choice for K approximate to an arbitrary level of accuracy any probability distribution . Therefore , the representation given by Equations (9) and (10) is completely general , and allows us to model arbitrary conditional distributions . More details on the mixture density network model can be found in [2] . Notice that the mixture density network model assumes a given number K of kernel components . In our case, this number is not given , and needs to be determined . The determination of the number of components of a mixture model is probably the most difficult step, and a completely general solution strategy is not available . Several strategies are proposed in [31, 33, 38]. However , most of the techniques are computationally expensive , and given our use of mixture models , minimizing the computational cost of the selection process becomes of paramount importance . Given a set of alternative model orders K. = { 1, . . . , Kmax } , we consider two alternative
LEARNING
strategies on
HYBRID
for
a
test
based
on
model
where
()
K
is
ML
P
order
- out
and
MK
BIG
respect aims
; rain
provides
D
,
34
] . dataset
density
each
K
....
; est
] rain
selection
the
mixture
( V
based
model
[ 30
for
531
selection
and
splitting
A
10
Vi
network
E
K
, MK
)
,
and
is
the
selected
,
.
finding
the
model ) ,
)
p
to
DATA
the
following
model
order
asymptotic
K
that
approxima
-
:
P
( Vll
where
d
of
is
by
the
the
number
0
the
Given
use
in
of
inputs
.
of
weights
are
\ k
( 1 )
( 11
model l
is
=
M
the M
K
K
( in
size
of
the
( in
our
1 , . . . , K ,
selection
the
}
our
case
dataset
case
of
error
for
would
be
the
,
,
is
bound
)
the
,
A
(J
and
given
trained
neural
for
each
Equation
prequential
term
computationally
whole
too
dataset term
,
is
,
only
the
ANN
we
' D ,
and
( i .e . ,
costly
we
,
then
use
we the
.
have
train
, Bs
) ,
the
weights to
l
for
the
be
the
their
trained
real
that
Given
the
of
in
s
the
to
one
for
the
ANN as
database
the
, of
a the
trained
the
+
1
prequen
of
Dl
ANN , 7ri
, Vi
)
terms
the
+ t
-
the
S ( Xi
database of
( or
.5 ) .
subsequent
given
initialization Vl
first
' s
.5 ,
prequential on
the
ANN ( -
weights for
of
half
The
to
the
each
be .
interval to
do
number
to
corresponding
specifically
s ,
the
units
corresponding
network on
values
we
.
units
hidden
,
- fitting
hidden
three
ANN
More
ANN
over
convergence
Currently
function
initialized .
.
a
of of
weights
1 , . . . , m of
number
The
term
faster .
for as
ANN .
optimization
much
control
selected
with
several =
shows
minimum
metric
- gradient
backpropagation to
is
a
true
conjugate
algorithm
set
with
scoring
to
the
than
prequential
need
use
This
initialized
terms
, Vl
use
units
,
units
previous we
] .
we
maximum
hidden
randomly
of
the
the
) 2
the
s ,
[ 27 )
specifically
prequential
for
the
0
of
technique
of
however
can
in
prequential
ANN
in
input
term
l7ri
order
on
regularization
More
number
we
+
) ,
( 7ri
( Ji ( l ) )
each
the
local
number
,
l
conditions
based
order
any
The
( Xi
) , ak
"
each
described
term
( 7ri
model
of
( possibly
,
log
parameters
regularity
order
training
the
tial
2
TRAINING
the
This
-
network
the
) , J- Lk
the
model
ANN
not
neural
certain
selected
to
the
( 7ri
)
parameters
of
estimation
algorithm
for
of
estimator
the
For
( ) , MK
of
repeating
select same
( Vll
d
( 1 ) .
Since
4 .3 .
P
{ ak
) . is
=
weights
outputs
network
for
)
ML
the
11
MK
is
number A
P
D
at
...
(J
.
probability
BIG
: "
by
Vlest
set
with
K
( BIG
set
the
on
) .
E - out
performed
test
training
estimator
( Vll
K
( " hold
is
a
the
based
FROM
Criterion
maximizes
selection
maximizes
best training
hold
on
that
the
Model
tion
by
Vlrain
trained
order ....
NETWORKS
Information
set
then
the
during
Bayesian
training is
of
out
selection
a
MK
selection
held
the
Model
into
the
set
BAYESIAN
the if
we
Vi weights use
the
,
532
STEFANOMONTI AND GREGORYF. COOPER
updating schemedescribed at the end of Section 3). This strategy will be particularly beneficial for large sized VI , where the addition of a new case (or a few new cases) will not changesignificantly the estimated probability . 5. Experimental
evaluation
In this section we describe the experimental evaluation we conducted to test the viability of use of the ANN-basedscoring metric. We first describe the experimental design. We then present the results and discussthem.
5.1. EXPERIMENTAL DESIGN The
experimental
new to
scoring
metric
continuous
over
the
use
of
. To
the
Al
first
of
highest
by
Algorithm
uous
data by
We ture
would
variable
number
To were
the
designed
discovering evaluation
is aimed
gorithms
, and
the
. With
it
data allows
With networks
into
of
for
regard
their
at
regard
to
whose
search
the
testing
two
the of
original
BN
structure
the
the
of the
metric
discovered continuous
using
original
the
contin
-
struc
-
struc
-
discovered
to be faster
, due
the
to
than
the
the
information
algorithm
value
of bins
to
algorithms
loss
, or
predictive
first
data
goal , real the
robustness
goal , simulated parameterization
experiments algorithms
is fully the
of
the
are
appropriate
known
. The two
al -
patterns
assumptions
generated
in
variable
structural
of data
two class
accuracy
relevant
bin .
, the
of the
response
a simple
continuous
an approximately
each
simpler
performance
is
of the
so that
is assigned
the
Al
range
of discovering
of
and
the the
a given
both
testing
structure
the
Al
with
points
of the
compare
second
by
number
data
parents
to
to
to
it searches
scoring
of
accurate
used
a given
capability
a better
; then
the
parameters
parameters
, whereby
comparison
set
data
to
:
.
less
technique
so as to the
algorithms
on
scoring
advantage applied
.
of contiguous
make
is
( 7 ) applied
the
structure
" discretization
is partitioned
equal
highest
possibly
( 3 ) , which
of the
the
of Equation
discretization
density
the
any
learning
the
is applicable
, offers
based
applied
estimators
the
two
structure
estimates
of ANN
discretization
" constant
considered
estimators
for
also
A2 , but
from
The
it
Equation
estimates
metric
expect by
of
whether
( 9 ) , which
variables
a discretization
it
searches
means
metric
network
ANN
scoring , and
search
resulting
using
A2
- based
to discrete
performs
at determining
( 7 ) , ( 8 ) , and
M
end , we
( 3 ) ; finally
ANN ture
this
aimed
Equations
scoring
scoring
Equation
structure data . .
by
M well
the
data
Algorithm for
is primarily
, given
variables
discretized .
evaluation
from is more
in , and
made
.
Bayesian appro
-
533
LEARNING HYBRIDBAYESIAN NETWORKS FROMDATA
priate , since the generating BN represents the gold standard with which we can compare the model (s) selected by the learning procedure . To assess the predictive accuracy of the two algorithms , we measured mean square error (M8E ) and log score (18 ) with respect to the class variable y on a test set distinct from the training set. The mean square error is computed with the formula :
L[y(l)-fJ (7r l))]2 vtest
1
(12)
,
MSE = L
where vtest is a test set of cardinality
L , y (l ) is the value of y in the l -th
case of vtest , andy(7rl)) isthevalue ofy predicted bythelearned BNfor thegiveninstantiation 7rl) of y's parents . Morespecifically , y(7rl)) is the
expectation ofy withrespect to theconditional probability P(y 17r l)). Similarly , the log-score LS is computed with the formula :
18 = - logp (Vtest ) ==- L logp(y{l) \7rl)) ,
(13)
vtest
where p(y(l) \7rl)) is theconditional probability fory in thelearned BN. With
both For
MSE
the
and LS the lower
evaluation
with
real
the score the better databases
, we
used
the model . databases
from
the
data repository at UC Irvine [26]. In particular , we used the databases AUTO - MPG and
variable
ABALONE
can be treated
the class variable
. These
databases
were
as a continuous
variable
is miles - per - gallon
selected
because
. In the database
their
class
AUTO - MPG
. In the database ABALONE the class
variable is an integer proportional
to the age of the mollusk , and can thus
be treated
. The
as a continuous
variable
database
AUTO - MPG has a total
of
392 cases over eight variables , of which two variables are discrete , and six are
continuous
. The
variables , of which were
normalized
database
ABALONE
only one variable
has
a total
of 4177
cases
is discrete . All continuous
over
nine
variables
.
Since we are only interested
in selecting
the set of parents
of the re -
sponse variable , the only relevant ordering of the variables needed for the search algorithm is the partial ordering that has the response variable as the successor
of all the other
variables
.
All the statistics reported are computed over ten simulations . In each simulation , 10% of the cases is randomly selected as the test set , and the
learning algorithms use the remaining 90% of the cases for training . Notice that in each simulation the test set is the same for the two algorithms . For the evaluation with simulated databases, we designed the experi ments with the goal of assessing the capability of the scoring metrics to correctly identify the set of parents of a given variable . To this purpose ,
534
STEFANO MONTIANDGREGORY F. COOPER
ny
X '
.,' " " ,
y
.."
"" .
.
.
.
,: ":,. .., -;--0 ,
:
"
,
" ,
,
.
.
.
, ,
.
,
""" ,
." "" .
Figure
2
.
General
denoting
the
synthetic
BN
taining
8
lected
s
of
as
be
y
(
aimed
y
.
7r
7ry
y
.
All
y
BNs
1 -
)
.
the
The
testing
X
used
'
in
For
a
I
7r
the
of as
.
us
2
at
,
.
we
show
a
con
-
se
of
variables
us
denote
-
is
this
assigned
of
in
variables
X
'
metrics
the
7( " 11
X
randomly
set
scoring
determining
with
randomly
Let
are
variables
the
is
n1r
this
the
domain
X
parents
denote
of
. ,
' s
variables
Let
of
Figure
structure
,
models a
y
in
number
y
,
.
with
as
at
parents
of
identifying
conditional
indepen
prototypical
structure
-
for
the
.
BN
linear
modeled
In
be
experiments y
a
variables
given
to
effectiveness
experiments
given
mixtures
.
}
the of
considered
the
remaining
7r
( i . e
)
y
We
A
designation
the
y
of
.
in
ancestors
.
One
{
used
indirect
follows
-
in
independencies
dence
BNs
the
y
X
variables
U
at
conditional
is
}
'
as
from
the
{
synthetic X
variable
with
of
-
the
with
variables
select
X
is
and
response
variables
=
,
generated
the
parents
'
of
y
were
randomly
set
7r
of
continuous
to
then
X
structure
parents
finite
.
the
parameterization
That
is
mixture
of
,
each
is
conditional
linear
in
terms
of
probability
models
as
follows
P
finite
( Xi
l7ri
)
:
K
P
( Xi
l7ri
)
=
E k
J1 , k
( 7ri
)
=
GkN =
( Jl , k
( 7ri
)
,
O' k
)
(
14
)
l
{ 3ok
+
E
( 3jkXj
,
XjE7ri
where
N
( Jl "
O'
deviation
0 '
tribution
over
real
l , Bjkl
choice
conditional
.
)
denotes
All
a
O' k
the
numbers
are
[ 1
of
,
K
+
interval
distributions
real
distribution
numbers
interval
randomly
E
Normal
[ 1
,
drawn
1 ] ,
where
is
randomly
] .
from
K
is
justified
by
to
10
depart
with
All
drawn
the
a
the
uniform
fact
significantly
. u
from
regression
of
that
and
a
standard
uniform
parameters
over
mixture
we
from
dis
, Bjk
distribution
number
the
mean
the
like
a
singly
.
the
peaked
-
are
interval
components
would
' s
This
resulting
curve
,
LEARNING
as the number
HYBRID
of mixture
BAYESIAN
components
NETWORKS
FROM
DATA
535
increases . Therefore , we choose to
incre ~ e the magnitude of the regression parameters with the number of mixture components , in an attempt to obtain a multimodal shape for the corresponding conditional probability function . The Qk are real numbers
randomly drawn from a uniform distribution over the interval (0, 1], then normalized
to sum
to 1.
Several simulations
were run for different
combinations
of parameter
settings. In particular : i) the number n1r of parents of y in the generating synthetic network , was varied from 1 to 4; ii ) the number K of mixture
components used in Equation (14) was varied from 3 to 7 (Note: this is the number
of linear models included
in the mixtures
used to generate the
database ) ; and iii ) the number of bins used in the discretization was either 2 or 3. Furthermore , in the algorithm A2 , the strategy of order selection for
the mixture density network (MDN ) model was either hold-out or BIG (see Section 4) . Finally , we chose the maximum admissible MDN model order
(referred to as Kmax in Section 4) to be 5. That is, the best model order was selected from the range 1, . . . , 5. Finally , we ran simulations
with
datasets
of different cardinality . In particular , we used datasets of cardinality 600 , and
300,
900 .
For each parameter setting , both algorithms were run five times , and in each run 20% of the database cases were randomly selected as test set, and the remaining 80% of the database cases were used for training . We ran several simulations
, whereby a simulation
consists of the follow -
ing steps :
- a synthetic Bayesian network B s is generated as described above; - a database V of cases is generated from B s by Markov simulation ; - the two algorithms A 1 and A2 are applied to the database V and relevant statistics on the algorithms ' performance are collected . The collected statistics for the two algorithms were then compared by means of standard statistical tests . In particular , for the continuous statistics , namely , the MSE , and the LS , we used a simple t -test , and the Welch -modified two -sample t -test for samples with unequal variance . For the discrete statistics , namely , number of arcs added and omitted , we used a median test (the Wilcoxon test ) . 5 .2 .
RESULTS
Figure 3 and Figure 4 summarize the results of the simulations with real and synthetic databases, respectively . As a general guideline , for each discrete measure (such as the number of arcs added or omitted ) we report the 3-tuple
(min, median, max) . For each continuous measure (such as the log-score) we report
mean and standard
deviation .
536
STEFANOMONTIAND GREGORY F. COOPER
I
DB
I
PI
I
P2
I
+
I
-
I
auto-mpg 2 2 2 2 2.5 4 0 1 2 0 0 1 abalone 3 4 4 0 3 4 0 1 2 1 2 4
I
DB auto
I MSEI I MSE2 I
- mpg
abalone
LSI
J
LS2
I time-ratio I
0 . 13
0 .01
0 . 14
0 .02
0 .25
0 .04
0 .28
0 .07
40
5
0 .92
0 .06
1 .09
0 .3
1 .31
0 .06
1 .41
0 . 14
42
12
Figure 3. Comparison of algorithms Al and A2 on real databases . The first table reports statistics about the structural differences of the models learned by the two algorithms . In particular , PI and P2 are the number of parents of the response variable discovered by the two algorithms Al and A2 respectively . The number of arcs added (+ ) and omitted (- ) by A2 with respect to Al are also shown . For each measure , the 3-tuple (min , median , max) is shown . The second table reports mean and standard deviation of the mean square error (MSE ) , of the log -score (LS ) , and of the ratio of the computational time of A2 to the computational time of Al (time -ratio ) .
With regard to the performance of the two algorithms when coupled with the two alternative MDN model selection criteria (namely , the BIC based model selection , and the "hold -out " model selection ) , neither of the two criteria is significantly superior . Therefore , we report the results corresponding to the use of the BI C- based model selection only . In Figure 3 we present the results of the comparison of the two learning algorithms based on the real databases. In particular we report : the number of parents PI and P2 discovered by the two algorithms Al and A2 ; the corresponding mean square error MSE and the log-score LS ~and the ratio of the computational with
time
- ratio
time for A2 to the computational
time for AI , denoted
.
In Figure 4 we present the results of the comparison of the two learning algorithms based on the simulated databases. In particular , the first table
of Figure 4 reports the number of arcs added (+ ) and the number of arcs omitted (- ) by each algorithm with respect to the gold standard Bayesian network GS (i .e., the synthetic BN used to generate the database ). It also reports the number of arcs added (+ ) and omitted (- ) by algorithm A2 with respect to algorithm AI . The second table of Figure 4 reports the measures MSE and LS for the two algorithms Al and A2 , and the time -ratio . Notice that the statistics shown in Figure 4 are computed over different
LEARNING HYBRIDBAYESIAN NETWORKS FROMDATA # of I casesr
GS vsAl + I -
f - GS vsA2 I + I -
I I
537
Al vsA2 + I -
I I
300 0 0 2 0 1 4 0 0 2 0 1 3 0 0 2 0 0 2 600 0 0 1 0 0.5 3 0 0 3 0 1 3 0 0 3 0 0 2 900 0 0 1 0 0 3 0 1 3 0 0 3 0 1 3 0 0 2
# of cases
MSEI
300 600 900
MSE2
LSI
LS2
time-ratio
0.72 0.33 0.73 0.33 0.93 0.35 0.96 0.33 29 0.73 0.32 0.74 0.33 0.77 0.34 0.80 0.37 36 0.78 0.30 0.78 0.32 0.91 0.27 0.92 0.28 35
7 10 8
Figure 4 . Comparison of algorithms Al and A2 on simulated databases . In the first table , the comparison is in term of the structural differences of the discovered networks ; each entry reports the 3 - tuple ( min , median , max ) . In the second table , the comparison is in terms of predictive accuracy ; each entry reports mean and standard deviation of the quantities MSE and LS . It also reports the time ratio , given AI .
by the
settings The
is
not in
in
previous
difference
statistically terms
of
gorithm
larger
Al
tends
difference tistically
we
(p
number
add
more
in
the
number
respect unexpected
algorithms
when
accuracy
of
cases . The
fact
both
that
the
the
the
BNs
of
they
Al
(p
of arcs
900 when
' performance
using
. In select
than
with the
decreases
and
to
al -
, algorithm
is not
added
,
differ
.01 ) , while
prediction cases
error
to
variable
algorithms
4
discover
tends
databases
BN s used
decreased
square
A2
two
Figure
algorithms
algorithms
response
number
dataset
algorithms
the
standard
algorithms
mean
simulated
by
is the
the two
the
algorithm
the
gold
result
using
prediction
than omitted
that
to
the
two
two
, algorithm
for
to
arcs
of arcs
of
the
, the
for
.
3 and
of the
of
hand
structure
of parents
extra
terms
other
time
design
in Figure
accuracy
databases
regard
( remember
with ) . An
real
computational
experimental
reported
in
the the
the
.01 ) . With
significant
databases
to
to
is computed
results prediction
. On
A2 to the
on
the
compare
regard
for
section
, either
log - score
when , with
time
of the in
significant the
a significantly
the
analysis
the
particular
both
computational
statistical that
significantly
A2
of the
as stated
shows
or
ratio
the sta -
omitted
generate
the
accuracy
of
respect
to
the
dataset
of
600
suggests
that
538
STEFANO MONTIANDGREGORY F. COOPER
this is due to an anomaly from sampling. Howeverthe point grants further verification by testing the algorithms on larger datasets.
5.3. DISCUSSION The results shown in Figures 3 and 4 support the hypothesis that discretiza tion of continuous variables does not decrease the accuracy of recovering the structure of BN s from data . They also show that using discretized continuous variables to construct a BN structure (algorithm AI ) is significantly faster (by a factor ranging from about 30 to 40) than using untransformed continuous variables (algorithm A2 ) . Also , the predictions based on Al are at least as accurate as (and often more accurate than ) the predictions based on A2 . Another important aspect differentiating the two learning methods is the relative variability of the results for algorithm A2 compared with the results for algorithm AI , especially with regard to the structure of the learned models . In Figures 3 and 4 the number of parents of the class variable discovered by algorithm Al over multiple simulations remains basically constant (e.g., 2 parents in the database AUTO- MPG, 4 parents in the database ABALONE ) . This is not true for algorithm A2 , where the difference between the minimum and maximum number of arcs discovered is quite high (e.g., when applied to the database ABALONE , A2 discovers a minimum of 0 parents and a maximum of 4 parents ) . These results suggest that the estimations based on the ANN -based scoring metric are not very stable , probably due to the tendency of ANN -based search to get stuck in local maxima in the search space. 6.
Conclusions
In this paper , we presented a method for learning hybrid BNs , defined as BN s containing both continuous and discrete variables . The method is based on the definition of a scoring metric that makes use of artificial neural networks as probability estimators . The use of the ANN -based scoring metric allows us to search the space of BN structures without the need for discretizing the continuous variables . We compared this method to the alternative of learning the BN structure based on discretized data . The main purpose of this work was to test whether discretization would or would not degrade the accuracy of the discovered BN structure and parameter estimation accuracy . The experimental results presented in this paper suggest that discretization of variables permits the rapid construction of relatively high fidelity Bayesian networks when compared to a much slower method that uses continuous variables . These results do not of course rule out the possibility that we can develop faster and more accurate continu -
LEARNING HYBRID BAYESIAN NETWORKS FROM DATA
539
ous variable learning methods than the one investigated here. However , the results do lend support to discretization as a viable method for addressing the problem of learning hybrid BNs . Acknowledgments
We
thank
Chris
on
a preliminary
IRI
- 9509792
Bishop
and
version from
Moises
of
the
this
Goldszmidt
manuscript
National
Science
for
their
. This
work
Foundation
.
useful
WM
comments
funded
by
grant
References 1.
2.
J . Binder with
ingham
B4
C . Bishop
4.
.
R . Bouckrert
.
6. 7. 8.
rithms IEEE
in
15 .
in
Oxford
University
and
Com
-
, Birm
-
Press
,
Uncertainty .
NESTOR
and
networks
Artificial
and
Networks theory
tistical
Society
A . Dawid
. The
, 1992 .
University
Data
.
- based
and
prequential
of
.
of the
Alto
86 - 94 , Los
Press
7th
results
.
, editors
,
networks
Workshop
on
: search Artificial
1995 . . In
Angeles
for
the
Proceedings
diagnostic HPP
marginal
of the that
- 84 - 48 , Dept
12 - th
integrates . of
Com
-
, 1984 .
method of the
for 7th
constructing
Conference
Bayesian of
Uncertainty
, CA , 1991 .
Method
Learning
and
, 1996 .
approximation
, California
Bayesian
) : Theory
5th
medical
Bayesian
. Machine
position
Proceedings
Bayesian
Report
Proceedings
. A
data
from
R . Uthurasamy
. MIT
network
Technical A
-
, 1989 .
networks
. In
. Learning
Efficient
, Palo
. In
,
: Algo
for
the
Induction
of Proba
-
, 9 :309 - 347 , 1992 .
potential
developments
approach
( with
discussion
: Some ).
personal
views
. Sta -
Journal
of Royal
Sta -
Bayesian
inference
. In
A , 147 :278 - 292 , 1984 .
. Prequential
J .M . Bernardo
computer
, pages
from
. Present
tistical
: A
E . Herskovits
Mining
a Bayesian
AI , 1996 .
databases
Intelligence
G . F . Cooper
.
.
outputs
, 8 ( 3 ) , 1996 .
, and
Proceedings
in
E . Herskovits
from
Data
In
given
In
.
- computing
York
( AutoClass
, P . Smyth
112 - 128 , January
knowledge
, Stanford
G . F . Cooper
A . Dawid
data
, New
networks
D . Heckerman
D . Heckerman
Neuro
.
, pages
network
probabilistic
classification
.
Publishers
In
Verlag
networks
Intelligence
52 - 60 , 1991 .
and
, and
.
Engineering
Bayesian
- Shapiro
, pages
and
probabilistic
Science
on
results
Statistics
Data
AI , pages
Discovery
of incomplete
and
. Springer and
belief
Artificial
classification
on learning
. Bayesian
experimental
G . F . Cooper
Press
University
Kaufmann
recognition
literature
, D . Geiger
of
pattern
refinement
Knowledge
and
bilistic 14 .
/ 4288 , Neural
, Aston
Bayesian in
of feedforward
Knowledge
J . Stutz
D . M . Chickering
in
.
for
, 1994 . Morgan
, G . Piatetsky
D . M . Chickering
belief
NCRG
Recognition
Uncertainty
Applications
Uncertainty and
Conference
13 .
on
U . M . Fayyad
likelihood
12 .
and
. Theory of
P . Cheeseman
puter
networks
.
Science
algorithms of
statistical to the
1Tansactions
Conference
learning
interpretation to
W . L . Buntine
causal
Pattern
, California
. A guide
Intelligence
11 .
probabilistic
appear
Report
of Computer
Conference
Francisco
, Architectures
methods 10 .
. Technical
for
of
10th
relationships
Advances 9.
the
. Probabilistic
W . Buntine
In
. Adaptive
, 1997 . To
1994 .
Networks
Properties
of
102 - 109 , San J . Bridle with
networks
, Department
, U . K . , February
Neural
K . Kanazawa
Learning
, 1995 .
Proceedings
5.
density Group
7ET
, and
. Machine
. Mixture
Research
Oxford
, S . Russel
variables
C . Bishop puting
3.
, D . Koller
hidden
analysis
et ai , editor
, stochastic
, Bayesian
Statistics
complexity 4 , pages
and
109 - 125 . Oxford
University
540
MONTI
AND
GREGORY
F . COOPER
M . Druzdzel , L . C . van der Gaag , M . Henrion , and F . Jensen , editors . Build ing probabilistic networks : where do the numbers come from ?, IJCAI -95 Workshop , Montreal , Quebec , 1995. B . Everitt and D . Hand . Finite mixture distributions . Chapman and Hall , 1981. U . Fayyad and R . Uthurusamy , editors . Proceedings of the First International Con -
ference on Knowledge Discovery and Data Mining (KDD -95) , Montreal , Quebec, Press and
. M . Goldszmidt
.
Discretization
of
continuous
attributes
while
learning Bayesian networks . In L . Saitta , editor , Proceedings of 13-th International Conference on Machine Learning , pages 157- 165, 1996. D . Geiger and D . Heckerman . Learning Gaussian networks . In R . L . de Mantras and D . Poole , editors , Prooceedings of the 10th Conference of Uncertainty in AI , San Francisco , California , 1994. Morgan Kaufmann . D . Geiger , D . Heckerman , and C . Meek . Asymptotic model selection for directed networks with hidden variables . Technical Report MSR - TR -96-07, Microsoft Research , May 1996. W . Gilks , S. Richardson , and D . Spiegelhalter . Markov Chain Monte Carlo in Practice . Chapman & Hall , 1996. D . Heckerman , D . Geiger , and D . M . Chickering . Learning Bayesian networks : The combination of knowledge and statistical data . Machine Learning , 20:197- 243, 1995. D . Madigan , S. A . Andersson , M . D . Perlman , and C . T . Volinsky . Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs . Communications in Statistics - Theory and Methods , 25, 1996. D . Madigan , A . E . Raftery , C . T . Volinsky , and J . A . Hoeting . Bayesian model averaging . In AAAI Workshop on Integrating Multiple Learned Models , 1996. C . Merz and P. Murphy . Machine leaFning repository . University of California , Irvine , Department of Information and Computer Science , 1996. http : / / www. ics . uci . edu / mlearn / MLRepository . html . M . Moller . A scaled conjugate gradient algorithm for fast supervised learning . S. Monti and G . F . Cooper . Learning Bayesian belief networks with neural network
. ~ N
0. ~
. ~. N M ~
~. ~
. ~ ~
~. M . 00 N
1995 . AAAI N . Friedman
~. N
. ~. M ~. ~ 00 M
16.
STEFANO
Neural
Networks
M . Mozer
, M . Jordan
, and
T . Petsche
, editors
, Advances
in
Neural
~. N
. In
Information Processing Systems 9: Proceedings of the 1996 Conference , 1997. J . Pearl . Probabilistic Reasoning in Intelligent Systems : Networks of Plausible In ference . Morgan Kaufman Publishers , Inc ., 1988.
~. ~
estimators
, 6 : 525 - 533 , 1993 .
A . E. Raftery . Bayesian model selection in social research (with discussion) . Sociological Methodology , pages 111- 196, 1995.
. M. ~ N N
pages
. N N
chapter 24, pages 441- 464. G . Schwarz . Estimating the dimension of a model . Annals of Statistics , 6:461- 464 ,
~. N
R . Shachter . Intelligent probabilistic inference . In L . K . . J . Lemmer , editor , Un certainty in Artificial Intelligence 1, pages 371- 382, Amsterdam , North -Holland ,
0. N
D . Spiegelhalter , A . Dawid , S. Lauritzen , and R . Cowell . Bayesian analysis in expert systems . Statistical Science , 8(3) :219- 283, 1993. B . Thiesson . Accelerated quantification of Bayesian networks with incomplete data .
~. ~
A . E. Raftery . Hypothesis testing and model selection. In Gilks et al [22], chapter 10,
In Fayyad and Uthurusamy [18], pages 306- 311.
163 - 188 .
M . Richard and R . Lippman . Neural network classifiers estimate posteriori probabilities . Neural Computation , 3:461- 483, 1991.
Bayesian a-
C. Robert . Mixtures of distributions : Inference and estimation . In Gilks et al [22], 1996 .
. . ~ ~ OO ~
1986 .
D . Titterington , A . Smith , and U . Makov . Statistical Distributions . Wiley , New York , 1985.
Analysis
of Finite
Mixture
A MEAN FIELD LEARNING ALGORITHM FOR UNSUPERVISED NEURAL NETWORKS
LAWRENCE SAUL ATl!:JT Labs- Research 180 Park Ave D-130 Florham Park, NJ 07932 AND MICHAELJORDAN Massachusetts Institute of Technology Center for Biological and Computational Learning
79AmherstStreet , E10-034D Cambridge , MA 02139 Abstract . We introduce a learning algorithm for unsupervised neural networks based on ideas from statistical mechanics. The algorithm is derived from a mean field approximation for large, layered sigmoid belief networks . We show how to (approximately ) infer the statistics of these networks with out resort to sampling . This is done by solving the mean field equations , which relate the statistics of each unit to those of its Markov blanket . Using these statistics as target values, the weights in the network are adapted by a local delta rule . We evaluate the strengths and weaknesses of these networks for problems in statistical pattern recognition .
1. Introduction Multilayer neural networks trained by backpropagation provide a versatile framework for statistical pattern recognition . They are popular for many reasons, including the simplicity of the learning rule and the potential for discovering hidden , distributed representations of the problem space. Nevertheless, there are many issues that are difficult to address in this framework . These include the handling of missing data , the statistical interpretation 541
542
LAWRENCE SAULAND MICHAELJORDAN
of hidden units , and the problem of unsupervised learning , where there are no explicit error signals . One way to handle these problems is to view these networks as probabilistic models . This leads one to consider the units in the network as random variables , whose statistics are encoded in a joint probability distribution . The learning problem , originally one of function approximation , now becomes one of density estimation under a latent variable model ; the objective function is the log-likelihood of the training data . The probabilis tic semantics in these networks allow one to infer target values for hidden units , even in an unsupervised setting . The Boltzmann machine [l ] was the first neural network to be endowed with probabilistic semantics . It has a simple Hebb-like learning rule and a fully probabilistic interpretation as a Markov random field . A serious problem for Boltzmann machines, however, is computing the statistics that appear in the learning rule . In general , one has to rely on approximate methods , such as Gibbs sampling or mean field theory [2] , to estimate these statistics ; exact calculations are not tractable for layered networks . Experience has shown , however , that sampling methods are too slow , and mean field approximations too impoverished [3] , to be used in this way. A different approach has been to recast neural networks as layered belief networks [4] . These networks have a fully probabilistic interpretation as directed graphical models [5, 6] . They can also be viewed as top-down generative models for the data that is encoded by the units in the bot tom layer [7, 8, 9] . Though it remains difficult to compute the statistics of the hidden units , the directionality of belief networks confers an impor tant advantage . In these networks one can derive a simple lower bound on the likelihood and develop learning rules based on maximizing this lower bound . The Helmholtz machine [7, 8] was the first neural network to put this idea into practice . It uses a fast , bottom -up recognition model to compute the statistics of the hidden units and a simple stochastic learning rule , known as wake-sleep, to adapt the weights . The tradeoff for this simplicity is that the recognition model cannot handle missing data or support certain types of reasoning , such as explaining away[5] , that rely on top -down and bottom -up processing. In this paper we consider an algorithm based on ideas from statistical mechanics . Our lower bound is derived from a mean field approximation for sigmoid belief networks [4] . The original derivation [lO] of this approx imation made no restrictions on the network architecture or the location of visible units . The purpose of the current paper is to tailor the approx imation to networks that represent hierarchical generative models. These are multilayer networks whose visible units occur in the bottom layer and
A MEAN FIELD ALGORITHM FOR UNSUPERVISED LEARNING
543
whose topmost layers contain large numbers of hidden units . The mean field approximation that emerges from this specialization is interesting in its own right . The mean field equations, derived by maxi mizing the lower bound on the log-likelihood , relate the statistics of each unit to those of its Markov blanket . Once estimated , these statistics are used to fill in target values for hidden units . The learning algorithm adapts the weights in the network by a local delta rule . Compact and intelligi ble, the approximation provides an attractive computational framework for probabilistic modeling in layered belief networks . It also represents a vi able alternative to sampling , which has been the dominant paradigm for inference and learning in large belief networks . While this paper builds on previous work , we have tried to keep it self-contained . The organization of the paper is as follows . In section 2, we examine the modeling problem for unsupervised networks and give a succinct statement of the learning algorithm . (A full derivation of the mean field approximation is given in the appendix .) In section 3, we aBsessthe strengths and weaknesses of these networks based on experiments with handwritten digits . Finally , in section 4, we present our conclusions , as well as some directions for future research. 2.
Generative
models
Suppose we are given a large sample of bi~ary (0/ 1) vectors , then asked to model the process by which these vectors were generated . A multilayer network (see Figure 1) can be used to parameterize a generative model of the data in the following way. Let sf denote the ith unit in the lth layer of the network , hf its bias, and Jfj- l the weights that feed into this unit from the layer above. We imagine that each unit represents a binary random variable whose probability of activation , in the data -generating process, is conditioned on the units in the layer above. Thus we have:
p(sf = 1ISf -l) = (J(Lj J~-lSJ - l +hf) ,
(1)
where a (z) == [1 + e- zJ- l is the sigmoid function . We denote by a; the squashed sum of inputs that appears on the right hand side of eq. ( 1) . The joint distribution over all the units in the network is given by :
P(S) :=Il (af)sf(l - af)l - Sf li
(2)
A neuralnetwork , endowedwith probabilisticsemantics in this way, is knownasa sigmoidbeliefnetwork [4]. Layered beliefnetworks wereproposed ashierarchical generative modelsby Hintonet al[7].
LAWRENCE SAUL ANDMICHAEL JORDAN
544
~J.
82 J,
Figure 1. A multilayer sigmoid belief network that parameterizes for the units in the bottom layer .
a generative model
The goal of unsupervised learning is to model the data by the units in the bottom layer . We shall refer to these units a8 visible (V ) units , since the data vectors provide them with explicit target values. For the other units in the network - the hidden (H) units - appropriate target values must be inferred from the probabilistic semantics encoded in eq. (2) . 2
. 1
.
MAXIMUM
The
LIKELIHOOD
problem
of
of
density
a
large
is
to
learning
.
family
of
the
of
sample
distri
The
tion
framework
visible
h1
,
hidden
units
units
that
make
.
the
is
essentially
can
The
one
parameterize
learning
problem
statistics
of
the
visible
.
to
.
bu
the
biases
data
this
many
over
the
in
with
and
approach
training
marginal
,
network
Jfj
those
simple
A
distributions
weights
match
One
the
unsupervised
estimation
find
units
ESTIMATION
learning
is
likelihood
of
to
maximize
each
the
data
vector
log
is
-
likelihood
obtained
!
of
from
its
,
P
( V
)
= =
}
:
P
( H
,
V
)
,
( 3
)
H
where
by
( hidden
definition
and
gradients
biases
visible
of
of
the
P
the
network
)
log
( H
.
We
-
likelihood
.
,
V
)
can
For
f.. 1\] 1 U .J ~hf
= =
P
( S
)
derive
,
each
(X
the
local
In
data
<X
is
P
( V
joint
distribution
learning
)
,
rules
with
vector
E [(5f+1-
respect
,
over
this
gives
o-f+1)8;] , E [s~ 1.- a~ 1.] '
by
to
the
all
computing
the
on
the
weights
- line
units
and
updates
:
(4) (5)
1For simplicity of exposition, we do not consider forms of regularization (e.g., penalized likelihoods , cross-validation ) that may be necessaryto prevent overfitting .
A MEAN FIELD ALGORITHM FOR UNSUPERVISED LEARNING
545
where E [. . .] denotes an expectation with respect to the conditional distri bution , P (HIV ) . Note that the updates take the form of a delta rule , with unit activations 0' [ being matched to target values sf . Many authors [4, 18J have noted the associative , error -correcting nature of gradient -based learn ing rules in belief networks .
2.2. MEANFIELDLEARNING In general, it is intractable(ll , 12] to calculate the likelihood in eq. (3) or the statistics in eqs. (4~5). It is also time-consuming to estimate them by sampling from P (H IV ) . One way to proceed is based on the following idea[7] . Supposewe have an approximate distribution , Q(HIV ) ~ P (HIV ). Using Jensen's inequality, we can form a lower bound on the log-likelihood from :
(H,V InP(V)~LHQ(HIV )In[P {~(l{I -V~ ))].
(6)
If this bound is eaEYto compute , then we can derive learning rules baEed on maximizing the bound . Though one cannot guarantee that such learn ing rules always increase the actual likelihood , they provide an efficient alternative to implementing the learning rules in eqs. (4- 5) . Our choice of Q (H IV ) is motivated by ideas from statistical mechanics. The mean field approximation [13] is a general method for estimating the statistics of large numbers of correlated variables . The starting point of the mean field approximation is to consider factorized distributions of the form :
Q(HIV ) ==II.eiEH II (J.Lf)Sf(l - J.L)I-Sf f
(7)
The parameters 1.L1are the mean values of sf under the distribution Q (HIV ) , and they are chosen to maximize the lower bound in eq. (6) . A full derivation of the mean field theory for these networks , starting from eqs. (6) and (7) , is given in the appendix . Our goal in this section , however , is to give a succinct statement of the learning algorithm . In what follows , we therefore present only the main results , along with a number of useful intuitions . For these networks , the mean field approximation works by keeping track of two parameters , { 1.L1,~f } for each unit in the network . Roughly speaking , these parameters are stored as approximations to the true statis tics of the hidden units : 1.L1~ E[Sf ] approximates the mean of sf , while ~f ~ E [O'f] approximates the average value of the squashed sum of inputs . Though only the first of these appears explicitly in eq. (7) , it turns out that both are needed to compute a lower bound on the log-likelihood . The values of { J.Lf, ~f } depend on the states of the visible units , ~ well ~ the weights and biases of the network . They are computed by solving the mean
LAWRENCE SAULANDMICHAELJORDAN
546
field equations :
11 .1=0"[2:::J Jfj -lI.1-1+hf+LJ J]i(Jl ,~+l-~J+l)-~(1-2Jl ,f)L( ]i)2~J+l(1-~J+l)1 JJ j )8 ~f = 0"
[~ Jfj _lJ.L~_l +hf+i(l- 2~;)~ (Jfj -l)2J .L~-1(1- J.L~-1)] .
(9)
These equations couple the parameters of each unit to those in adjacent layers . The terms inside the brackets can be viewed as effective influences (or " mean fields " ) on each unit in the network . The reader will note that sigmoid belief networks have twice as many mean field parameters as their undirected counterparts [2] . For this we can offer the following intuition . Whereas the parameters JL~ are determined by top -down and bottom -up influences, the parameters ~f are determined only by top -down influences . The distinction - essentially , one between parents and children - is only meaningful for directed graphical models. The procedure for solving these equations is fairly straightforward . Ini tial guesses for { JLf, ~f } are refined by alternating passes through the network , in which units are updated one layer at a time . We alternate these passes in the bottom -up and top -down directions so that information is propagated from the visible units to the hidden units , and vice versa. The visible units remain clamped to their target values throughout this process. Further details are given in the appendix . The learning rules for these networks are designed to maximize the bound in eq. (6) . An expression for this bound , in terms of the weights and biases of the network , is derived in the appendix ; see eq. (24) . Gradient ascent in Jfj and hf leads to the learning rules:
1\J.'.e U lJ. cx: [(jJ ;f+l - ~f+l) J.L~cx : ~h1 (/1; - ~f).
Jfj~f+l (1- ~f+l)J.L~(1- J.L~)] ,
(10) (11)
Comparing these learning rules to eqs. (4- 5) , we see that the mean field parameters fill in for the statistics of sf and o-f . This is, of course, what makes the learning algorithm tractable . Whereas the statistics of P (HIV ) cannot be efficiently computed , the parameters { J.Lf, ~f } can be found by solving the mean field equations . We obtain a simple on-line learning algorithm by solving the mean field equations for each data vector in the training set, then adjusting the weights by the learning rules , eqs. (10) and (11) . The reader may notice that the rightmost term of eq. ( 10) has no counterpart in eq. (4) . This term , a regularizer induced by the mean field approximation , causes Jfj to be decayed according to the mean-field statistics
A MEAN
FIELD
ALGORITHM
FOR UNSUPERVISED
LEARNING
547
of0-;+1andSJ. In particular , theweightdecay issuppressed if either~f+l or J.L] issaturated nearzeroorone;in effect , weights between highlycorrelated units
are burned
in to their
current
values .
3 . Experiments We used a large database of handwritten
digits to evaluate the strengths
and weaknessesof these networks. The database[16] was constructed from NIST Special Databases 1 and 3. The examples in this database were deslanted , downsampled , and thresholded to create 10 X 10 binary images. There were a total of 60000 examples for training and 10000 for testing ; these were divided roughly equally among the ten digits ZEROto NINE. Our
experiments had several goals: (i) to evaluate the speed and performance of the mean field learning algorithm ; (ii) to assessthe quality of multilayer networks as generative models; (iii ) to seewhether classifiersbasedon generative models work in high dimensions; and (iv) to test the robustnessof these classifiers with respect to missing data . We used the mean field algorithm from the previous section to learn generative models for each digit . The generative models were parameterized by four -layer networks with 4 x 12 x 36 x 100 architectures . Each network was
trained by nine passes2through the training examples. Figure 2 shows a typical plot of how the scorecomputed from eq. (6) increasedduring train ing . To evaluate the discriminative
capabilities
of these models , we trained
ten networks , one for each digit , then used these networks to classify the images in the test set . The test images were labeled by whichever
network
assigned them the highest likelihood score, computed from eq. (6) . Each of these experiments required about nineteen CPO hours on an SGI R10000 , or roughly 0.12 seconds of processing time per image per network . We con-
ducted five such experiments; the error rates were 4.9%(x2 ), 5.1%(x2 ), and 5.2% . By comparison , the error rates3 of several k-nearest neighbor al-
goriths were: 6.3% (k == 1), 5.8% (k == 3), 5.5% (k == 5), 5.4% (k == 7) , and 5.5% (k == 9) . These results show that the networks have learned noisy but essentially accurate models of each digit class. This is confirmed by look ing at images sampled from the generative model of each network ; some of these are shown in figure 3. One advantage of generative models for classification is the seamless handling of missing data . Inference in this case is simply performed on the 2The first pass through the training examples was used to initialize the biases of the bottom layer ; the rest were used for learning . The learning rate followed a fixed schedule : 0.02 for four epochs and 0 .005 for four epochs . 3All the error rates in this paper apply to experiments with 10 x 10 binary images .
The best backpropagation networks(16) , which exploit prior knowledge and operate on 20 x 20 greyscale images , can obtain error rates less than one percent .
548
LAWRENCESAUL AND MICHAEL JORDAN
10 epoch
Figure 2. Plot of the lower bound on the log-likelihood , averagedover training patterns , versus the number of epochs, for a 4 x 12 x 36 x 100 network trained on the dig-it TWO. The score has been normalized
. . . . Figure 3.
by 100 x In 2.
. . .. . . . .m . . ... .#": . Synthetic images sampled from each digit 's generative model.
pruned network in which units corresponding to missing pixels have been removed (i .e., marginalized ) . We experimented by randomly labeling a certain fraction , f , of pixels in the test set as missing , then measuring the number of classification errors versus f . The solid line in figure 4 shows a plot of this curve for one of the mean field classifiers. The overall perfor mance degrades gradually from 5% error at f == a to 12% error at f == 0.5. One can also compare the mean field networks to other types of generative models . The simplest of these is a mixture model in which the pixel values (within each mixture component ) are conditionall .y distributed as independent binary random variables . Models of this type can be trained by an Expectation -Maximization (EM ) algorithm [19] for maximum likelihood estimation . Classification via mixture models was investigated in a separate set of experiments . Each experiment consisted of training ten mixture models , one for each digit , then using the mixture models to classify the
meanfield
Figure 4. Plot of classification error rate versus fraction of missing pixels in the test set. The solid curve gives the results for the mean field classifier; the dashed curve, for the mixture model classifier.
digits in the test set. The mixture models had forty mixture components and were trained by ten iterations of EM . The classification error rates in five experiments were 5.9% ( x3 ) and 6.2% ( x2 ) ; the robustness of the best classifier to missing data is shown by the dashed line in figure 4. Note that while the mixture models had roughly the same number of free parameters as the layered networks , the error rates were generally higher . These results suggest that hierarchical generative models, though more difficult to train , may have representational advantages over mixture models . 4.
Discussion
The trademarks of neural computation are simple learning rules , local message-passing, and hierarchical distributed representations of the prob lem space. The backpropagation algorithm for multilayer networks showed that many supervised learning problems were amenable to this style of computation . It remains a challenge to find an unsupervised learning algorithm with the same widespread potential . In this paper we have developed a mean field algorithm for unsuper vised neural networks . The algorithm captures many of the elements of neural computation in a sound , probabilistic framework . Information is propagated by local message-passing, and the learning rule- derived from a lower bound on the log-likelihood - combines delta -rule adaptation with weight decay and burn -in . All these features demonstrate the advantages of tailoring a mean field approximation to the properties of layered networks . It is worth comparing our approach to methods based on Gibbs sampling [4] . One advantage of the mean field approximation is that it enables one to compute a lower bound on the marginal likelihood , P (V ) . Estimat -
550
LAWRENCE SAULANDMICHAEL JORDAN
ing these likelihoods by sampling is not so straightforward ; indeed , it is considerably harder than estimating the statistics of individual units . In a recent study , Frey et al [15] reported that learning by Gibbs sampling Wag an extremely slow process for sigmoid belief networks . Mean field algorithms are evolving . The algorithm in this paper is considerably faster and easier to implement than our previous one[10, 15] . There are several important areas for future research. Currently , the overall computation time is dominated by the iterative solution of the mean field equations . It may be possible to reduce processing times by fine tun ing the number of mean field updates or by training a feed-forward network (i .e., a bottom -up recognition model [7, 8] ) to initialize the mean field parameters close to solutions of eqs. (8- 9) . In the current implementation , we found the processing times (per image) to scale linearly with the number of weights in the network . An interesting question is whether mean field algorithms can support massively parallel implementations . With better algorithms come better architectures . There are many possible elaborations on the use of layered belief TLetworksas hierarchical generative models . Continuous -valued units , as opposed to binary -valued ones, would help to smooth the output of the generative model . We have not exploited any sort of local connectivity between layers , although this structure is known to be helpful in supervised learning [16] . An important consider ation is how to incorporate prior knowledge about the data (e.g., transla tion / rotation invariance [20] ) into the network . Finally , the synthetic images in figure 3 reveal an inherent weakness of top -down generative models ; while these models require an element of stochasticity to model the variability in the data , they lack a feedback mechanism (i .e., relaxation [21]) to clean up noisy pixels . These extensions and others will be necessary to realize the full potential of unsupervised neural networks . ACKNOWLEDGEMENTS The authors acknowledge useful discussions with T . Jaakkola , H . Seung, P. Dayan , G . Hinton , and Y . LeCun and thank the anonymous reviewers for many helpful suggestions .
A . Mean field approximation In this appendix we derive the mean field approximation for large , layered sigmoid belief networks . Starting from the factorized distribution for Q (HIV ) , eq. (7) , our goal is to maximize the lower bound on the log-
A MEAN FIELD ALGORITHM FOR UNSUPERVISED LEARNING
551
likelihood , eq. (6) . This bound consists of the difference between two terms :
InP (V ) ~ [ - ~ Q (HIV ) lnQ (HIV )] - [ - ~
Q (HIV ) lnp (H , V )] . (12) The first term is simply the entropy of the mean field distribution . Because Q (HIV ) is fully factorized , the entropy is given by :
- LHQ(HIV )InQ(HIV ) = - ilEH L [JlfInJLf +(1- JLf )In(l - J1f )J
(13)
We identify the second term in eq. (12) as (minus ) the mean field energy ; the name arises from interpreting P (H , V ) == eln P(H,V) as a Boltzmann distribution . Unlike the entropy , the energy term in eq. (12) is not so straight forward . The difficulty in evaluating the energy stems from the form of the joint distribution , eq. (2) . To see this , let
z~ 1.==~ LI J~ 1.)~lS~ )-l + h~ 1. )
(14)
denote theweighted sumof inputsintounitsf. Fromeqs.(1) and(2), we canwritethejoint distribution in sigmoid beliefnetworks as: InP(S)
-
-
- Lfi {sf In[1+ e-zf] + (1- Sf) In[1+ ezf ]}
(15)
Lii {Sfz ! - In[1+ezf ]} .
(16)
The difficulty in evaluating the mean field energy is the logarithm on the right hand side of eq. (16) . This term makes it impossible to perform the averaging of In P (S) in closed form , even for the simple distribution , eq. (7) . Clearly , another approximation is needed to evaluate (In [1 + ezfJ) , averaged over the distribution , Q (HIV ) . We can make progress by studying the sum of inputs , zl , as a random variable in its own ri .ght. Under the distribu tion Q (HIV ) , the right hand side of eq. (14) is a weighted sum of indepen dent random variables with means J.L1- 1 and variances J.L1- 1(1 - J.L1- 1) . The number of terms in this sum is equal to the number of hidden units in the (f - l )th layer of the network . In large networks , we expect the statistics of this sum- or more precisely, the mean field distribution Q (zfIV )- to be governed by a central limit theorem . In other words , to a very good approx imation , Q (zf IV ) assumes a normal distribution with mean and variance :
(z~ '") == ~ L..., J.~ '"J~1,/J .,.J~-1+ h~ '", J
(17)
LAWRENCE SAUL AND MICHAEL JORDAN
552
(18)
((8zf)2) = ~J (Jfj-l)2J .L~-1(1- J.L~-1).
In what follows, we will use the approximation that Q(zfIV ) is Gaussian to simplify the mean field theory for sigmoid belief networks. The approximation is well suited to layered networks where each unit receivesa large number of inputs from the (hidden) units in the preceding layer. The asymptotic form of Q (zfIV ) and the logarithm term in eq. (16) motivate us to consider the following lemma. Let z denote a Gaussian random variable with mean (z) and variance (8z2) , and consider the expected value, (In[I + eZ]) . For any real number ~, we can form the upper bound[22]:
-~Z(l + eZ)]), (In[1 + eZ]) == (In[e~Ze
(19) (20) (21)
== ~(z) + (In[e-~Z+ e(l-~)Z]), ::; ~(Z) + In(e-~Z+ e(l-~)Z),
where the last line follows from Jensen's inequality . Since z is Gaussian , it is straightforward to perform the averages on the right hand side. This gives . us an upper bound on (In [1 + eZ]) expressed in terms of the mean and varIance :
( In [ l +
In
what
follows
( In [ l
+
that
appear
the
, we
will
ez1 ] ) . Recall
value
side
eZ ] ) ~
of
in of
~
the
~ ~ 2 ( 8z2
use
eq .
mean
field
that
makes
eq . ( 22 ) is
minimized
Eq
.
( 23 ) (z ) 0'
has
and
a
unique
( 8z2 ) ,
~
f -
in
eq . ( 22 ) . We
[(z ) +
! (1 can
it
; we
bound
2~)
5z2 ) ]
understand
as
~ (1 -
in solved is
approximate are
are tight
the the
exact
value
intractable
therefore as
( 22 )
motivated
possible
of
averages
. The
to right
find hand
:
[(z ) +
easily
to
e { Z } + ( 1 - 2 ~ ) { 8z2 } / 2 ] .
these
energy
solution is
[1 +
that
when
0'
In
bound
( 16 )
the
~ =
for
this
from
) +
the
2 ~ ) ( 8z2 ) ]
interval by
guaranteed eq . ( 23 ) as
~
iteration to a self
(23)
.
E ;
in
tighten - consistent
[ 0 , 1 ] . Given fact the
,
the
values iteration
upper
bound
approximation
for computing ~ ~ (a (z)) where z is a Gaussian random variable . To see this , consider the limiting behaviors : (a (z)) - + a ((z)) as (8z2) - + 0 and (a (z)) - + ! as (8z2) - + 00. Eq . (23) captures both these limits and interpo lates smoothly between them for finite (t5z2) . Equipped with the lemma , eq. (22) , we can proceed to deal with the intractable terms in the mean field energy. Unable to compute the average
A MEAN FIELD ALGORITHM FOR UNSUPERVISEDLEARNING 553 over Q(HIV ) exactly, we instead settle for the tightest possible bound. This is done by introducing a new mean field parameter, ~f , for each unit in the network, then substituting ~f and the statistics of zf into eq. (22). Note that these terms appear in eq. (6) with an overall minus sign; thus, to the extent that Q (zi IV ) is well approximated by a Gaussian distribution , the upper bound in eq. (22) translates4 into a lower bound on the log likelihood. Assembling all the terms in eq. (12) , we obtain an objective function for the mean field approximation:
InP(V) ~ +
-
l::: [J1 ,f In/.l1+(1- /.l7)In(1- /.l7)] + l::: ) ifEH ijfJfj-1/.l7/.l~-1(24 l::: /.l1- _ 2l::: 1 (~f)2(Jfj-l)2/.l~-l (1- /.l~-1) if h1 ijf Llin t+'"L.."",J[JtJ ,Ji-I+!.(1 2 -2~t~)(JtJ ,J~-1(1-J1 ,J1-1)]} ~if { 1+eh~ ~~1J1 ~~1)2J1
The mean field parameters are chosen to maximize eq. (24) for different settings of the visible units . Equating their gradients to zero gives the mean field equations , eqs. (8- 9) . Likewise , computing the gradients for J& and h1 gives the learning rules in eqs. (10- 11) . The mean field equations are solved by finding a local maximum of eq. (24) . This can be done in many ways. The strategy we chose was cyclic stepwise ascent- fixing all the parameters except one, then locating the value of that parameter that maximizes eq. (24) . This procedure for solving the mean field equations can be viewed as a sequence of local messagepassing operations that " match " the statistics of each hidden unit to those of its Markov blanket [5] . For the parameters ~f , new values can be found by iterating eq. (9) ; it is straightforward to show that this iteration always leads to an increase in the objective function . On the other hand , iterating eq. (8) for fl1 does not always lead to an increase in eq. (24) ; hence, the optimal values for flf cannot be found in this way. Instead , for each update , one must search the interval 1L1E [0, 1] using some sort of bracketing procedure [17] to find a local maximum in eq. (24) . This is necessary to ensure that the mean field parameters converge to a solution of eqs. (8- 9) .
4Our earlier work [lO] showed how to obtain a strict lower bound on the log likelihood ; i .e., the earlier work made no appeal to a Gaussian approrimation for Q(zfIV ) . For the networks considered here, however, we find the difference between the approrimate bound and the strict bound to be insignificant in practice. Moreover, the current algorithm has advantages in simplicity and interpretability .
554
LAWRENCE SAULAND MICHAELJORDAN
References 1.
D . Ackley , G . Hinton , and T . Sejnowski . A learning algorithm
for Boltzmann
ma -
chines. Cognitive Science 9: 147- 169 (1985). 2. 3.
C . Peterson and J . R . Anderson . A mean field theory learning algorithm for neural networks . Complex Systems 1:995- 1019 ( 1987) . C . Galland . The limitations of deterministic Boltzmann machine learning . Network 4 :355 - 379 .
4 .
learning of belief networks . Artificial Reasoning in Intelligent
Intelligence 56 :71- 113
Systems . Morgan
Kaufmann : San
Mateo , CA ( 1988). 6. S. Lauritzen . Graphical Models. Oxford University Press: Oxford (1996).
II _
II
5.
R . Neal . Connectionist ( 1992) . J . Pearl . Probabilistic
7.
. a N
8.
for unsuper -
vised neural networks. Science 268 :1158- 1161 (1995). P. Dayan , G . Hinton , R . Neal , and R . Zemel . The Helmholtz
machine . Neural Com -
putation 7:889- 904 ( 1995). M . Lewicki and T . Sejnowski . Bayesian unsupervised learning of higher order struc ture . In M . Mozer , M . Jordan , and T . Petsche , eds. Advances in Neural Information
Processing Systems 9: . MIT Press: Cambridge (1996).
10 .
L . Saul , T . Jaakkola , and M . Jordan . Mean field theory for sigmoid belief networks .
. t~
. '). 00 ~ 0 ~
9.
G . Hinton , P. Dayan , B . Frey , and R . Neal . The wake-sleep algorithm
G . Cooper . Computational complexity of probabilistic inference using Bayesian belief networks . Artificial Intelligence 42 :393-405 ( 1990) . P. Dagum and M . Luby . Approximately probabilistic reasoning in Bayesian belief
Journal of Artificial Intelligence Research4:61- 76 (1996). 11 .
. ~ ~
12 . 13 . 14 .
15.
networks is NP-hard. Artificial Intelligence 60 :141- 153 (1993) . G. Parisi. Statistical Field Theory. Addison-Wesley: Redwood City (1988) . J . Hertz , A . Krogh , and R .G . Palmer . Introduction
to the Theory of Neural Com -
putation . Addison-Wesley: Redwood City (1991). B . Frey , G . Hinton , and P. Dayan . Does the wake-sleep algorithm produce good density estimators ? In D . Touretzky , M . Mozer , and M . Hasselmo , eds. Advances in Neural Information Processing Systems 8 :661-667 . MIT Press : Cambridge , MA ( 1996) . Y . LeCun
, L . Jackel
, L . Bottou
, A . Brunot
, C . Cortes
, J . Denker
, H . Drucker
, I.
Guyon , U . Muller , E . Sackinger , P. Simard , and V . Vapnik . Comparison of learning algorithms for handwritten digit recognition . In Proceedings of / CANN '95. W . Press , B . Flannery , S. Teukolsky , and W . Vetterling . Numerical Recipes . Cam -
bridge University Press: Cambridge (1986). S. Russell , J . Binder , D . Koller , and K . Kanazawa . Local learning in probabilistic networks with hidden variables . In Proceedings of I J CA / - 95. A . Dempster , N . Laird , and D . Rubin . 1977. Maximum likelihood from incomplete data via the EM algorithm . Journal of the Royal Statistical Society B39 : 1- 38. P. Simard , Y . LeCun , and J . Denker . Efficient pattern recognition using a new transformation
distance
Neural Information
. In
S . Hanson
, J . Cowan
, and
C . Giles
Processing Systems 5 :50- 58. Morgan
, edge
Advances
in
Kaufmann : San Mateo ,
CA (1993) . S.
Geman
and
D . Geman
Bayesian restoration
. Stochastic
of images . IEEE
relaxation
Transactions
,
Gibbs
distributions
on Pattern
Analysis
,
and
the
and Ma -
chine Intelligence 6:721- 741 ( 1984). H . Seung . Annealed theories of learning . In J .-H . Dh , C . Kwon , and S. Cho , eds. Neural Networks : The Statistical Mechanics Perspective , Proceedings of the CTP -
PRSRI Joint Workshop on Theoretical Physics. World Scientific: Singapore (1995).
EDGE EXCLUSION TESTS FOR GRAPHICAL GAUSSIAN MODELS
PETER
W
. F . SMITH
Department The
of
Social
University
,
Statistics
Southampton
, ,
S09
5NH
UK
.
AND JOEWHITTAKER Department
of
University email
of .. joe
Mathematics Lancaster
. whittaker
and ,
LAl
@ lancaster
4 YF . ac
Statistics ,
UK
, .
. uk
Summary Testing that an edge can be excluded from a graphical Gaussian model is an important step in model fitting and the form of the generalised like lihood ratio test statistic for this hypothesis is well known . Herein the modified profile likelihood test statistic for this hypothesis is obtained in closed form and is shown to be a function of the sample partial correla tion . Related expressions are given for the Wald and the efficient score statistics . Asymptotic expansions of the exact distribution of this correla tion coefficient under the hypothesis of conditional independence are used to compare the adequacy of the chi-squared approximation of these and Fisher 's Z statistics . While no statistic is uniformly best approximated , it is found that the coefficient of the 0 (n - l ) term is invariant to the dimension of the multivariate Normal distribution in the case of the modified profile likelihood and Fisher 's Z but not for the other statistics . This underlines the importance of adjusting test statistics when there are large numbers of variables , and so nuisance parameters in the model . Similar comparisons are effected for the Normal approximation signed square-rooted versions of these statistics .
to the
Keywords .- Asymptotic expansions ; Conditional independence ; Edge exclusion ; Efficient score; Fisher 's Z ; Graphical Gaussian models ; Modified profile likelihood ; Signed square-root tests ; Wald statistic . 555
PETER W. F. SMITH ANDJOEWHITTAKER
556 1. Introduction
Dempster (1972) introduced graphical Gaussian models where the structure
of the
inverse
variance
matrix
, rather
than
the variance
matrix
itself ,
is modelled . The idea is to simplify the joint Normal distribution of p continuous random variables by testing if a particular element Wij of the p by p inverse variance matrix n can be set to zero. The remaining elements are nuisance parameters and the hypothesis is composite . Wermuth (1976) showed that fitting these models is equivalent to testing for conditional in dependence between the corresponding elements of the random vector X .
Speedand Kiiveri (1986) showedthat the test correspondsto testing if the edge connecting the vertices corresponding to Xi and X j in the conditional independence graph can be eliminated . Hence such tests are known as edge
exclusion tests. For an introduction to this material see Lauritzen (1989) or Whittaker
(1990) .
Many graphicalmodelselectionproceduresstart by makingthe (~) single edge exclusion tests , evaluating the (generalised) likelihood ratio statis tic and comparing
it to a chi -squared
distribution
. However , this is only
asymptotically correct , and may be poor , as is the case for models with discrete observations (Kreiner , 1987; Frydenburg and Jensen, 1989) . One
approach taken by Davison, Smith and Whittaker (1991) is to use the exact conditional
distribution
of a test statistic
, where
available
. However
, the
exact conditional test for edge exclusion for the graphical Gaussian case is equivalent
to the unconditional
test , and is based on the square of the sam -
ple partial correlation coefficient whose null distribution is a beta (Davison , Smith and Whittaker , 1991) . Thus in practice the exact test should be used. This statistic is the same as would be derived from a t- test for testing for a zero coefficient in multiple regression . Fisher 's Z transformation is also based on the sample partial correlation coefficient and allows Normal ta bles to be used to a reasonable degree of approximation . It is of interest to assess which
of several competing
best approximated
statistics
has an exact distribution
by the chi-squared over a varying number of nuisance
parameters .
To achieve this aim , explicit expressions for the modified profile likeli hood
ratio , Wald
and
efficient
inverting the information evant
submatrix
. These
score
statistics
are obtained
. This
involves
matrix and calculating the determinant of a reltest
statistics
turn
out
to be functions
of the sam -
ple partial correlation coefficient and it is natural to compare them with Fisher
' s Z transformation
In Section
.
2, after inverting
the information
matrix , the Wald and ef-
ficient score test statistics for excluding a single edge from a graphical Gaussian
model
are constructed
. In Section
3 a test
based
on the modified
558
PETERW. F. SMITHANDJOEWHITTAKER
where a = vec(E) is (a linear function of) the mean value parameter. The observed (and expected) information matrix is
(} 1 (} I = - 8wT U(w) = - -J a. 2 8wT
(3)
The inverse , K , of this information matrix is required to compute the test statistics
.
Consider more generally a linear exponential family model with pdimensional canonical parameter (), canonical statistics t and log-likelihood (}Tt (x ) - K((}) - h (x ) (Barndorff -Nielsen , 1988, p87). The information matrix for the canonical
parameter , Io , can be expressed as ()
If } = ""ijijf T(()) ,
(4)
wherer ==/eK ,(O) is themean -valuemapping . Asthemean -valuemapping is bijective (Barndorff-Nielsen, 1978, pI21),
8T.8TT 8f}=I' 80T I-I - aa(}T T 8--
the identity matrix . Hence the inverse of the information matrix can be computed from
This
(
result
1978
( 5
)
and
corollary
,
is
)
does
in
a
not
appear
well
different
- known
expression
To
( 3
find
)
to
The
the
partial
A
( IO
)
though
a
Amari
(
form
1982
appears
,
1985
in
,
Efron
p106
)
.
One
inverse
K
=
I
-
I
=
for
the
-
l
.
graphical
Gaussian
model
apply
( 6
)
( 5
)
give
K
trix
in
that
IT
to
,
,
(5)
derivatives
=
{
ail
}
of
with
=
the
respect
known (seeGraybill, 1983, as follows:
-
2
~
WTJ
elements
to
Corollary
the
of
-
a
elements
10
1
.
p -
dimensional
of
. 8
. 10
or
its
symmetric
inverse
McCullagh
B
,
ma
=
1987
{
)
brs
and
}
-
are
are
if r = s (airajs+ aisajr ) 1 .f r -.L ~ ={ -- airajs -;- s
.. t,J,r,s= l ,...,p. Hence theinverse information matrixofwcanbeobtained explicitly . So, in a sample ofsizen, thecovariance between themaximum likelihood estimates (mles ) ofanytwoelementsof the inverse variance matrix , asymptotically , is cov (GJijWrs) = .!. n (WirWjS + WisWjr ) .
(7)
EDGEEXCLUSION TESTSFORGRAPHICAL GAUSSIAN MODELS559 Cox
and
Wermuth
for graphical
( 1990 ) obtained
Gaussian
models
the
inverse
by another
of the
method
information
using
a result
matrix of Isserlis
( 1918 ) . Excluding model
the edge connecting
corresponds
alternative
vertices
accepting
the
is H A : W12 unspecified
are nuisance single
to
parameters
edge from
. The
a graphical
1 and 2 in a graphical
null
. The
hypothesis remaining
likelihood
ratio
Gaussian
distinct
test
model
Gaussian
H 0 : CJ )12 =
statistic
O. The
elements
of 0
for excluding
a
is
Tz = - n log ( 1 - ri2Irest ) ,
(8 )
where r121rest is the sample partial correlation of Xl the remainder X3 , . . . , Xp . The latter can be expressed of elements of the inverse variance matrix as
and X2 adjusted for in terms of the mles
-- ( W11W22 -- -- ) - 1/ 2 , r121rest = - W12 see for example efficient
score
2 .1. THE The
, Whittaker test
WALD
Wald
for the null
statistic
Tw
quadratic
approximation a single
asymptotic
( Cox
and
of the
edge from
variance
so leads
Hinkley
the Wald
and
.
of (;)12 from
to the closed
, 1974 , p314 , 323 ) based
likelihood
a graphical
varW12 and
are derived
hypothesis
TEST
excluding The
( 1990 , p189 ) . Below
statistics
(9 )
function Gaussian
equation
at
its
on
maximum
a
, for
model , is WI2 / var (W12) .
( 7 ) is
= ~ (WIIW22 + Wr2 ) , n
( 10 )
form
~
- 2 nw12 WIIW22 + Wf2
-
w
nr2 =
using
( 11 )
( 9 ) above .
2 .2 . THE The
l ~ rest 1 + r121rest
EFFICIENT
efficient
the conditional
score
SCORE test
TEST
Ts ( Cox
distribution
and
Hinkley
, 1974 , p315 , 324 ) is based
of the score statistic
for the interest
parameter
on
560
PETERW. F. SMITHANDJOEWHITTAKER
given the score statistic for the nuisance parameter, evaluated under the null hypothesis. From (2) the score is U(W12 ) = 0:12- 812, where . tilde denotes evaluation under the null hypothesis, with conditional varIance 1 2 1 -n (WIIW22+ W12 )- . (12) Evaluation of these requires estimates of W12 , Wll , W22and 0"12 under the null hypothesis. Under H 0 : W12= 0 the mles of n and ~ are W12 = 0, -- (1 - r121rest 2 ) Wii = Wii ~. = 1, 2 ail = 8ij i , j # 1, 2 -. - rl2lrest a12 = (-WIIW22 - )1/2(1 - r 2 ) + 812. 121rest
(13) (14) (15) (16)
Equation ( 13) restates the null hypothesis . Speed and Kiiveri (1986) showed that for all unconstrained Wij the files of the corresponding a ij are equal to Sij , hence (15). The other two equations are new and their proofs are given in Appendix C . The corollary of interest is the closed form expression for the score Ts
-
)2.W11W22 ... .... (1- r121rest 2 )2 n (-a12- 812 2 nr 12lrest '
(17)
using (16). 3.
The
Modified
Profile
Likelihood
Ratio
Test
The modified profile likelihood function (Barndorff-Nielsen, 1983, 1988), explicitly designed to take account of nuisance parameters , can be analyt ically expressed in the Gaussian context , and is an obvious candidate on which to base a test statistic . A number of authors have developed this work and for linear exponential
families
have calculated
the modified
pro -
file log-likelihood function : equations (10) of Cox and Reid (1987), (5.2) of Davison (1988) and (6) of Pierce and Peters (1992). Taking this as the starting point , the corresponding test statistic , in the canonical parameter isation
, is .
-
IIbbl Tm = Tl + log - . IIbbl
(18)
EDGE
The
EXCLUSION
term
Tt
is
information
parameter
{
12
}
)
and
log
been
used
is
rather
likelihood
and
;
Reid
The
(
the
)
matrix
and
to
Gaussian
model
prove
this
note
that
with
taking
III
The
last
from
p59
term
n
)
to
the
~
equals
The
ratio
on
and
test
is
for
this
-
:
,
the
,
-
p
+
-
P
~
l ) /
:
if
lp
+
,
+
l
.
l
)
I
= =
0
,
,
W
=
( wa
,
by
Wb
ordinary
)
have
modified
see
the
profile
comments
of
determinant
the
information
of
the
matrix
,
then
(
( 3
)
19
)
gives
ale
the
transformation
Theorem
~
3
. 7
;
the
,
result
modified
= =
Muirhead
the
the
where
a
.
mles
the
establishes
that
the
Here
.
is
~
1951
is
.
.
2IJII
This
Ibb
parameterisation
of
is
W12
1992
canonical
PI
( P +
aIkin
lp
so
,
of
section
Ho
( n
-
)
Jacobian
1~
this
Normal
)
the
and
of
statistic
=
2
value
dimensional
prove
( -
right
result
Tm
To
=
absolute
the
2
,
for
determinants
( Deemer
in
main
p -
the
(
18
and
Wb
the
form
(
=
b
the
Peters
evaluate
)
indexed
maximising
explicit
III
To
by
difficult
and
an
( 8
parameter
statistic
from
Pierce
at
561
parameter
interest
test
MODELS
nuisance
analytically
gives
required
graphical
the
is
lemma
information
the
obtained
latter
test
the
indexed
of
those
1987
following
a
into
version
than
directly
Cox
ratio
to
parameter
this
GAUSSIAN
- likelihood
partitioned
nuisance
in
,
GRAPHICAL
corresponding
the
that
FOR
ordinary
W
Note
for
the
submatrix
the
( =
TESTS
~
n
~
1982
,
.
profile
likelihood
underlying
distribution
is
is
1
following
)
log
(
1
-
r
identities
~ 2Irest
)
+
log
( Whittaker
(
1
1990
+
,
r
~ 2Irest
p149
,
)
169
.
( 20
)
are
)
used
..-..
lEI IIbbl
=
IIIIJCaal
and
2
~
=
1
-
r12lrest
,
lEI
where
Kaa
sponding
is
to
Equation
applying
the
the
(
lemma
submatrix
of
interest
18
)
the
inverse
parameter
is
evaluated
19
)
Wa
by
,
firstly
of
the
information
matrix
to
-
,
using
the
first
identity
. --lEI IKaal log ~ l'Ibbl -- (p+1)log~ +logjK:I -(
corre
,
followed
by
give
-
2 I ~ (p+1)log (1- T12lrest )+ ogIKaal
(21)
562
PETER W. F. SMITH AND JOE WHITTAKER
using the secondidentity and noting that the log 2Pterms cancel. The last term on the right is simplified by Kaa = WIIW22+ W?2 as already utilised in the expressions(10) and (12). So
.'"-. Kaa Kaa -
WllW22 + Wr2
-.)11(,1 -.)22 (1 - r 2 (,1 121rest)2
-
1+ r~2lrest (1 - r2121rest)28
(22)
Finally , combining (21) and (22) and substituting into (18) gives Tm
- n log(1 - r~2Irest ) + (p + 1) log(1 - r~2Irest ) + log
{ I+ri21rest } 2 2 (1 - r 12lrest)
== - (n - p - 1) log (1 - r12lrest ) + log { (1I -+ r121rest )2 } 2 r2~2lrest
(23)
== - (n - p + 1) log (1 - r ~2Irest) + log (1 + r ~2Irest). Note that (23) shows that the modified test statistic is the ordinary likelihood ratio test statistic where n has been replaced by n - p - 1, a multiplicative correction, plus another term which is the log of the ratio of the asymptotic variancesof the interest parameter evaluated under Ho and H A. The adjustment to the multiplicative term is not directly related to dimension of the nuisance parameters, although this might be expected if adjusting for the number of degreesof freedom used up by estimating these parameters. Instead this term is reduced by one more than the number of variables for which the samplepartial correction coefficient is adjusted. This is similar to the caseof linear regressionwhere the degreesof freedom are reduced: one for the mean and one for each variable included in the model. This modified test is a function of the sample partial correlation coefficient alone, which is a maximal invariant for the problem (see Davison, Smith and Whittaker , 1991); as are the tests derived in the previous section. 4. The Null Distributions Moran (1970) (seealso Hayakawa, 1975and Harris and Peers, 1980) showed that in general the likelihood ratio , Wald and efficient scorestatistics have the same asymptotic power. It has been shown here that for excluding a single edge from a graphical Gaussianmodel the four statistics: the generalised likelihood ratio , the Wald, the efficient scoreand the modified profile
EDGE
EXCLUSION
TESTS
FOR
GRAPHICAL
GAUSSIAN
MODELS
563
likelihood , are functions of the sample partial correlation coefficient , as is the test based on Fisher 's Z transformation . Consequently the tests have exactly the same power provided the null distributions are correct . Under the null hypothesis , the square of the sample partial correlation coefficient
from a sample of size n from a p- dimensional
Normal
distribution
hasa Beta(~, ~ ) distribution. Forexample , seeMuirhead(1982 , pI88). That
is 1
fu (u ) =
1 U- l / 2(1 - U)(n- p- 2)/ 2 B ( -2 ' ~.2=E) "
0 < U< 1
where u = ri21rest andB(.,.) is thebetafunction . By usingtherelevant transformation
, the
exact
null
distribution
of the
statistics
above
can
be
obtained . Following Barndorff -Nielsen and Cox (1989, Chapter 3) and expanding the log of the density functions in powers of n - l , the adequacy of the asymptotic chi-squared approximation can be assessed. Consider first the likelihood ratio test . With t == - n log (1 - u ) the density
Ii (t)
of Tl under the null hypothesis
is
1 -1/2( ( B(!2'1~.2 = .E .)u 1-U )n-p-2)/2.1-nU'
-
= nB(!2'!!.= (-t/n)}-1/2exp {-t(n- p)/2n }' 2p.){I - exp
The leadingterm of this expansioncorresponds to the XI distribution and so
1 fl (t) = 9x(t)[1+ 4(t - 1)(2p+ 1) n- lJ + O(n- 2),
t >
0,
where 9x (t ) is the density function of a XI random variable . Finally inte grating , from 0 to x , term by term , gives the expansion for the cumulative distribution function as Fl (X)
= = =
1 (X Gx (x ) + 4 (2p + 1) n - l Jo (t - l )gx (t )dt + O (n - 2) Gx (x ) - 21 (2p + 1) n - 1 ( 2X -;: ) 1/ 2 exp (- x / 2) + O (n - 2) 1 Gx (x ) - 2 (2p + 1) xgx (x ) n - 1 + O (n - 2) , x > 0,
564
PETER
W . F . SMITH
AND
JOE
WHITTAKER
TABLE 1. Coefficients of the O(n- l ) term in the asymptotic expansions of the density and distribution test
statistics
functions
of the five
.
a(t )
likelihood modified Wald score Fisher
b(x )
~(t - 1)(2p+ 1) ~(t - 1) i (t - 1)(2p+ 1) + 3t(3 - t) ~(t - 1)(2p+ 1) + t(3 - t) ~ (3 - 6t + t2)
-
~(2p+ 1) x !x ~(2p+ 1 - 3x) x ~(2p+ 1 - x) x i (x - 3) x
whereGx(x) is the distribution function of a xi randomvariable. These expressions
are of the form
fl (t ) = Fl (X) = with
9x(t ) + al(t ) 9x(t )n- l + O(n- 2) Gx(x) + bl(X) 9x(x) n- l + O(n- 2),
a and b appropriately defined . The approximations for the five test statistics
(24)
considered are of a similar
form and are displayed in Table 1. The expansions for the Wald and efficient score tests are similarly derived and details are included in Appendix A . In the case of the modified profile likelihood test , where 1
fm (t ) =
1
B(~, y )
it is not possible
u - l / 2(I - u )(n- p- 2)/ 2. to obtain
2
- u
n - p + 2+ (n - p)u"
a transformation
u in terms
of t
0< u < 1
(25) explicitly.
However, a function u*(t ), can be found such that u = u*(t ) + O(n- 3) and leads to the evaluation of the coefficients am and bm displayed in the table above; see Appendix B for details . Fisher
' s statistic
, 1
Zf == '2 log can
be
used
to
T121rest ( 11+ T12lrest ),
test for a zero partial
correlation
coefficient . Under
the
null hypothesis, it has expectation zero and variance 1/ [(n - p + 2) - 3] = l / (n - p - 1). SeeFisher (1970 , Ex. 30, p204) ,. -p160 and . . or Muirhead (1982 ~ Theorem 5.3.1) . Usually Z f is standardised and compared with a standard
EDGE EXCLUSION TESTS FOR GRAPHICAL GAUSSIAN MODELS 565 Normal distribution . However here, for comparison , the null distribution
of
T f = (n - p - 1) ZJ is considered and the terms in the asymptotic expansion are given . Details are included in Appendix A as the derivation is similar to that of the likelihood ratio above. The second term b. (x ) 9x (x ) n - 1 in the distribution function expansion gives F . (x ) - G x (x ) . If this coefficient is negative then the test will reject too few hypotheses (a conservative test ) whereas if the coefficient is positive too many will be rejected (a liberal test ). The striking feature of Table 1 is that to order n - 1 the distribution of the modified profile likelihood ratio test and Fisher 's Z f statistics do not depend on p ; hence to this accuracy their distributions do not depend on the number of nuisance parameters . The expansions of the modified profile likelihood and the likelihood ratio density functions are the same when p = 2. In general the actual sizes of the other three tests are increasing with p and hence when the coefficient of n - 1 becomes negative the adequacy of the chi-squared approximation decreases with p . When the number of nuisance parameters is large , that is large p , all the tests are on the conservative side. Inspection of Table 2 reveals that to order n - 1, for a 5% test , the like lihood ratio and efficient score tests are always conservative , whereas the Wald test rejects too few hypotheses for p less than 6 and too many for larger p . For a 1% test the likelihood ratio test is again always conservative and so is the efficient score test apart from when p = 2. The Wald test is liberal until p equals 10. As expected the modified profile statistic does well , but surprisingly so does Fisher 's statistic . 5.
Discussion
The asymptotic expansions of the null distribution functions given in Table 1 above (i ) allow a comparison of test accuracy among the five statistics considered here, (ii ) make explicit how nuisance parameters affect the tests to varying degrees, and (iii ) indicate the effect of sample size. The closed form expressions , of interest in their own right , enable the detailed calcula tion of these expansions . The main conclusion is that the modified profile likelihood test (and Fisher 's Z f test ) do not depend on the dimension p of the vector X . From Table 2 these two tests are in general more accurate than the others . For large p this superiority is uniform , since the accuracy of the others deteriorates . The implication is that it is important to modify the statistic when p is large and n is small . Interestingly the adjusting factor depends linearly on p rather than the number of nuisance parameters in the model which is 0 ( P2) . The signed square root version of the test statistics discussed in A p-
566
PETERW. F. SMITHAND JOEWHITTAKER
TABLE 2. Achievedsignificancelevelsfor the Wald, efficient Score, Likelihood, Modified profile likelihood and Fisher . 's test statistics, with varying samplesizenand dimenslonp. Nominal level 5% n
10
2 0.0126 0.0566 0.0786 0.0786 0.0516 6 0.0585 0.1025 0.1245 10 0.1043 0.1483 0.1703
20
2 0.0313 0.0533 0.0643 0.0643 0.0508 6 0.0542 0.0762 0.0872 10 0.0771 0.0991 0.1101
30
2 0.0375 0.0522 0.0595 0.0595 0.0505 6 0.0528 0.0675 0.0748 10 0.0681 0.0828 0.0901
50
2 6 10
200
0.0425 0.0513 0.0557 0.0517 0.0605 0.0649 0.0609 0.0697 0.0741
0.0557 0.0503
2 0.0481 0.0503 0.0514 0.0514 0.0501 6 0.0504 0.0526 0.0537 10 0.0527 0.0549 0.0560 Nominal level 1%
n
10
p 2
w
s
L
M
F
0.0178 0.0070 0.0193 0.0193 0.0123
6 0.0029 0.0219 0.0342 10 0.0120 0.0368 0.0491
20
2 0.0039 0.0085 0.0147 0.0147 0.0111 6 0.0036 0.0159 0.0221 10 0.0110 0.0234 0.0296
30
2 0.0007 0.0090 0.0131 0.0131 0.0108 6 0.0057 0.0140 0.0181 10 0.0107 0.0189 0.0230
50
2 0.0044 0.0094 0.0119 0.0119 0.0105 6 0.0074 0.0124 0.0148 10 0.0104 0.0154 0.0178
200 2 0.0086 6 0.0094 10 0.0101
0 .0098 0 .0106 0 .0113
0.0105 0.0105 0.0101 0.0112 0.0120
EDGE EXCLUSION TESTS FOR GRAPHICAL GAUSSIAN MODELS 567 pendix D lead to much the same conclusions . These conclusions will generalise to situations in the wider context of graphical model selection : in particular the cases of excluding several edges simultaneously , single edge exclusion from a non-saturated model , and of models involving discrete and mixed variables . All but Fisher 's test generalise conceptually , but it is hard to find closed form expressions. The results of the paper suggest that modifying the profile likelihood is most rewarding , and this can always be done numerically . A small simulation study for excluding two edges indicated this to be the case, Smith (1990) . In particular these show that the modified profile give the most accurate p-values and its accuracy is least affected by an increase in the number of nuisance parameters . Acknowledgements : We should like to thank Antony Davison for some extremely valuable comments on an earlier version of this paper . The work of the first author was supported by a SERa studentship .
6. References Amari , S-I . (1982). Geometrical theory of asymptotic ancillarity and conditional inference. Biometrika, 69, 1- 17. Amari , S-I . (1985). Differential -Geometric Methods in Statistics. Lecture Notes in Statistics 28, Springer-Verlag: Heidelberg. Barndorff-Nielsen, a .E. (1978). Information and Exponential Families in Statistical Theory. Wiley : New York. Barndorff-Nielsen, a .E. (1983). On a formula for the distribution of the maximum likelihood estimator. Biometrika, 70, 343- 365. Barndorff-Nielsen, a .E. (1986). Inference on full or partial parameters basedon the standardized signed log likelihood ratio . Biometrika, 73, 307- 322. Barndorff-Nielsen, a .E. (1988). Parametric Statistical Models and Likeli hood. Lecture Notes in Statistics 50, Springer-Verlag: Heidelberg. Barndorff-Nielsen, a .E. (1990a). A note on the standardised signed log likelihood ratio . Scand. J. Statist., 17 157- 160. Barndorff-Nielsen, a .E. (1990b). Approximate probabilities . J. R. Statist. Soc.B, 52, 485- 496. Barndorff-Nielsen, a .E. and Cox, D.R. (1989). Asymptotic Techniquesfor Use in Statistics. Chapman and Hall : London. Cox, D.R. and Hinkley, D. V . (1974). TheoreticalStatistics. Chapman and Hall : London. Cox, D.R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference, (with discussion). J. R. Statist. Soc. B, 49, 1- 39. Cox, D.R. and Wermuth, N. (1990). An approximation to maximum like-
568
PETERW. F. SMITHANDJOEWHITTAKER
lihood estimates in reduced models . Biometrika , 77 , 747- 761. Davison , A . C . (1988). Approximate conditional inference in generalised linear models . J . R . Statist . Soc. B -. 50 ~ . 445- 462. Davison , A . C ., Smith , P.W .F . and Whittaker , J . ( 1991) . An exact condi tional test for covariance selection . Austral . J . Statist ., 33 , 313- 318. Deemer , W .L . and aikin , O . (1951). The jacobians of certain matrix trans formations useful in multivariate analysis . Biometrika , 38 , 345- 367. Dempster , A .P. (1972) . Covariance selection . Biometrics , 28 , 157- 175. Efron , B . (1978) . The geometry of exponential families . Ann . Statist ., 6, 362- 376. Fisher , R .A . (1970) . Statistical Methods for Research Workers , 14th Edi tion . Hafner Press: New York . Fraser , D .A .S. (1991) . Statistical inference : likelihood to significance . J . Amer . Statist . Soc., 86 , 258- 265. Frydenburg , M . and Jensen, J .L . (1989). Is the 'improved likelihood ratio statistic ' really improved for the discrete case? Biometrika , 76 , 655661. Graybill , F .A . (1983). Matrices with Applications in Statistics . 2nd Edi tion . Wadsworth : California . Harris , P. and Peers, H .W . (1980) . The local power of the efficient score test statistic . Biometrika , 67 , 525- 529. Hayakawa , T . ( 1975). The likelihood ratio criterion for a composite hypothesis under a local alternative . Biometrika , 62 , 451- 460. Isserlis , L . (1918) . On a formula for the product -moment coefficient of anv ., order of a normal frequency distribution in any number of variables . Biometrika , 12 , 134- 139. Kreiner , S. (1987) . Analysis of multi -dimensional contingency tables by exact conditional tests : techniques and strategies . Scand. J. Stat ., 14 . Lauritzen , B.L . ( 1989) . Mixed graphical association models . Scand. J . Statist ., 16 , 273- 306. McCullagh , P. (1987) . Tensor Methods in Statistics . Chapman and Hall : London . Moran , P. A . P. (1970) . On asymptotically hypotheses . Biometrika , 57 , 45- 55. Muirhead , R .J . (1982) . Aspects of Multivariate New York .
optimal tests of composite Statistical Theory . Wiley :
Pierce , D .A . and Peters , D . (1992). Practical use of higher order asymp totics for multiparameter exponential families (with discussion ) . J. R . Statist . Soc. B , 54 , 701- 737. Smith , P.W .F . (1990) . Edge Exclusion Tests for Graphical Models. Unpub lished Ph .D . thesis . Lancaster University . Speed, T .P. and Kiiveri , H . (1986) . Gaussian Markov distributions over
EDGE EXCLUSIONTESTS FOR GRAPHICAL GAUSSIANMODELS 569 finite graphs. Ann . Statist., 14, 138- 150. Wermuth, N. (1976). Analogies between multiplicative models in contingency tables and covarianceselection. Biometrics, 32, 95- 108. Whittaker , J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley : Chichester.
Appendix A : Expansions Wald and Efficient Score
of the Density Functions for Test Statistics in Section 4
the
The Wald test statistic for excluding a single edge from a graphical Gaussian models is
nu, Tw ==1+u
where u = TI2lrest" Under the null hypothesis Ho : W12= 0 1
fr (u) =
1 U- 1/2(1 - U)(n- p- 2)/2 B (-2 ' !!.= ) , 2
0 < u < 1.
Putting t = n u/ (1 + u) gives the density function of Twas
1 -1/2( B(~,y)u 1-u)(n-p-2)/2.i~_.~n_~2:' n-t)(n-p-2)/2 tt)-1/2(1-~ B~:-1~ (~-=
O< u < l
fw(t) -
n
.
2E
(n - t)2'
0 < t < ~, since u = t / (n - i ). Using Stirling 's approximation, taking logs and expanding gives
logfl (t)=-~{t+log (27rt )}+~{(t- l)(2P - l)+3t(3-t)}n-l+0(n-2). The leadingterm of this expansioncorresponds to the XI distribution and so
1 n fw(t) = gx(t)[l + 4{ (t - 1)(2p+ 1) + 3t (3 - i )} n- l ) + O(n- 2), 0 < t < 2 ' wheregx(t) is the density function of a xi randomvariable. Finally integrating from 0 to x, term by term, givesthe expansionfor the cumulative
570
PETER W . F . SMITH AND JOE WHITTAKER
distribution
function
Fw (x )
1 (X Gx (x ) + 4n - l Jo { (t - 1) (2p + 1) + 3t (3 - t ) } gx (t )dt + O (n - 2)
=
M
=
Gx (x ) - 2n 1 - 1(2p + 1 - 3x ) ( 2X; ) 1/ 2 exp ( - x / 2) + O (n - 2) 1
=
Gx (x ) - 2 (2p + 1 - 3x ) xgx (x ) n - 1 + O (n - 2) ,
where Gx (x ) is the distribution
function
of a X~ random
x > 0,
variable .
The efficient score test statistic for excluding a single edge from a graph ical Gaussian model is
Ts = nu . Putting t = n u gives the density function of Ts under the null hypothesis as
f s(t)
-
-
1 -1/2 B(~,T)u (1-u)(n-p-2)/2..n!.' nB(!,1T)(;t:;:)-1/2(1-~)(n-p-2)/2
0 < u < 1,
0 < t < n,
since u = tin . . gIves
expanding
1 1 logf s(t) = - 2{ t + log(27rt)} + 4{ (t - 1)(2p+ 1) + t (3 - t)} n- l + O(n- 2). Now this expansionis identical to that for the Wald statistic, apart from 3 t (3 - t) is replacedby t (3 - f). Hence 1 f s(t) = 9x(t)[1 + 4{ (t - 1)(2p+ 1) + t (3 - t)} n- l ) + O(n- 2), 0 < t < n, 1 Fs(x) = Gx(x) - 2(2p+ 1 - x) X9x(x) n- l + O(n- 2), x > o.
Under the null hypothesis the density function of R = R121rest is 1
fR (r ) =
B (l2 ' !!=2E. )
(1 - r2)(n- p- 2)/2
,
- 1 < r < 1,
EDGEEXCLUSION TESTSFORGRAPHICAL GAUSSIAN MODELS571 Muirhead (1982, pI88 ). So the density function of Tf is
1 2(n-p-2}/2 2(1- r2) B(~~ )(l - r ) . (n- p- 1) log(~ )
-
ff (t )
2(1 - r2)(n- p)/2
1
-
B(~~-~ (n- p- 1) log(~ ) ' where
[2{tj(n- p- 1)}1/2] - 1 r - exp - exp [2{tj(n - p- 1)}1/2] + 1. Expanding thelogoff f(t) asbefore gives 1
Hence
f
f
( t )
=
gx
( t )
+
12
( 3
-
6t
+
t2
) n
-
) n
-
l
+
O
( n
-
2 ) .
-
log
ff (t) Ff(x)
Appendix
1
=
gx
( t ) [ l
+
12
( 3
-
6t
+
t2
l )
+
O
( n
-
2 ) ,
t
l
+
O
( n
-
2 ) ,
x
>
0 ,
1
=
Gx
( x
)
-
6
( x
-
B : The Derivation
3 )
xgx
( x
)
n
-
>
O .
of u*(t ) in Section 4
Recallt = - (n - p + 1) log (1 - u) + log (1 + u), whereu = r ~2Irest . Put
Ul - (1 - Ul) log(1 + ul )n- l
U2
U+ ~n- 2(1 - u) log (1 + u) ( 2P+ log (1 + u) - ~
) + O(n- 3).
So U = U2 + 0 (n - 2) . Finally put U3
-
12
(
U2- -2n- (1 - U2) log(1 + U2) 2p+ log (1 + U2) -
4)
1 + U2
U + O (n - 3) .
Recursively substituting for U2 and then Ul gives U3 as a function of t which is a equal to u to order n - 2 and hence is the required u* (t ) . Then am and bm are obtained by substituting u* (t ) in (25) and expanding as for the other test statistics .
572
PETER
W . F . SMITH
AND
JOE WHITTAKER
Appendix C : Proofs of ( 14) and (16) Sincen = ~- 1 (and files are invariant under continuous transformations) -
(. Vii
--JEiiJ IEJ '
where l~ iiJ is the of O'ii and so does not contain 0' 12 when i = 1 -- co-factor .... or 2. By (15) , IEiil = IEiil , for i = 1, 2, and hence ..-J L ' ii J I L /.1.. -\.AJ - 'f 1.1.
lEI lEI =
Wii( 1 - ri2 /rest) ,
which proves (14). For (16) note that -0 = (;;12(= (;;21) = ~ lEI
==: >
JE211= 0,
(26)
...where I~ 211is the cofactor of a21. Expanding 1~ 211about the first row of -L' 21 gives p IE211= - L alkl [E.21] lkl , k=2
(27)
where I[~ 21] 1kI is the cofactor of O'--lk from the submatrix of E without the first column and second row...... So I[E21] lkl does not contain 0: 12 (= 0:21) and hence, by (15) , is equal to I[~ 21] 1kI for all k . Combining (26) and (27) gives
0 = - 1~211 -
since alk = alk , k # 2
-
-
By rearranging,
a12 -
_ ~3-]I1-21 1+21 a12 I[JE
EDGE
EXCLUSION
TESTS
FOR
GRAPHICAL
.-
I~ I -
I~ I
=
Substituting
r121rest
for
MODELS
573
.-
1~ 211 .-
=
GAUSSIAN
.0' 12
+
1[ ~ 21 ] 121 (,1) 12
..(,1)11 (,1)22 -
.-
- 2 + (,1) 12
a12 .
- (; ) 12 ( (; ) 11 (;) 22 ) - 1/ 2 and
812
for
0: 12
gives
the
result
( 16 ) .
Appendix
D
There
is recent
with
null
1992
part
applied
signed ratio
( 11 )
that
,
deriving
,
1990a
, one
- root
of by
199Gb
;
- side
hypotheses
versions are
modification
of
given
Fraser
and
square ( 1992
Z
,
signed
square
- root
tests
the
Normal
distribution
1991
;
and
Pierce
have
Wald
=
proposed
modifying
Peters
the
by
statistic
directly
and
modifications
especial
Peters
,
relevance
.
( 17 ) . The
by
Tests
approximated
because
statistics
and
Pierce
1986 is
square
test
calculated
of
in
- root
better
, this
studies
The
(8 ) ,
Square
interest
- Nielsen
) ; in
hood
Signed
distributions
( Barndorff
in
:
the
( r12Irest
by
Barndorff
square
- rooting
do
, efficient
sgn
- root
not
score
and
likeli
) T1 / 2 together - Nielsen
test
( 1986
statistic
commute
. ) From
,
-
with
Zl .
)
is
( Note
equation
(8 )
) .1
Zm
=
Zl
+
log 2Zl
The
relevant
values
( 22 ) ; giving partial
a closed
correlation Asymptotic
score
and
for
can
and
efficient
be
to given
the
by be
score required
are
that
for
have
the by
lower
order
, so
to
from
those
. The
bounds a comparison
is
test
Section the
not for
in
sample
Wald
, efficient
. The
resulting
4 with
the
argu
modified
to
that invert statistic
-
density
distribution mind
possible the
the
chi - square
cumulative
bearing . It
the
's Z f in
of
and
.
compare
correspondingly
Normal
( 8 ) , ( 11 ) , ( 21 )
here
Fisher
integration
( 28 )
is a function
write to
with
. Zw
again to
similar , and
Z l log
obtained
calculated
along
squares
obtained
Zl
which
be
1
func the
-
Wald
Zm
, not
cannot
.
Numerical reveals
can
their
ILbb I
complicated
tests
densities
by
then
the
too
-
are
expression is
ratio
replaced
tions
even
form
expansions
replaced
function
substitution
, but
likelihood
expansion ments
for
ILbb I "' " -
nothing
results new
are .
given
in
Table
3 , but
comparison
with
Table
2
HEPATITIS B:
A
CASE
STUDY
IN
MCMC
D. J. SPIEGELHALTER MRCBiostatisticsUnit Instituteof PublicHealth Cambridge CB22SR UK N . G . BEST
Dept Epidemiology and Public H faith Imperial College School of Medicine at St Mary 's London
W2
1PG
UK W . R . GILKS
M RC
Biostatistics
Unit
Institute of Public Health Cambridge
CB2 2SR
UK AND
H . INSKIP
MRC Environmental Epidemiology Unit Southampton General Hospital Southampton
5016
6YD
575
DAVIDJ. SPIEGELHALTER ET AL.
576
1.
Introduction
This chapter features a worked example using Bayesian graphical modelling and the most basic of MCMC techniques , the Gibbs samplei', and serves to introduce ideas that are developed more fully in other chapters . This case study first appeared in Gilks , Richardson and Spiegelhalter ( 1996), and frequent reference is made to other chapters in that book . Our data for this exercise are serial antibody -titre measurements , obtained from Gambian infants after hepatitis B immunization . We begin our analysis with an initial statistical model , and describe the use of the Gibbs sampler to obtain inferences from it , briefly touching upon issues of convergence, presentation of results , model checking and model criticism . We then step through some elaborations of the initial model , emphasizing the comparative ease of adding realistic complerity to the traditional , rather simplistic , statistical assumptions ; in particular , \ve illustrate the accommodation of covariate measurernellt error . The Appendix conta,ins some details of a freely available software package (BUGS, Spiegelhalter et al ., 1994) , within which all the analyses in this chapter were carried out . We emphasize that the analyses presented here cannot be considered the definitive approach to this or any other da.taset , but merely illustrate some of the possibilities afforded by computer -intensive MCMC methods . Further details are provided in other chapters in this volume . 2. 2 .1.
Hepatitis
B immunization
BACKGROUND
Hepatitis B ( HB ) is endemic in many parts of the world . In highly endemic areas such as West Africa , almost everyone is infected with the HB virus during childhood . About 20% of those infected , and particularly those who acquire the infection very early in life , do not completely clear the infection and go on to become chronic carriers of the virus . Such carriers are at increased risk of chronic liver disease in adult life and liver cancer is a major cause of death in this region . The Gambian Hepatitis Intervention Study (GHIS ) is a national pro gramme of vaccination against HB , designed to reduce the incidence of HB carriage (Whittle et at., 1991) . The effectiveness of this programme will depend on the duration of immunity that HB vaccination affords . To study this , a cohort of vaccinated GHIS infants was followed up . Blood samples were periodically collected from each infant , and the amount of surface-antibody was measured . This measurement is called the anti -HBs titre , and is measured in milli -lnternational Units (mIU ). A similar study
HEPATITIS
in
neighbouring
where
is
Senegal
t
of
B
denotes
since
may
equivalent
to
a
where
y
anti
- HBs
the
infant
vary
linear
denotes
vaccine
Here
et
we
ale
a
would
In
,
via
2 .2 .
Figure
1
from - HBs
the the
titre
of
titre
288
post ,
to
Of
titre
the
et
titre
al
and
. ,
constant
1991
log
) .
time
This
:
t ,
Q' i
is
( 2 )
constant
validate
after
the
the 1 ,
as
in
the
findings
final
of
plausibility
( 2 ) .
predicting
76
dose
Coursaget
of
This
individuals
relationship
individual
a
or
with
a .
at
)
log
,
if
protectiol1
- log
These
the
true
,
against
,
of
the
a
made
of a
,
these
and
infants
infants
- monthly
106
baseline
vaccination
For ( 30
six
subset had
final .
approximately
for
each
subsequently were
at
scale
infants
time
taken
three
note
is
the
of
the
1
data
,
with
a
two
intervals
for
infant 1
each
labelled
mIU
at
over
measurements
,
being
.t
infant
,
but
and
possibly
days be
time
tha
with
826
could
titre
suggests
intercepts
behaviour
change of
the
Figure
different
from
atypical the
in
to
have
rose
after
a to
an
' * '
in
these
Figure
mIU as
outlying to
be
different
of
subject
might that
1329
thought
i .e .
it
1 , at
an
day
outlier
gradient
;
extraneous
or
error
,
. exploratory
line
[ Yij
]
expectation infant
analysis
,
for
each
infant
in
Figure
1
we
:
E
for
data
lines to
preliminary
denotes
the
straight allowed
both
straight
E
on study
measurements
observations a
plotted - up
of
to
one
As
,
taken
somewhat
outlying
where
log
investigate
for
data
particular
respect
observation
~ and
log
x
to
minus
follow
apparently
This
to
fitted
raw
- baseline
be
.
whose
i .e .
vaccination
( Coursaget
and
measurements
fit
should
gradients
due
we
tool
examination
reasonable
with
1
( 1 )
.
Initial
.
,
GHIS
vaccination
1077
-
data
measurement
two
measurements
lines
final
,
ANALYSIS
shows
least
final
~ t
( 1 ) .
infants
total
Q' i
titre
of
simple
PRELIMINARY
anti
' s
577
infants
infants
=
GHIS
particular
a
cx
MCMC
i .
gradient
provide
HB
at
) .
all
between
- REs
the
common
IN
for
between
infant
analyse
( 1991
having
anti
each
STUDY
titre
relationship
log
for
CASE
that
Y
of
A
concluded
time
proportionality
:
=
ai
+
and i .
We
standardized
, Bi
( log
tij
-
log
subscripts
ij log
t
730
index around
)
,
( 3 )
the
jth
log
730
post for
- baseline numerical
578
DAVIDtJ. SPIEGELHALTER ET AL.
0 0 0 0 0 .
-
0
8 0 .
-
0 0 0 -
0 0 -
0 -
-
300
400
500
600
time since final vaccination
700
800
900
1000
(days )
Figure 1. Itaw data for a subset of 106 GHIS infants: straight lines connect anti-HBS measurements for each infant .
(nIW) eJ~!~S8H~!~ue
stability ; thus the intercept Qi represents estimated log titre at two years post -baseline . The regressions were performed independently for each infant using ordinary least squares, and the results are shown in Figure 2. The distribution of the 106 estimated intercepts { ai } in Figure 2 appears reasonably Gaussian apart from the single negative value associated with " infant ' * ' mentioned above. The distribution of the estimated gradients { ,6i} also appears Gaussian apart from a few high estimates , particularly that for infant ' * ' . Thirteen ( 12%) of the infants have a positive estimated gradient , while four (4%) have a ' high ' estimated gradient greater than 2.0. Plotting estimated intercepts against gradients suggests independence of G:i and fJi, apart from the clear outlier for infant ' * ' . This analysis did not explicitly take account of baseline log titre , YiO: the final plot in Figure 2 suggests a positive relationship between YiOand Qi, indicating that a high baseline titre predisposes towards high subsequent titres . Our primary interest is in the population from which these 106 infants were drawn , rather than in the 106 infants themselves. Independently applying the linear regression model (3) to each infant does not provide a basis for inference about the population ; for this , we must build into our model assumptions about the underlying population distribution of Qi and ,6i. Thus we are concerned with 'random -effects growth -curve ' models . If
HEPATITIS B: A CASESTUDYIN MCMC
579
0 N a
-4 -2 0 2 4 6 8 10 Intercepts :logtitre at2years
tl) N 0 rl {) .
0
20
10 Gradients
CX ) ..q-
... I .~~ . 1\'8' " . -2 0 2 4 6 8 10 Intercept
. ~
.
"
~
.
.
. . ,- J;~
. ' .. ~ . .
.
.
)- .
81 '. . .
. .
.
.
NI
2
8
10
Baseline log titre
Figure 2. Results of independently fitting straight lines to the data for each of the infants in Figure 1.
\\'e are willing to make certain simplifying assumptions and asymptotic approximations , then a variety of techniques are available for fitting such models , such as restricted maximum likelihood or penalized quasi-likelihood (Breslow and Clayton , 1993) . Alternatively , we can take the more general approach of simulating 'exact ' solutions , where the accuracy of the solution depends only the computational care taken .
3. Modelling
Specification of model quantities and their qualitative conditional in dependence structure : we and other authors in this volume find it convenient to use a graphical representation at this stage. Specification of the parametric form of the direct relationships between these quantities : this provides the likelihood terms in the model . Each of these terms may have a standard form but , by connecting them together according to the specified conditional -independence structure , models of arbitrary complexity may be constructed . Specification of prior distributions for parameters : see Gilks et at. ( 1996) for a brief introduction to Bayesian inference .
-
-
I
_
-
I
11111111111I1111I111
This section identifies three distinct components in the construction of a full probability model , and applies them in the analysis of the GHIS data :
580
DAVIDJ. SPIEGELHALTER ET AL.
3.1. STRUCTURAL MODELLING We make the following minimal structural assumptions based on the exploratory analysis above. The Yij are independent conditional on their mean J.Lij and on a parameter 0" that governs the sampling error . For an individual i , each mean lies on a 'growth curve ' such that J.Lij is a deterministic func tion of time tij and of intercept and gradient parameters ai and {Ji. The ai are independently drawn from a distribution parameterized by ao and 0"Q, while the ,Bi are independently drawn from a distribution parameterized by ,Boand O"(j . Figure 3 shows a directed acyclic graph (DAG ) representing these assumptions ( directed because each link between nodes is an arrow ; acyclic because, by following the directions of the arrows , it is impossible to return to a node after leaving it ) . Each quantity in the model appears as a node in the graph , and directed links correspond to direct dependencies as specified above: solid arrows are probabilistic dependencies, while dashed arrows show functional (deterministic ) relationships . The latter are included to simplify the graph but are collapsed over when identifying probabilis tic relationships . Repetitive structures , of blood -samples within infants for example , are shown as stacked 'sheets' . There is no essential difference between any node in the graph in that each is con-sidered a random quantity , but it is convenient to use some graphical notation : here we use a double rectangle to denote quantities assumed fixed by the design (i .e. sampling times tij ) , single rectangles to indicate observed data , and circles to represent all unknown quantities . To interpret the graph , it will help to introduce some fairly self-explanat ory definitions . Let v be a node in the graph , and V be the set of all nodes. We define a ' parent ' of v to be any node with an arrow emanating from it pointing to v , and a ' descendant ' of v to be any node on a directed path starting from v . In identifying parents and descendants, deterministic links are collapsed so that , for example , the parents of Yij are ai , ,Bi and 0". The graph represents the following formal assumption : for any node v , if we know the value of its parents , then no other nodes would be inform ative concerning v except descendants of v . The genetic analogy is clear : if we know your parents ' genetic structure , then no other individual will gi ve any additional information concerning your genes except one of your descendants. Thomas and Gauderman ( 1996) illustrate the use of graphical models in genetics . Although no probabilistic model has yet been specified , the conditional independencies expressed by the above assumptions permit many prop erties of the model to be derived ; see for example Lauritzen et ale ( 1990) , Whittaker ( 1990) or Spiegelhalter et al. ( 1993) for discussion of how to read
HEPATITIS B: A CASESTUDYIN MCMC
Figure 9 .
off
independence
the
graph
served data initially
the example
Our cation we
now
joint
of
the show
a graph
, it
forms of
3 .2 .
PROBABILITY
The
preceding
pretation ( Lauritzen
, such
are
model
will
that
when
, Yi2 , Yi3
essentials
distribution
ditional
retained
is important
full
change
have
no
Cti
and
as
understand any
when
upon
, dependence
that
data
. is
0 b -
conditioning
common
' ancestor
fJi , this
conditioning
observed
to before
on ' will
be
independence
other
will
quantities
between
Cti
. For
and
, Bi ma .y
. use
of
Yil
. It
the
nodes
independent
, when
induced
DAGs
properties
, although
be
model for hepatitis B data .
of
independence
necessarily
example
from
properties
marginally
not
be
properties represents
, and . For
Graphical
581
in
this
of
the
a
all
convenient
model
discussion
we et
is
primarily
without basis
quantities
to
needing for
the
facilitate algebra
cornmuni . Howevel
specification
of
~, as
the
full
.
MODELLING
independence . If
example model
wish
al . , 1990
of
graphical
properties to
construct ) that
models
without a full a
DAG
has
been
necessarily pro
model
in
terms
of
a probabilistic
babili
ty
is
equivalent
model
, it to
can
con inter
be
assuming
shown that
-
582
DAVIDJ. SPIEGELHALTER ET AL.
the joint distribution of all the randomquantitiesis fully specifiedin terms of the conditionaldistribution of eachnodegivenits parents: P(1f) = II P(v I parents[v]), VfV
(4)
whel'e P ( .) denotes a probability distribution . This factorization not only allows extremely complex models to be built up from local components , but also provides an efficient basis for the implementation of some forms of MCMC
methods
.
For our example , we therefore need to specify exact forms of 'parent child ' relationships on the graph shown in Figure 3. We shall make the initial assumption of normality both for within - and between-infant variability , although this will be relaxed in later sections. We shall also assume a simple
linear relationship between expected log titre and log time , as in (3). The likelihood
terms
in the model
Yij
are therefore
I"V N(J1 ,ij , 0"2) ,
(5)
JLij == Qi + !3i(log tij - log 730),
(6)
ai rv N(ao, a; ),
(7)
,Bi I"V N(,Oo ,O "J),
(8)
where 'rv' means 'distributed as', and N(a, b) generically denotesa normal distribution with mean a and variance b. Scaling log t around log 730 makes the assumed prior independence of gradient and intercept more plausible , as suggested in Figure 2.
3.3. PRIORDISTRIBUTIONS To complete
the specification
of a full probability
model , we require prior
distributionson the nodeswithoutparents : 0'2, ao, 0'; , ,80andO '~. These nodes are known as 'founders ' in genetics. In a scientific context , we would often
like these priors
to be not too influential
in the final
conclusions ,
al though if there is only weak evidence from the data concerning some secondary aspects of a model , such as the degree of smoothness to be expected in a set of adjacent observations , it may be very useful to be able to include
external information in the form of fairly informative prior distributions . In hierarchical models such as ours , it is particularly important to avoid casual use of standard
improper
priors since these may result in improper
posterior
distributions (DuMouchel and Waternaux, 1992); see also Clayton (1996) and Carlin ( 1996) . The priors chosen for our analysis are
0:0, ,80 rv N(O, 10000),
a- 2, a~2, a~2 rv Ga(O.Ol, 0.01),
(9)
(10)
HEPATITIS B: A CASESTUDYIN MCMC
where
Ga
and
(
a
,
b
variance
might
generically
/
expect
have
(
)
a
b2
the
precisions
to
)
magnitude
We
at
all
.
a
estimate
our
,
of
1994
;
,
starting
-
any
full
total
the
,
:
In
,
(
'
or
forget
all
the
order
of
deviations
BUGS
.
at
.
(
software
1996
)
(
for
a
Gilks
et
description
sampling
unobserved
:
nodes
uno
(
parameters
of
the
of
length
of
the
a
be
of
the
,
for
con
burn
-
in
'
-
required
;
calculated
a
-
compu
from
unobserved
sampling
statistics
'
more
is
must
values
be
;
algorithm
interest
Gibbs
ust
upon
whether
MCMC
true
summary
the
identify
or
III
decided
on
to
about
node
them
decide
perhaps
quantities
examine
bserved
from
to
or
inference
each
of
volume
the
choice
any
other
'
widely
nodes
fifth
step
evidence
of
.
should
lack
of
fit
of
these
steps
briefly
;
further
details
are
provided
.
its
to
extreme
initial
the
the
useful
,
this
possibility
of
of
a
the
the
that
the
Gelman
very
,
long
posterior
of
-
in
However
(
the
is
the
no
guarantee
not
,
Raftery
,
main
1996
very
)
.
support
by
other
runs
are
.
aggravated
On
for
number
)
towards
.
Gibbs
enough
conclusions
1996
burn
converge
the
long
a
being
posterior
run
perform
check
to
since
be
to
(
to
fail
tails
mode
to
unimportant
should
values
may
,
the
is
lead
sampler
extreme
at
It
values
could
the
is
)
starting
distribution
in
simulation
.
of
values
,
posterior
values
sampler
states
choice
cases
starting
starting
the
starting
severe
instability
of
MCMC
dispersed
sensitive
the
to
this
principle
with
of
an
INITIALIZATION
to
In
for
for
discuss
in
sampler
it
80
.
now
.
of
least
Gibbs
each
implementation
added
We
1
et
parameterization
satisfactory
elsewhere
.
Examination
at
implement
sampling
,
statistics
model
the
Gilks
for
monitored
length
efficient
be
See
for
be
run
output
a
,
;
for
must
the
summary
also
.
to
distributions
tationally
For
)
provided
)
methods
output
the
and
components
standard
using
required
be
data
and
and
b
sampling
1996
are
must
conditional
the
-
steps
missing
structed
-
four
values
and
,
.
are
posterior
sampling
.
ao
variance
10
/
we
.
general
-
at
a
,
since
the
deviation
Gibbs
et
analysis
of
deviations
Gibbs
mean
distributions
the
inverse
corresponding
by
with
probability
on
the
standard
the
model
' oper
standard
using
sampling
In
and
prior
model
distribution
pl
effect
,
prior
Spiegelhalter
Gibbs
4
have
than
Fitting
gamma
are
minimal
these
greater
.
have
100
shows
a
these
deviation
results
4
Although
them
standard
final
denotes
.
583
numerical
hand
,
of
success
starting
if
584
DAVIDJ. SPIEGELHALTER ET AL.
the sampler is not miring well , i .e. if it is not moving fluidly around the support of the posterior . We performed three runs with starting values shown in Table 1. The first run starts at values considered plausible in the light of Figure 2, while the second and third represent substantial deviations in initial values. In particular , run 2 is intended to represent a situation in which there is low measurement error but large between-individual variability , while run :3 repl'esents very similar individuals with very high measurement error . From these parameters , initial values for for Qi and ,f3i were indepen dently generated from ( 7) and (8) . Such 'forwards sampling ' is the default strategy in the BUGSsoftware . Parameter
Run 1
Run2 Run 3
5.0
20.0
-10.00
- 1.0
-5.0
5.00
O" a
2.0
20.0
0.20
U {3
0.5
5.0
0.05
1.0
0.1
10.00
ao
,80
G'
TABLE 1. Starting valuesfor parameters in three runs of the Gibbs sampler
4.2. SAMPLING FROM FULL CONDITIONAL DISTRIBUTIONS Gibbs sampling works by iteratively drawing samples from the full condi tional distributions of unobserved nodes in the graph . The full conditional distribution for a node is the distribution of that node given current or known val lies for all the other nodes in the graph . For a directed graphical model , we can exploit the structure of the joint distribution given in (4). For any node v , we may denote the remaining nodes by V- v, and from (4) it follows that the full conditional distribution P (vIV - v) has the form
P(v I V-v)
cx
P(v, V-v)
cx
wEcht .
P (w I parents[w]),
( 11)
where cx means 'proportional to '. (The proportionality constant, which ensures that the distribution integrates to 1, will in general be a function of
HEPATITIS B: A CASESTUDYIN MCMC the
remaining
nodes
tribution
for
v
components
'
only
co
For
(
to
11
-
'
,
tells
the
us
(
5
,
6
)
,
)
We
see
prior
(
of
v
of
intercept
,t
v
I
tile
hill
pal
the
.
ai
the
,
given
v
The
number
of
(
7
)
,
)
dis
and
and
for
and
co
-
any
parents
,
.
prescription
ai
ni
is
proportional
likelihood
observations
-
likelihood
general
for
by
]
,
of
ai
v
conditional
children
children
term
[
full
,
conditional
' ents
distribution
for
is
tlla
(
' rhus
the
collditional
prior
)
parents
of
the
full
ni
11
.
its
parents
the
where
' om
P
child
other
the
fI
component
values
consider
of
by
.
the
are
that
product
given
v
each
on
parents
example
)
-
from
depends
where
I
a
arising
node
of
'
contains
585
OIl
terms
the
ith
,
infant
.
Thus
P
(
ai
I
.
)
cx
exp
-
2
{
(
2aa -
ai
exp
where
the
except
of
'
ai
(
12
)
,
,
it
.
(
'
i
in
. e
can
P
.
V
be
(
-
ai
CXi
shown
[
n =
I
)
.
"
X
~
O
)
2
}
Yij
-
Q '
i
-
13i
{
l
.
)
all
data
nodes
completing
that
(
log 2
tij
-
log
730
)
the
P
(
ai
I
.
)
is
., 2
and
square
a
,
]
2 a
denotes
By
(12)
-
111
j
(
normal
all
for
}
parameter
ai
distribution
in
nodes
the
exponent
with
mean
; f + ~ ~ j ~l Yij - j3i(logtij - log730) a
1
-0- 2
n "
+
~ 0- 2
a
and variance
1 l0 - 2
+
.!.!.L . 0- 2
a
The full conditionals for ~i , ao and !3ocan similarly be shown to be normal distributions
.
The full conditional distribution
for the precision parameter a~ 2 can
also be easily worked out . Let Ta denote a;;2. The general prescription ( 11) tells us that the full conditional for TCY . is proportional to the product of the
prior for TO ' , given by ( 10), and the 'likelihood' terms for TO ' , given by (7) for each i . These are the likelihood are the only children
P(Tcx I .)
<X
tel 'ills for Ta because the Q'i parameters
of TO ' . Thus we have
106 1 {I
}
T~.01- 1e- 0.O1Ta II TJexp - "iTQ ' (ai - ao)2 1= 1
-
cx
0 01 + 1Q . - 1
1
2
TO '. 2 exp {-TQ ' (0.01+'2~ 106 (ai- ao ) )} Ga ' 0.01+'2 ~ (ai- ao )2). (0.01+2106 1106
586
DAVIDJ. SPIEGELHALTER ET AL.
Tilus
the
The
full
full
to
be
In
ley
,
1987
plify
) .
so
MONITORING
The
values
must
be
many
three
method
of two
monitored
stabilize
of
length
runs
1
2
to
can
distribution
similarly
normal
.
be
. for
shown
gamma
are
Gilks
( 1996
dis
example
distributions
techniques see
or
( see
do
available
-
sim
-
con
-
not
for
-
Rip
efficiently
) .
generated
use
iterations
Rubin
are
of
of of
,
quickly
given
down
,
in
Gelman
10
called
CODA
values
of 3
1 .
took
) .
,
for
around
)
statistics
Details
of
Each
and
the
( Best , 80
sampler and
( 1992
Table
BUGS
Gibbs miring
( 1996
using
run
the
check Rubin
as
by
sampled
settled
and
started
S - functions the
to
Gelman
SPARCstation
suite
by
summarized
000
a
trace 2
gamma
-
conditional
quantities
5
the the
and
;
the
on
using shows
full
statistically
a .nd
minutes
4
,
illustrate
Gelman
a
straightforward
, several
unknown
we
another
OUTPUT
and
I ' tIllS
around
while
THE
Here
is
and
reduce is
distributions
the
2
conditionals
applications
such
for
.
full sampling
. However
graphically
vergence
to
In
from
4 .3 .
all
which
conveniently
sampling
Figure
,
from
T Q
a ~
.
exarnple ,
for
for
distributions
tllis
butions
distribution
distributions
galnrna
tri
on
conditional
conditional
et the
run runs
aI
. ,
three 700
the
took were
1995
) .
runs
:
iterations
.
Table 2 shows Gelman - Rubin statistics for four parameters being mon itored (b4 is defined below ) . For each parameter , the statistic estimates the reduction in the pooled estimate of its variance if the runs were continued indefinitely . The estimates and their 97.5% points are near 1, indicating that reasonable convergence has occurred for these parameters . Parameter
Estimate
,80
1.03
1 . 11
0" / 3
1.01
1 . 02
0-
1.00
1 .00
b4
1.00
1 . 00
TABLE
2 .
97.5% quantile
Gelman- Rubin statistics for
four parameters
The Gelman - Rubin statistics can be calculated sequentially as the runs proceed , and plotted as in Figure 5: such a display would be a valuable tool in parallel implementation of a Gibbs sampler . These plots suggest discarding the first 1 000 iterations of each run and then pooling the remaining 3 x 4000 samples.
587
mil
HEPATITIS B: A CASESTUDYIN MCMC
0
1000
2000
3000
4000
5000
Iteration
Figure 4. Sampled values for .80 from three runs of the Gibbs sampler applied to the model of Section 3; starting values are given in Table 1: run 1, solid line ; run 2, dotted line ; run 3, broken line .
588
DAVIDJ. SPIEGELHALTER ET AL.
b4
betaO 0
0 0 < 0 0 ~
' . : ' -,
. "
. .
.
. .
.
. .
.
I.()
.
: .
:1
~
0
:
C\ J
~
0j~ ,
,
.
0
0
2000
0
4000
iteration
O!Js!JeJs~-~
O!lS!lBlS ~-E)
0
0
2000
4000
iteration
.
sigma.beta
sigma 0
0
0
0
0
~
. . -
.. ..
: : :
. .. Ii 0 t
O!lS!le~S ~ -E)
a
. .
.
. .
: : :
. ..
.
.
.
.
.
.
.
.
.
.
0
:
U)
: . . .
.
.
.
.
l _-
0
2000
4000
iteration
O!Js!Jets~ -~
.
.
.
0 0 l ! )
L ~ 2000
_4000
iteration
Figure 5. Gelman - Rubin statistics for four parameters from three parallel runs . At each iteration the median and 97.5 centile of the statistic are calculated and plotted , based on the sampled values of the parameter up to that iteration . Solid lines : medians ; broken lines : 97.5 centiles . Convergence is suggested when the plotted values closely approach 1.
HEPATITIS B: A CASESTUDYIN MCMC
5S9
4.4. INFERENCEFROMTHE OUTPUT Figure
6
shows
evidence of
the
is
not
gradient
are
4 .5 .
,
provided
maximum - or
- fit
ically
to
classical sumptions ,
not
care
.
and
.
Gelfand
for ,
) ,
( 1996 assessing
we
measuring
simply
again
a
that
that
'd
) ,
our
' .
in
example
is
no
.
In
-
con
-
large theol
' Y
particular
( 1996
describe
) ~ a
George
variety
of
. standard
may
by
as
with
Meng )
which
model
consider
likelihood
and
-
- subject
require
( 1996
-
compared
models
adequacy
manner
good specif
also
subjects
- level
,n
be )
within
therefore
Smith
assumed
may
asymptotic
Gellna
for are
( 1986
multi
model
the
standardized
of this
' GG
basis
between
a ,nd
an
is for
estimates
standard
of
improving
from
emphasize
define
( 1996
illustrate
The
value
headed
natural
Solomon
comparison
Phillips
and
departures
a
and
standal and
and
this
column
deviance
Cox
fitting
which
Raftery )
0 .3 .
statistics
the
indepelldence the
criticism
( 1996
1 :
parameter
[ I ' om
assuming
for
) ~ in
the
of .
allow
Model
McCulloch
Here
,
minus
clear size
around
Summary
provide
since
departures
parameters
hold
techniques
,
measures
data
be
is
absolute
- FIT
models
methods of
does
18
There
the to
2 . 1 .
page
.
although
estimated
methods
detecting
such
MCMC
numbers
( see
- OF
minimize
for in
trast
3
nested
tests
0' { 3
values
~
Section
comparison
alternative
sampled
around
in
Table
model
designed
between
with
likelihood
and
the
gradients
concentrated
GOODNESS
Standard ness
,
noted
in
ASSESSING
is
as
of
the
great
, 80
interest
model
plots
between
variability
particular
We
density
variability
underlying
we
kernel
of
be
means
statistics
calculated a
,
definitive
for although analysis
.
residual
Yij Tij
-
/ lij
= a
with
mean
zero
and
uals
can
be
calculated
and
can
be
used
culate
a
normal
model
,
and
here
mean
and
fourth
is Meng
truly
Various
we
to
( 1996
)
.
values For
These
of
IIij
example
deviations
of
.
,
from
standardized
we
the
resid and can
a
) ,
cal
-
assumed
residuals
could
be
calculate
of
then for
model
current
detect
functions
moment
normal
( using
error
statistics
intended
4
the
assumed
iteration
b
bution
the
summary is
.
under
each
construct that
error
1
at to
statistic
considered
variance
more
the
this formal
-
1 288
~
r . . Z)
. . 4 Z) ,
standardized
statistic assessment
residual
should
be of
such
.
close
If
the
to summary
error
3 ;
see
distri
-
Gelman statistics
.
590
DAVIDJ. SPIEGELHALTER ET AL.
C\ I
N 0 ~
0
0
5
10
-1.5
-1.0
-0.5
betaO
b4
l {)
C\ J
0
0
0.0
0.5
sigma
1.0
0.0
0.5
1.0
1.5
sigma.beta
Figure 6. Kernel density plots of sampled valuesfor parameters of the model of Section 3 based on three pooled runs, each of 4 000 iterations after 1000 iterations burn-in . Results are shown for ,80, the mean gradient in the population ; u,B, the standard deviation of gradients in the population ; 0' , the sampling error; and b4, the standardized fourth moment of the residuals.
HEPATITISB: A CASESTUDYIN MCMC
591
Figure 6 clearly shows that sampled values of b4 are substantially greater
than 3 (mean = 4.9; 95% interval from 3.2 to 7.2). This strongly indicates that the residuals are not normally distributed . 5.
Model
5 .1.
elaboration
HEAVY
- TAILED
DISTRIBUTIONS
The data and discussion
in Section 2 suggest we should take account
of
apparent outlying observations . One approach is to use heavy-tailed distri butions , for example
t distributions
, instead of Gaussian distributions
for
the intercepts ai , gradients fJi and sampling errors Yij - lLij . Many
researchers
duced within
have shown how t distributions
can be easily intro -
a Gibbs sampling framework by representing the precision
(the inverse of the variance) of each Gaussian observation as itself being a random quantity
with a suitable gamma distribution ; see for example
Gelfand et al. (1992). However, in the BUGSprogram, a t distribution on v degrees of freedom can be specified directly as a sampling distribution , with priors being specified for its scale and location , as for the Gaussian distributions
in Section 3. Which value of v should we use? In BUGS, a prior
distribu tion can be placed over v so that the data can indicate the degree of support for a heavy- or Gaussian-tailed distribution . In the results shown below , we have assumed a discrete uniform
prior distribution
for 1/ in the
set
{
1, 1.5, 2, 2.5, . . . , 20, 21, 22, . . . , 30, 3.5, 40, 45, 50, 75, 100, 200, . . . , 500, 750, 1000 } .
We fitted the following models : GG GT
Gaussian sampling errors Yij - Jlij ; Gaussian intercepts Qi and gradients ,Bi, as in Section 3; Gaussian sampling errors ; t-distributed intercepts and gra dients ;
TG
i -distributed
samping
errors ; Gaussian intercepts
and gra -
dients ;
TT
t -distributed
samping
errors ; t -distributed
intercepts
and
gradients . ReGul ts for these models are given in Table 3 , each based on 5 000 i tera -
tions after a 1000-iteration correlations
in the parameters
burn -in . Strong auto -correlations and crossof the t distributions
were observed
, but
the
,80sequencewas quite stable in each model (results not shown). We note that the point estimate of ,8ois robust to secondary assumptions about distributional shape, although the width of the interval estimate is
592
DAVIDJ. SPIEGELHALTER ET AL. GG
Parameter /30
mean
O'fJ
- 1.05
- 1. 13
- 1.06
- 1. 11
- 1. 33 , - 0 . 80
- 1. 95 , - 0. 93
- 1. 24 , - 0 . 88
- 1 . 26 , - 0 . 99
mean
0.274 0.070, 0.698
0.033 0.004, 0.111
0 . 028 0. 007 , 0. 084
mean
95 % c.i .
3.
00
19 5 , 30
1 1, 1
mean
TABLE
2 .5 2, 3.5
;' , 20
95 % c.i .
VfJ
0 . 007 , 0 . 176
3.5
12
mean
0 .065
62 .5, 3.5
00
95 % c. i . Va
TT
95 % c . i .
95 % c. i . v
TG
GT
Results
of fitting
models
00
16 8 . 5 , 26
GG , GT , TG
and TT
to the GHIS
data :
posterior means and 95 % credible intervals ( c . i ) . Parameters 1/ , I/a and 1/13are the degrees of freedom in t distributions for sampling errors , intercepts and gradients , respectively . Degrees of freedom = 00 corresponds to a Gaussian distribution .
reduced dom
by for
errors
35 % when
both
and
error of
, due
alone
sampling and
belief
to
a heavy
butions
population
to
the
have
- tailed
at
all
levels
sampling
distributions coefficients
( va. ~
heavy
( model
TG
TT
) supports
2 . 5 ) and
11/3 ~
) leads
(v ~
a fairly
the
degrees
. Gaussian ( model
) tails ( v {3 ~
distribution
(v ~ 19 ,
( Cauchy
individuals
tails
( model
unknown
regression
outlying
heavy
of
for very
sampling
distribution
gradients
in
t distributions
and
t distributions
overwhelming gradients
allowing
for
the
sampling
GT ) leads
the
a confident
judgement
allowing
assumption
of a heavy shape
of
sampling
3 .5 ) , while
Gaussian
to
distribution
1 ) . Allowing to
of free -
for
t distri
-
- tailed
intercepts
16 ) .
5.2. INTRODUCING A COVARIATE
.~~ .-
.. ..
As noted in Section 2.2, the observed baseline log titre measurement , YiO, is correlated with subsequent titres . The obvious way to adjust for this is to replace the regression equation (6) with
_ . .~. .- ~
J1 ,ij == ai + , (YiO- Y.O) + fii(logtij - log 730),
(13)
where Y.o is the mean of the observations { YiO} . In ( 13), the covariate YiOis ' centred ' by subtracting Y.o: this will help to reduce posterior correlations between 'Y and other parameters , and consequently to improve miring in the Gibbs sampler . See Gilks and Roberts ( 1996) for further elaboration of this point .
HEPATITISB: A CASESTUDYIN MCMC As for all anti - RES titre
measurements , YiO is subject
593
to measllrernent
error . We are scientifically interested in the relationship between the ' true ' underlying log titres JLioand !lij , where ~iO is the unobserved ' trtle ' Jog titl 'e on the ith infant
at baseline . Therefore , instead of the obvious
regression
model ( 13), we should use the 'errors-in-variables' regressionmodel /1ij == ai + , (/1iO- Y.O) + ,6i(log tij - log 730). Information ment
YiO . We
( 14)
about the unknown }.LiDin ( 14) is provided by the mea,suremodel
this
with
YiOI"V N ( /-LiO, a 2).
( 15)
Note that we have assigned the same variance a2 to both Yij in (5) and YiO
in ( 15), becausewe believe that Yij and YiOare subject to the samesources of measurement(sampling) error. We must also specify a prior for /liD. We choose
lLiDr"'-I N((),
( 16)
where the hyperparameters 8 and r,p- 2 are assigned vague but proper C~aussian and gamma prior
distributions
.
Equations (14- 16) constitute a measurement-error model, as discussed further by Richardson ( 1996). The measurement-error model ( 14- 16) forms one component of our complete model , which includes equations (5) and ( 710) . The graph for the complete model is shown in Figure 7, and the results from fitting both this model and the simpler model with a fixed baseline ( 13)
instead of ( 14- 16), are shown in Table 4. Fixed
baseline
Parameter .80
0' / 3
- 1.08
- 1.32 , - 0.80
- 1.35 , - 0.81
0 . 31
95 % c. i .
the GHIS
4.
Results of fitting
data : posterior
0 . 24
0. 07 , 0. 76
0. 07 , 0. 62
0 .68
1 . 04
0.51 , 0.85
0. 76, 1.42
mean
95 % c. i . TABLE
baseline
- 1.06
mean
I
in
( 14- 16)
mean 95 % c. i .
Errors
( 13)
alternative
means
regression models to
and 95 % credible
intervals
We note the expected result : the coefficient I attached to the covari ate measured with error increases dramatically when that error is prop erly taken into account . Indeed , the 95% credible interval for / under the
594
DAVIDJ. SPIEGELHALTER ET AL.
- 1 -
- - -- - - - -
-
-
- - --
- - -
-
- - -.-
- - -- - .-
- -- -
- --
\ \ \ - - - -- \ . - - -
- ~
- -
- - -
--
-
- - -
.-
-
-
--
-
- -
--
- - --
\ \ \ \ \
I
""
\ \
~ \
I
~ ~
\ \ I
I
i I
------------------
I
-----
t ij
I I
I
I
!I
YiO
!
!I
Yij
blood -sample j
I
L - - - - --- - -- -
Figure 7.
infant i
,-
Graphical model for the GHIS data, showing dependenceon baseline titre ,
measured with error .
errors -in -baseline model does not contain the estimate for l' under the fixed baseline model . The estimate for 0' {3 from the errors -in -baseline model suggests that population variation in gradients probably does not have a major impact on the rate of loss of antibody , and results (not shown) from an analysis of a much larger subsample of the GHIS data confirm this . Setting 0' .0 == 0, with the plausible values of I == 1, f3o == - 1, gives the satisfyingly simple model :
titre at time t
1
ti tre at time 0 <X t ' which is a useful elaboration of the simpler model given by ( 1). 6. Conclusion We have provided a brief overview of the issues involved in applying MCMC to full probability modelling . In particular , we emphasize the possibility for constructing increasingly elaborate statistical models using 'local ' associations which can be expressed graphically and which allow strajghtforward
HEPATITISB: A CASESTUDYIN MCMC
595
implementation using Gibbs sampling . However , this possibility for complex modelling brings associated dangers and difficulties ; we refer the reader to other chapters in Gilks , Richardson and Spiegelhalter ( 1996) for deeper discussion of issues such as convergence monitoring and improvement , model checking and model choice. Acknowledgements
. IS
References
Best, N. G., Cowles, M . K . and Vines, S. K . (1995) CODA: ConvergenceDiagnosis and Output Analysis software for Gibbs Sampler output : Version 0.3. Cambridge : Medical Research Council Biostatistics Unit .
Breslow, N. E. and Clayton , D. G. (1993) Approximate inference in generalized linear
mixed
models
. J . Am . Statist
. Ass . , 88 , 9 - 25 .
Carlin , B. P. (1996) Hierarchical longitudinal modelling. In Markov Chain Monte Carlo in Practice (eds W . R. Gilks, S. Richardson and D. J. Spiegelhalter), pp . 303- 320. London : Chapman & Hall .
Clayton , D. G. (1996) Generalized linear mixed models. In Markov Chain Monte Carlo in Practice (eds W . R. Gilks, S. Richardson and D. J. Spiegelhalter), pp . 275- 302. London : Chapman & Hall . Coursaget , P., Yvonnet , B ., Gilks , W . R ., Wang , C . C ., Day , N . E ., Chiron , J . P.
and Diop-Mar , I . (1991) Scheduling of revaccinations against Hepatitis B virus
. Lancet
, 337 , 1180 - 3 .
Cox , D . R . and Solomon , P. J . ( 1986) Analysis of variability of small samples . Biometrika , 73 , 543- 54.
with large numbers
DuMouchel, W . and Waternaux, C. (1992) Discussionon hierarchical models for combining information and for meta-analyses (by C. N. Morris and S. L. Normand). In Bayesian Statistics 4 (eds J. M . Bernardo, J. O. Berger, A . P. Dawid and A . F. M . Smith), pp. 338- 341. Oxford: Oxford University Press. Gelfand, A . E. (1996) Model determination using sampling-based methods. In Markov Chain Monte Carlo in Practice (eds W . R . Gilks , S. Richardson and
D. J. Spiegelhalter), pp. 145- 162. London: Chapman & Hall . Gelfand, A . E., Smith , A . F. M . and Lee, T .-M . (1992) Bayesiananalysis of constrained parameter and truncated data problems using Gibbs sampling . J. Am . Statist
. Ass . , 87 , 523 - 32 .
596
DAVIDJ. SPIEGELHALTER ET AL.
Gelman, A. (1996) Inferenceand monitoringconvergence . In MarkovChainMonte Carlo in Practice(edsW. R. Gilks, S. Richardsonand D. J. Spiegelhalter ), pp. 131- 144. London: Chapman& Hall. Gelman, A. and Meng, X.-L. (1996) Model checkingand modelimprovement . In Markov ChainMonte Carlo in Practice(edsW. R. Gilks, S. Richardsonand D. J. Spiegelhalter ), pp. 189- 202. London: Chapman& Hall. Gelman, A. and Rubin, D. B. (1992) Inferencefrom iterative simulation using multiple sequences (with discussion ). Statist. Sci., 7, 457- 511. George , E. I. and McCulloch, R. E. (1996) Stochasticsearchvariableselection . In Markov ChainMonte Carlo in Practice(edsW. R. Gilks, S. Richardsonand D. J. Spiegelhalter ), pp. 203- 214. London: Chapman& Hall. Gilks, W. R. (1996) Full conditionaldistributions. In Markov Chain Monte Carlo in Practice(eds W. R. Gilks, S. Richardsonand D. J. Spiegelhalter ), pp. 75- 88. London: Chapman& Hall. Gilks, W. R. S. Richardsonand D. J. Spiegelhalter ). (1996) Strategiesfor improving MCMC. In Markov Chain Monte Carlo in Practice(edsW. R. Gilks, S. Richardsonand D. J. Spiegelhalter ), pp. 89- 114. London: Chapman& Hall. Gilks, W. R. and Roberts, G. O. (1996) Strategiesfor improving MCMC. In Markov ChainMante Carlo in Practice(edsW. R. Gilks, S. Richardsonand D. J. Spiegelhal ter), pp. 89- 114. London: Chapman& Hall. Gilks, W. R., Richardson , S. and Spiegelhalter , D. J. (1996) Introducing Markov chain Monte Carlo. In Markov Chain Monte Carlo in Practice(eds W. R. Gilks, S. Richardsonand D. J. Spiegelhalter ) , pp. 1- 20. London: Chapman & Hall. Gilks, W. R., Thomas, A. and Spiegelhalter , D. J. (1994) A languageand program for complexBayesianmodelling. TheStatistician, 43, 169- 78. Lauritzen, S. L., Dawid, A. P., Larsen, B. N. and Leimer, H.-G. (1990) Independ encepropertiesof directedMarkovfields. Networks , 20, 491- 505. Phillips, D. B. and Smith, A. F. M. (1996) Bayesianmodelcomparisonvia jump diffusions. In Markov Chain Monte Carlo in Practice(eds W. R. Gilks, S. Richardsonand D. J. Spiegelhalter ), pp. 215- 240. London: Chapman& Hall. Raftery, A. E. (1996) Hypothesistesting and modelselection . In Markov Chain Monte Carlo in Practice(edsW. R. Gilks, S. Richardsonand D. J. Spiegel halter), pp. 163- 188. London: Chapman& Hall. Richardson , S. (1996) Measurement error. In MarkovChainMonte Carlo in Practice (edsW. R. Gilks, S. Richardsonand D. J. Spiegelhalter ) , pp. 401- 418. London: Chapman& Hall. Ripley, B. D. (1987) StochasticSimulation. NewYork: Wiley. Spiegelhalter , D. J., Dawid, A. P., Lauritzen, S. L. and Cowell, R. G. (1993) Bayesiananalysisin expertsystems(with discussion ) Statist. Sci., 8, 219- 83.
HEPATITIS B: A CASESTUDYIN MCMC
597
Spiegelhalter, D. J., Thomas, A . and Best, N. C; . ( 1996) Computatioll on Bayesian graphical models. In Bayesian Statistics 5, (eds J. M. Bernardo, J. O. Berger, A . P. Dawid and A . F. M . Smith ,), -pp - 407- 425. Oxford: Oxford UniversityPress. Spiegelhalter, D. J., Thomas, A ., Best, N. G. and Gilks , W . R . ( 1994) BUGS : Bayesian inference Using GibbsSampling, Version 0.30. Cambridge : Medical ResearchCouncil Biostatistics Unit . Thomas
, In
D
. C
. and
Markov
and
D
Whittaker
.
,
Whittle
J .
J .
chester
:
, H
Gauderman
Chain
( 1990
. , lnskip
.
Appendix
a
:
is
a
can
be
Figure
Lancet
.
419
)
Gibbs
Practice
- 440
Models
. , Hall
, A
.
sampling ( eds
London
in
. J . , Mendy
Hepatitis ,
W :
337
,
which
language
747
provides
for from
- variables
Applied
B - 750
, M
and
. , Downes
protection
, &
S .
Hall
genetics
.
llichardson .
Analysis
.
Chi
-
,
R
. and
Hoare
viral
, S . ( 1991
carriage
in
)
The
.
running
the
model
model described
a
syntax
Gibbs
for sampling
description in
specifying
graphical
sessions shown
equations
( i
in
l
: I
beta [ iJ
7 -
*
} covariate
yO [ i ] muO [ i ]
with measurement error - dnorm ( muO[ i ] , tau ) ; - dnorm ( theta , phi ) ;
lines
beta [ i ] alpha
[i ]
dnorm ( betaO , tau . beta ) ; dnorm ( alphaO , tau . alpha ) ;
} # prior distributions tau - dgamma ( O.O1, O. O1) ; gamma alphaO
dnorm ( O, O.OOO1) ; dnorm ( O, O. OOO1) ;
.
below ( ~1 ,
) {
+
random
in
Gilks
against
for ( j in 1 : n [ i ] ) { y [i ,j ] - dnorm ( mu [ i , j ] , tau ) ; mu [ i , j ] <- alpha [ i ] + gamma*
#
.
Multivariate
{
#
R
Chapman
7 .
for
methods .
BUGS
obtained - in
, H
program
command
errors
Graphical
against
Gambia
pp
in
.
Vaccination
BUGS
) ,
)
. J . ( 1996
Carlo
Spiegelhalter
Wiley
. C
, W
Monte
An ,
10
,
models
idea
of
the
and syntax
corresponding 14
-
16
)
to a .nd
shown
the in
DAVIDJ. SPIEGELHA .LTERET AL.
598 betaO tau . beta tau . alpha theta phi
sigma ( - l / sqrt (tau ) ; sigma .beta <- 1/ sqrt (tau .beta ) ; sigma . alpha <- 1/ sqrt (tau . alpha ) ;
} The essential correspondencebetween the syntax and the graphical representation should be clear: the relational operator "'" correspondsto 'is distributed as' and <- to 'is logically defined by'. Note that BUGS parameterizesthe Gaussian distribution in terms of mean and precision (= l / variance). The program then interprets this declarative model description and constructs an internal representation of the graph, identifying relevant prior and likelihood terms and selecting a sampling method. Further details of the program are given in Gilks et at. (1994) and Spiegelhaltel' et ai. ( 1996). The software will run under UNIX and DOS, is available for a number of computer platforms , and can be freely obtained, together with a manual and extensive examples from http : / / www.mrc- bsu . cam. ac . uk/ bugs/ , or contact the authors at bugs~mrc- bsu . cam. ac .uk.
PREDICTION WITH GAUSSIAN PROCESSES : FROMLINEARREGRESSION TOLINEARPREDICTION ANDBEYOND C. K. I. WILLIAMS Neural ComputingResearchGroup, Aston University BirminghamB4 7ET, UK Abstract . The main aim of this paper is to provide a tutorial on regression with Gaussian processes. We start from Bayesian linear regression , and show how by a change of viewpoint one can see this method as a Gaussian process predictor based on priors over functions , rather than on priors over parameters . This leads in to a more general discussion of Gaussian processes in section 4. Section 5 deals with further issues, including hierarchical mod elling and the setting of the parameters that control the Gaussian process, the covariance functions for neural network models and the use of Gaussian processes in classification problems .
1. Introduction In the last decade neural networks have been used to tackle regression and classification problems , with some notable successes. It has also been widely recognized that they form a part of a wide variety of non-linear statistical techniques that can be used for these tasks ; other methods include , for example , decision trees and kernel methods . The books by Bishop (1995) and Ripley ( 1996) provide excellent overviews . One of the attractions of neural network models is their flexibility , i .e. their ability to model a wide variety of functions . However , this flexibil ity comes at a cost , in that a large number of parameters may need to be determined from the data , and consequently that there is a danger of "overfitting " . Overfitting can be reduced by using weight regularization , but this leads to the awkward problem of specifying how to set the regularization parameters (e.g. the parameter Q in the weight regularization term QwT w for a weight vector w .) The Bayesian approach is to specify an hierarchical model with a prior distribution over hyperparameters such as Q, then to specify the prior distri 599
600
CI K. II WILLIAMS
bution of the weights relative to the hyperparameters . This is connected to data via an "observations " model ; for example , in a regression context , the value of the dependent variable may be corrupted by Gaussian noise. Given an observed dataset , a posterior distribution over the weights and hyperpa rameters (rather than just a point estimate ) will be induced . However , for neural network models this posterior cannot usually be obtained analyti cally ; computational methods used include approximations (MacKay , 1992) or the evaluation of integrals using Monte Carlo methods (Neal , 1996) . In the Bayesian approach to neural networks , a prior on the weights of a network induces a prior over functions . An alternative method of putting a prior over functions is to use a Gaussian process (GP ) prior over functions . This idea has been used for a long time in the spatial statistics community under the name of "kriging " , although it seems to have been largely ignored as a general-purpose regression method . Gaussian process priors have the advantage over neural networks that at least the lowest level of a Bayesian hierarchical model can be treated analytically . Recent work (Williams and Rasmussen, 1996, inspired by observations in Neal , 1996) has extended the use of these priors to higher dimensional problems that have been tradition ally tackled with other techniques such as neural networks , decision trees etc and has shown that good results can be obtained . The main aim of this paper is to provide a tutorial on regression with Gaussian processes. The approach taken is to start with Bayesian linear regression , and to show how by a change of viewpoint one can see this method as a Gaussian process predictor based on priors over functions , rather than performing the computations in parameter -space. This leads in to a more general discussion of Gaussian processes in section 4. Section 5 deals with further issues, including hierarchical modelling and the setting of the parameters that control the Gaussian process, the covariance functions for neural network models and the use of Gaussian processes in classification problems . 2. Bayesian
regression
To apply the Bayesian method to a data analysis problem , we first specify a set of probabilistic models of the data . This set may be finite , countably infinite or uncountably infinite in size. An example of the latter case is when the set of models is indexed by a vector in ~m. Let a member of this set be denoted by 1l0 , which will have a prior probability P {1lo ) . On observing some data V , the likelihood of hypothesis 1la is P {VI1la ) . The posterior probability of 1 0 is then given by posterior
<X prior x likelihood
(1)
P (1laIV )
<X P (1la )P (VI1la ) .
(2)
PREDICTIONWITH GAUSSIAN PROCESSES
601
d
Figure 1. The four possiblecurves(labelled a, b, c and d) and the two data points (shownwith + signs).
The
P
an
proportionality
(
V
)
=
can
Eo
P
integration
VI1la
we
models
individual
models
prediction
is
P
(
turned
1la
)
into
(
where
)
now
say
asked
we
the
to
are
to
(
make
)
=
L
be
dividing
may
a
prediction
through
be
by
interpreted
using
some
for
y
equality
summation
as
.
predict
prediction
P
an
the
appropriate
are
;
be
)
where
Suppose
abilistic
(
P
y
(
yl1lo
is
quantity
given
)
by
P
(
1laIV
this
y
P
(
)
.
yl1la
.
set
Under
of
each
)
.
The
prob
of
-
the
combined
(
3
)
0
In this paper we will discuss the Bayesian approach to the regression problem, i .e. the discovery of the relationship between input (or independent) variables x and the output (or dependent) variable y. In the rest of this section we illustrate the Bayesianmethod with only a finite number of hypotheses; the caseof an uncountably infinite set is treated in section'3. We are given four different curves denoted fa (x ), fb(X), fc (x ) and fd (X) which correspond to the hypotheseslabelled by 1la, 1lb, 1lc and rid ; seethe illustration in Figure 1. Each curve 1la has a prior probability P (1la ); if there is no reason a priori to prefer one curve over another, then eachprior probability is 1/ 4. The data V is given as input -output pairs (Xl , tl ), (X2, t2), . . . , (xn , tn ). Assuming that the targets ti are generated by adding independent Gaussian noise of variance u~ to the underlying function evaluated at Xi, the likelihood of model1la (for a E { a, b, c, d} ) when the data point (Xi, ti ) is
602
C. K. I. WILLIAMS
observed is
P(tilxi,1{a) = (27r0 1v)1/22 exp - (ti - fa "2 O "v2(Xi))2.
(4)
The likelihood of each hypothesis given all n data points is simply lli P (ti lXi , 1 0). Let us assume that the standard deviation of the noise is much smaller (say less than 1/ 10) than the overall y-scale of Figure 1. On observing one data point (the left -most + in Figure 1), the likelihood (or data fit ) term given by equation 4 is much higher for curves a, band c than for curve d. Thus the posterior distribution after the first data point will now have less weight on 1ld and correspondingly more on hypotheses 1la , 1lb and 1lc . A second data point is now observed (the right -most + in Figure 1) . Only curve c fits both of of these data points well , and thus the posterior will have most of its mass concentrated on this hypothesis . If we were to make predictions at a new x point , it would be curve c that had the largest contribution to this prediction . From this example it can be seen that Bayesian inference is really quite straightforward , at least in principle ; we simply evaluate the posterior prob ability of each alternative , and combine these to make predictions . These calculations are easily understood when there are a finite number of hypotheses , and they can also be carried out analytically in some cases when the number of hypotheses is infinite , as we shall see in the next section . 3 . From
linear
regression
...
Let us consider what may be called "generalized linear regression" , which we take to mean linear regression using a fixed set of m basis functions { (x ) , for some vector of "weights " w . If there is a Gaussian prior distribution on the weights and we assume Gaussian noise then there are two equivalent ways of obtaining the regression function , (i ) by performing the computations in weight -space, and (ii ) by taking a Gaussian process view . In the rest of this section we will develop these two methods and demon strate their equivalence .
3.1. THE WEIGHT-SPACEVIEW Let the weights have a prior distribution which is Gaussian and centered
on the origin, w rv N(O, ~w), i.e.
P(w)=(27r 1:Ewll {-2w IT }J;1w}. )m /21 /2exp
(5)
604 We
C. K. I. WILLIAMS
can
also
mean
is
obtain
gi
ven
O' ~
To
( x
)
to
In
a
the
JRP
( Xl
)
,
of
J - Lws
.
for
the
variance
about
the
)
( W
-
WMP
) TJC
/ > ( x
*
)
*
)
it
is
necessary
due
to
to
the
add
noise
,
(
14
)
(
15
)
(
16
)
O' ~
to
since
the
.
regression
is
and
in
the
discussed
Tiao
(
in
1973
.
problem
.
to
the
)
most
texts
.
( x
)
is
Y
( Xk
)
)
in
a
covariance
matrix
and
.elow
x
- space
be
any
consider
A
finite
only
vari
usually
.
general
-
finite
processes
by
giving
of
Gaussian
a
stochas
any
Gaussian
subset
-
be
of
specified
-
view
random
will
.
way
can
for
x
di
points
- space
of
,
deal
the
function
distributions
consistent
that
at
collection
b
probability
processes
to
values
a
of
through
possible
or
considered
the
described
also
function
dimensionality
,
is
process
Y
cases
was
It
the
stochastic
giving
stochastic
the
weights
respect
the
.
( x
Box
the
'
)
.
t
process
,
) 2 ]
variance
linear
is
is
)
WMP
uncertainty
In
)
.
.
VIEW
by
( X2
-
/ > ( x
example
This
.
( x
var
stochastic
x
function
- space
that
can
.
random
bination
the
equation
5
view
be
Y
.
Y
( x
stochastic
at
)
=
only
points
.
processes
the
In
fact
which
have
is
are
is
clear
shown
set
Wj
< pj
point
( x
)
,
is
indicates
,
the
a
I ' V
weights
and
with
input
simply
W
the
mean
kinds
functions
in
where
that
the
the
basis
x
which
W
consider
of
space
linear
N
are
covariance
com
( O
,
~
w
-
)
as
viewed
as
functions
for
,
Ew
it
we
fixed
particular
calculate
Ew
it
j
W
can
regression
a
variables
notation
We
process
since
~
a
random
( The
. )
linear
from
value
Gaussian
variables
of
generated
The
variable
of
random
functions
-
lc
over
p
weights
variables
-
to
specialize
the
fact
;
.
random
In
)
uncorrelated
the
A
Y
.
[ ( w
A
with
further
function
the
prediction
additional are
and
mean
In
the
specified
vector
shall
this
variance
where
subset
mean
in
)
,
.
( Y
a
a
.
in
is
subset
is
( x
by
in
of
c/ >T
uncertainty
process
zero
=
section
problem
vector
for
( x
- SPACE
indexed
we
) Ewlv
interested
the
are
.
distribution
are
tic
( x
variation
previous
ables
[ ( y
approach
with
of
c/ >T
for
probability
we
=
FUNCTION
rectly
"
Ewlv
statistics
THE
bars
=
Bayesian
Bayesian
.
error
predictive
of
The
. 2
)
account
sources
on
.
the
*
two
3
( x
obtain
O' ~
"
by
[ Y
( x
[ Y
( x
) Y
derived
( x
as
that
Y
in
is
Figure
) ]
=
0
' ) ]
=
ct > T
a
a
linear
)
wct
> ( x
)
,
using
' )
of
process
( b
~
combination
Gaussian
2
( x
.
the
Some
basis
.
Gaussian
examples
functions
(
17
)
(
18
)
random
of
shown
sample
in
PREDICTION WITHGAUSSIAN PROCESSES
605
(b )
(a)
Figure 2. (a) shows four basis functions , and (b) shows three sample functions generated by taking different linear combinations
of these basis functions .
Figure 2(a). The sample functions are obtained by drawing samplesof W from N (O, ~ w). 3 .3 .
PREDICTION
USING
GAUSSIAN
PROCESSES
We will first consider the general problem of prediction using a Gaussian process, and then focus on the particular Gaussian process derived from linear regression . The key observation is that rather than deal with com plicated entities such as priors on function spaces , we can consider just the function values at the x points that concern us , namely the training points Xl , . . . , Xn and the test point x * whose y -value we wish to predict . Thus we need only consider finite - dimensional objects , namely covariance matrices .
Consider n + 1 random variables (Zl , Z2, . . . , Zn , Z * ) which have a joint Gaussian
distribution
with
mean
0 and
covariance
matrix
K + . Let
the
(n + 1) x (n + 1) matrix K + be partitioned into a n x n matrix K , a n x 1 vector
k and
a scalar
k*
K+=( :r ~)
(19 )
If particular values are observed for the first n variables , i .e. Zl = Zl , Z2 = Z2, . . . , Zn = Zn, then the conditional distribution for Z * is Gaussian (see, e.g. von Mises , 1964, section 9.3) with
E [Z*] = var[Z*] =
kTK - lz k* - kTK - lk
(20) (21)
where zT = (Zl , Z2, . . . , Zn). Notice that the predicted mean (equation 20) is a linear
combination
of the z 's .
C. K. I. WILLIAMS
606
3.4. LINEAR REGRESSIONUSING THE FUNCTION-SPACEVIEW Returning to the linear regressioncase, we are interested in calculating the distribution of Y (x *) = Y. , given the noisy observations (tl ' . . . ' in ). To do this we calculate the joint distribution of (Tl , T2, . . . , Tn, Y*) and then condition on the specific values Tl = tl , T2 = t2, . . . , Tn = tn to obtain the desired distribution P (Y. It ). Of coursethis distribution will be a Gaussian so we need only compute its mean and variance. The distribution for (TI , T2, . . . , Tn, Y. ) is most easily derived by first considering the joint distribution of Y + = (YI , Y2, . . . , Yn, Y. ). Under the linear regression model this is given by Y + ~ N (O, +Ew~ ), where <1>+ is an extended matrix with an additional bottom row c!>. = c!>(x . ) = PI (x . ), m(x . ) ). Under the partitioning used in equation 19, the structure of + ~ w~ can be written
~wct >* ct>; EwT ct >; ~wct >*
< I>+Ew < I>~=( < I>Ew < I>T The T
joint ' s
are
x
n
and n
distribution
hence
that
matrix
( The
a
reason
,
corrupting
( Tl
, . . . Tn
~ In
Tl
var
where
we
basis
functions
it
the will
the
a P
By
,
then
=
tl
N
a
~ In
, T2
=
t2
be
( O , +
the
not
can
found
corresponding
r....I
~
right
+
l
is
w
~
and that =
by Y
' s
+
E
realizing
with
+
tn
E
+
bands
predicting
and
the noise
where
by
are
that
Gaussian
) ,
bottom
we
, . . . , Tn
=
J. L / s ( x
[Y * ]
=
c / >;
P
)
is
be
T
multiplying can
with
= = T
+
through be
substituted
zero
c/ >;
+
f3 T Ew
A into
w
~
T
w
is
of Y * ,
using
so .
u ~ In
) . of
that
,
the
zeros
not
equations
T 20
. * . )
and
-
l <1> ~
wc
/ >*
that
if
the
( n
) ,
points will
be
rank
probability ,
the
addition
( 23
)
( 24
)
number as
of
will
often
- deficient of
points
of
O' ~ In
,
i .e .
lying ensures
. these
results
obtained 13
equation
p
- l 23
we
for in
we
= = ( 3 T ( O' ~ I
and
lt
Note
the
expressions
equation
p
T
in
- I
-
data
However
that
T
p
T
Ew
definite show
as
~
number
, be
the
by
-
matrix
to
mean
cJ >~
T
the
will
now
=
/ >*
I > Ew
than
positive
the
wc
eigenvalues
strictly is
for
~
* )
covariance
zero
consistent
formula
=
less
subspace
challenge are
[Y * ]
the
some linear
will
AEw
which
, Y * ) on
is
defined ( m
have
The ance
have
case
outside that
+
, Y *
the
(22)
obtain
E
be
, . . . Tn
bordered E
on
we
Tl
by
that
Conditioning 21
for
obtained
).
note
+
obtain to
yield
Ew
the
mean
section
and
3 . 1 .
To
vari
-
obtain
that
T )
=
~ w T p the
desired
{ 3 T P .
- l
=
, aA
result
( 25
)
- l T , .
PREDICTIONWITH GAUSSIAN PROCESSES The by
equivalence
using
1992
,
the
of
section
A
- I
=
+
YZ
( ~ ~ I
+
may
be
Given it
is
the
that
.
In
one
less
time
weight
tenable
4 .
. . .
As
we
just
a
linear
m
a
zero
- mean
covariance definite
on
One
an
lIn tions functions
be
isotropic
fact
Bochner
C ( h ) which .
x
) - IZX
to
give
et
- 1
equation
- space
is
m
n
x
al ,
( 26
)
( 27
)
P
thus
and
,
the
- space
inversion
of
which
takes
m
n
of
function
invert
function
method
kinds
the
form
. As
the
other
equivalent more
the
problems
for
and
to
choose
regression
are
for
matrix
.
computationally
necessary
to
24
views is
. Similarly n
. But
section
3 .3 ,
Gaussian of and
how opens
is
x ' .
In
in class
=
C
( lx
and
linear
so
a
the
prediction
- space
derived
is
as
the
probability
' s theorem are
the
x ' l)
continuous
as
view
is
of
E
points
or
a
the
y - value said
as
a
linear the
.
-
(x , x ' )
of
( x ' ) ] . Formally a
non
, . . . , Xk
condition
In
function
covariance C
generate , X2
be
regression
function (x ) Y
to
predictor
from
specifying
[Y
Gaussian
is
linear
seen ,
( Xl
this
general predicted
method
be
will
obey
a
, the - negative
) .
It
, although
is
non
-
several
.
C
and
( h ) , where
characteristic .
( see , e . g . Wong 0
and
isotropic
For
, 1971
satisfy
covariance
1 . 1 denotes function
densityl
at
of
that
stationary =
the )
a
the
covariance
defined
set
;
1990
way
function
that
is
can
the
any
prior , then
possibilities one
literature
-
,
further
( x ) is
functions the
t - values
just
any for
with
the
general Y
be
the
model
regression
up
weights
and
if
noise
Tibshirani
linear
matrix up
(x , x ' )
- Iy
methods
natural
linear
in a
process
- known
can
proved
. . .
can
well
These
infinite
function
known
C
be
Gaussian
are
where
can
the x
come
families
preferred
This
covariance to
be
seen .
points
trivial
simple
will
( Hastie
have
prior
between
be , Press
~ wcI >T ) - IcI > ~ w ,
it
the is
combination
viewpoint
with
can
formula
ZX
function two
m
invert
seen
smoother
space
and
( l3 ) , it
assume
linear
we
+
approach
for
already
3
+
16
the
dimension
0
(I
equation
of
and
- IY
~ w T ( O' ~ I
prediction
we
some
section
variance
.
linear
and
is
the
Woodbury
obtain
- space
time
one
have
-
X
- space
of
form
4 )
to
process
for ( the
-
to
which
is
view
- I
- 1
weight ask
. Usually
only
X
~ w
weight
to
- space section
=
into
which
takes
( see
of
the
has
1 matrix
=
the
A
1 x
- I
to
matrix
view
expressions identity
substituted
interesting
efficient
) - I
{ 3 <1>T
A
which
two
matrix
2 .7 )
( X
on
the
following
607
( or
example
) states 0 (0 ) =
the
C
that 1 are
functions
Euclidean
Fourier
( h )
the exactly
norm transform
=
exp
positive the
( -
. )
( h / u ) V )
definite characteristic
func
-
C. K. I. WILLIAMS
608 is
a
valid
covariance
corresponding v
=
1
and
length
v
=
- scale
those
2
of
, =
paths
function
mean
- square
20
and
up
of
to
21
;
to
come
from
or
a
as
Wo
)
=
e -
some
<
v
~
correlation
functions
may
2 ,
when
the
have
( e .g .
no
preferred
splines
widely
are
Cressie
splines power
lhl
.
On
=
)
/J( x
that
)
with
sample
the
that
,
very other
- line
noted
I - d
has the
wTc
straight
paths
a
the
, ;
is
is
covari
are
infinitely
use
equations
-
is
not
on
her )
Gaussian
.
a
processes
"
with
,
-
and
although
of
spline
Kimeldorf
and
overview
a
) .
- dimensional use
to
useful
; and
pro
( J ournel
three
back
' s ( 1963
" kriging
and
.
topic
1940
Gaussian
the
dates
provides
recent the
field
-
below
Whittle
promoting
work
5 .2
very
in
as
assumed
estimation
models
two
in
is
section
a
made
covariance
but
in
known
mostly
( 1990
in
geostatistics
it
the
)
discussed
the
If
data
be
likelihood
process
where
;
.
practice
Kolmogorov are
the
will
term
in
discussed
Gaussian
problems
simply
certainly
influential
Wahba
noise
and
focussed
characterize
maximum
this
in )
to
; we
function
case
then
Wiener
been
to
points
and
are
has
prediction in
.
A
- law
WM
the in
covariance function
a
be
in
e -
.
particular
Essen
choice
of
3 .
- free
3Technically
y
to
covariance
taken
1993
has
although
stationary
covariance
assumed
test
be
known ,
although
noise
is
processes
well
used ) ,
sample
regression
process
1989
example
from
leads
also
overall
be
series
also
function
Gaussian
the
to
correspond
covariance
arises )
to
new
class
regression ) ,
rise
properties
For
differentiable
should
always
back
naturally
( 1970
different .
function
( which
which
can
Wahba
for
It
for
will
time
;
'
gives
multivariate
1978
.
techniques
have
h2
Gaussian
is
spaces
2For
sets
very
- square
function
goes
literature
tions
)
have
( 0 ' 5 , O' r ) ) .
parametric
for
,
is
0'
covariance
covariance
O' rxx
WIX
3 .4
( as
to
,
0
.
with
Huijbregts
+
( O , diag
covariance
prediction
aI
case
function
mean
function
theory
can
not
+
section
models
and
for distributions
densities2
( with
0' 5 N
=
unknown
applications
tially
= "' -'
approach
basic
Wahba
this
covariance
predictions
in
Prediction
input
' )
y
make
Bayesian
this
, x w
( h
prior
is
ARMA
and
Gaussian
other
spectral
the
are
covariance
easy
the
cess
( x
and
C
function
the
in
although
process
form
a
is
,
p
and
processes
differentiable
Given
, it
field
of
which
C ) T
the
ance
dimensions
that
- law
choice
paths
( 1 , x of
Note
power
- Uhlenbeck
choosing
)
.
Gaussian the
sample
l/J( x
they
random
from
Ornstein
hand
the
respectively
the
on
rough
all Cauchy
.
Samples
the
for
multivariate
to
scale
depending
et
the
corresponding
length
V
function
to
analysis this
also of
suggested computer
application
connection
to
functions
it
neural
the
by
spectral
O
' Hagan
experiments is
assumed
( e .g that
networks
was
density
is
the
( 1978
the
made
Fourier
) ,
Sacks
observa
by
-
Poggio
transform
of
. require spectral
generalized density
covariance S ( c,,; )
cx: c,,; - { 3 with
functions .B >
( see O .
Cressie
5 .4 ) ,
and
WITHGAUSSIAN PROCESSES PREDICTION
609
and Girosi (1990) and Girosi, Jones and Poggio (1995) with their work on Regularization Networks. When the covariancefunction C(x , x ' ) depends only on h == Ix - xii , the predictor derived in equation 20 has the form ~ i ciC (lx - xii ) and may be called a radial basisfunction (or RBF ) network. 4.1. COVARIANCE FUNCTIONS AND EIGENFUNCTIONS It turns out that general Gaussian processes can be viewed as Bayesian linear regression with an infinite number of basis functions . One possible basis set is the eigenfunctions of the covariance function . A function
C(x , x')l/J(x)dx = Al/J(X')
(28)
is called an eigenfunction of C with eigenvalue A. In general there are an infinite number of eigenfunctions, which we label
00 C(x, X') ==L Ai< i=l />i(X)>i(X').
(29)
-
This decomposition is just the infinite -dimensional analogue of the diagonalization of a real symmetric matrix . Note that if n is JRP , then the summation in equation 29 can become an integral . This occurs , for example , in the spectral representation of stationary covariance functions . However , it can happen that the spectrum is discrete even if n is IRPas long as C (x , x ' ) decays fast enough . The equivalence with Bayesian linear regression can now be seen by tak ing the prior weight matrix ~ w to be the diagonal matrix A == diag (Al , A2, . . .) and choosing the eigenfunctions as the basis functions ; equation 29 and the equivalence of the weight -space and function -space views demonstrated in section 3 completes the proof . The fact that an input vector can be expanded into an infinite -dimensional space PI (x ) ,
610
C. K. I. WILLIAMS
to Gaussian noise) , a modified version of the 11error metric Iti - Yi I is used, called the e-insensitive loss function . Finding the maximum a posteriori (or MAP ) y-values for the training points and test point can now be achieved using quadratic programming (see Vapnik , 1995 for details ). 5 . . . . and beyond
...
In this section some further details are given on the topics of modelling issues, adaptation of the covariance function , computational issues, the covariance function for neural networks and classification with Gaussian processes.
5.1. MODELLING ISSUES As we have seen above, there is a wide variety of covariance functions that can be used. The use of stationary covariance functions is appealing as usually one would like the predictions to be invariant under shifts of the origin in input space. From a modelling point of view we wish to specify a covariance function so that nearby inputs will give rise to similar predictions . Experiments in Williams and Rasmussen (1996) and Rasmussen (1996) have demonstrated that the following covariance function seems to work well in practice :
C(X(i),X(j))
Voexp{- ~ tl=1Ql(X~i) - x~j))2} p +ao+ a1~ X~i)X~j) + v1r5 (i ,j ), l=1
(30)
where Od,;} (log vo, log VI , log aI , . . . , log ap , log ao, log al ) is the vector of adjustable parameters . The parameters are defined to be the log of the vari ables in equation (30) since they are positive scale-parameters . The covariance function is made up of three parts ; the first term , a linear regression term (involving ao and al ) and a noise term VI8(i , j ). The first term expresses the idea that cases with nearby inputs will have highly correlated outputs ; the a [ parameters allow a different distance measure for each input dimension . For irrelevant inputs , the corresponding al will become small , and the model will ignore that input . This is closely related to the Automatic Relevance Determination (ARD ) idea of MacKay and Neal (MacKay , 1993; Neal 1996) . The Vo variable gives the overall scale of the local correlations , ao and al are variables controlling the scale of the bias and linear contributions to the covariance . A simple extension of the linear regression part of the covariance function would allow a different
612 It
C. K. I. WILLIAMS
is also
possible
likelihood
with
to
~8l as derived the its
, for
to
example and
the
with
respect
the
an
likelihood
rameters ( GCV
. An
is to
In the some
one
maxima
these
may
be
the
defining
a prior
once
a prediction
for
test
distribution
P ( lJIV ) , i .e .
a new
over
point
is
not
possible
techniques paper
(}
high
Two
distribution
and
However
, the
the
random
that
is - walk
is
high this
for
- Hastings available behaviour
. To the
a make
posterior
general
, but
dimension
parameter
locate by
chain
nu -
, then
example
samples
Monte
the
regions
are
( see , e .g . , Gelman are not
has
given
the
form
not
that - space
utilize it
the
tends . Following
to
equation
derivative have the
in
chain the
.
Gibbs
et aI , 1995 ) .
amenable by
)
whose
integral
Markov
methods
distributions
( MCMC
chain
) ; the
the
of
techniques
Carlo
P ( 8IV from
MCMC
does
the
gridding
a Markov
distribution
algorithms
means
to
density
constructing
algorithm
in
obtaining
used . See , for
constructing
using
function
, which
approach
( 33 ) in
Markov by
parameter
covariance
over
, or if
be local
) d8 .
difficult
desired
- Hastings
conditional
very
case
is the
methods
may
then
low
posterior
work
approximated
Metropolis
if the
Metropolis
tion
have
when
points
is attractive
averages
be
appear
estimates
Bayesian
sufficiently
can
would
of parameters
if there
analytically
8 is of
pa -
, ( 1993 ) . it
. In
simply
of
of the
of data
and
seen
P ( y * 18 , V ) P ( 8IV
(} - space
used . These
33 is then
sampler
sampling
which
be
standard
in
been
land
informa
cross - validation
the
parameters
integration
Stein
sampling
may
equilibrium equation
and
, or
reasons
of
maximum
, it
point
determined
has
this
number
making
these
V
used . If
- dimensional
- space
importance
this
be grids
Handcock is
parameter
methods
do
may
involving
by
If
or
to
methods
generalized
number
x * one
I
feed
estimation
to the
the
data
P ( y * IV ) =
merical
about
relative
evaluation
a local
a large
log
( 32 )
O ( n3 ) . Given to
( 1990 ) . However
. For
the
time
likelihood
when
the
K - 1t ,
obtain
( CY ) or
be poorly
distribution
to
Wahba
concerned
surface
distribution
It
order
of
equation
( 1984 ) . The
takes
methods
may
likelihood
Marshall
maximum
is large
parameters
in
posterior
use
in
the
1 T K - 1 ~oK 2t
+
a cross - validation
of parameters
of the
in
to
derivatives
is straightforward
package
to
general
number
)
and
8 it
, as discussed
that it is difficult are involved .
partial , using
derivatives to
alternative
use
) method
Mardia
partial
optimization
the
hyperparameters
( K - 1~ OK
, in
likelihood
to
analytically
the
1 - 2tr
=
derivatives
tion
of
express
respect
an work
to Gibbs 30 , and informa inefficient of
Neal
-
PREDICTIONWITH GAUSSIAN PROCESSES
613
( 1996) on Bayesian treatment of neural networks , Williams and Rasmussen
(1996) and Rasmussen(1996) have used the Hybrid Monte Carlo method of Duane et al (1987) to obtain samplesfrom P (8IV ). Rasmussen(1996) carried out a careful comparison of the Bayesian treatment of Gaussian pro cess regression with several other state -of-the-art methods on a number of problems and found that its performance is comparable to that of Bayesian
neural networks as developedby Neal (1996), and consistently better than the other
methods
.
5 .3 . COMPUTATIONAL
ISSUES
Equations 20 and 21 require the inversion of a n x n matrix . When n is of the order of a few hundred then this is quite feasible with modern
computers . However , once n '""" 0 (1000) these computations can be quite time consuming , especially if this calculation must be carried out many times
in an iterative
of interest
scheme
as discussed
to consider approximate
above
in section
5 .2 . It is therefore
methods .
One possible approach is to approximate the matrix inversion step needed for prediction , i .e. the computation of K - lz in equation 20. Gibbs
and MacKay (1997a) have used the conjugate gradients (CG) algorithm for this task, based on the work of Skilling (1993). The algorithm iteratively computes an approximation to K - 1z; if it is allowed to run for n iterations it takes time O (n3) and computes the exact solution to the linear system , but by stopping the algorithm after k < n iterations an approximate solution is obtained . Note also that when adjusting the parameters of the covariance
matrix , the solution
of the linear system that was obtained
with
the old parameter values will often be a good starting point for the new CG
iteration
.
When adjusting the parameters one also needs to be able to calculate
quantities such as tr (K - 18K / 8(}i ). For large matrices this computation can be approximated
by using the "randomized trace method " . Observe that
if d '""" N (O, In ), then E [dT Md ] = trM , and thus the trace of a matrix M may be estimated by averaging dTM d over several d 's. This method has
been used in the splines literature by Hutchinson (1989) and Girard (1989) and also by Gibbs and MacKay (1997a) following the independent work of Skilling (1993). Similar methods can be brought to bear on the calculation of log det K . An alternative approximation scheme for Gaussian processes is to project the covariance
kernel
onto
a finite
number
of basis functions
, i .e . to find
the
Ew in equation 5 which leads to the best approximation of the covariance function . This method has been discussed by Silverman (1985), Wahba
(1990) Zhu and Rohwer (1996) and Hastie (1996). However, if we also can
C. K. I. WILLIAMS
614
b
u 10
.
. Xl
Figure ;' .
choose
which
which sion
the
makes
enough
so
for
There
when
d
et
be is
l
Ai
in
then
the
used
.
.
m
This
equation
converges
29
The
eigenfunctions
truncated
expan
decay
to
technique
is
analysis
,
zero
an
and
is
fast
infinite
-
discussed
,
) .
methods
ignore
use be
components
( 1997
other
to
making
~
principal
al
to
should
eigenvalues
E of
well
wish
of
data
predictions
speeding
points
up
which
, thereby
are
producing
the
computations
far a
away
; one
from
smaller
the
matrix
test to
be
.
5 .4 .
THE
COVARIANCE
own
interest
Radford
Neal
the
over as
single
Consider units
then
of
hidden
layer
network linearly
is
which
,
1996
neural will
section
NEURAL
processes
a
weights
this
hidden
a and
Gaussian
by
number of
OF
( Neal
its
the
remainder a
using
observation
produced
functions
with
FUNCTION
in 's
functions
distribution
H
the
sum
Zhu
example
inverted
the
the
may
possible
My
when
, in
we
eigenvalues
analogue
example
point
functions
largest
that
X
The architecture of a network with a single hidden layer.
sense
dimensional
.
.
X2
basis
have
.
to
a
units
in
the
.
takes
with
tend
derived
an the
regression
under
network
covariance
combines
for
) , that
the
NETWORKS
outputs
Gaussian
for
x , has
one the
kinds
to
a
neural
hidden
of
, prior
prior
tends
hidden
by
treatment
process
network
of
sparked
Bayesian
certain
function
input
was
a
over
infinity
. In
network
layer units
with with
a
615
PREDICTION WITH GAUSSIANPROCESSES bias b to obtain f (x ). The mapping can be written
H Llvjh (x;OJ ) j=
f (x ) = b +
where
h
( x
bounded
;
)
ture
is
with
0
)
units
to
variance
,
are
infinity
)
.
,
Let
O' ~
b
we
has
for
O' ~
a
[ f
wide
[ f
( x
)
f
( x
,
( x
' ) )
(
transfer
1993
the
is
architec
that
networks
number
of
( but
- mean
weights
-
hidden
excluding
distributions
OJ
.
assume
This
)
the
zero
let
.
functions
independent
and
shall
o
Hornik
distributed
1996
we
weights
as
of
have
,
) )
( which
hidden
by
class
' s
identically
Neal
Ew
-
shown
respectively
( following
Ew
been
v
and
obtain
- to
approximators
the
independently
function
input
universal
and
and
transfer
the
it
layer
tends
w
unit
on
because
hidden
be
hidden
depends
polynomials
unit
the
important
one
of
is
which
(34)
for
each
Denoting
all
hidden
weights
by
)
=
0
= =
a
~
+
L
a
;
Eu
[ hj
( x
;
u
)
hj
h
( x
( x
' ;
D
) ]
u
( 35
)
( 36
)
(
37
)
;
u
) ]
) )
J
= =
where
equation
37
distributed
.
by
letting
O' ~
The
of
Theorem
be
Gaussian
By
sets
work
is
sian
,
)
One
finite
over
currently
the
and
that
For
(
of
full
in
joint
this
some
high
/ ) ]
for
H
- dimensional
'
;
units
are
(,,(.,' 2Eu
x
[ h
identically
( x
;
u
)
h
( x
/
t
'
in
the
to
are
weight
,
covariance
-
-
net
-
converge
2
/
and
of
the
course
and
,
input
f
~
a
e
functions
-
integrals
over
.
integrals
and
~
testing
neural
weights
these
priors
=
dis
mo
Limit
will
training
priors
)
all
Central
describe
biases
weight
transfer
t2dt
are
or
given
For
can
be
function
( ii
in
)
a
Gaus
-
Williams
.
infinite
neural
networks
computation
should
circumstances
weight
the
process
x
the
( z
,
.
and
of
)
00
bounded
hence
expectations
the
independently
is
needed
and
1997b
and
stochastic
-
all
Gaussian
in
hidden
and
the
as
These
using
posterior
)
becomes
bounded
function
Bayesian
networks
the
( x
for
Williams
the
h
error
attraction
u
function
function
expressions
;
identically
that
function
.
( i )
explicit
be
limit
.
transfer
( x
the
transfer
distributions
of
either
1997a
)
H
the
covariance
analytically
that
( x
of
[ h
37
over
showing
process
choices
is
As
the
probability
calculated
is
[ h
the
Gaussian
relevant
some
,
in
Eu
.
.
applied
obtain
a
H
will
Eu
can
as
the
/
O" ~
equation
36
process
evaluating
we
(,,(.,' 2
equation
distribution
can
a
H
all
in
variables
the
+
because
term
as
in
random
ments
(
follows
final
scale
sum
tributed
to
The
O" ~
.
- space
integration
( represented
be
Finite
much
networks
and
( hyper
can
only
as
easier
require
tackled
)
with
integration
) parameter
be
GPs
than
space
with
,
MCMC
and
616
C. K. I. WILLIAMS
methods . With GPs , in effect the integration over the weights can be done
exactly (using equations 20 and 21) so that only the integration over the parameters remains . This should lead to improved computational efficiency for GP predictions over neural networks , particularly for problems where n is not too large . 5 .5 . CLASSIFICATION
Given an input
PROBLEMS
x , the aim of a classifier
is to produce
an estimate
of the
posterior probabilities for each classP (klx ), where k = 1, . . . C indexes the C classes. Naturally we require that 0 ~ P (k Ix) ~ 1 for all k and that }:::k P (klx ) = 1. A naIve application of the regressionmethod for Gaussian processes using , say, targets of 1 when an example of class k is observed and 0 otherwise will not obey these constraints . For the two -class classification problem it is only necessary to represent
P (llx ), since P (2/x ) = 1 - P (llx ). An easy way to ensure that the estimate 7r(x ) of P (lfx ) lies in [O, lJ is to obtain it by passing an unbounded value y(x ) through a the logistic transfer function u(z) = 1/ (1 + e- Z) so that 7r(X) = a (y (x )) . The input y (x ) to the logistic function will be called the activation . In the simplest method of this kind , logistic regression , the activation
is simply
computed
as a linear combination
of the inputs , plus a
bias , i .e. y (x ) = wT x + b. Using a Gaussian process or other flexible meth ods allow y (x ) to be a non-linear function of the inputs . An early reference
to this approach is the work of Silverman (1978). For the classification
problem
with
more than
two classes , a simple
extension of this idea using the "softmax" function (Bridle , 1990) gives the predicted probability
for class k as
7r(klx ) =
exp Yk(X) ~ mexPYm(x ) .
For the rest of this section we shall concentrate
(38)
on the two - class problem ;
extension of the methods to the multi -class case is relatively straightfor ward
.
By defining a Gaussian process prior over the activation y(x ) automatically induces a prior over 7r(X), as illustrated in Figure 5.5. To make predictions for a test input x . when using fixed parameters in the GP we
would like to compute 7r. = J 7r. P (7r. lt , 8) d1r. , which requires us to find P (7r. lt , 9) = P (7r(x . )lt , 8) for a new input x . . This can be done by finding the distribution P (y. It , 8) (y. is the activation of 7r. ) as given by
P(y* It , 8) =
! P(yiYlt,8)dY= ~
/
P(y. , yI8)P(tly )dy
(39)
617
PREDICTION WITHGAUSSIAN PROCESSES 1
y
1t
I
~
=>
- 3
Figure 5. 7r(x ) is obtained from y(x ) by "squashing" it through the sigmoid function 0' .
and then using the appropriate
Jacobian to transform the distribution .
When P (tly ) is Gaussianthen the integral in equation 39 can be computed exactly to give equations 20 and 21. However , the usual expression for
P(tly ) = lli '1r ~i(1 - '1ri)l - ti for classification data (wherethe t's takeon values of 0 or 1), meansthat the marginalization to obtain P (y* It , 8) is no longer analytically
tractable .
Faced with this problem there are two routes that we can follow: (i) to use an analytic approximation to the integral in equation 39 or (ii ) to use Monte Carlo methods , specifically MCMC methods , to approximate it . These
two
methods
will
be considered
in turn .
The first analytic method we shall consider is Laplace 's approximation ,
where the integrand P (y*, ylt , 8) is approximated by a Gaussian distribu tion centered
at a maximum
of this function
with
respect to y * , y with
an
inverse covariance matrix given by - VV log P (y* , ylt , 8 ) . Finding a max -
imum can be carried out using the Newton-Raphson (or Fisher scoring) iterative
method
y * to be calculated
on y , which then allows the approximate . This
is also
the
method
used
distribution
to calculate
the
of
maxi -
mum a posteriori estimate of y* . For more details see Green and Silverman ( 1994) 5.3 and Barber and Williams (1997) . An alternative analytic approximation is due to Gibbs and MacKay
(1997b). Instead of using the Laplace approximation they use variational methods to find approximating Gaussian distributions that bound the marginal likelihood P (t 18) above and below , and then use these approxi mate distributions to predict P (y* It , 8 ) and thus ir (x *) . For the analytic approximation methods , there is also the question of what
to do about
the parameters
8 . Maximum
likelihood
and GOV
ap -
proachescan again be used as in the regressioncase (e.g. 0 'Sullivan et aI, 1986) . Barber and Williams (1997) used an approximate Bayesian scheme based on the Hybrid Monte Carlo method whereby the marginal likelihood
P (tI8 ) (which is not available analytically ) is replaced by the Laplace approximation of this quantity . Gibbs and MacKay (1997b) estimated 8 by maximizing their lower bound on P (t 18) .
618
C. K. I. WILLIAMS
Recently sian
Neal
process
P ( y , 81V ual
) in
Yi ' S are
time
a two
should
use
it
to
for
a
regression
the
be
The
it
x , and
C is
fixed
9 , each
where standard
also the
be
- field
N
process
it
in
a
drawn
from N
to
from
is
a
(x ) =
can how
to
to
be
assumed . This
is
the
Gaussian
that
model
on
from
( see
.
model
depends
generated
Z (x )
im -
noise
regression
prior
exp
Monte
that
a variance
it
chain
describes
hierarchical
has
takes
processes
)
the
-
between
Hybrid
desired
than
process
Z ( x ) by
is
"
Markov
distribution
when
individ
O ( n3 ) ) , so
Gaussian
model
outliers
used
( x ) is
noise
used to
noise
the
( 1997
Gaussian
widely
using
-
from
n
scans the
for
Gaus
" sweep
time
makes
Neal
the
sensitive
can
Gaussian
method
the
sampling
updated
example
of
. This
Gibbs
the
samples
( in
probably
MCMC
is
sampling computed
a few this
. For
the
that noise
Gibbs been
for
an
Goldberg
et
Discussion
this
sion
paper
to
issues
in
using
taken
can the to
covariance posed
of
by
Sampson
Gaussian .
the in
relaxed
original be function
co - ordinate
with model
be
is
that
in and functions
processes4
the
input
space
a
number
of
space
~ - space
x
into
dimension . This
Guttorp
one
can
can
regres
some
of
made
of
problem
. This ways
is
-
the that
course
weakness used
space
, say
then
to
this may
a hierarchical
use
, the
assump method
~ , which a
is be
Gaussian
is
may
stationary
approach
warping again
sta occur
strong
appealing
of
are
interactions
a very
speaking
1 , . . . p , which derive
which
. One
x , and
) . Of
. One
functions
over
is , roughly
=
kinds
be
another as
( 1992
~i (X ) , i
linear
discussed
the
covariance
- scales
same
, so
the
length
in
input the
for
that
that
Bayesian
have
on .
elaborations above
everywhere
, and warp
described
used
simple
, and
prediction
been
further
same
the
be
also
means
tion to
still
move processes
process
have
, which
are
how
Gaussian
Gaussian
are
method
tionary
shown with
networks
There the
I have
regression
neural
by
than
approach
independent al , 1997 ) .
that
" , i . e . less
this
, for
are
problem
assumed
that
. Firstly
parameters
t - distribution
" robust
M CM
where
generating
, as
noted
method
by
quite
situations
rather as
model
be
MCMC
works
- 1 has
perform
, the
a
. This
using K
parameters
other
t - distributed portant
6 .
to
the
also
applied
process
matrix
faster . Secondly method . It
be
of
developed
sequentially
sense
update
mix Carlo
has model
the
makes
each
)
stage
updated
O ( n2 ) once
actually
In
( 1997
classification
pro
-
defined
modelled process
4Itmay toimpose thecondition that thexto( mapping should be bijective . bedesirable
619
PREDICTIONWITH GAUSSIAN PROCESSES
It is also interesting to consider the differences between finite neural network and Gaussian process priors . One difference is that in functions
generated from finite neural networks , the effects of individual basis func tions can be seen. For example , with sigmoidal units a steep step may be
observed where one basis function (with large weights) comes into play. A judgement about whether this type of behaviour is appropriate or not should depend on Pl.ior beliefs about the problem at hand . Of course it is also possible to compare finite neural networks and Gaussian process
predictions empirically. This has been done by Rasmussen(1996), where GP predictions (using MCMC for the parameters ) were compared to those from Neal 's MCMC
Bayesian neural networks . These results show that for
a number of problems , the predictions of GPs and Bayesian neural networks are similar , and that both methods outperform several other widely -used regression techniques . ACKNOWLEDGEMENTS
I thank
David
Barber , Chris Bishop , David
MacKay , Radford
Neal , Man -
fred Opper , Carl Rasmussen, Richard Rohwer , Francesco Vivarelli and Huaiyu Zhu for helpful discussions about Gaussian processes over the last few years , and David Barber , Chris Bishop ments on the manuscript .
and David
MacKay
for com -
References
Aizerman , M . A ., E. M . Braverman, and L . I . Rozoner (1964) . Theoretical foundations of the potential
function
Remote
25 , 821 - 837 .
Control
method in pattern
recognition
learning . Automation
and
Barber , D . and C. K . I . Williams (1997). Gaussian Processesfor Bayesian Classification via Hybrid Monte Carlo. In M . C. Mozer, M . I . Jordan, and T . Petsche (Eds.) , Advances in Neural Information Processing Systems 9. MIT Press . Bishop , C . M . ( 1995) . Neural Networks for Pattern Recognition . Oxford : Clarendon Press
.
Box , G . E . P. and G . C . Tiao ( 1973) . Bayesian Inference in Statistical Analysis . Reading , Mass .: Addison -Wesley . Bridle , J . ( 1990) . Probabilistic interpretation of feedforward classification network out puts , with relationships to statistical pattern recognition . In F . Fougelman -Soulie
and J. Herault (Eds.), NATO AS! series on systems and computer science. Springer-
Verlag . Cressie , N. A. C. (1993 ). Statistics for Spatial Data. NewYork: Wiley. Duane , S., A. D. Kennedy , B. J. Pendleton , andD. Roweth (1987 ). HybridMonteCarlo . Physics LettersB 195, 216 - 222. Gelman , A., J. B. Carlin,H. S. Stern , andD. B. Rubin(1995 ). Bayesian DataAnalysis . London : Chapman andHall. Gibbs , M. andD. J. C. MacKay (1997a ). Efficient Implementation ofGaussian Processes . Draftmanuscript , available from http:/ /vol.ra.phy.cam .ac.uk/mackay /homepage .html. Gibbs , M. andD. J. C. MacKay (1997b ). Variational Gaussian Process Classifiers . Draft
620
C.
manuscript,
Girard,
available
D.
problems
Girosi,
(1989).
with
F.,
M.
Goldberg, dependent Green, P.
P.
Architectures.
models.
Hastie, Hastie, and Hornik,
6
Journel,
Kimeldorf, stochastic 495 502.
MacKay,
(1996). J. and K. (1993). (8),
1069 1072.
M. A.
G.
and
for
J. J.
: //www.
R.in
O Hagan, Journal O Sullivan, gression
Association
and
M.
(1989). and
Bayesian
Neal, Notes
ra.
Math.
56,
Carlo
Poggio
(1995).
phy.
cross-v
1 23
Reg
7(2),
219 269.
cs
M. of
A.
Stein
A
C.
C.
J.
stochastic
E.
Bayesian and
Regression
. toronto
and
96 103.
Mining
Bayesian
448 472.
Methods
K.
f
Schulten
(Ed
Classification.
. edu/radford/.
Bayesian
Learning
(1978). Curve Royal Statistical F., B. S. Yandell,
81,
f
J. Marshall (1984). Maximum in spatial regression. Biometrik Monte Carlo Implementation
the
Functions
of th Genera neural
A correspo by splines
Practical
4(3),
Domany,
R.
(1978).
A
(1993).
B
estimator
Huijbregts
(1992).
A
Communications
Wahba (1970). and smoothing
Computation
C.
(1993).
Journal (1990). results on
splines.
(1996). 118.
Statistics
Hall.
L.
Pseudosplines. R. J. Tibshirani Some new
Mardia, K. V. and residual covariance Neal, R. M. (1997). http
I/vol.
Computation
G. and G. processes
Hemmen,
Springer.
T.
smoothing 1059 1076.
D.
van
S.
403 410.
Neural
MacKay,
http:
Monte
Ntimer.
and
Chapman
M.
D.
works.
data.
Neural
London:
Hutchinson,
Laplacian tion 18,
via
A fast
Jones,
T. T.
Hall.
WILLIAMS
W., C. K. I. Williams, and C. A Gaussian Process Treatment. J. and B. W. Silverman (1994). Non
35(4),
works
noisy
I.
Noise:
Handcock, rics
K.
in
Generalized
for
Neu
Fitting and Optima Society B 40(1), and W. J. Rayno Linear
Models.
Poggio, T.78, and F. Girosi (1990). Networks for of IEEE 1481 1497. Press, W. H., S. A. Teukolsky, W. T. Vetterling, Recipes in C (second ed.) Cambridge University Rasmussen, C. E. (1996). Evaluation of Gaussian linear
Available
Regression. from
http:
Ph.D.
thesis,
//www.
cs
Ripley, B. Press. (1996). Pattern University Sacks, J., W. J. Welch, T. Computer
Sampson, covariance Silverman, Statistics
Experiments.
Recognition J. Mitchell,
Statistical
P. D. and P. Guttorp structure. Journal of B. W. (1978). Density
27(1),
26 33.
Dept.
.utoronto.
Science
of
Comp
ca/carl/.
and Ne and H. 4(4),
(1992). Nonpa the American Ratios, Empi
PREDICTION WITHGAUSSIAN PROCESSES
621
New York : Springer
Submitted to
New
Contributors
Nicky
G
.
Department and
Nir
Best of
Public
387
College
London
Computer
Epidemiology
Health
Imperial
W2
Friedman
School
of
Medicine
Science
Soda
University
of California
Berkeley
IPG
Division
Hall
, C A
94720
USA
UK
Christopher
M
Microsoft
.
Dan
Bishop
Research
St
. George
1
Guildhall
Science
Department
Technion
House
Haifa
Street
Cambridge
Geiger
Computer
CB2
, 32000
ISRAEL
3NH
UK
Zoubin Joachim
M
Institut
.
fur
Department
Buhmann
Informatik
U niversitat .
D - 53117
III
, Ontario
M5S
3H5
CANADA
164
Bonn
Wally
R . Gilks
MRC
Biostatistics
Institute Gregory
F
Forbes
.
Unit
of Public
Health
Cooper
Tower
,
University
of
Pittsburgh
, PA
Suite
University
8084
Pittsburgh 15213
Forvie
Robinson
Site
Way
Cambridge
- 2582
CB2
2SR
UK
USA
Robert
G
School
of
Science
.
Moises
Cowell
Mathematics
and
, Actuarial
Statistics
Goldszmidt
SRI
International
333
Ravenswood
Menlo
University
Park
Ave . , EK329
, CA
94025
USA Northampton
Square
London
EC
IV
ORB
David
UK
Heckerman
Microsoft Rina
One
Dechter
Information
and
Computer
Science
University
of ,
CA
California
92697
USA
623
Research
Microsoft
Redmond USA
Irvine
Science
of Toronto
Toronto
GERMANY
City
of Computer
University
Bonn
Romerstr
Ghahramani
, W A
Way 98052
624 Geoffrey E . Hinton Department of Computer Science University of Toronto Toronto , Ontario M5S 3H5 CANADA Hazel Inski p MRC Environmental Unit
David
J
.
Cavendish
Michael I . Jordan Massachusetts Institute ogy E25-229 Cambridge , MA 02139 USA
CB3
Mansour
Department
of
of
Computer
Science
Mathematical
Tel
-
Aviv
University
Tel
-
Aviv
69978
Sciences
ISRAEL
Christopher
Meek
Microsoft
Research
One
Microsoft
Way
Redmond
,
W
A
98052
USA
Stefano
Michael J . Kearns AT & T Labs - Research RoomA201 180 Park Avenue Florham Park , NJ 07932 USA
OHE
UK
School
of Technol -
MacKay
Road
Cambridge
Yishay
Tommi S. Jaakkola Department of Computer Science University of California Santa Cruz , CA 95064 USA
.
Laboratory
Madingley
Epidemiology
Southampton General Hospital Southampton 8016 6YD UK
C
Monti
Intelligent
Systems
University
of
901M
Program
Pittsburgh
CL
Pittsburgh
,
PA
15260
USA
Radford
M
.
Department
of
Department
100
Statistics
of
University
Toronto
Neal
of
St
.
Computer
Science
Toronto
George
,
and
Street
Ontario
M5S
3G3
CANADA
Uffe Kjrerulff Department of Computer Science Aal borg U ni versi ty Fredrik Bajers Vej 7E DK -9220 Aalborg 0 DENMARK
Andrew
Artificial
Y
.
N
g
Intelligence
Laboratory
MIT
Cambridge
USA
,
MA
02139
625 Thomas S. Richardson Departmentof Statistics Box 354322 University of Washington Seattle, WA 98195 -4322 USA
Jirina Vejnarov ~
Brian Sallans Departmentof ComputerScience University of Toronto Toronto, Ontario M5S 3H5 CANADA
Joe
Laboratory of Intelligent Systems University
of Economics
Ekonomicka
957
148 00 Prague CZECH
REPUBLIC
Whittaker
Department
of Mathematics
and Statis -
tics
Lancaster
University
Lancaster
LA
! 4YF
UK Lawrence
K . Saul
AT & T Labs
Christopher K . I . Williams Neural ComputingResearchGroup Aston University BirminghamB4 7ET UK
- Research
180 Park
Avenue
Florham
Park , N J 07932
USA Peter
W . F . Smith
Department The
of Social Statistics
U ni versi ty
Southampton
, S09 5NH
UK
David MRC
J . Spiegelhalter Biostatistics
Institute
of Public
University Robinson
Unit Health
Forvie Site Way
Cambridge CB2 2SR UK
Milan Studeny Institute and
of Information
Theory
Automation
Academy
of Sciences of Czech
Republic Pod
vodarenskou
18208 CZECH
Prague REPUBLIC
vezl
4
INDEX
see inference
A
acceptance
rate
adaptive
beta prior
190
rejection
sampling
binomial
222
Adler
' s
overrelaxation
ancestral
method
sets
20
,
approximation
Gaussian
see
Laplace
see
sampling
see
variational
sampling
307
Bol tzmann
209
distribution
21
117 , 170 , 358 , 408
machine 116 , 134 , 178 , 542 bucket elimination 75
methods
see
308
between-separated models 244
method
approximation
BUGS
approximation
193 , 583 , 597
burn - in period 201, 583
methods
methods
C chi
-
squared
artificial
563
neural
see
networks
neural
( ANNs
canonical parameter ca usali ty
)
networks
asymptotic
expansion
462
,
563
,
569
causal graph 248, 344, 465 causal knowledge 344
,
571
causal augmented
graph
AutoClass
342
,
344
relevance
determination
)
234 , 303 , 343
CG - potentials 17 chain graphs 231, 235, 238 chain graph Markov properties Andersson - Madigan - Perlman prop -
211
610
erty 239
B
basis
function
Baum
393
- Welch
see
,
602
,
also
EM
'
factor
Bayes
'
theorem
Bayesian
characteristic
algorithm
,
networks
422
score
472
52
,
337
462
,
,
464
Bayesian
434 ,
,
,
classification
531
76
,
,
522
472
313
,
373
,
340 , 383 , 616
clique 19, 53
clique -marginal representation 23, 41
430
clustering
networks
see
(BIG)
criterion
466
238
variable
Cheeseman - Stutz scoring function
10
,
property
random
435 , 439 , 441 , 444
328
332
belief
Markov
400
information
Bayesian
Lauritzen - Wermuth -Frydenberg
606
algorithm
Bayes
central
networks
496 belief
condition
causal relationships
472
time
( ARD
BDe
Markov
95
autocorrelation
automatic
557
updating
627
367 , 383 , 386 , 391 , 406 ,
628 K - means 367 , 386 , 407 , 496 , 503 , 506
detailed
pairwise
dimensionality 176, 177, 179, 183,
410
topological
pairwise
co - connection
413
185 , 192
determined
models
optimization
406
intrinsic
community
of experts network 334
dependence entropy
structure
333 ,
Dirichlet
prior
dissimilarity
13 , 25 , 236 , 237 ,
266 , 445
dependence family
of
269
E
edge exclusion test 556 eigenfunction 609, 614 EM algorithm 118, 141, 325, 355,
distributions
376 , 382 , 409 , 412 , 480 ,
308
492 , 496 , 503 , 506 , 548
function
CODA
586
context
- specific
126 , 161
see also generalized EM algorithm
independence
( CSI )
see also incremental EM algorithm
423 , 424 , 433 , 434 125 , 168 , 192 - Herskovits 464
propagation
covariance cross cutset
333 , 334
dummy evidence 33, 34
information
stochMtic
cost
matrix 411
distribution equivalence d-separation 237
239 , 265 , 556
Cooper
311 , 431 , 440 , 463 ,
38
mutual
convexity
Jacobian
464
268 , 429
independence
conjugate
of the
matrix
270 , 271
Gaussian
378 , 417
see also rank
479
completeness 241 conditional
conjugate
391
reduction
combinatorial
193 , 216
deterministic annealing 416
246
complete
balance
function
validation conditioning
scoring
function
see also sparse EM algorithm embedding 413 energy
69
function
117 , 147 , 170 , 178 ,
199 , 358 , 482 608 , 609 , 611
329 , 612 99
D decimation117 decisiontree 340, 424, 425, 433, 434, 435, 436, 438, 440, 441, 442, 459 default table 424, 425, 433, 434, 435, 436, 441 delta rule 482, 545 density estimation 372, 542, 544
ensemble learning 156 entropic function 268 equivalence class of structures equivalent evidence
334
samples 309 311
expect at ion - ma : ximization
see EM algorithm explaining away 114, 542 exploratory data analysis 405 exponential family 311, 557, 558, 560
629
GT model591
F
factor
analysis
factorial
375
hidden
,
377
,
Markov
480
model
119
H
,
Hamiltonian dynamics 194 handwritten digit recognition 383,
149
factorized
mean
field
tion
fast
35
free
-
,
37
backward
energy
,
heat
48
algorithm
170
function
-
488 , 547
retraction
forward
approxima
167
,
- space
356
,
,
482
,
method
192 , 206
heavy - tailed distribution
400
408
view
bath
Helmholtz
491
machine
591
142 , 154 , 542
Hepatitis B (HB ) 576
604
hidden
Markov
decision
tree
122 ,
152
G
Gambian
Hepatitis
Intervention
( GHIS
)
hidden
Study
Markov
model
118 , 156 , 367 ,
399
576
see also
Gaussian
factorial
hidden
Markov
model approximation
324
see graphical
model
moment
characteristics
potentials
39
processes
Gelman
38
,
473
,
,
hidden
40
Rubin
generalized
EM
generalized
statistics
algorithm
linear
hierarchical
356
models
cross
393
,
475
,
- validation
( GCV
potential
models
topographic
mixtures
mapping
( G
time
model
TM
)
hyperparameters
I
imagined future data 309
distribution
Boltzmann
I - map
distribution
sampling
155
219
,
583
,
323
,
,
192
484
,
,
201
491
,
,
206
,
549
,
612
parameter
independence
464
,
446
importance sampling 180 incomplete data 303, 321 incomplete network structures 335 incremental EM algorithm 360 independence axiomatization
Markov
property
236
,
237
238
growth
- based
curve
200 , 308 , 334
399
591
525
gradient
387
hypothesis equivalence 334, 465
through
global
200 , 223 , 386 ,
hybrid Bayesian networks 521 hybrid Monte Carlo 194, 207, 618
representation
392
global
149 , 233 , 310 , 322 ,
Hugin propagation 52, 53, 60, 62,
)
22
Gibbs
deci -
69
generalized
see
Markov
387 , 479 , 483 , 543 , 582
612
Gibbs
variables
hierarchical
generalized
GG
tree
467 , 544
586
602
generative
hidden
sion
342 , 347 , 355 , 373 , 462 ,
599
and
also
557
43
optimization
580
325
292
,
equivalence 333 independency model 265
630 see also conditional
indepen -
dence
see also context - specific inde pendence
see also global parameter in dependence see also local
parameter
inde -
pendence induced
width
86 , 87
inducing path 243 inference
9 , 27 , 51 , 79 , 316
see also bucket
elimination
see also cutset conditioning see also H ugin propagation
see also poly -tree algorithm see also sampling methods see also Shafer-Shenoy propa gation see also variational inference
methods
rule
regular
283
logical implication perfect pure
of 285
291
pro babilistically
sound 288
287
redundant
289
semantics
of 284
soundness
of 292
information
matrix
558 , 561
information -modeling trade -off 497, 502 , 505
inseparability interaction
243
graph 76
intervention
see causality invariant distribution 193 , 200 inverse variance matrix 556
Ising model 178 J
Jensen '8 inequality 545
join tree seej unction tree joint probability distribution 18, 52, 128, 313, 357, 373 junction tree 18, 21, 28, 40, 45, 52, 98, 108
140 , 148 , 168,
K K -meansalgorithm seeclustering kriging 600 Kullback-Leibler divergence 139, 165, 266, 358, 416, 444, 492, 496 L Laplace approximation 332, 617 Laplace's method 332, 465 latent space375 latent variables seehidden variables learning causal relationships 343 parameters 326, 334 structure 326, 334, 421, 436 leave-one-out 329 likelihood complete data 377 equivalence334 function 11, 106, 141, 307, 311, 372, 376, 381, 557, 559 see also maximum likelihood estimation linear Gaussianunit 479, 480, 481, 486, 488 linear smoothing 607 linkage analysis 70 local parameter independence464, 525 local 2, 433, 436, 438, 439 log score (18) 533 logistic belief network
631
see sigmoid belief network logistic function 115, 124, 616 L p- propagation 47
Metropolis algorithm 185, 187, 206,
M
missing data
612
minimum description length (MD L ) 332 , 427 , 462
see hidden
magnification factor 395, 397 manifold
mixed graphs 235
374 , 392 , 395
marginal likelihood 156, 311, 331,
mixed
models
approximation
marginal representation 23, 52 marginalization
models
strong 44 Markov
blanket 553
reversible
530 , 548
chain
of experts
145 , 148 , 155 , 543 ,
model
156 , 493
selection
327 , 328 , 339 , 379 ,
462 , 531 , 556 momentum 194
193
irreversible
166
322 , 342 , 362 , 373 , 375 ,
378 , 385 , 393 , 480 , 497 ,
44
Mar kov chain
Markov
42
mixture
462
weak
variables
194 , 198 Monte
Monte - car 10 methods
see sampling methods
Carlo
see Gibbs sampling see Metropolis algorithm Markov equivalence 241
moralization
Markov
multidimensional scaling (MDS ) 415,
properties
16, 52 , 55 , 231 ,
most probable explanation (MPE ) 75 , 79 , 89 417
524
see also global Markov prop -
multiinformation
random
function
decomposition
erty
see also chain graph Markov properties Markov
19 , 20 , 85 , 111
field
109 , 232
maximal branching 452
maximum a posteriori (MAP ) 30, 75 , 79 , 92 , 324 , 528
mutual
maximum likelihood estimation (ML )
of 279
information
167 , 261
N
naive Bayes models 470, 472, 474 nested junction tree 55 neural
maximum entropy principle 408 maximum expected utility (MEU ) 75 , 79 , 92 , 93 , 95
267
networks
115 , 146 , 479 , 522 ,
523 , 527 , 531 , 541 , 615
seealso sigmoid belief network noisy - OR model 113 , 132
nonserial dynamic programming 75
325 , 355 , 372 , 544 , 558
max -marginal representation 30 max -propagation 30 mean
field
methods
143 , 157 , 163 ,
165 , 412 , 545 , 546 metric
tensor
396
0 ordered
overrelaxation
Ornstein
- Uhlenbeck
208 , 215 process
out-marginalization35 output monitoring586
608
632 overrelaxation 197 , 207 see ordered overrelaxation
Q
QMR
p
-
DT
database
112
,
132
R
parameter independence 320 see global parameter indepen -
random
-
effects
growth
els
-
curve
mod
-
578
dence randomized
see local parameter indepen -
trials
random
234
walk
188
dence randomized
parameter modularity
trace
rank
partial
correlation
method
613
335, 464
coefficient
of
the
556 ,
Jacobian
matrix
468
,
469
562 , 564 recursive
partition
function
partitioned
factorization
14
110 , 176 , 358 recursi
density 498
ve
recursive
path consistency 435 , 436 , 446 persistent motion 197
tree
growing
tree
trimming
regression
442
340
Bayesian
physical probability 304, 343 poly - tree algorithm 96
442
regression
errors
-
in
-
599
variables
regression
model
593
posterior
11 measurement
posterior assignment 496 posterior
partition
see
512
-
also
error
model
generalized
593
linear
mod
-
els
potential 19, 53 , 55 , 109 prequential analysis 527
principal component analysis (PCA ) 378 , 379
prior
probabilistic 11
380 , 383
structures
sampling
data
,
474
183
,
222
410
entropy
Kullback
-
Leibler
posterior
running
divergence
probability
328
intersection
property
21
,
112
308 , 311 , 333 , s
inference
sampler
see inference
density
sampling
probabilistic logic sampling 27 proper scoring rule 329 proposal distribution
rejection
relational
relative
know ledge 303
469
546
see
333 , 335
on parameters 335 , 582
probabilistic
rank
relative
see beta prior see Dirichlet prior on
regular
regularization
183 , 186 , 190 ,
200
pseudo- independent model 277
180
methods
see
importance
see
Gibbs
see
Metropolis
see
rejection
Cauchy
sampling
sampling
algorithm
sampling
182
Gaussian
182
independent
samples
188
176
,
184
,
633 likelihood -weighting sampling 47 score function search
557
methods
best - first
336 search
338
greedy search 337, 440 heuristic
search
simulated
427 , 440 , 452
annealing
selection
bias
selection
variables
337
345 233
selective model averaging 327 self-organizing map (SOM ) 395 separable criterion
336
distributions
190 , 195
separation 236 separators
22 , 53
sequential log- marginal -likelihood criterion
330
set-chain representation 3 S- function
586
Shafer-Shenoy propagation 52, 53, 60 , 62 , 69
sigmoid belief network 115, 146, 475 , 481 , 482 , 543
simulated annealing 199, 337 softmax
function
529 , 616
sparse EM algorithm 365 standardized
residual of
measure strong
root
U unsupervisedlearning 115, 342, 479, 496, 544 utility function 329
276
of 278 45 , 47
structural equation model 248 structure
Bayesian network structure 315, 326
complete network structure
333
equivalence class of structures 334
incomplete network structures 335
T temperature 179, 199, 408 critical 179 test statistic 556 asymptotic power of 562 efficient score 559 Fisher's Z 563 generalizedlikelihood ratio 559 modified profile likelihood 560 signed square-root 573 t test 556 Wald statistic 559 tetrad constraints 474 TG model 591 topological ordering 15 transition probability 193, 198 tree-width 86 trek-sum rule 474 triangulation 21, 45, 52, 56, 111 triplex 239 TT model 591 typical set 179, 183
589
stochastic dependence levels
prior 333 substitution mapping 284 sufficient statistics 308, 311, 356, 361, 363, 377 syntactic record 283
V variable firing 66 variable propagation 66 . varIance of an estimator 176, 181, 201 variation distance 510 variational methods 105, 123, 165, 358, 544, 617
634 see
vector
also
mean
field
quantization
see
also
methods
406
,
497
clustering
visualization
384
,
385
,
405
,
413
,
416
v
- structure
333
W weight-space view 602 winner-take-all (WTA ) 367, 418, 496