Adam M. Johansen and Ludger Evers 2007 Edited by Nick Whiteley 2008
Monte Carlo Methods Lecture Notes October 27, 2008
Department of Mathematics
2
Table of Contents
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.1 What are Monte Carlo Methods? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3 A Brief History of Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.4 Pseudo-random numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.
Fundamental Concepts: Transformation, Rejection, and Reweighting . . . . . . . . . . . . . . 15 2.1 Transformation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.
Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Discrete State Space Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 General State Space Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Selected Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.
The Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 The Hammersley-Clifford Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 Convergence of the Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.
The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Convergence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 The random walk Metropolis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4 Choosing the proposal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.5 Composing kernels: Mixtures and Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4
6.
TABLE OF CONTENTS
The Reversible Jump Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.1 Bayesian multi-model inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2 Another look at the Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.3 The Reversible Jump Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.
Diagnosing convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.1 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 Tools for monitoring convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.
Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 8.1 A Monte-Carlo method for finding the mode of a distribution . . . . . . . . . . . . . . . . . . . . . . . . . 83 8.2 Minimising an arbitrary function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 8.3 Using annealing strategies for imporving the convergence of MCMC algorithms . . . . . . . . . . 87
9.
Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 9.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 9.2 State Estimation: Optimal Filtering, Prediction and Smoothing . . . . . . . . . . . . . . . . . . . . . . . 93 9.3 Static Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10. Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 10.1 Importance Sampling Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 10.2 Sequential Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 10.3 Sequential Importance Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 10.4 Resample-Move Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.5 Auxiliary Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 10.6 Static Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 10.7 Extensions, Recent Developments and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
1. Introduction
1.1 What are Monte Carlo Methods? This lecture course is concerned with Monte Carlo methods, which are sometimes referred to as stochastic simulation (Ripley (1987) for example only uses this term). Examples of Monte Carlo methods include stochastic integration, where we use a simulation-based method to evaluate an integral, Monte Carlo tests, where we resort to simulation in order to compute the p-value, and Markov-Chain Monte Carlo (MCMC), where we construct a Markov chain which (hopefully) converges to the distribution of interest. A formal definition of Monte Carlo methods was given (amongst others) by Halton (1970). He defined a Monte Carlo method as “representing the solution of a problem as a parameter of a hypothetical population, and using a random sequence of numbers to construct a sample of the population, from which statistical estimates of the parameter can be obtained.”
1.2 Introductory examples Example 1.1 (A raindrop experiment for computing π). Assume we want to compute a Monte Carlo estimate of π using a simple experiment. Assume that we could produce “uniform rain” on the square [−1, 1] × [−1, 1], such that the probability of a raindrop falling into a region R ⊂ [−1, 1]2 is propor-
tional to the area of R, but independent of the position of R. It is easy to see that this is the case iff
the two coordinates X, Y are i.i.d. realisations of uniform distributions on the interval [−1, 1] (in short i.i.d.
X, Y ∼ U[−1, 1]).
Now consider the probability that a raindrop falls into the unit circle (see figure 1.1). It is RR 1 dxdy π π area of the unit circle {x2 +y 2 ≤1} RR = = = P(drop within circle) = area of the square 2·2 4 1 dxdy {−1≤x,y≤1}
In other words,
π = 4 · P(drop within circle), i.e. we found a way of expressing the desired quantity π as a function of a probability. Of course we cannot compute P(drop within circle) without knowing π, however we can estimate the probability using our raindrop experiment. If we observe n raindrops, then the number of raindrops Z that fall inside the circle is a binomial random variable:
6
1. Introduction 1
−1 −1
1
Fig. 1.1. Illustration of the raindrop experiment for estimating π
Z ∼ B(n, p),
with p = P(drop within circle).
Thus we can estimate p by its maximum-likelihood estimate pˆ =
Z , n
and we can estimate π by π ˆ = 4ˆ p=4·
Z . n
Assume we have observed, as in figure 1.1, that 77 of the 100 raindrops were inside the circle. In this case, our estimate of π is π ˆ=
4 · 77 = 3.08, 100
which is relatively poor. However the law of large numbers guarantees that our estimate π ˆ converges almost surely to π. Figure 1.2 shows the estimate obtained after n iterations as a function of n for n = 1, . . . , 2000. You can see that the estimate improves as n increases. We can assess the quality of our estimate by computing a confidence interval for π. As we have X ∼
B(100, p), we can obtain a 95% confidence interval for p using a Normal approximation: # " r r 0.77 · (1 − 0.77) 0.77 · (1 − 0.77) , 0.77 + 1.96 · = [0.6875, 0.8525], 0.77 − 1.96 · 100 100
As our estimate of π is four times the estimate of p, we now also have a confidence interval for π: [2.750, 3.410] In more general, let π ˆn = 4ˆ pn denote the estimate after having observed n raindrops. A (1−2α) confidence interval for p is then
"
pˆn − z1−α
r
pˆn (1 − pˆn ) , pˆn + z1−α n
r
# pˆn (1 − pˆn ) , n
thus a (1 − 2α) confidence interval for π is " # r r π ˆn (4 − π ˆn ) π ˆn (4 − π ˆn ) π ˆn − z1−α ,π ˆn + z1−α n n
⊳
1.2 Introductory examples
7
3.0 2.0
2.5
Estimate of π
3.5
4.0
Monte Carlo estimate of π (with 90% confidence interval)
0
500
1000
1500
2000
Sample size
Fig. 1.2. Estimate of π resulting from the raindrop experiment
Let us recall again the different steps we have used in the example: – We have written the quantity of interest (in our case π) as an expectation.1 – Second, we have replaced this algebraic representation of the quantity of interest by a sample approximation to it. The law of large numbers guaranteed that the sample approximation converges to the algebraic representation, and thus to the quantity of interest. Furthermore we used the central limit theorem to assess the speed of convergence. It is of course of interest whether the Monte Carlo methods offer more favourable rates of convergence than other numerical methods. We will investigate this in the case of Monte Carlo integration using the following simple example. Example 1.2 (Monte Carlo Integration). Assume we want to evaluate the integral Z 1 1 · −65536x8 + 262144x7 − 409600x6 + 311296x5 − 114688x4 + 16384x3 f (x) dx with f (x) = 27 0 using a Monte Carlo approach.2 Figure 1.3 shows the function for x ∈ [0, 1]. Its graph is fully contained
in the unit square [0, 1]2 .
Once more, we can resort to a raindrop experiment. Assume we can produce uniform rain on the unit square. The probability that a raindrop falls below the curve is equal to the area below the curve, which of course equals the integral we want to evaluate (the area of the unit square is 1, so we don’t need to rescale the result). A more formal justification for this is, using the fact that f (x) = Z
0
1 2
1
f (x) dx =
Z
0
1
Z
0
f (x)
1 dt dx =
Z Z
R f (x) 0
1dt dx =
{(x,t):t≤f (x)}
A probability is a special case of an expectation as P(A) = E(IA ). 4096 As f is a polynomial we can obtain the result analytically, it is 8505 =
1 dt, RR
1 dt dx
{(x,t):t≤f (x)}
RR
1dt dx
{0≤x,t≤1}
212 35 ·5·7
≈ 0.4816.
8
1. Introduction
The numerator is nothing other than the dark grey area under the curve, and the denominator is the area of the unit square (shaded in light grey in figure 1.3). Thus the expression on the right hand side is the probability that a raindrop falls below the curve. We have thus re-expressed our quantity of interest as a probability in a statistical model. Figure 1.3 shows the result obtained when observing 100 raindrops. 52 of them are below the curve, yielding a Monte-Carlo estimate of the integral of 0.52. If after n raindrops a proportion pˆn is found to lie below the curve, a (1 − 2α) confidence interval for the value of the integral is
pˆn (1 − pn ) pˆn (1 − pn ) , pˆn + z1−α pˆn − z1−α n n
Thus the speed of convergence of our (rather crude) Monte Carlo method is OP (n−1/2 ).
⊳
1
x 0
1
Fig. 1.3. Illustration of the raindrop experiment to compute
R1 0
f (x)dx
When using Riemann sums (as in figure 1.4) to approximate the integral from example 1.2 the error is of order O(n−1 ).3,4 Recall that our Monte Carlo method was “only” of order OP (n−1/2 ). However, it is easy to see that its speed of convergence is of the same order, regardless of the dimension of the support of f . This is not the case for other (deterministic) numerical integration methods. For a two-dimensional function f the error made by the Riemann approximation using n function evaluations is O(n−1/2 ).
5
This makes the Monte Carlo methods especially suited for high-dimensional problems. Furthermore the Monte Carlo method offers the advantage of being relatively simple and thus easy to implement on a computer.
3
4
5
The error made for each “bar” can be upper bounded by
∆2 2
max |f ′ (x)|. Let n denote the number evaluations
of f (and thus the number of “bars”). As ∆ is proportional to n1 , the error made for each bar is O(n−2 ). As there are n “bars”, the total error is O(n−1 ). The order of convergence can be improved when using the trapezoid rule and (even more) by using Simpson’s rule. Assume we partition both axes into m segments, i.e. we have to evaluate the function n = m2 times. The error made for each “bar” is O(m−3 ) (each of the two sides of the base area of the “bar” is proportional to m−1 , so is the upper bound on |f (x) − f (ξmid )|, yielding O(m−3 )). There are in total m2 bars, so the total error is only O(m−1 ), or equivalently O(n−1/2 ).
1.3 A Brief History of Monte Carlo Methods
9
1 ∆ |f (x) − f (ξmid )| <
∆ 2
· max |f ′ (x)| for
|x−ξmid |≤ ∆ 2
x 0
ξmid
1
Fig. 1.4. Illustration of numerical integration by Riemann sums
1.3 A Brief History of Monte Carlo Methods Experimental Mathematics is an old discipline: the Old Testament (1 Kings vii. 23 and 2 Chronicles iv. 2) contains a rough estimate of π (using the columns of King Solomon’s temple). Monte Carlo methods are a somewhat more recent discipline. One of the first documented Monte Carlo experiments is Buffon’s needle experiment (see example 1.3 below). Laplace (1812) suggested that this experiment can be used to approximate π. Example 1.3 (Buffon’s needle). In 1733, the Comte de Buffon, George Louis Leclerc, asked the following question (Buffon, 1733): Consider a floor with equally spaced lines, a distance δ apart. What is the probability that a needle of length l < δ dropped on the floor will intersect one of the lines? Buffon answered the question himself in 1777 (Buffon, 1777). δ
δ
δ
θ l sin θ (a) Illustration of the geometry behind Buffon’s needle
(b) Results of the Buffon’s needle experiment using 50 needles. Dark needles intersect the thin vertical lines, light needles do not.
Fig. 1.5. Illustration of Buffon’s needle
Assume the needle landed such that its angle is θ (see figure 1.5). Then the question whether the needle intersects a line is equivalent to the question whether a box of width l sin θ intersects a line. The probability
10
1. Introduction
of this happening is P(intersect|θ) =
l sin θ . δ
Assuming that the angle θ is uniform on [0, π) we obtain Z π Z π Z π l sin θ 1 l 2l 1 · dθ = · sin θ dθ = . P(intersect) = P(intersect|θ) · dθ = π δ π πδ 0 πδ 0 0 | {z } =2
When dropping n needles the expected number of needles crossing a line is thus 2nl . πδ Thus we can estimate π by 2nl , π≈ Xδ where X is the number of needles crossing a line.
The Italian mathematician Mario Lazzarini performed Buffon’s needle experiment in 1901 using a needle of length l = 2.5cm and lines d = 3cm apart (Lazzarini, 1901). Of 3408 needles 1808 needles crossed a line, so Lazzarini’s estimate of π was 17040 355 2 · 3408 · 2.5 = = , 1808 · 3 5424 133 which is nothing other than the best rational approximation to π with at most 4 digits each in the π≈
denominator and the numerator.6
⊳
Historically, the main drawback of Monte Carlo methods was that they used to be expensive to carry out. Physical random experiments were difficult to perform and so was the numerical processing of their results. This however changed fundamentally with the advent of the digital computer. Amongst the first to realise this potential were John von Neuman and Stanislaw Ulam, who were then working for the Manhattan project in Los Alamos. They proposed in 1947 to use a computer simulation for solving the problem of neutron diffusion in fissionable material (Metropolis, 1987). Enrico Fermi previously considered using Monte Carlo techniques in the calculation of neutron diffusion, however he proposed to use a mechanical device, the so-called “Fermiac”, for generating the randomness. The name “Monte Carlo” goes back to Stanislaw Ulam, who claimed to be stimulated by playing poker and whose uncle once borrowed money from him to go gambling in Monte Carlo (Ulam, 1983). In 1949 Metropolis and Ulam published their results in the Journal of the American Statistical Association (Metropolis and Ulam, 1949). Nonetheless, in the following 30 years Monte Carlo methods were used and analysed predominantly by physicists, and not by statisticians: it was only in the 1980s — following the paper by Geman and Geman (1984) proposing the Gibbs sampler — that the relevance of Monte Carlo methods in the context of (Bayesian) statistics was fully realised.
1.4 Pseudo-random numbers For any Monte-Carlo simulation we need to be able to reproduce randomness by a computer algorithm, which, by definition, is deterministic in nature — a philosophical paradox. In the following chapters we will assume that independent (pseudo-)random realisations from a uniform U[0, 1] distribution7 are readily 6
7
That Lazzarini’s experiment was that precise, however, casts some doubt over the results of his experiments (see Badger, 1994, for a more detailed discussion). We will only use the U(0, 1) distribution as a source of randomness. Samples from other distributions can be derived from realisations of U(0, 1) random variables using deterministic algorithms.
1.4 Pseudo-random numbers
11
available. This section tries to give very brief overview of how pseudo-random numbers can be generated. For a more detailed discussion of pseudo-random number generators see Ripley (1987) or Knuth (1997). A pseudo-random number generator (RNG) is an algorithm for whose output the U[0, 1] distribution is a suitable model. In other words, the number generated by the pseudo-random number generator should have the same relevant statistical properties as independent realisations of a U[0, 1] random variable. Most importantly: – The numbers generated by the algorithm should reproduce independence, i.e. the numbers X1 , . . . , Xn that we have already generated should not contain any discernible information on the next value Xn+1 . This property is often referred to as the lack of predictability. – The numbers generated should be spread out evenly across the interval [0, 1]. In the following we will briefly discuss the linear congruential generator. It is not a particularly powerful generator (so we discourage you from using it in practise), however it is easy enough to allow some insight into how pseudo-random number generators work. Algorithm 1.1 (Congruential pseudo-random number generator). 1. Choose a, M ∈ N, c ∈ N0 , and the initial value (“seed”) Z0 ∈ {1, . . . M − 1}.
2. For i = 1, 2, . . .
Set Zi = (aZi−1 + c) mod M , and Xi = Zi /M . The integers Zi generated by the algorithm are from the set {0, 1, . . . , M − 1} and thus the Xi are in
the interval [0, 1).
It is easy to see that the sequence of pseudo-random numbers only depends on the seed X0 . Running the pseudo-random number generator twice with the same seed thus generates exactly the same sequence of pseudo-random numbers. This can be a very useful feature when debugging your own code. Example 1.4. Cosider the choice of a = 81, c = 35, M = 256, and seed Z0 = 4. Z1
=
(81 · 4 + 35)
Z2
=
(81 · 103 + 35)
mod 256 = 8378
Z3
=
(81 · 186 + 35)
mod 256 = 15101
mod 256 = 359
mod 256 = 103 mod 256 = 186 mod 256 = 253
... The corresponding Xi are X1 = 103/256 = 0.4023438, X2 = 186/256 = 0.72656250, X1 = 253/256 = 0.98828120.
⊳
The main flaw of the congruential generator its “crystalline” nature (Marsaglia, 1968). If the sequence of generated values X1 , X2 , . . . is viewed as points in an n-dimension cube8 , they lie on a finite, and often very small number of parallel hyperplanes. Or as Marsaglia (1968) put it: “the points [generated by a congruential generator] are about as randomly spaced in the unit n-cube as the atoms in a perfect crystal at absolute zero.” The number of hyperplanes depends on the choice of a, c, and M . An example for a notoriously poor design of a congruential pseudo-random number generator is RANDU, which was (unfortunately) very popular in the 1970s and used for example in IBM’s System/360 and System/370, and Digital’s PDP-11. It used a = 216 + 3, c = 0, and M = 231 . The numbers generated by RANDU lie on only 15 hyperplanes in the 3-dimensional unit cube (see figure 1.6). 8
The (k + 1)-th point has the coordinates (Xnk+1 , . . . , Xnk+n−1 ).
12
1. Introduction
0.0
0.2
0.4
X2k
0.6
−2 log(X2k−1 ) sin(2πX2k ) -5 0 5
0.8
1.0
Fig. 1.6. 300,000 realisations of the RANDU pseudo-random number generator plotted in 3D. A point corresponds to a triplet (x3k−2 , x3k−1 , x3k ) for k = 1, . . . , 100000. The data points lie on 15 hyperplanes.
0.0
0.2
0.4 0.6 X2k−1
0.8
1.0
(a) 1,000 realisations of this congruential generator plotted in 2D.
-10
-5 0 5 −2 log(X2k−1 ) cos(2πX2k )
(b) Supposedly bivariate Gaussian pseudo-random numbers obtained using the pseudo-random numbers shown in panel (a).
Fig. 1.7. Results obtained using a congruential generator with a = 1229, c = 1, and M = 211
1.4 Pseudo-random numbers
13
Figure 1.7 shows another cautionary example (taken from Ripley, 1987). The left-hand panel shows a plot of 1,000 realisations of a congruential generator with a = 1229, c = 1, and M = 211 . The random numbers lie on only 5 hyperplanes in the unit square. The right hand panel shows the outcome of the Box-Muller method for transforming two uniform pseudo-random numbers into a pair of Gaussians (see example 2.2). Due to this flaw of the congruential pseudo-random number generator, it should not be used in Monte Carlo experiments. For more powerful pseudo-random number generators see e.g. Marsaglia and Zaman (1991) or Matsumoto and Nishimura (1998). GNU R (and other environments) provide you with a large choice of powerful random number generators, see the corresponding help page (?RNGkind) for details.
14
1. Introduction
2. Fundamental Concepts: Transformation, Rejection, and Reweighting
2.1 Transformation Methods In section 1.4 we have seen how to create (pseudo-)random numbers from the uniform distribution U[0, 1]. One of the simplest methods of generating random samples from a distribution with cumulative distribution function (CDF) F (x) = P(X ≤ x) is based on the inverse of the CDF.
1 F (x) u
F − (u)
x
Fig. 2.1. Illustration of the definition of the generalised inverse F − of a CDF F
The CDF is an increasing function, however it is not necessarily continuous. Thus we define the generalised inverse F − (u) := inf{x : F (x) ≥ u}. Figure 2.1 illustrates its definition. If F is continuous,
then F − (u) = F −1 (u).
Theorem 2.1 (Inversion Method). Let U ∼ U[0, 1] and F be a CDF. Then F − (U ) has the CDF F . Proof. It is easy to see (e.g. in figure 2.1) that F − (u) ≤ x is equivalent to u ≤ F (x). Thus for U ∼ U[0, 1] P(F − (U ) ≤ x) = P(U ≤ F (x)) = F (x), thus F is the CDF of X = F − (U ).
Example 2.1 (Exponential Distribution). The exponential distribution with rate λ > 0 has the CDF Fλ (x) = 1 − exp(−λx) for x ≥ 0. Thus Fλ− (u) = Fλ−1 (u) = − log(1 − u)/λ. Thus we can generate
random samples from Expo(λ) by applying the transformation − log(1 − U )/λ to a uniform U[0, 1] random variable U . As U and 1 − U , of course, have the same distribution we can use − log(U )/λ as well.
⊳
16
2. Fundamental Concepts: Transformation, Rejection, and Reweighting
The Inversion Method is a very efficient tool for generating random numbers. However very few distributions possess a CDF whose (generalised) inverse can be evaluated efficiently. Take the example of the Gaussian distribution, whose CDF is not even available in closed form. Note however that the generalised inverse of the CDF is just one possible transformation and that there might be other transformations that yield the desired distribution. An example of such a method is the Box-Muller method for generating Gaussian random variables. Example 2.2 (Box-Muller Method for Generating Gaussians). Using the transformation of density fori.i.d.
mula one can show that X1 , X2 ∼ N(0, 1) iff their polar coordinates (R, θ) with X1 = R · cos(θ),
X2 = R · sin(θ) i.i.d.
are independent, θ ∼ U[0, 2π], and R2 ∼ Expo(1/2). Using U1 , U2 ∼ U[0, 1] and example 2.1 we can generate R and θ by
p −2 log(U1 ),
R= and thus X1 =
p
−2 log(U1 ) · cos(2πU2 ),
θ = 2πU2
X2 =
are two independent realisations from a N(0, 1) distribution.
p −2 log(U1 ) · sin(2πU2 )
⊳
The idea of transformation methods like the Inversion Method was to generate random samples from a distribution other than the target distribution and to transform them such that they come from the desired target distribution. In many situations, we cannot find such a transformation in closed form. In these cases we have to find other ways of correcting for the fact that we sample from the “wrong” distribution. The next two sections present two such ideas: rejection sampling and importance sampling.
2.2 Rejection Sampling The basic idea of rejection sampling is to sample from an instrumental distribution1 and reject samples that are “unlikely” under the target distribution. Assume that we want to sample from a target distribution whose density f is known to us. The simple idea underlying rejection sampling (and other Monte Carlo algorithms) is the rather trivial identity f (x) =
Z
0
f (x)
1 du =
Z
0
1
10
Thus f (x) can be interpreted as the marginal density of a uniform distribution on the area under the density f (x) {(x, u) : 0 ≤ u ≤ f (x)}. Figure 2.2 illustrates this idea. This suggests that we can generate a sample from f by sampling from the area under the curve. Example 2.3 (Sampling from a Beta distribution). The Beta(a, b) distribution (a, b ≥ 0) has the density f (x) = 1
Γ (a + b) a−1 (1 − x)b−1 , x Γ (a)Γ (b)
for 0 < x < 1,
The instrumental distribution is sometimes referred to as proposal distribution.
2.2 Rejection Sampling
17
u 2.4
x 0
1
Fig. 2.2. Illustration of example 2.3. Sampling from the area under the curve (dark grey) corresponds to sampling from the Beta(3, 5) density. In example 2.3 we use a uniform distribution of the light grey rectangle as proposal distribution. Empty circles denote rejected values, filled circles denote accepted values.
where Γ (a) =
R +∞ 0
ta−1 exp(−t) dt is the Gamma function. For a, b > 1 the Beta(a, b) density is unimodal
with mode (a − 1)/(a + b − 2). Figure 2.2 shows the density of a Beta(3, 5) distribution. It attains its maximum of 1680/729 ≈ 2.305 at x = 1/3.
Using the above identity we can draw from Beta(3, 5) by drawing from a uniform distribution on the area under the density {(x, u) : 0 < u < f (x)} (the area shaded in dark gray in figure 2.2).
In order to sample from the area under the density, we will use a similar trick as in examples 1.1 and 1.2. We will sample from the light grey rectangle and and only keep the samples that fall in the area under the curve. Figure 2.2 illustrates this idea. Mathematically speaking, we sample independently X ∼ U[0, 1] and U ∼ U[0, 2.4]. We keep the pair (X, U ) if U < f (X), otherwise we reject it.
The conditional probability that a pair (X, U ) is kept if X = x is P(U < f (X)|X = x) = P(U < f (x)) = f (x)/2.4 As X and U were drawn independently we can rewrite our algorithm as: Draw X from U[0, 1] and accept X with probability f (X)/2.4, otherwise reject X.
⊳
The method proposed in example 2.3 is based on bounding the density of the Beta distribution by a box. Whilst this is a powerful idea, it cannot be directly applied to other distributions, as the density might be unbounded or have infinite support. However we might be able to bound the density of f (x) by M · g(x), where g(x) is a density that we can easily sample from. Algorithm 2.1 (Rejection sampling). Given two densities f, g with f (x) < M · g(x) for all x, we can generate a sample from f by 1. Draw X ∼ g
2. Accept X as a sample from f with probability f (X) , M · g(X) otherwise go back to step 1. Proof. We have P(X ∈ X and is accepted) =
Z
X
g(x)
f (x) M · g(x) | {z }
=P(X is accepted|X=x)
dx =
R
X
f (x) dx , M
(2.1)
18
2. Fundamental Concepts: Transformation, Rejection, and Reweighting
and thus2 P(X is accepted) = P(X ∈ E and is accepted) =
1 , M
(2.2)
yielding R
P(X ∈ X and is accepted) = P(x ∈ X |X is accepted) = P(X is accepted)
X
f (x) dx/M = 1/M
Z
f (x) dx.
(2.3)
X
Thus the density of the values accepted by the algorithm is f (·).
Remark 2.1. If we know f only up to a multiplicative constant, i.e. if we only know π(x), where f (x) = C · π(x), we can carry out rejection sampling using π(X) M · g(X) as probability of rejecting X, provided π(x) < M · g(x) for all x. Then by analogy with (2.1) - (2.3) we
have
P(X ∈ X and is accepted) = P(X is accepted) = 1/(C · M ), and thus
Z
X
π(x) dx = g(x) M · g(x)
P(x ∈ X |X is accepted) =
R
X
R
X
π(x) dx = M
f (x) dx/(C · M ) = 1/(C · M )
Z
R
X
f (x) dx , C ·M
f (x) dx
X
Example 2.4 (Rejection sampling from the N(0, 1) distribution using a Cauchy proposal). Assume we want to sample from the N(0, 1) distribution with density 2 x 1 f (x) = √ exp − 2 2π using a Cauchy distribution with density g(x) =
1 π(1 + x2 )
as instrumental distribution.3 The smallest M we can choose such that f (x) ≤ M g(x) is M = exp(−1/2).
√
2π ·
Figure 2.3 illustrates the results. As before, filled circles correspond to accepted values whereas open circles correspond to rejected values.
M · g(x) f (x)
−6
−5
−4
−3
−2
−1
1
2
3
4
5
6
Fig. 2.3. Illustration of example 2.3. Sampling from the area under the density f (x) (dark grey) corresponds to sampling from the N(0, 1) density. The proposal g(x) is a Cauchy(0, 1). 2 3
We denote by E the set of all possible values X can take. There is not much point is using this method is practise. The Box-Muller method is more efficient.
2.3 Importance Sampling
19
Note that it is impossible to do rejection sampling from a Cauchy distribution using a N(0, 1) distribution as instrumental distribution: there is no M ∈ R such that 2 x 1 1 <M·√ exp − ; 2 π(1 + x2 ) 2 2πσ the Cauchy distribution has heavier tails than the Gaussian distribution.
⊳
2.3 Importance Sampling In rejection sampling we have compensated for the fact that we sampled from the instrumental distribution g(x) instead of f (x) by rejecting some of the values proposed by g(x). Importance sampling is based on the idea of using weights to correct for the fact that we sample from the instrumental distribution g(x) instead of the target distribution f (x). Importance sampling is based on the identity Z Z Z f (x) dx = g(x)w(x) dx P(X ∈ X ) = f (x) dx = g(x) g(x) X X X | {z }
(2.4)
=:w(x)
for all g(·), such that g(x) > 0 for (almost) all x with f (x) > 0. We can generalise this identity by considering the expectation Ef (h(X)) of a measurable function h: Z Z Z f (x) h(x) dx = g(x)w(x)h(x) dx = Eg (w(X) · h(X)), Ef (h(X)) = f (x)h(x) dx = g(x) g(x) | {z }
(2.5)
=:w(x)
if g(x) > 0 for (almost) all x with f (x) · h(x) 6= 0.
Assume we have a sample X1 , . . . , Xn ∼ g. Then, provided Eg |w(X) · h(X)| exists, n
a.s. 1X n→∞ w(Xi )h(Xi ) −→ Eg (w(X) · h(X)) n i=1
and thus by (2.5)
n
a.s. 1X n→∞ w(Xi )h(Xi ) −→ Ef (h(X)). n i=1
In other words, we can estimate µ := Ef (h(X)) by n
µ ˜ := Note that whilst Eg (w(X)) =
R
f (x) g(x) E g(x)
1X w(Xi )h(Xi ) n i=1 dx =
R
E
f (x) = 1, the weights w1 (X), . . . , wn (X) do not
necessarily sum up to n, so one might want to consider the self-normalised version n
X 1 w(Xi )h(Xi ). i=1 w(Xi ) i=1
µ ˆ := Pn
This gives rise to the following algorithm: Algorithm 2.2 (Importance Sampling). Choose g such that supp(g) ⊃ supp(f · h). 1. For i = 1, . . . , n: i. Generate Xi ∼ g.
ii. Set w(Xi ) =
f (Xi ) g(Xi ) .
20
2. Fundamental Concepts: Transformation, Rejection, and Reweighting
2. Return either
Pn w(Xi )h(Xi ) Pn µ ˆ = i=1 i=1 w(Xi )
or
µ ˜=
Pn
i=1
w(Wi )h(Xi ) n
The following theorem gives the bias and the variance of importance sampling. Theorem 2.2 (Bias and Variance of Importance Sampling). (a) Eg (˜ µ) = µ Varg (w(X) · h(X)) (b) Varg (˜ µ) = n µVarg (w(X)) − Covg (w(X), w(X) · h(X)) + O(n−2 ) (c) Eg (ˆ µ) = µ + n Varg (w(X) · h(X)) − 2µCovg (w(X), w(X) · h(X)) + µ2 Varg (w(X)) (d) Varg (ˆ + O(n−2 ) µ) = n ! n n 1X 1X Proof. (a) Eg w(Xi )h(Xi ) = Eg (w(Xi )h(Xi )) = Ef (h(X)) n i=1 n i=1 ! n n 1X 1 X Varg (w(X)h(X)) (b) Varg w(Xi )h(Xi ) = 2 Varg (w(Xi )h(Xi )) = n i=1 n i=1 n (c) and (d) see (Liu, 2001, p. 35)
Note that the theorem implies that contrary to µ ˜ the self-normalised estimator µ ˆ is biased. The selfnormalised estimator µ ˆ however might have a lower variance. In addition, it has another advantage: we only need to know the density up to a multiplicative constant, as it is often the case in hierarchical Bayesian modelling. Assume f (x) = C · π(x), then
Pn C·π(Xi ) Pn π(Xi ) Pn f (Xi ) Pn w(Xi )h(Xi ) i=1 g(Xi ) h(Xi ) i=1 g(X ) h(Xi ) i=1 g(Xi ) h(Xi ) i=1 Pn = Pn C·π(X ) = Pn π(X i) , µ ˆ= = Pn f (X ) i i i w(Xi ) w(Xi ) w(Xi ) i=1 w(Xi ) i=1 g(Xi )
i=1
g(Xi )
i=1 g(Xi )
i.e. the self-normalised estimator µ ˆ does not depend on the normalisation constant C.4 On the other hand, as we have seen in the proof of theorem 2.2 it is a lot harder to analyse the theoretical properties of the self-normalised estimator µ ˆ. Although the above equations (2.4) and (2.5) hold for every g with supp(g) ⊃ supp(f · h) and the
importance sampling algorithm converges for a large choice of such g, one typically only considers choices of g that lead to finite variance estimators. The following two conditions are each sufficient (albeit rather restrictive) for a finite variance of µ ˜: – f (x) < M · g(x) and Varf (h(X)) < ∞.
– E is compact, f is bounded above on E, and g is bounded below on E. So far we have only studied whether an g is an appropriate instrumental distribution, i.e. whether the variance of the estimator µ ˜ (or µ ˆ) is finite. This leads to the question which instrumental distribution is optimal, i.e. for which choice Var(˜ µ) is minimal. The following theorem answers this question: Theorem 2.3 (Optimal proposal). The proposal distribution g that minimises the variance of µ ˜ is |h(x)|f (x) . |h(t)|f (t) dt E
g ∗ (x) = R
4
By complete analogy one can show that is enough to know g up to a multiplicative constant.
2.3 Importance Sampling
21
Proof. We have from theroem 2.2 (b) that n·Varg (˜ µ) = Varg (w(X) · h(X)) = Varg
h(X) · f (X) g(X)
= Eg
h(X) · f (X) g(X)
2 ! !2 h(X) · f (X) . − Eg g(X) | {z } =Eg (˜ µ)=µ
Thus we only have to minimise Eg
Eg⋆
h(X) · f (X) g ⋆ (X)
2 !
Z
h(X)·f (X) g(X)
2
. When plugging in g ⋆ we obtain:
h(x)2 · f (x)2 dx = g ⋆ (x) E Z 2 = |h(x)|f (x) dx =
Z
E
Z h(x)2 · f (x)2 dx · |h(t)|f (t) dt |h(x)|f (x) E
E
On the other hand, we can apply the Jensen inequality to Eg
Eg
h(X) · f (X) g(X)
2 !
≥
Eg
|h(X)| · f (X) g(X)
h(X)·f (X) g(X)
2
=
Z
E
2
yielding
|h(x)|f (x) dx
2
An important corollary of theorem 2.3 is that importance sampling can be super-efficient, i.e. when using the optimal g ⋆ from theorem 2.3 the variance of µ ˜ is less than the variance obtained when sampling directly from f : h(X1 ) + . . . + h(Xn ) = Ef (h(X)2 ) − µ2 n · Varf n Z 2 2 ≥ (Ef |h(X)|) − µ2 = |h(x)|f (x) dx − µ2 = n · Varg⋆ (˜ µ)
E
by Jensen’s inequality. Unless h(X) is (almost surely) constant the inequality is strict. There is an intuitive explanation to the super-efficiency of importance sampling. Using g ⋆ instead of f causes us to focus on regions of high probability where |h| is large, which contribute most to the integral Ef (h(X)).
Theorem 2.3 is, however, a rather formal optimality result. When using µ ˜ we need to know the nor-
malisation constant of g ⋆ , which is exactly the integral we are looking for. Further we need to be able to draw samples from g ⋆ efficiently. The practically important corollary of theorem 2.3 is that we should choose an instrumental distribution g whose shape is close to the one of f · |h|. Example 2.5 (Computing Ef |X| for X ∼ t3 ). Assume we want to compute Ef |X| for X from a t-distribution
with 3 degrees of freedom (t3 ) using a Monte Carlo method. Three different schemes are considered – Sampling X1 , . . . , Xn directly from t3 and estimating Ef |X| by 1X n|Xi |. n i=1
– Alternatively we could use importance sampling using a t1 (which is nothing other than a Cauchy distribution) as instrumental distribution. The idea behind this choice is that the density gt1 (x) of a t1 distribution is closer to f (x)|x|, where f (x) is the density of a t3 distribution, as figure 2.4 shows. – Third, we will consider importance sampling using a N(0, 1) distribution as instrumental distribution. Note that the third choice yields weights of infinite variance, as the instrumental distribution (N(0, 1)) has lighter tails than the distribution we want to sample from (t3 ). The right-hand panel of figure 2.5 R illustrates that this choice yields a very poor estimate of the integral |x|f (x) dx.
2. Fundamental Concepts: Transformation, Rejection, and Reweighting
|x| · f (x) (Target) f (x) (direct sampling) gt1 (x) (IS t1 ) gN(0,1) (x) (IS N(0, 1))
0.0
0.1
0.2
0.3
0.4
22
-4
-2
0
2
4
x Fig. 2.4. Illustration of the different instrumental distributions in example 2.5.
Sampling directly from the t3 distribution can be seen as importance sampling with all weights wi ≡ 1,
this choice clearly minimises the variance of the weights. This however does not imply that this yields R an estimate of the integral |x|f (x) dx of minimal variance. Indeed, after 1500 iterations the empirical standard deviation (over 100 realisations) of the direct estimate is 0.0345, which is larger than the empirical standard deviation of µ ˜ when using a t1 distribution as instrumental distribution, which is 0.0182. So using a t1 distribution as instrumental distribution is super-efficient (see figure 2.5). Figure 2.6 somewhat explains why the t1 distribution is a far better choice than the N(0, 1) distributon. As
the N(0, 1) distribution does not have heavy enough tails, the weight tends to infinity as |x| → +∞. Thus
large |x| get large weights, causing the jumps of the estimate µ ˜ shown in figure 2.5. The t1 distribution
has heavy enough tails, so the weights are small for large values of |x|, explaining the small variance of the estimate µ ˜ when using a t1 distribution as instrumental distribution.
⊳
2.3 Importance Sampling
IS using t1 as instrumental distribution
IS using N(0, 1) as instrumental distribution
1.5 1.0 0.0
0.5
IS estimate over time
2.0
2.5
3.0
Sampling directly from t3
23
0
500
1000
1500 0
500
1000
1500 0
500
1000
1500
Iteration
Fig. 2.5. Estimates of E|X| for X ∼ t3 obtained after 1 to 1500 iterations. The three panels correspond to the three different sampling schemes used. The areas shaded in grey correspond to the range of 100 replications.
IS using t1 as instrumental distribution
IS using N(0, 1) as instrumental distribution
1.5 1.0 0.5 0.0
Weights Wi
2.0
2.5
3.0
Sampling directly from t3
-4
-2
0
2
4
-4
-2
0
2
4
-4
-2
0
Sample Xi from the instrumental distribution
Fig. 2.6. Weights Wi obtained for 20 realisations Xi from the different instrumental distributions.
2
4
24
2. Fundamental Concepts: Transformation, Rejection, and Reweighting
3. Markov Chains
What is presented here is no more than a very brief introduction to certain aspects of stochastic processes, together with those details which are essential to understanding the remainder of this course. If you are interested in further details then a rigorous, lucid and inexpensive reference which starts from the basic principles is provided by (Gikhman and Skorokhod, 1996). We note, in particular, that a completely rigorous treatment of this area requires a number of measure theoretic concepts which are beyond the scope of this course. We will largely neglect these issues here in the interests of clarity of exposition, whilst attempting to retain the essence of the concepts which are presented. If you are not familiar with measure theory then please ignore all references to measurability.
3.1 Stochastic Processes For our purposes we can define an E-valued process as a function ξ : I → E which maps values in some
index set I to some other space E. The evolution of the process is described by considering the variation of ξ(i) with i. An E-valued stochastic process (or random process) can be viewed as a process in which, for each i ∈ I, ξ(i) is a random variable taking values in E.
Although a rich literature on more general situations exists, we will consider only the case of discrete
time stochastic processes in which the index set I is N (of course, any index set isomorphic to N can be
used in the same framework by simple relabeling). We will use the notation ξi to indicate the value of the
process at time i (note that there need be no connection between the index set and real time, but this terminology is both convenient and standard). We will begin with an extremely brief description of a general stochastic process, before moving on to discuss the particular classes of process in which we will be interested. In order to characterise a stochastic process of the sort in which we are interested, it is sufficient to know all of its finite dimensional distributions, the joint distributions of the process at any collection of finitely many times. For any collection of times i1 , i2 , . . . , it and any measurable collection of subsets of E, Ai1 , Ai2 , . . . , Ait we are interested in the probability: P (ξi1 ∈ Ai1 , ξi2 ∈ Ai2 , . . . , ξit ∈ Ait ) . For such a collection of probabilities to define a stochastic process, we require that they meet a certain consistency criterion. We require the marginal distribution of the values taken by the process at any collection of times to be the same under any finite dimensional distribution which includes the process at those time points, so, defining any second collection of times j1 , . . . , js with the property that jk 6= il for
26
3. Markov Chains
any k ≤ t, l ≤ s, we must have that: P (ξi1 ∈ Ai1 , ξi2 ∈ Ai2 , . . . , ξit ∈ Ait ) =P (ξi1 ∈ Ai1 , ξi2 ∈ Ai2 , . . . , ξit ∈ Ait , ξj1 ∈ E, . . . , ξjt ∈ E) . This is just an expression of the intuitive concept that any finite dimensional distribution which describes the process at the times of interest should provide the same description if we neglect any information it provides about the process at other times. Or, to put it another way, they must all be marginal distributions of the same distribution. In the case of real-valued stochastic processes, in which E = R, we may express this concept in terms of the joint distribution functions (the multivariate analogue of the distribution function). Defining the joint distribution functions according to: Fi1 ,...,it (x1 , x2 , . . . , xt ) = P (ξi1 ≤ x1 , ξi2 ≤ x2 , . . . , ξit ≤ xt ) , our consistency requirement may now be expressed as: Fi1 ,...,it ,j1 ,...,jt (x1 , x2 , . . . , xt , ∞, . . . , ∞) = Fi1 ,...,it (x1 , x2 , . . . , xt ). Having established that we can specify a stochastic process if we are able to specify its finite dimensional distributions, we might wonder how to specify these distributions. In the next two sections, we proceed to describe a class of stochastic processes which can be described constructively and whose finite dimensional distributions may be easily established. The Markov processes which we are about to introduce represent the most widely used class of stochastic processes, and the ones which will be of most interest in the context of Monte Carlo methods.
3.2 Discrete State Space Markov Chains 3.2.1 Basic Notions We begin by turning our attention to the discrete state space case which is somewhat easier to deal with than the general case which will be of interest later. In the case of discrete state spaces, in which |E|
is either finite, or countably infinite, we can work with the actual probability of the process having a particular value at any time (you’ll recall that in the case of continuous random variables more subtlety is generally required as the probability of any continuous random variable defined by a density (with respect to Lebesgue measure, in particular) taking any particular value is zero). This simplifies things considerably, and we can consider defining the distribution of the process of interest over the first t time points by employing the following decomposition: P (ξ1 = x1 , ξ2 = x2 , . . . , ξt = xt ) =P (ξ1 = x1 , ξ2 = x2 , . . . , ξt−1 = xt−1 ) P (ξt = xt |ξ1 = x1 , . . . , ξt−1 = xt−1 ) . Looking at this decomposition, it’s clear that we could construct all of the distributions of interest from an initial distribution from which ξ1 is assumed to be drawn and then a sequence of conditional distributions for each t, leading us to the specification: P (ξ1 = x1 , ξ2 = x2 , . . . , ξt = xt ) = P (ξ1 = x1 )
t Y
i=2
P (ξi = xi |ξ1 = x1 , . . . , ξi−1 = xi−1 ) .
(3.1)
3.2 Discrete State Space Markov Chains
27
From this specification we can trivially construct all of the finite dimensional distributions using no more than the sum and product rules of probability. So, we have a method for constructing finite distributional distributions for a discrete state space stochastic process, but it remains a little formal as the conditional distributions seem likely to become increasingly complex as the time index increases. The conditioning present in decomposition (3.1) is needed to capture any relationship between the distribution at time t and any previous time. In many situations of interest, we might expect interactions to exist on only a much shorter time-scale. Indeed, one could envisage a memoryless process in which the distribution of the state at time t + 1 depends only upon its state at time t, ξt , regardless of the path by which it reached ξt . Formally, we could define such a process as: P (ξ1 = x1 , ξ2 = x2 , . . . , ξt = xt ) =P (ξ1 = x1 )
t Y
i=2
P (ξi = xi |ξi−1 = xi−1 ) .
(3.2)
It is clear that (3.2) is a particular case of (3.1) in which this lack of memory property is captured explicitly, as: P (ξt = xt |ξ1 = x1 , . . . , ξt−1 = xt−1 ) = P (ξt = xt |ξt−1 = xt−1 ) . We will take this as the defining property of a collection of processes which we will refer to as discrete time Markov processes or, as they are more commonly termed in the Monte Carlo literature, Markov chains. There is some debate in the literature as to whether the term “Markov chain” should be reserved for those Markov processes which take place on a discrete state space, those which have a discrete index set (the only case we will consider here) or both. As is common in the field of Monte Carlo simulation, we will use the terms interchangeably. When dealing with discrete state spaces, it is convenient to associate a row vector1 with any probability distribution. We assume, without loss of generality, that the state space, E, is N. Now, given a random variable X on E, we say that X has distribution µ, often written as X ∼ µ for some vector µ with the
property that:
∀x ∈ E : P (X = x) = µx . Homogeneous Markov Chains. The term homogeneous Markov Chain is used to describe a Markov process of the sort just described with the additional caveat that the conditional probabilities do not depend explicitly on the time index, so: ∀m ∈ N : P (ξt = y|ξt−1 = x) ≡ P (ξt+m = y|ξt+m−1 = x) . In this setting, it is particular convenient to define a function corresponding to the transition probability (as the probability distribution at time i + 1 conditional upon the state of the process at time i) or kernel as it is often known, which may be written as a two argument function or, in the discrete case as a matrix, K(i, j) = Kij = P (ξt = j|ξt−1 = i). Having so expressed things, we are able to describe the dynamic structure of a discrete state space, discrete time Markov chain in a particularly simple form. If we allow µt to describe the distribution of the chain at time t, so that µt,i = P (ξt = i), then we have by applying the sum and product rules of probability, that: µt+1,j =
X
µt,i Kij .
i
1
Formally, much of the time this will be an infinite dimensional vector but this need not concern us here.
28
3. Markov Chains
We may recognise this as standard vector-matrix multiplication and write simply that µt+1 = µt K and, proceeding inductively it’s straightforward to verify that µt+m = µt K m where K m denotes the usual mth matrix power of K. We will make some use of this object, as it characterises the m-step ahead condition distribution: m Kij := (K m )ij = P (ξt+m = j|ξt = i) .
In fact, the initial distribution µ1 , together with K tells us the full distribution of the chain over any finite time horizon: P (ξ1 = x1 , . . . , ξt = xt ) = µ1,x1
t Y
Kxi−1 xi .
i=2
A general stochastic processes is said to possess the weak Markov property if, for any deterministic time, t and any finite integers p, q, we may write that for any integrable function ϕ : E q → R: E [ϕ(ξt+1 , . . . , ξt+q )|ξ1 = x1 , . . . ξt = xt ] = E [ϕ(ξ2 , . . . , ξq+1 )|ξ1 = xt ] . Inhomogeneous Markov Chains. Note that it is perfectly possible to define Markov Chains whose behaviour does depend explicitly upon the time index. Although such processes are more complex to analyse than their homogeneous counterparts, they do play a rˆole in Monte Carlo methodology – in both established algorithms such as simulated annealing and in more recent developments such as adaptive Markov Chain Monte Carlo and the State Augmentation for Maximising Expectations (SAME) algorithm of Doucet et al. (2002). In the interests of simplicity, what follows is presented for homogeneous Markov Chains. Examples. Before moving on to introduce some theoretical properties of discrete state space Markov chains we will present a few simple examples. Whilst there are innumerable examples of homogeneous discrete state space Markov chains, we confined ourselves here to some particular simple cases which will be used to illustrate some properties below, and which will probably be familiar to you. We begin with an example which is apparently simple, and rather well known, but which exhibits some interesting properties Example 3.1 (the simple random walk over the integers). Given a process ξt whose value at time t + 1 is ξt + 1 with probability p+ and ξt−1 with probability p− = 1 − p+ , we obtain the familiar random walk. We
may write this as a Markov chain by setting E = Z and noting that the transition kernel may be written as:
p− Kij = p+ 0 t−2
p+ p−
t−1
p+ p−
if j = i − 1
if j = i + 1
otherwise.
t
p+ p−
t+1
p+
t+2
p−
Fig. 3.1. A simple random walk on Z.
⊳ Example 3.2. It will be interesting to look at a slight extension of this random walk, in which there is some probability p0 of remaining in the present state at the next time step, so p+ +p− < 0 and p0 = 1−(p+ +p− ). In this case we may write the transition kernel as:
3.2 Discrete State Space Markov Chains
p− p 0 Kij = p+ 0 p0
p0
p+
t−2
p−
t−1
29
if j = i − 1
if j = i
if j = i + 1 otherwise.
p0
p+
p0
p+
t
p−
t+1
p−
p0
p+
t+2
p−
Fig. 3.2. A random walk on Z with Ktt > 0.
⊳ Example 3.3 (Random Walk on a Triangle). A third example which we will consider below could be termed a “random walk on a triangle”. In this case, we set E = {1, 2, 3} and define a transition ker-
nel of the form:
0
K= p− p+
p+
p+ 0 p−
p−
p+ . 0
2
p+
p−
p− p− 1
3
p+
Fig. 3.3. A random walk on a triangle.
⊳ Example 3.4 (One-sided Random Walk). Finally, we consider the rather one-sided random walk on the positive integers, illustrated in figure 3.4, and defined by p0 Kij = p+ = 1 − p0 0
transition kernel: if j = i if j = i + 1 otherwise. ⊳
30
3. Markov Chains
p0
t
p0
p+
t+1
p0
p+
t+2
p0
p+
t+3
p0
p+
t+4
Fig. 3.4. A random walk on the positive integers.
3.2.2 Important Properties In this section we introduce some important properties in the context of discrete state space Markov chains and attempt to illustrate their importance within the field of Monte Carlo simulation. As is the usual practice when dealing with this material, we will restrict our study to the homogeneous case. As you will notice, it is the transition kernel which is most important in characterising a Markov chain. We begin by considering how the various states that a Markov chain may be reached from one another. In particular, the notion of states which communicate is at the heart of the study of Markov chains. Definition 3.1 (Accessibility). A state y is accessible from a state x, sometimes written as x → y if,
for a discrete state space Markov chain,
inf {t : P (ξt = y|ξ1 = x) > 0} < ∞.
t We can alternatively write this condition in terms of the transition matrix as inf t : Kxy > 0 < ∞.
This concept tells us which states one can reach at some finite time in the future, if one starts from a particular state and then moves, at each time, according to the transition kernel, K. That is, if x → y,
then there is a positive probability of reaching y at some finite time in the future, if we start from a state x and then “move” according to the Markov kernel K. It is now useful to consider cases in which one can traverse the entire space, or some subset of it, starting from any point. Definition 3.2 (Communication). Two states x, y ∈ E are said to communicate (written, by some authors as x ↔ y) if each is accessible from the other, that is:
x ↔ y ⇔ x → y and y → x. We’re now in a position to describe the relationship, under the action of a Markov kernel, between two states. This allows us to characterise something known as the communication structure of the associated Markov chain to some degree, noting which points its possible to travel both to and back from. We now go on to introduce a concept which will allow us to describe the properties of the full state space, or significant parts of it, rather than individual states. Definition 3.3 (Irreducibility). A Markov Chain is said to be irreducible if all states communicate, so ∀x, y ∈ E : x → y. Given a distribution φ on E, the term φ-irreducible is used to describe a Markov
chain for which every state with positive probability under φ communicates with every other such state: ∀x, y ∈ supp(φ) : x → y
where the support of the discrete distribution φ is defined as supp(φ) = {x ∈ E : φ(x) > 0}. It is said to
be strongly irreducible if any state can be reached from any point in the space in a single step and strongly φ-irreducible if all states (except for a collection with probability 0 under φ) may be reached in a single step.
3.2 Discrete State Space Markov Chains
31
This will prove to be important for the study of Monte Carlo methods based upon Markov chains as a chain with this property can somehow explore the entire space rather than being confined to some portion of it, perhaps one which depends upon the initial state. It is also important to consider the type of routes which it is possible to take between a state, x, and itself as this will tell us something about the presence of long-range correlation between the states of the chain. Definition 3.4 (Period). A state x in a discrete state space Markov chain has period d(x) defined as: s d(x) = gcd {s ≥ 1 : Kxx > 0} ,
where gcd denotes the greatest common denominator. A chain possessing such a state is said to have a cycle of length d. Proposition 3.1. All states which communicate have the same period and hence, in an irreducible Markov chain, all states have the same period. Proof. Assume that x ↔ y. Let there exist paths of lengths r, s and t, respectively from x → y, y → x
and y → y, respectively.
There are paths of length r + s and r + s + t from x to x, hence d(x) must be a divisor of r + s and
r + s + t and consequently of their difference, t. This holds for any t corresponding to a path from y → y
and so d(x) is a divisor of the length of any path from y → y: as d(y) is the greatest common divisor of all such paths, we have that d(x) ≤ d(y).
By symmetry, we also have that d(y) ≤ d(x), and this completes the proof.
In the context of irreducible Markov chains, the term periodic is used to describe those chains whose states have some common period great than 1, whilst those chains whose period is 1 are termed aperiodic. One further quantity needs to be characterised in order to study the Markov chains which will arise later. Some way of describing how many times a state is visited if a Markov chain is allowed to run for infinite time still seems required. In order to do this it is useful to define an additional random quantity, the number of times that a state is visited: ηx :=
∞ X
Ix (ξk ).
k=1
We will also adopt the convention, common in the Markov chain literature that, given any function of the path of a Markov chain, ϕ, Ex [ϕ] is the expectation of that function under the law of the Markov chain initialised with ξ1 = x. Similarly, if µ is some distribution over E, then Eµ [ϕ] should be interpreted as the expectation of φ under the law of the process initialised with ξ1 ∼ µ. Definition 3.5 (Transience and Recurrence). In the context of discrete state space Markov chains, we describe a state, x, as transient if: Ex [ηx ] < ∞ whilst, if we have that, Ex [ηx ] = ∞, then that state will be termed recurrent. In the case of irreducible Markov chains, transience and recurrence are properties of the chain itself, rather than its individual states: if any state is transient (or recurrent) then all states have that property. Indeed, for an irreducible Markov chain either all states are recurrent or all are transient.
32
3. Markov Chains
We will be particularly concerned in this course with Markov kernels which admit an invariant distribution. Definition 3.6 (Invariant Distribution). A distribution, µ is said to be invariant or stationary for a Markov kernel, K, if µK = µ. If a Markov chain has any single time marginal distribution which corresponds to it’s stationary distribution, ξt ∼ µ, then all of its future time marginals are the same as, ξt+s ∼ µK s = µ. A Markov chain
is said to be in its stationary regime once this has occurred. Note that this tells us nothing about the correlation between the states or their joint distribution. One can also think of the invariant distribution µ of a Markov kernel, K as the left eigenvector with unit eigenvalue.
Definition 3.7 (Reversibility). A stationary stochastic process is said to be reversible if the statistics of the time-reversed version of the process match those of the process in the forward distribution, so that reversing time makes no discernible difference to the sequence of distributions which are obtained, that is the distribution of any collection of future states given any past history must match the conditional distribution of the past conditional upon the future being the reversal of that history. Reversibility is a condition which, if met, simplifies the analysis of Markov chains. It is normally verified by checking the detailed balance condition, (3.3). If this condition holds for a distribution, then it also tells us that this distribution is the stationary distribution of the chain, another property which we will be interested in. Proposition 3.2. If a Markov kernel satisfies the detailed balance condition for some distribution µ, ∀x, y ∈ E : µx Kxy = µy Kyx
(3.3)
then: 1. µ is the invariant distribution of the chain. 2. The chain is reversible with respect to µ. Proof. To demonstrate that K is µ-invariant, consider summing both sides of the detailed balance equation over x: X
µx Kxy =
x∈E
X
µy Kyx
x∈E
(µK)y = µy , and as this holds for all y, we have µK = µ. In order to verify that the chain is reversible we proceed directly: P (ξt = x, ξt+1 = y) P (ξt+1 = y) P (ξt = x) Kxy = P (ξt+1 = y) µy Kyx µx Kxy = = µy µy
P (ξt = x|ξt+1 = y) =
= Kyx = P (ξt = x|ξt−1 = y) , in the case of a Markov chain it is clear that if the transitions are time-reversible then the process must be time reversible.
3.3 General State Space Markov Chains
33
3.3 General State Space Markov Chains 3.3.1 Basic Concepts The study of general state space Markov chains is a complex and intricate business. It requires a degree of technical sophistication which lies somewhat outside the scope of this course to do so rigorously. Here, we will content ourselves with explaining how the concepts introduced in the context of discrete state spaces in the previous section might be extended to continuous domains via the use of probability densities. We will not consider more complex cases – such as mixed continuous and discrete spaces, or distributions over uncountable spaces which may not be described by a density. Nor will we provide proofs of results for this case, but will provide suitable references for the interested reader. Although the guiding principles are the same, the study of Markov chains with continuous state spaces requires considerably more subtlety as it is necessary to introduce concepts which correspond to those which we introduced in the discrete case, describe the same properties and are motivated by the same intuition but which remain meaningful when we are dealing with densities rather than probabilities. As always, the principle complication is that the probability of any random variable distributed according to a non-degenerate density on a continuous state space taking any particular value is formally zero. We will begin by considering how to emulate the decomposition we used to define a Markov chain on a discrete state space, (3.2), when E is a continuous state space. In this case, what we essentially require is that the probability of any range of possible values, given the entire history of the process depends only upon its most recent value in the sense that, for any measurable A ⊂ E: P (ξt ∈ At |ξ1 = x1 , . . . , ξt−1 = xt−1 ) = P (ξt ∈ At |ξt−1 = xt−1 ) . In the case which we are considering, it is convenient to describe the distribution of a random variable over E in terms of some probability density, µ : E → R which has the property that, if integrated over
any measurable set, it tells us the probability that the random variable in question lies within that set, i.e. if X ∼ µ, we have that for any measurable set A that: Z P (X ∈ A) = µ(x)dx. A
We will consider only the homogeneous case here, although the generalisation to inhomogeneous Markov chains follows in the continuous setting in precisely the same manner as the discrete one. In this context, we may describe the conditional probabilities of interest as a function K : E × E → R which has the property that for all measurable sets A ⊂ E and all points x ∈ E: Z P (ξt ∈ A|Xt−1 = x) = K(x, y)dy. A
We note that, as in the discrete case the law of a Markov chain evaluated at any finite number of points may be completely specified by the initial distribution, call it µ, and a transition kernel, K. We have, for any suitable collection of sets A1 , . . . , that the following holds: Z t Y P (ξ1 ∈ A1 , . . . , ξt ∈ At ) = µ(x1 ) Kk (xk−1 , xk )dx1 . . . dxt . A1 ×···×At
k=2
And, again, it is useful to be able to consider the s-step ahead conditional distributions, Z k=t+s Y P (ξt+s ∈ A|ξt = xt ) = K(xk−1 , xk )dxt+1 . . . dxt+s , E m−1 ×A
k=t+1
34
3. Markov Chains
and it is useful to define an s-step ahead transition kernel in the same manner as it is in the discrete case, here matrix multiplication is replaced by a convolution operation but the intuition remains the same. Defining s
K (xt , xt+s ) :=
Z
E s−1
k=t+s Y
K(xk−1 , xk )dxt+1 . . . dxt+s−1 ,
k=t+1
we are able to write P (ξt+s ∈ A|ξt = xt ) =
Z
K s (xt , xt+s )dxt+s .
A
3.3.2 Important Properties In this section we will introduce properties which fulfill the same rˆole in context of continuous state spaces as those introduced in section 3.2.2 do in the discrete setting. Whilst it is possible to define concepts similar to communication and accessibility in a continuous state space context, this isn’t especially productive. We are more interested in the property of irreducibility: we want some way of determining what class of states are reachable from one another and hence what part of E might be explored, with positive probability, starting from a point within such a class. We will proceed directly to a continuous state space definition of this concept. Definition 3.8 (Irreducibility). Given a distribution, µ, over E, a Markov chain is said to be µirreducible if for all points x ∈ E and all measurable sets A such that µ(A) > 0 there exists some t such that:
Z
K t (x, y)dy > 0.
A
If this condition holds with t = 1, then the chain is said to be strongly µ-irreducible. This definition has the same character as that employed in the discrete case, previously, but is well defined for more general state spaces. It still tells us whether a chain is likely to be satisfactory if we are interested in approximation of some property of a measure µ by using a sample of the evolution of that chain: if it is not µ-irreducible then there are some points in the space from which we cannot reach all of the support of µ, and this is likely to be a problem. In the sequel we will be interested more of less exclusively with Markov chains which are irreducible with respect to some measure of interest. We need a little more subtlety in extending some of the concepts introduced in the case of discrete Markov chains to the present context. In order to do this, it will be necessary to introduce the concept of the small set; these function as a replacement for the individual states of a discrete space Markov chain as we will see shortly. A first attempt might be to consider the following sets which have the property that the distribution of taken by the Markov chain at time t + 1 is the same if it starts at any point in this set – so the conditional distribution function is constant over this set. Definition 3.9 (Atoms). A Markov chain with transition kernel K is said to have an atom, α ⊂ E, if there is some probability distribution, ν, such that: Z Z ∀x ∈ α, A ⊂ E : K(x, y)dy = ν(y)dy. A
A
If the Markov chain in question is ν-irreducible, then α is termed an accessible atom.
3.3 General State Space Markov Chains
35
Whilst the concept of atoms starts to allow us to introduce some sort of structure similar to that seen in discrete chains – it provides us with a set of positive probability which, if the chain ever enters it, we know the distribution of the subsequent state2 – most interesting continuous state spaces do not possess atoms. The condition that the distribution in the next state is precisely the same, wherever the current state is rather strong. Another approach would be to require only that the conditional distribution has a common component, and that is the intuition behind a much more useful concept which underlies much of the analysis of general state space Markov chains. Definition 3.10 (Small Sets). A set, C ⊂ E, is termed small for a given Markov chain (or, when one is being precise, (ν, s, ǫ)-small) if there exists some positive integer m, some ǫ > 0 and some non-trivial probability distribution, ν, such that: ∀x ∈ α, A ⊂ E :
Z
K s (x, y)dy ≥ ǫ
A
Z
ν(y)dy.
A
This tells us that the distribution s-steps after the chain enters the small set has a component of size at least ǫ of the distribution ν, wherever it was within that set. In this sense, small sets are not “too big”: there is potentially some commonality of all paths emerging from them. Although we have not proved that such sets exist for any particular class of Markov chains it is, in fact, the case that they do for many interesting Markov chain classes and their existence allows for a number of sophisticated analytic techniques to be applied In order to define cycles (and hence the notion of periodicity) in the general case, we require the existence of a small set. We need some group of “sufficiently similar” points in the state space which have a finite probability of being reached. We then treat this collection of points in the same manner as an individual state in the discrete case, leading to the following definitions. Definition 3.11 (Cycles). A µ-irreducible Markov chain has a cycle of length d if there exists a small set C, an associated integer M and some probability distribution νs which has positive mass on C (i.e. R ν (x)dx > 0) such that: C s d = gcd {s ≥ 1 : C is small for some νs ≥ δs νs with δs > 0} .
This provides a reasonable concept of periodicity within a general state space Markov chain as it gives us a way of characterising the existence of regions of the space with the property that, wherever you start within that region you have positive probability of returning to that set after any multiple of d steps and this does not hold for any number of steps which is not a multiple of d. We are able to define periodicity and aperiodicity in the same manner as for discrete chains, but using this definition of a cycle. As in the discrete space, all states within the support of µ in a µ-irreducible chain must have the same period (see proposition 3.1) although we will not prove this here. Considering periodicity from a different viewpoint, we are able to characterise it in a manner which is rather easier to interpret but somewhat difficult to verify in practice. The following definition of period is equivalent to that given above (Nummelin, 1984): a Markov chain has a period d if there exists some partition of the state space, E1 , . . . , Ed with the properties that: – ∀i 6= j : Ei ∩ Ej = ∅ d S – Ei = E i=1
2
Note that this is much stronger than knowledge of the transition kernel, K, as in general all points in the space have zero probability.
36
3. Markov Chains
– ∀i, j, t, s : P (Xt+s ∈ Ej |Xt ∈ Ei ) =
(
1 j = i + s mod d 0
otherwise.
What this actually tells us is that a Markov chain with a period of d has associated with it a disjoint partition of the state space, E1 , . . . , Ed and that we know that the chain moves with probability 1 from set E1 to E2 , E2 to E3 , Ed−1 to Ed and Ed to E1 (assuming that d ≥ 3, of course). Hence the chain will visit a particular element of the partition with a period of d.
We also require some way of characterising how often a continuous state space Markov chain visits any particular region of the state space in order to obtain concepts analogous to those of transience and recurrence in the discrete setting. In order to do this we define a collection of random variables ηA for any P∞ subset A of E, which correspond to the number of times the set A is visited, i.e. ηA := l=1 IA (ξk ) and, once again we use Ex to denote the expectation under the law of the Markov chain with initial state x.
We note that if a chain is not µ-irreducible for some distribution µ, then there is no guarantee that it is either transient or recurrent, however, the following definitions do hold: Definition 3.12 (Transience and Recurrence). We begin by defining uniform transience and recurrence for sets A ⊂ E for µ-irreducible general state space Markov chains. Such a set is recurrent if: ∀x ∈ A : Ex [ηA ] = ∞. A set is uniformly transient if there exists some M < ∞ such that: ∀x ∈ A :
Ex [ηA ] ≤ M.
The weaker concept of transience of a set may then be introduced. A set, A ⊂ E, is transient if it may be
expressed as a countable union of uniformly transient sets, i.e.: ∞
∃ {Bi ⊂ E}i=1 : ∀i ∈ N :
A⊂
∞ [
Bi
i=1
∀x ∈ Bi :
Ex [ηBi ] ≤ Mi < ∞.
A general state space Markov chain is recurrent if the following two conditions are satisfied: – The chain is µ-irreducible for some distribution µ. R – For every measurable set A ⊂ E such that A µ(y)dy > 0, Ex [ηA ] = ∞ for every x ∈ A.
whilst it is transient if it is µ-irreducible for some distribution µ and the entire space is transient. As in the discrete setting, in the case of irreducible chains, transience and recurrence are properties of the chain rather than individual states: all states within the support of the irreducibility distribution are either transient or recurrent. It is useful to note that any µ-irreducible Markov chain which has stationary distribution µ is positive recurrent (Tierney, 1994). A slightly stronger form of recurrence is widely employed in the proof of many theoretical results which underlie many applications of Markov chains to statistical problems, this form of recurrence is known as Harris recurrence and may be defined as follows: Definition 3.13 (Harris Recurrence). A set A ⊂ E is Harris recurrent if Px (ηA = ∞) = 1 for every x ∈ A.
A Markov chain is Harris recurrent if there exists some distribution µ with respect to which it is R irreducible and every set A such that A µ(x)dx > 0 is Harris recurrent.
3.4 Selected Theoretical Results
37
The concepts of invariant distribution, reversibility and detailed balance are essentially unchanged from the discrete setting. It’s necessary to consider integrals with respect to densities rather than sums over probability distributions, but no fundamental differences arise here.
3.4 Selected Theoretical Results The probabilistic study of Markov chains dates back more than fifty years and comprises an enormous literature, much of it rather technically sophisticated. We don’t intend to summarise that literature here, nor to provide proofs of the results which we present here. This section serves only to motivate the material presented in the subsequent chapters. These two theorems fill the rˆ ole which the law of large numbers and the central limit theorem for independent, identically distributed random variables fill in the case of simple Monte Carlo methods. They tell us, roughly speaking, that if we take the sample averages of a function at the points of a Markov chain which satisfies suitable regularity conditions and possesses the correct invariant distribution, then we have convergence of those averages to the integral of the function of interest under the invariant distribution and, furthermore, under stronger regularity conditions we can obtain a rate of convergence. There are two levels of strength of law of large numbers which it is useful to be aware of. The first tells us that for most starting points of the chain a law of large numbers will hold. Under slightly stronger conditions (which it may be difficult to verify in practice) it is possible to show the same result holds for all starting points. Theorem 3.1 (A Simple Ergodic Theorem). If (ξi )i∈N is a µ-irreducible, recurrent Rd -valued Markov chain which admits µ as a stationary distribution, then the following strong law of large numbers holds (convergence is with probability 1) for any integrable function f : E → R: t
1X f (ξi ) → t→∞ t i=1 lim
Z
f (x)µ(x)dx.
for almost every starting value x. That is, for any x except perhaps for some set N which has the property R that N µ(x)dx = 0. An outline of the proof of this theorem is provided by (Roberts and Rosenthal, 2004, Fact 5.).
Theorem 3.2 (A Stronger Ergodic Theorem). If (ξi )i∈N is a µ-invariant, Harris recurrent Markov chain, then the following strong law of large numbers holds (convergence is with probability 1) for any integrable function f : E → R:
t
1X f (ξi ) → t→∞ t i=1 lim
Z
f (x)µ(x)dx.
A proof of this result is beyond the scope of the course. This is a particular case of (Robert and Casella, 2004, p. 241, Theorem 6.63), and a proof of the general theorem is given there. The same theorem is also presented with proof in (Meyn and Tweedie, 1993, p. 433, Theorem 17.3.2). Theorem 3.3 (A Central Limit Theorem). Under technical regularity conditions (see (Jones, 2004) for a summary of various combinations of conditions) it is possible to obtain a central limit theorem for the ergodic averages of a Harris recurrent, µ-invariant Markov chain, and a function f : E → R which
has at least two finite moments (depending upon the combination of regularity conditions assumed, it may be necessary to have a finite moment of order 2 + δ).
38
3. Markov Chains
" t # Z √ 1X d lim t f (ξi ) − f (x)µ(x)dx → N(0, σ 2 (f )), t→∞ t i=1
∞ X σ 2 (f ) = E (f (ξ1 ) − f¯)2 + 2 E (f (ξ1 ) − f¯)(f (ξk ) − f¯) , k=2
where f¯ =
R
f (x)µ(x)dx.
3.5 Further Reading We conclude this chapter by noting that innumerable tutorials on the subject of Markov chains have been written, particularly with reference to their use in the field of Monte Carlo simulation. Some which might be of interest include the following: – (Roberts, 1996) provides an elementary introduction to some Markov chain concepts required to understand their use in Monte Carlo algorithms. – In the same volume, (Tierney, 1996) provides a more technical look at the same concepts; a more in-depth, but similar approach is taken by the earlier paper Tierney (1994). – An alternative, elementary formulation of some of the material presented here together with some additional background material, aimed at an engineering audience, can be found in Johansen (2008). – (Robert and Casella, 2004, chapter 6). This is a reasonably theoretical treatment intended for those interest in Markov chain Monte Carlo; it is reasonably technical in content, without dwelling on proofs. Those familiar with measure theoretic probability might find this a reasonably convenient place to start. – Those of you interested in technical details might like to consult (Meyn and Tweedie, 1993). This is the definitive reference work on stability, convergence and theoretical analysis of Markov chains and it is now possible to download it, free of charge from the website of one of the authors. – A less detailed, but more general and equally rigorous, look at Markov chains is provided by the seminal work of (Nummelin, 1984). This covers some material outside of the field of probability, but remains a concise work and presents only a few of the simpler results. It is perhaps a less intimidating starting point than (Meyn and Tweedie, 1993), although opinions on this vary. – A recent survey of theoretical results relevant to Monte Carlo is provided by (Roberts and Rosenthal, 2004). Again, this is necessarily somewhat technical.
4. The Gibbs Sampler
4.1 Introduction In section 2.3 we have seen that, using importance sampling, we can approximate an expectation Ef (h(X)) without having to sample directly from f . However, finding an instrumental distribution which allows us to efficiently estimate Ef (h(X)) can be difficult, especially in large dimensions. In this chapter and the following chapters we will use a somewhat different approach. We will discuss methods that allow obtaining an approximate sample from f without having to sample from f directly. More mathematically speaking, we will discuss methods which generate a Markov chain whose stationary distribution is the distribution of interest f . Such methods are often referred to as Markov Chain Monte Carlo (MCMC) methods. Example 4.1 (Poisson change point model). Assume the following Poisson model of two regimes for n random variables Y1 , . . . , Yn .1 Yi ∼ Poi(λ1 )
for
i = 1, . . . , M
Yi ∼ Poi(λ2 )
for
i = M + 1, . . . , n
A suitable (conjugate) prior distribution for λj is the Gamma(αj , βj ) distribution with density f (λj ) =
1 α −1 α λj j βj j exp(−βj λj ). Γ (αj )
The joint distribution of Y1 , . . . , Yn , λ1 , λ2 , and M is f (y1 , . . . , yn , λ1 , λ2 , M )
=
M Y exp(−λ1 )λyi 1
i=1
·
yi !
!
·
n Y
i=M +1
exp(−λ2 )λy2i yi !
!
1 1 λα1 −1 β1α1 exp(−β1 λ1 ) · λα2 −1 β2α2 exp(−β2 λ2 ). Γ (α1 ) 1 Γ (α2 ) 2
If M is known, the posterior distribution of λ1 has the density α −1+
f (λ1 |Y1 , . . . , Yn , M ) ∝ λ1 1
PM
i=1
yi
exp(−(β1 + M )λ1 ),
so
1
The probability distribution function of the Poi(λ) distribution is p(y) =
exp(−λ)λy y!
.
40
4. The Gibbs Sampler
λ1 |Y1 , . . . Yn , M
∼ Gamma α1 +
λ2 |Y1 , . . . Yn , M
∼ Gamma α2 +
M X
yi , β1 + M
i=1 n X
i=M +1
!
(4.1)
yi , β2 + n − M
!
.
(4.2)
Now assume that we do not know the change point M and that we assume a uniform prior on the set {1, . . . , M − 1}. It is easy to compute the distribution of M given the observations Y1 , . . . Yn , and λ1 and λ2 . It is a discrete distribution with probability density function proportional to PM
p(M ) ∝ λ1
i=1
Pn
yi
· λ2
i=M +1
yi
· exp((λ2 − λ1 ) · M )
(4.3)
The conditional distributions in (4.1) to (4.3) are all easy to sample from. It is however rather difficult to sample from the joint posterior of (λ1 , λ2 , M ).
⊳
The example above suggests the strategy of alternately sampling from the (full) conditional distributions ((4.1) to (4.3) in the example). This tentative strategy however raises some questions. – Is the joint distribution uniquely specified by the conditional distributions? – Sampling alternately from the conditional distributions yields a Markov chain: the newly proposed values only depend on the present values, not the past values. Will this approach yield a Markov chain with the correct invariant distribution? Will the Markov chain converge to the invariant distribution? As we will see in sections 4.3 and 4.4, the answer to both questions is — under certain conditions — yes. The next section will however first of all state the Gibbs sampling algorithm.
4.2 Algorithm The Gibbs sampler was first proposed by Geman and Geman (1984) and further developed by Gelfand and Smith (1990). Denote with x−i := (x1 , . . . , xi−1 , xi+1 , . . . , xp ). (0)
(0)
Algorithm 4.1 ((Systematic sweep) Gibbs sampler). Starting with (X1 , . . . , Xp ) iterate for t = 1, 2, . . . (t)
(t−1)
1. Draw X1 ∼ fX1 |X−1 (·|X2
(t−1)
, . . . , Xp
).
...
(t)
j. Draw Xj ...
(t)
(t)
(t)
(t)
(t)
(t−1)
(t−1)
∼ fXj |X−j (·|X1 , . . . , Xj−1 , Xj+1 , . . . , Xp
).
p. Draw Xp ∼ fXp |X−p (·|X1 , . . . , Xp−1 ). Figure 4.1 illustrates the Gibbs sampler. The conditional distributions as used in the Gibbs sampler are often referred to as full conditionals. Note that the Gibbs sampler is not reversible. Liu et al. (1995) proposed the following algorithm that yields a reversible chain. (0)
(n)
Algorithm 4.2 (Random sweep Gibbs sampler). Starting with (X1 , . . . , Xp ) iterate for t = 1, 2, . . . 1. Draw an index j from a distribution on {1, . . . , p} (e.g. uniform) (t)
2. Draw Xj
(t−1)
∼ fXj |X−j (·|X1
(t−1)
(t−1)
(t−1)
, . . . , Xj−1 , Xj+1 , . . . , Xp
(t)
), and set Xι
(t−1)
:= Xι
for all ι 6= j.
4.3 The Hammersley-Clifford Theorem
(3)
(2)
(X1 , X2 ) (1)
(2)
(2)
(2)
(1)
41
(X1 , X2 )
(1)
(X1 , X2 ) (X1 , X2 ) (6)
(6)
(6)
(5)
(4)
(4)
(X1 , X2 )
(5)
(4)
(5)
(5)
(X1 , X2 )
X2
(t)
(X1 , X2 )
(X1 , X2 )
(X1 , X2 )
(0)
(0)
(X1 , X2 ) (1)
(0)
(X1 , X2 )
(3) (3) (X1 , X2 )
(4)
(3)
(X1 , X2 )
(t)
X1
Fig. 4.1. Illustration of the Gibbs sampler for a two-dimensional distribution
4.3 The Hammersley-Clifford Theorem An interesting property of the full conditionals, which the Gibbs sampler is based on, is that they fully specify the joint distribution, as Hammersley and Clifford proved in 19702 . Note that the set of marginal distributions does not have this property. Definition 4.1 (Positivity condition). A distribution with density f (x1 , . . . , xp ) and marginal densities fXi (xi ) is said to satisfy the positivity condition if f (x1 , . . . , xp ) > 0 for all x1 , . . . , xp with fXi (xi ) > 0. The positivity condition thus implies that the support of the joint density f is the Cartesian product of the support of the marginals fXi . Theorem 4.1 (Hammersley-Clifford). Let (X1 , . . . , Xp ) satisfy the positivity condition and have joint density f (x1 , . . . , xp ). Then for all (ξ1 , . . . , ξp ) ∈ supp(f ) f (x1 , . . . , xp ) ∝
p Y fXj |X−j (xj |x1 , . . . , xj−1 , ξj+1 , . . . , ξp ) f (ξ |x , . . . , xj−1 , ξj+1 , . . . , ξp ) j=1 Xj |X−j j 1
Proof. We have f (x1 , . . . , xp−1 , xp ) = fXp |X−p (xp |x1 , . . . , xp−1 )f (x1 , . . . , xp−1 )
(4.4)
and by complete analogy f (x1 , . . . , xp−1 , ξp ) = fXp |X−p (ξp |x1 , . . . , xp−1 )f (x1 , . . . , xp−1 ),
(4.5)
thus 2
Hammersley and Clifford actually never published this result, as they could not extend the theorem to the case of non-positivity.
42
4. The Gibbs Sampler (4.4)
f (x1 , . . . , xp )
=
f (x1 , . . . , xp−1 ) {z } |
(4.5)
fXp |X−p (xp |x1 , . . . , xp−1 )
= f (x1 ,...,,xp−1 ,ξp )/fXp |X−p (ξp |x1 ,...,xp−1 )
=
f (x1 , . . . , xp−1 , ξp )
=
...
=
f (ξ1 , . . . , ξp )
fXp |X−p (xp |x1 , . . . , xp−1 ) fXp |X−p (ξp |x1 , . . . , xp−1 )
fX |X (xp |x1 , . . . , xp−1 ) fX1 |X−1 (x1 |ξ2 , . . . , ξp ) · · · p −p fX1 |X−1 (ξ1 |ξ2 , . . . , ξp ) fXp |X−p (ξp |x1 , . . . , xp−1 )
The positivity condition guarantees that the conditional densities are non-zero.
Note that the Hammersley-Clifford theorem does not guarantee the existence of a joint probability distribution for every choice of conditionals, as the following example shows. In Bayesian modeling such problems mostly arise when using improper prior distributions. Example 4.2. Consider the following “model” X1 |X2
∼ Expo(λX2 )
X2 |X1
∼ Expo(λX1 ),
for which it would be easy to design a Gibbs sampler. Trying to apply the Hammersley-Clifford theorem, we obtain f (x1 , x2 ) ∝ The integral
RR
fX1 |X2 (x1 |ξ2 ) · fX2 |X1 (x2 |x1 ) λξ2 exp(−λx1 ξ2 ) · λx1 exp(−λx1 x2 ) = ∝ exp(−λx1 x2 ) fX1 |X2 (ξ1 |ξ2 ) · fX2 |X1 (ξ2 |x1 ) λξ2 exp(−λξ1 ξ2 ) · λx1 exp(−λx1 ξ2 ) exp(−λx1 x2 ) dx1 dx2 however is not finite, thus there is no two-dimensional probability
distribution with f (x1 , x2 ) as its density.
⊳
4.4 Convergence of the Gibbs sampler First of all we have to analyse whether the joint distribution f (x1 , . . . , xp ) is indeed the stationary distribution of the Markov chain generated by the Gibbs sampler3 . For this we first have to determine the transition kernel corresponding to the Gibbs sampler. Lemma 4.1. The transition kernel of the Gibbs sampler is K(x(t−1) , x(t) )
(t)
(t−1)
= fX1 |X−1 (x1 |x2
(t)
(t)
(t−1)
, . . . , x(t−1) ) · fX2 |X−2 (x2 |x1 , x3 p (t)
(t)
, . . . , x(t−1) ) · ... p
·fXp |X−p (x(t) p |x1 , . . . , xp−1 ) Proof. We have (t)
(t−1)
(t−1)
Z
P(X ∈ X |X =x )= f(Xt |X(t−1) ) (x(t) |x(t−1) ) dx(t) X Z (t) (t) (t−1) (t) (t−1) , . . . , x(t−1) )·... = fX1 |X−1 (x1 |x2 , . . . , x(t−1) ) · fX2 |X−2 (x2 |x1 , x3 p p {z } {z } | | X corresponds to step 1. of the algorithm
·
(t) (t) fXp |X−p (x(t) p |x1 , . . . , xp−1 )
|
{z
}
corresponds to step 2. of the algorithm
dx(t)
corresponds to step p. of the algorithm
3
All the results in this section will be derived for the systematic scan Gibbs sampler (algorithm 4.1). Very similar results hold for the random scan Gibbs sampler (algorithm 4.2).
X
···
|
|
Z
(t−1)
(t)
(t)
(t)
)
(t)
(t−1)
{z
}
=f (x1 ,x2 ,x3
(t−1)
,...,xp
)
(t)
(t)
(t−1)
(t−1)
,...,xp
)
}
fX2 |X−2 (x2 |x1 , x3
(t−1)
,...,xp
(t−1)
{z
(t−1)
, . . . , x(t−1) ) dx2 p {z
(t−1)
(t−1)
)
=f (x1 ,x2
(t−1)
,...,xp
=f (x1 ,x3
(t)
f (x1 , x2
=f (x2
X
(t)
(t) f (x1 , . . . , x(t) p ) dx
(t)
(t)
=f (x1 ,...,xp−1 )
Thus f is the density of X(t) (if X(t−1) ∼ f ).
=
Z
|
{z
(t)
(t)
=f (x1 ,...,xp )
}
(t)
(t)
(t−1)
, . . . , x(t−1) ) · · · fXp |X−p (x(t) p p |x1 , . . . , xp−1 )dx3
}
= ... Z Z (t) (t) (t) (t) (t) = f (x1 , . . . , xp−1 , x(t−1) ) dx(t−1) fXp |X−p (x(t) p p p |x1 , . . . , xp−1 ) dx X | {z }
=
Z Z
|
. . . dx(t−1) dx(t) p
Proof. Assume that X(t−1) ∼ f , then Z Z (t) P(X ∈ X ) = f (x(t−1) )K(x(t−1) , x(t) ) dx(t−1) dx(t) X Z Z Z (t) (t) (t−1) (t) (t−1) (t−1) (t−1) . . . dx(t−1) dx(t) fX1 |X−1 (x1 |x2 , . . . , x(t−1) ) · · · fXp |X−p (x(t) = · · · f (x1 , . . . , x(t−1) ) dx1 p |x1 , . . . , xp−1 )dx2 p p p X {z } |
Proposition 4.1. The joint distribution f (x1 , . . . , xp ) is indeed the invariant distribution of the Markov chain (X(0) , X(1) , . . .) generated by the Gibbs sampler.
4.4 Convergence of the Gibbs sampler 43
44
4. The Gibbs Sampler
So far we have established that f is indeed the invariant distribution of the Gibbs sampler. Next, we have to analyse under which conditions the Markov chain generated by the Gibbs sampler will converge to f . First of all we have to study under which conditions the resulting Markov chain is irreducible4 . The following example shows that this does not need to be the case. Example 4.3 (Reducible Gibbs sampler). Consider Gibbs sampling from the uniform distribution on C1 ∪ C2 with C1 := {(x1 , x2 ) : k(x1 , x2 ) − (1, 1)k ≤ 1} and C2 := {(x1 , x2 ) : k(x1 , x2 ) − (−1, −1)k ≤ 1}, i.e. f (x1 , x2 ) =
1 IC ∪C (x1 , x2 ) 2π 1 2
Figure 4.2 shows the density as well the first few samples obtained by starting a Gibbs sampler with (0)
(0)
< 0 and X2
< 0. It is easy to that when the Gibbs sampler is started in C1 it will stay there
0 -2
-1
X2
(t)
1
2
X1
-2
-1
1
0
2
(t) X1
Fig. 4.2. Illustration of a Gibbs sampler failing to sample from a distribution with unconnected support (uniform distribution on {(x1 , x2 ) : k(x1 , x2 ) − (1, 1)k ≤ 1|} ∪ {(x1 , x2 ) : k(x1 , x2 ) − (−1, −1)k ≤ 1|})
and never reach C2 . The reason for this is that the conditional distribution X2 |X1 (X1 |X2 ) is for X1 < 0
(X2 < 0) entirely concentrated on C1 .
⊳
The following proposition gives a sufficient condition for irreducibility (and thus the recurrence) of the Markov chain generated by the Gibbs sampler. There are less strict conditions for the irreducibility and aperiodicity of the Markov chain generated by the Gibbs sampler (see e.g. Robert and Casella, 2004, Lemma 10.11). Proposition 4.2. If the joint distribution f (x1 , . . . , xp ) satisfies the positivity condition, the Gibbs sampler yields an irreducible, recurrent Markov chain. Proof. Let X ⊂ supp(f ) be a set with 4
R
X
(t)
(t)
(t)
(t)
f (x1 , . . . , xp )d(x1 , . . . , xp ) > 0.
Here and in the following we understand by “irreducibilty” irreducibility with respect to the target distribution f.
4.4 Convergence of the Gibbs sampler
Z
K(x(t−1) , x(t) )dx(t) =
X
Z
(t)
X
(t−1)
fX1 |X−1 (x1 |x2 {z |
(t)
45
(t)
(t) >0 , . . . , x(t−1) ) · · · fXp |X−p (x(t) p p |x1 , . . . , xp−1 ) dx } {z } |
>0 (on a set of non-zero measure)
>0 (on a set of non-zero measure)
Thus the Markov Chain (X(t) )t is strongly f -irreducible. As f is the invariant distribution of the Markov chain, it is as well recurrent (see the remark on page 36).
If the transition kernel is absolutely continuous with respect to the dominating measure, then recurrence even implies Harris recurrence (see e.g. Robert and Casella, 2004, Lemma 10.9). Now we have established all the necessary ingredients to state an ergodic theorem for the Gibbs sampler, which is a direct consequence of theorems 3.1 and 3.2. Theorem 4.2. If the Markov chain generated by the Gibbs sampler is irreducible and recurrent (which is e.g. the case when the positivity condition holds), then for any integrable function h : E → R n
1X h(X(t) ) → Ef (h(X)) lim n→∞ n t=1 for almost every starting value X(0) . If the chain is Harris recurrent, then the above result holds for every starting value X(0) . Theorem 4.2 guarantees that we can approximate expectations Ef (h(X)) by their empirical counterparts using a single Markov chain. Example 4.4. Assume that we want to use a Gibbs sampler to estimate P(X1 ≥ 0, X2 ≥ 0) for a ! !! µ1 σ12 σ12 N2 , distribution.5 The marginal distributions are µ2 σ12 σ22 X1 ∼ N(µ1 , σ12 )
and
X2 ∼ N(µ2 , σ22 )
In order to construct a Gibbs sampler, we need the conditional distributions Y1 |Y2 = y2 and Y2 |Y1 = y1 . We have6
f (x1 , x2 )
5
! !!′ !−1 x1 µ1 σ12 σ12 1 − ∝ exp − 2 x2 µ2 σ12 σ22 2 (x1 − (µ1 + σ12 /σ22 (x2 − µ2 )))2 ∝ exp − , 2(σ12 − (σ12 )2 /σ22 )
x1 x2
!
−
µ1 µ2
!!
A Gibbs sampler is of course not the optimal way to sample from a Np (µ, Σ) distribution. A more efficient way i.i.d.
6
is: draw Z1 , . . . , Zp ∼ N (0, 1) and set (X1 , . . . , Xp )′ = Σ 1/2 (Z1 , . . . , Zp )′ + µ We make use of ! !!′ !−1 ! !! x1 µ1 σ12 σ12 x1 µ1 − − x2 µ2 σ12 σ22 x2 µ2 ! !!′ ! ! x1 µ1 σ22 −σ12 x1 1 = − − σ12 σ22 − (σ12 )2 x2 µ2 −σ12 σ12 x2 1 σ22 (x1 − µ1 )2 − 2σ12 (x1 − µ1 )(x2 − µ2 ) + const = σ12 σ22 − (σ12 )2 1 σ22 x21 − 2σ22 x1 µ1 − 2σ12 x1 (x2 − µ2 ) + const = σ12 σ22 − (σ12 )2 1 x21 − 2x1 (µ1 + σ12 /σ22 (x2 − µ2 )) + const = σ12 − (σ12 )2 /σ22 2 1 = x1 − (µ1 + σ12 /σ22 (x2 − µ2 ) + const σ12 − (σ12 )2 /σ22
µ1 µ2
!!
46
4. The Gibbs Sampler
i.e. X1 |X2 = x2 ∼ N(µ1 + σ12 /σ22 (x2 − µ2 ), σ12 − (σ12 )2 /σ22 ) Thus the Gibbs sampler for this problem consists of iterating for t = 1, 2, . . . (t)
(t−1)
1. Draw X1 ∼ N(µ1 + σ12 /σ22 (X2 2. Draw
(t) X2
∼ N(µ2 +
(t) σ12 /σ12 (X1
− µ2 ), σ12 − (σ12 )2 /σ22 )
− µ1 ), σ22 − (σ12 )2 /σ12 ).
Now consider the special case µ1 = µ2 = 0, σ12 = σ22 = 1 and σ12 = 0.3. Figure 4.4 shows the sample paths of this Gibbs sampler. (t)
(t)
Using theorem 4.2 we can estimate P(X1 ≥ 0, X2 ≥ 0) by the proportion of samples (X1 , X2 ) with (t)
(t)
X1 ≥ 0 and X2 ≥ 0. Figure 4.3 shows this estimate.
0.3 0.2
≥ 0}|/t
0.1
(τ )
(τ )
|{(X1 , X2 ) : τ ≤ t, X1
(τ )
≥ 0, X2
(τ )
0.4
0.5
⊳
0
2000
4000
6000
8000
10000
t
Fig. 4.3. Estimate of the P(X1 ≥ 0, X2 ≥ 0) obtained using a Gibbs sampler. The area shaded in grey corresponds to the range of 100 replications.
Note that the realisations (X(0) , X(1) , . . .) form a Markov chain, and are thus not independent, but typically positively correlated. The correlation between the X(t) is larger if the Markov chain moves only slowly (the chain is then said to be slowly mixing). For the Gibbs sampler this is typically the case if the variables Xj are strongly (positively or negatively) correlated, as the following example shows. Example 4.5 (Sampling from a highly correlated bivariate Gaussian). Figure 4.5 shows the results obtained when sampling from a bivariate Normal distribution as in example 4.4, however with σ12 = 0.99. This yields a correlation of ρ(X1 , X2 ) = 0.99. This Gibbs sampler is a lot slower mixing than the one considered in example 4.4 (and displayed in figure 4.4): due to the strong correlation the Gibbs sampler can (t−1)
only perform very small movements. This makes subsequent samples Xj
(j)
and Xj
highly correlated
and thus yields to a slower convergence, as the plot of the estimated densities show (panels (b) and (c) of figures 4.4 and 4.5).
⊳
2.0
4.4 Convergence of the Gibbs sampler
47
1.0
fˆX1 (x1 )
MCMC sample
0.0
X2
(t)
(t)
X1
(t)
(b) Path of X1 and estimated density of X after
-2.0
-1.0
1,000 iterations
fˆX2 (x2 ) -2.0
-1.0
1.0
0.0
2.0
(t) X1
(t)
(t)
(a) First 50 iterations of (X1 , X2 )
MCMC sample
(t)
X2
(t)
(c) Path of X2 and estimated density of X2 after 1,000 iterations
1.5
Fig. 4.4. Gibbs sampler for a bivariate standard normal distribution (correlation ρ(X1 , X2 ) = 0.3)
0.5
1.0
fˆX1 (x1 )
MCMC sample
(t)
(t)
0.0
(b) Path of X1 and estimated density of X1 after 1,000 iterations
-1.0
-0.5
X2
(t)
X1
-1.5
fˆX2 (x2 )
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
(t)
X1
MCMC sample
(t)
(t)
(t)
(a) First 50 iterations of (X1 , X2 )
X2
(t)
(c) Path of X2 and estimated density of X2 after 1,000 iterations
Fig. 4.5. Gibbs sampler for a bivariate normal distribution with correlation ρ(X1 , X2 ) = 0.99
48
4. The Gibbs Sampler
4.5 Data Augmentation Gibbs sampling is only feasible when we can sample easily from the full conditionals. For some models this may not be possible. A technique that can help achieve full conditionals that are easy to sample from is demarginalisation: we introduce a set of auxiliary random variables Z1 , . . . , Zr such that f is the marginal density of (X1 , . . . , Xp , Z1 , . . . , Zr ), i.e. f (x1 , . . . , xp ) =
Z
f (x1 , . . . , xn , z1 , . . . , zr ) d(z1 , . . . , zr ).
In many cases there is a “natural choice” of the completion (Z1 , . . . , Zr ), as the following example shows. Example 4.6 (Mixture of Gaussians). Consider data Y1 , . . . , Yn , each of which might stem for one of K populations. The distribution of Yi in the k-th population is N(µk , 1/τ ). The probability that an observation is from the k-th population is πk . If we cannot observe which population the i-th observation is from, it is from a mixture distribution: f (yi ) =
K X
πk φ(µk ,1/τ ) (yi ).
(4.6)
k=1
In a Bayesian framework a suitable prior distribution for the mean parameters µk is the N(µ0 , 1/τ0 ) distribution. A suitable prior distribution for (π1 , . . . , πK ) is the Dirichlet distribution7 with parameters α1 , . . . , αK > 0 with density
for π ≥ 0 and
PK
k=1
PK K Γ ( k=1 αk ) Y αk −1 πk f(α1 ,...,αK ) (π1 , . . . , πK ) = QK k=1 Γ (αk ) k=1
πk = 1. For the sake of simplicity we assume that the dispersion τ is known8 , as well
as the number populations K.9 It is however difficult to sample from the posterior distribution of µ and π given data Y1 , . . . Yn using a Gibbs sampler. This is due to the mixture nature of (4.6). This suggests introducing auxiliary variables Z1 , . . . Zn which indicate which population the i-th individual is from, i.e. P(Zi = k) = πk
and
Yi |Zi = k ∼ N (µk , 1/τ ).
It is easy to see that the marginal distribution of Yi is given by (4.6), i.e. the Zi are indeed a completion. Now we have that f (y1 , . . . , yn , z1 , . . . , zn , µ1 , . . . , µK , π1 , . . . , πK ) ! ! n K Y Y 2 2 exp −τ0 (µk − µ0 ) /2 · ∝ πzi exp −τ (yi − µzi ) /2 · i=1
k=1
K Y
k=1
πkαk −1
!
.
(4.7)
Thus the full conditional distributions given Y1 , . . . , Yn are P(Zi = k|Y1 , . . . , Yn , µ1 , . . . , µK , π1 , . . . , πK ) µk |Y1 , . . . , Yn , Z1 , . . . , Zn , π1 , . . . , πK π1 , . . . , πK |Y1 , . . . , Yn , Z1 , . . . , Zn , µ1 , . . . , µK 7 8 9
πk φ(µk ,1/τ ) (yi ) (4.8) PK ι=1 πι φ(µι ,1/τ ) (yi ) ! P τ 1 i: Zi =k Yi + τo µ0 , ∼ N |{: Zi = k}|τ + τ0 |{i : Zi = k}|τ + τ0 =
∼ Dirichlet (α1 + |{i : Zi = 1}|, . . . , αK + |{I : Zi = K}|) .
The Dirichlet distribution is a multivariate generalisation of the Beta distribution. Otherwise, a Gamma distribution would be a suitable choice. For a model where the number of components is variable, see section 6.
4.5 Data Augmentation
49
To derive the full conditional of µk we have used that the joint density (4.7) is proportional to !2 P K Y τ Y + τ µ |{i : Z = k}|τ + τ i 0 0 i 0 i: Zi =k , µk − exp − 2 |{i : Zi = k}|τ + τ0 k=1
as τ
X
Zi =k
(Yi − µk )2 + τ0 (µk − µ0 )2
= (|{i : Zi = k}|τ + τ0 )µ2K + 2µK
τ
X
i: Zi =k
=
Yi
!
+ τ0 µ0
!
+ const
!2 P τ i: Zi =k Yi + τo µ0 + const. (|{i : Zi = k}|τ + τ0 ) µK − |{i : Zi = K}|τ + τ0
Thus we can obtain a sample from the posterior distribution of µ1 , . . . , µK and π1 , . . . , πK given observations Y1 , . . . , Yn using the following auxiliary variable Gibbs sampler: Starting with initial values (0)
(0)
(0)
(0)
µ1 , . . . , µK , π1 , . . . , πK iterate the following steps for t = 1, 2, . . . 1. For i = 1, . . . , n: (t)
Draw Zi
from the discrete distribution on {1, . . . , K} specified by (4.8).
2. For k = 1, . . . , K: P τ Y + τo µ0 (t) i 1 i: Zi =k (t) . , Draw µk ∼ N (t) (t) |{i : Zi = k}|τ + τ0 |{i : Zi = k}|τ + τ0 (t) (t) (t) (t) 3. Draw (π1 , . . . , πK ) ∼ Dirichlet α1 + |{i : Zi = 1}|, . . . , αK + |{i : Zi = K}| .
⊳
50
4. The Gibbs Sampler
5. The Metropolis-Hastings Algorithm
5.1 Algorithm In the previous chapter we have studied the Gibbs sampler, a special case of a Monte Carlo Markov Chain (MCMC) method: the target distribution is the invariant distribution of the Markov chain generated by the algorithm, to which it (hopefully) converges. This chapter will introduce another MCMC method: the Metropolis-Hastings algorithm, which goes back to Metropolis et al. (1953) and Hastings (1970). Like the rejection sampling algorithm 2.1, the Metropolis-Hastings algorithm is based on proposing values sampled from an instrumental distribution, which are then accepted with a certain probability that reflects how likely it is that they are from the target distribution f . The main drawback of the rejection sampling algorithm 2.1 is that it is often very difficult to come up with a suitable proposal distribution that leads to an efficient algorithm. One way around this problem is to allow for “local updates”, i.e. let the proposed value depend on the last accepted value. This makes it easier to come up with a suitable (conditional) proposal, however at the price of yielding a Markov chain instead of a sequence of independent realisations. (0)
(0)
Algorithm 5.1 (Metropolis-Hastings). Starting with X(0) := (X1 , . . . , Xp ) iterate for t = 1, 2, . . . 1. Draw X ∼ q(·|X(t−1) ). 2. Compute
α(X|X
(t−1)
(
) = min 1,
f (X) · q(X(t−1) |X)
f (X(t−1) ) · q(X|X(t−1) )
)
.
(5.1)
3. With probability α(X|X(t−1) ) set X(t) = X, otherwise set X(t) = X(t−1) . Figure 5.1 illustrates the Metropolis-Hasting algorithm. Note that if the algorithm rejects the newly proposed value (open disks joined by dotted lines in figure 5.1) it stays at its current value X(t−1) . The probability that the Metropolis-Hastings algorithm accepts the newly proposed state X given that it currently is in state X(t−1) is a(x(t−1) ) =
Z
α(x|x(t−1) )q(x|x(t−1) ) dx.
(5.2)
Just like the Gibbs sampler, the Metropolis-Hastings algorithm generates a Markov chain, whose properties will be discussed in the next section.
52
5. The Metropolis-Hastings Algorithm
x(3) = x(4) = x(5) = x(6) = x(7)
X2
(t)
x(15)
x(8)
x(11) = x(12) = x(13) = x(14) x(0) = x(1) = x(2)
x(10)
x(9)
(t)
X1
Fig. 5.1. Illustration of the Metropolis-Hastings algorithm. Filled dots denote accepted states, open circles rejected values.
Remark 5.1. The probability of acceptance (5.1) does not depend on the normalisation constant, i.e. if f (x) = C · π(x), then Cπ(x) · q(x(t−1) |x) π(x) · q(x(t−1) |x) f (x) · q(x(t−1) |x) = = f (x(t−1) ) · q(x|x(t−1) ) Cπ(x(t−1) ) · q(x|x(t−1) ) π(x(t−1) ) · q(x|x(t−1) ) Thus f only needs to be known up to normalisation constant.1
5.2 Convergence results Lemma 5.1. The transition kernel of the Metropolis-Hastings algorithm is K(x(t−1) , x(t) ) = α(x(t) |x(t−1) )q(x(t) |x(t−1) ) + (1 − a(x(t−1) ))δx(t−1) (x(t) ),
(5.3)
where δx(t−1) (·) denotes Dirac-mass on {x(t−1) }. Note that the transition kernel (5.3) is not continuous with respect to the Lebesgue measure. Proof. We have
1
On a similar note, it is enough to know q(x(t−1) |x) up to a multiplicative constant independent of x(t−1) and x.
5.2 Convergence results
P(X(t) ∈ X |X(t−1) = x(t−1) )
53
= P(X(t) ∈ X , new value accepted|X(t−1) = x(t−1) ) =
Z
X
+P(X(t) ∈ X , new value rejected|X(t−1) = x(t−1) ) α(x(t) |x(t−1) )q(x(t) |x(t−1) ) dx(t) + =
=
Z
X
|
IX (x(t−1) ) | {z } R X
δx(t−1) (dx(t) ) =
P(new value rejected|X(t−1) = x(t−1) ) {z } |
R
X
{z
=1−a(x(t−1) )
}
(1−a(x(t−1) ))δx(t−1) (dx(t) )
α(x(t) |x(t−1) )q(x(t) |x(t−1) ) dx(t) +
Z
X
(1 − a(x(t−1) ))δx(t−1) (dx(t) )
Proposition 5.1. The Metropolis-Hastings kernel (5.3) satisfies the detailed balance condition K(x(t−1) , x(t) )f (x(t−1) ) = K(x(t) , x(t−1) )f (x(t) ) and thus f (x) is the invariant distribution of the Markov chain (X(0) , X(1) , . . .) generated by the MetropolisHastings sampler. Furthermore the Markov chain is reversible. Proof. We have that f (x(t) )q(x(t−1) |x(t) ) q(x(t) |x(t−1) )f (x(t−1) ) α(x(t) |x(t−1) )q(x(t) |x(t−1) )f (x(t−1) ) = min 1, f (x(t−1) )q(x(t) |x(t−1) ) n o f (x(t−1) )q(x(t) |x(t−1) ) , 1 q(x(t−1) |x(t) )f (x(t) ) = min f (x(t−1) )q(x(t) |x(t−1) ), f (x(t) )q(x(t−1) |x(t) ) = min f (x(t) )q(x(t−1) |x(t) )
= α(x(t−1) |x(t) )q(x(t−1) |x(t) )f (x(t) ) and thus K(x(t−1) , x(t) )f (x(t−1) )
= α(x(t) |x(t−1) )q(x(t) |x(t−1) )f (x(t−1) ) + (1 − a(x(t−1) )) δx(t−1) (x(t) ) f (x(t−1) ) {z } {z } | | =α(x(t−1) |x(t) )q(x(t−1) |x(t) )f (x(t) )
= K(x(t) , x(t−1) )f (x(t) )
|
=0 if x(t) 6= x(t−1)
{z
}
(1−a(x(t) ))δx(t) (x(t−1) )
The other conclusions follow by proposition 3.2, suitably adapted to the continuous case (i.e. replacing the sums by integrals).
Next we need to examine whether the Metropolis-Hastings algorithm yields an irreducible chain. As with the Gibbs sampler, this does not need to be the case, as the following example shows. Example 5.1 (Reducible Metropolis-Hastings). Consider using a Metropolis-Hastings algorithm for sampling from a uniform distribution on [0, 1] ∪ [2, 3] and a U(x(t−1) − δ, x(t−1) + δ) distribution as proposal
distribution q(·|x(t−1) ). Figure 5.2 illustrates this example. It is easy to see that the resulting Markov chain is not irreducible if δ ≤ 1: in this case the chain either stays in [0, 1] or [2, 3].
⊳
Under mild assumptions on the proposal q(·|x(t−1) ) one can however establish the irreducibility of the resulting Markov chain: – If q(x(t) |x(t−1) ) is positive for all x(t−1) , x(t) ∈ supp(f ), then it is easy to see that we can reach any
set of non-zero probability under f within a single step. The resulting Markov chain is thus strongly irreducible. Even though this condition seems rather restrictive, many popular choices of q(·|x(t−1) ) like multivariate Gaussians or t-distributions fulfil this condition.
54
5. The Metropolis-Hastings Algorithm δ
δ q(·|x(t−1) )
1/(2δ) 1/2
f (·)
x(t−1) 1
2
3
Fig. 5.2. Illustration of example 5.1
– Roberts and Tweedie (1996) give a more general condition for the irreducibility of the resulting Markov chain: they only require that ∀ ǫ ∃ δ : q(x(t) |x(t−1) ) > ǫ if kx(t−1) − x(t) k < δ together with the boundedness of f on any compact subset of its support. The Markov chain (X(0) , X(1) , . . .) is further aperiodic, if there is positive probability that the chain remains in the current state, i.e. P(X(t) = X(t−1) ) > 0, which is the case if P f (X(t−1) )q(X|X(t−1) ) > f (X)q(X(t−1) |X) > 0.
Note that this condition is not met if we use a “perfect” proposal which has f as invariant distribution: in this case we accept every proposed value with probability 1 (see e.g. remark 5.2). Proposition 5.2. The Markov chain generated by the Metropolis-Hastings algorithm is Harris-recurrent if it is irreducible. Proof. Recurrence follows (using the result stated on page 36) from the irreducibility and the fact that f is the invariant distribution. For a proof of Harris recurrence see (Tierney, 1994).
As we have now established (Harris-)recurrence, we are now ready to state an ergodic theorem (using theorems 3.1 and 3.2). Theorem 5.1. If the Markov chain generated by the Metropolis-Hastings algorithm is irreducible, then for any integrable function h : E → R n
1X h(X(t) ) → Ef (h(X)) n→∞ n t=1 lim
for every starting value X(0) . As with the Gibbs sampler the above ergodic theorem allows for inference using a single Markov chain.
5.3 The random walk Metropolis algorithm In this section we will focus on an important special case of the Metropolis-Hastings algorithm: the random walk Metropolis-Hastings algorithm. Assume that we generate the newly proposed state X not using the fairly general X ∼ q(·|X(t−1) ), from algorithm 5.1, but rather
(5.4)
5.3 The random walk Metropolis algorithm
X = X(t−1) + ε,
ε ∼ g,
55
(5.5)
with g being a symmetric distribution. It is easy to see that (5.5) is a special case of (5.4) using q(x|x(t−1) ) = g(x − x(t−1) ). When using (5.5) the probability of acceptance simplifies to ( ) f (X) · q(X(t−1) |X) f (X) , min 1, = min 1, f (X(t−1) ) · q(X|X(t−1) ) f (X(t−1) ) as q(X|X(t−1) ) = g(X − X(t−1) ) = g(X(t−1) − X) = q(X(t−1) |X) using the symmetry of g. This yields
the following algorithm which is a special case of algorithm 5.1, which is actually the original algorithm proposed by Metropolis et al. (1953). (0)
(0)
Algorithm 5.2 (Random walk Metropolis). Starting with X(0) := (X1 , . . . , Xp ) and using a symmetric distributon g, iterate for t = 1, 2, . . . 1. Draw ε ∼ g and set X = X(t−1) + ε.
2. Compute
α(X|X
(t−1)
) = min 1,
f (X) f (X(t−1) )
.
(5.6)
3. With probability α(X|X(t−1) ) set X(t) = X, otherwise set X(t) = X(t−1) . Example 5.2 (Bayesian probit model). In a medical study on infections resulting from birth by Cesarean section (taken from Fahrmeir and Tutz, 2001) three influence factors have been studied: an indicator whether the Cesarian was planned or not (zi1 ), an indicator of whether additional risk factors were present at the time of birth (zi2 ), and an indicator of whether antibiotics were given as a prophylaxis (zi3 ). The response Yi is the number of infections that were observed amongst ni patients having the same influence factors (covariates). The data is given in table 5.1. Number of births with infection total yi ni
planned
risk factors
antibiotics
zi1
zi2
zi3
11 1
98 18
1 0
1 1
1 1
0 23
2 26
0 1
0 1
1 0
28 0 8
58 9 40
0 1 0
1 0 0
0 0 0
Table 5.1. Data used in example 5.2
The data can be modeled by assuming that Yi ∼ Bin(ni , πi ),
π = Φ(z′i β),
where zi = (1, zi1 , zi2 , zi3 ) and Φ(·) being the CDF of the N(0, 1) distribution. Note that Φ(t) ∈ [0, 1] for
all t ∈ R.
A suitable prior distribution for the parameter of interest β is β ∼ N (0, I/λ). The posterior density
of β is
f (β|y1 , . . . , yn ) ∝
N Y
i=1
Φ(z′i β)yi · (1 − Φ(z′i β))ni −yi
!
3 X λ β2 · exp − 2 j=0 j
56
5. The Metropolis-Hastings Algorithm
We can sample from the above posterior distribution using the following random walk Metropolis algorithm. Starting with any β (0) iterate for t = 1, 2, . . .: 1. Draw ε ∼ N (0, Σ) and set β = β (t−1) + ε. 2. Compute
α(β|β
(t−1)
(
) = min 1,
f (β|Y1 , . . . , Yn ) f (β (t−1) |Y1 , . . . , Yn )
)
.
3. With probability α(β|β (t−1) ) set β (t) = β, otherwise set β (t) = β (t−1) . To keep things simple, we choose the covariance Σ of the proposal to be 0.08 · I.
Figure 5.3 and table 5.2 show the results obtained using 50,000 samples2 . Note that the convergence of the Posterior mean intercept planned
β0 β1
-1.0952 0.6201
95% credible interval -1.4646 0.2029
-0.7333 1.0413
risk factors β2 1.2000 0.7783 1.6296 -1.8993 -2.3636 -1.471 antibiotics β3 Table 5.2. Parameter estimates obtained for the Bayesian probit model from example 5.2
(t)
βj
is to a distribution, whereas the cumulative averages
Pt
τ =1
(τ )
βj /t converge, as the ergodic theorem
implies, to a value. For figure 5.3 and table 5.2 the first 10,000 samples have been discarded (“burn-in”). ⊳
5.4 Choosing the proposal distribution The efficiency of a Metropolis-Hastings sampler depends on the choice of the proposal distribution q(·|x(t−1) ). An ideal choice of proposal would lead to a small correlation of subsequent realisations X(t−1) and X(t) . This correlation has two sources: – the correlation between the current state X(t−1) and the newly proposed value X ∼ q(·|X(t−1) ), and
– the correlation introduced by retaining a value X(t) = X(t−1) because the newly generated value X has been rejected.
Thus we would ideally want a proposal distribution that both allows for fast changes in the X(t) and yields a high probability of acceptance. Unfortunately these are two competing goals. If we choose a proposal distribution with a small variance, the probability of acceptance will be high, however the resulting Markov chain will be highly correlated, as the X (t) change only very slowly. If, on the other hand, we choose a proposal distribution with a large variance, the X (t) can potentially move very fast, however the probability of acceptance will be rather low. Example 5.3. Assume we want to sample from a N (0, 1) distribution using a random walk MetropolisHastings algorithm with ε ∼ N (0, σ 2 ). At first sight, we might think that setting σ 2 = 1 is the optimal choice, this is however not the case. In this example we examine the choices: σ 2 = 0.1, σ 2 = 1, σ 2 = 2.382 ,
and σ 2 = 102 . Figure 5.4 shows the sample paths of a single run of the corresponding random walk Metropolis-Hastings algorithm. Rejected values are drawn as grey open circles. Table 5.3 shows the average correlation ρ(X (t−1) , X (t) ) as well as the average probability of acceptance α(X|X (t−1) ) averaged over 100 runs of the algorithm. Choosing σ 2 too small yields a very high probability of acceptance, however at 2
You might want to consider a longer chain in practise.
5.4 Choosing the proposal distribution
(t)
(t)
β1
β1
0.0
0.5
(t)
1.0
1.5
-2.0 -1.5 -1.0 -0.5 0.0
β0
(t)
β0
0
10000
20000
30000
40000
50000
0
10000
20000
(t)
30000
40000
50000
30000
40000
50000
40000
50000
40000
50000
(t)
β3
β3
0.5
-3.0
-0.5
-2.0
(t)
(t)
1.5
-1.0
β2
β2
57
0
10000
20000
30000
40000
50000
0
10000
20000
(t)
(a) Sample paths of the βj Pt
τ =1
10000
20000
30000
40000
50000
0
10000
sample
30000
sample Pt
(τ )
β2 /t
(τ )
τ =1
-1.0
β3 /t
-2.0
0.0
Pt
Pt
τ =1
0.5
(τ )
(τ )
β3 /t
1.0
-0.5
τ =1
20000
-1.5
Pt β2 /t
β1 /t
Pt
0.2 0.3 0.4 0.5 0.6 0.7
(τ )
β1 /t
-0.2 -0.6 -1.0
0
τ =1
(τ )
τ =1
(τ )
β0 /t
β0 /t
Pt
τ =1
Pt
(τ )
τ =1
0
10000
20000
30000
40000
50000
0
sample
Pt
τ =1
20000
30000
sample (τ )
50000
(b) Cumulative averages
10000
βj /t β1
-2.0
-1.5
-1.0
1.0 0.5 0.0
density
0.0 0.5 1.0 1.5
density
1.5
β0
-0.5
0.0
0.5
1.5
β3
1.0 0.5 0.0
0.0
0.5
density
1.0
1.5
1.5
β2
density
1.0
0.5
1.0
1.5
2.0
-3.0
-2.5
-2.0
(c) Posterior distributions of the βj Fig. 5.3. Results obtained for the Bayesian probit model from example 5.2
-1.5
-1.0
58
5. The Metropolis-Hastings Algorithm
the price of a chain that is hardly moving. Choosing σ 2 too large allows the chain to make large jumps, however most of the proposed values are rejected, so the chain remains for a long time at each accepted value. The results suggest that σ 2 = 2.382 is the optimal choice. This corresponds to the theoretical results of Gelman et al. (1995).
⊳ Autocorrelation ρ(X (t−1) , X (t) ) Mean 95% CI
Probability of acceptance α(X, X (t−1) ) Mean 95% CI
σ 2 = 0.12 σ2 = 1
0.9901 0.7733
(0.9891,0.9910) (0.7676,0.7791)
0.9694 0.7038
(0.9677,0.9710) (0.7014,0.7061)
σ 2 = 2.382 σ 2 = 102
0.6225 0.8360
(0.6162,0.6289) (0.8303,0.8418)
0.4426 0.1255
(0.4401,0.4452) (0.1237,0.1274)
Table 5.3. Average correlation ρ(X (t−1) , X (t) ) and average probability of acceptance α(X|X (t−1) ) found in example 5.3 for different choices of the proposal variance σ 2 .
Finding the ideal proposal distribution q(·|x(t−1) ) is an art.3 This is the price we have to pay for the generality of the Metropolis-Hastings algorithm. Popular choices for random walk proposals are multivariate Gaussians or t-distributions. The latter have heavier tails, making them a safer choice. The covariance structure of the proposal distribution should ideally reflect the expected covariance of the (X1 , . . . , Xp ). Gelman et al. (1997) propose to adjust the proposal such that the acceptance rate is around 1/2 for oneor two dimensional target distributions, and around 1/4 for larger dimensions, which is in line with the results we obtained in the above simple example and the guidelines which motivate them. Note however that these are just rough guidelines. Example 5.4 (Bayesian probit model (continued)). In the Bayesian probit model we studied in example 5.2 we drew ε ∼ N(0, Σ) with Σ = 0.08 · I, i.e. we modeled the components of ε to be independent. The proportion of accepted
values we obtained in example 5.2 was 13.9%. Table 5.4 (a) shows the corresponding autocorrelation. The resulting Markov chain can be made faster mixing by using a proposal distribution that represents the covariance structure of the posterior distribution of β. This can be done by resorting to the frequentist theory of generalised linear models (GLM): it suggests ˆ is (Z′ DZ)−1 , where Z is the matrix that the asymptotic covariance of the maximum likelihood estimate β of the covariates, and D is a suitable diagonal matrix. When using Σ = 2 · (Z′ DZ)−1 in the algorithm
presented in section 5.2 we can obtain better mixing performance: the autocorrelation is reduced (see table 5.4 (b)), and the proportion of accepted values obtained increases to 20.0%. Note that the determinant of both choices of Σ was chosen to be the same, so the improvement of the mixing behaviour is entirely due
to a difference in the structure of the the covariance.
⊳
5.5 Composing kernels: Mixtures and Cycles It can be advantageous, especially in the case of more complex distributions, to combine different Metropolis-Hastings updates into a single algorithm. Each of the different Metropolis-Hastings updates 3
The optimal proposal would be sampling directly from the target distribution. The very reason for using a Metropolis-Hastings algorithm is however that we cannot sample directly from the target!
59
2 0 -2 0 -2 2 0 -2 2 0 -2 -6
-4
σ 2 = 102
4
6
-6
-4
σ 2 = 2.382
4
6
-6
-4
σ2 = 1
2
4
6
-6
-4
σ 2 = 0.12
4
6
5.5 Composing kernels: Mixtures and Cycles
0
200
400
600
800
1000
Fig. 5.4. Sample paths for example 5.3 for different choices of the proposal variance σ 2 . Open grey discs represent rejected values.
60
5. The Metropolis-Hastings Algorithm (a) Σ = 0.08 · I (t−1)
Autocorrelation ρ(βj
(t)
, βj )
β0
β1
β2
β3
0.9496
0.9503
0.9562
0.9532
(b) Σ = 2 · (Z′ DZ)−1 (t−1)
Autocorrelation ρ(βj (t−1)
Table 5.4. Autocorrelation ρ(βj
(t)
, βj )
β0
β1
β2
β3
0.8726
0.8765
0.8741
0.8792
(t)
, βj ) between subsequent samples for the two choices of the covariance Σ.
corresponds to a transition kernel K (j) . As with the substeps of Gibbs sampler there are two ways of combining the transition kernels K (1) , . . . , K (r) : – As in the systematic scan Gibbs sampler, we can cycle through the kernels in a deterministic order, i.e. first carry out the Metropolis-Hastings update corresponding the the kernel K (1) , then carry out the one corresponding to K (2) , etc. until we start again with K (1) . The transition kernel of this composite chain is K ◦ (x(t−1) , x(t) ) =
Z
···
Z
K (1) (x(t−1) , ξ (1) )K (2) (ξ (1) , ξ (2) ) · · · K (r) (ξ (r−1) , x(t) ) dξ (r−1) · · · dξ (1)
If each of the transition kernels K (j) has the invariant distribution f (i.e.
R
f (x(t−1) )K(x(t−1) , x(t) ) dx(t−1) =
f (x(t) )), then K ◦ has f as invariant distribution, too, as Z f (x(t−1) )K ◦ (x(t−1) , x(t) ) dx(t−1) Z Z Z = ··· K (1) (x(t−1) , ξ (1) )f (x(t−1) )dx(t−1) K (2) (ξ (1) , ξ (2) )dξ (1) · · · dξ (r−2) K (r) (ξ (r−1) , x(t) )dξ (r−1) {z } | |
|
= f (x(t) )
=f (ξ(1) )
{z
}
f (ξ(2) )
{z
=f (ξ(r−1) )
}
– Alternatively, we can, as in the random scan Gibbs sampler, choose each time at random which of Pr the kernels should be used, i.e. use the kernel K (j) with probability wj > 0 ( ι=1 wι = 1). The corresponding kernel of the composite chain is the mixture K + (x(t−1) , x(t) ) =
r X
wι K (ι) (x(t−1) , x(t) )
ι=1
Once again, if each of the transition kernels K (j) has the invariant distribution f , then K + has f as invariant distribution: Z Z r X f (x(t−1) )K + (x(t−1) , x(t) ) dx(t−1) = wι f (x(t−1) )K (ι) (x(t−1) , x(t) ) dx(t−1) = f (x(t) ). ι=1 {z } | =f (x(t) )
Example 5.5 (One-at-a-time Metropolis-Hastings). One example of a method using composite kernels is the so-called one-at-a-time Metropolis-Hastings algorithm. Consider the case of a p-dimensional random variable X = (X1 , . . . , Xp ). The Metropolis-Hastings algorithms 5.1 and 5.2 update all components at a time. It can, however, be difficult to come up with a suitable proposal distribution q(·|x(t−1) ) (or g) for all variables. Alternatively, we could, as in the Gibbs sampler, update each component separately. For this
5.5 Composing kernels: Mixtures and Cycles
61
we need p proposal distributions q1 , . . . , qp for updating each of the Xj . The j-th proposal qj (and thus the j-th kernel K (j) ) corresponds to updating the Xj . As mentioned above we can cycle deterministically through the kernels (corresponding to the kernel K ◦ ), (0)
(0)
yielding the following algorithm. Starting with X(0) = (X1 , . . . , Xp ) iterate (t−1)
(t−1)
1. i. Draw X1 ∼ q1 (·|X2 , . . . , Xp ii. Compute α1 = min 1,
iii. With probability
(t−1)
f (X1 (t) α1 set X1
=
).
(t−1)
f (X1 ,X2
(t−1)
,...,Xp(t−1) )·q1 (X1
(t−1)
(t−1)
|X1 ,X2
(t−1)
(t−1)
,...,Xp(t−1) )
(t−1)
(t−1)
)·q1 (X1 |X1 ,X2 ,...,Xp (t) (t−1) X1 , otherwise set X1 = X1 .
,X2
,...,Xp
)
.
... (t)
(t)
(t−1)
(t−1)
j. i. Draw Xj ∼ qj (·|X1 , . . . , Xj−1 , Xj , . . . , Xp ). (t) (t) (t−1) (t−1) (t) (t) (t−1) f (X1 ,...,Xj−1 ,Xj ,Xj+1 ,...,Xp(t−1) )·qj (Xj |X1 ,...,Xj−1 ,Xj ,Xj+1 ,...,Xp(t−1) ) . ii. Compute αj = min 1, (t) (t) (t−1) (t−1) (t−1) (t) (t) (t−1) (t−1) (t−1) iii. With probability
f (X1 ,...,Xj−1 ,Xj ,Xj+1 ,...,Xp (t) (t) αj set Xj = Xj , otherwise set Xj
=
)·qj (Xj |X1 ,...,Xj−1 ,Xj (t−1) Xj .
,Xj+1 ,...,Xp
)
... (t)
(t)
(t−1)
p. i. Draw Xp ∼ qp (·|X1 , . . . , Xp−1 , Xp ). (t) (t) (t) (t) f (X1 ,...,Xp−1 ,Xp )·qp (Xp(t−1) |X1 ,...,Xp−1 ,Xp ) ii. Compute αp = min 1, . (t) (t) (t−1) (t) (t) (t−1) iii. With probability
f (X1 ,...,Xp−1 ,Xp )·qp (Xp |X1 ,...,Xp−1 ,Xp (t) (t) (t−1) αp set Xp = Xp , otherwise set Xp = Xp .
)
(0)
(0)
The corresponding random sweep algorithm (corresponding to K + ) is: Starting with X(0) = (X1 , . . . , Xp ) iterate 1. Draw an index j from a distribution on {1, . . . , p} (e.g. uniform) (t−1)
(t−1)
2. Draw Xj ∼ qj (·|X1 , . . . , Xp
3. Compute αj = min 1,
(t)
4. With probability αj set Xj 5. Set
(t) Xι
:=
(t−1) Xι
).
(t−1) (t−1) (t−1) (t−1) (t−1) (t−1) (t−1) f (X1 ,...,Xj−1 ,Xj ,Xj+1 ,...,Xp(t−1) )·qj (Xj |X1 ,...,Xj−1 ,Xj ,Xj+1 ,...,Xp(t−1) ) (t−1) (t−1) (t−1) (t−1) (t−1) (t−1) (t−1) (t−1) (t−1) (t−1) f (X1 ,...,Xj−1 ,Xj ,Xj+1 ,...,Xp )·qj (Xj |X1 ,...,Xj−1 ,Xj ,Xj+1 ,...,Xp )
(t)
= Xj , otherwise set Xj
(t−1)
= Xj
.
for all ι 6= j.
Note the similarity to the Gibbs sampler. Indeed, the Gibbs sampler is a special case of a one-at-a-time Metropolis-Hastings algorithm as the following remark shows.
⊳
Remark 5.2. The Gibbs sampler for a p-dimensional distribution is a special case of a one-at-a-time Metropolis-Hasting algorithm: the (systematic scan) Gibbs sampler (algorithm 4.1) is a cycle of p kernels, whereas the random scan Gibbs sampler (algorithm 4.2) is a mixture of these kernels. The proposal qj (t)
corresponding to the j-th kernel consists of drawing Xj acceptance is uniformly equal to 1.
∼ fXj |X−j . The corresponding probability of
Proof. The update of the j-th component of the Gibbs sampler consists of sampling from Xj |X−j , i.e. it has the proposal
(t)
(t)
(t−1)
qj (xj |x(t−1) ) = fXj |X−j (xj |x1 , . . . , xj−1 , xj+1 , . . . , x(t−1) ). p We obtain for the j-th kernel that
.
62
5. The Metropolis-Hastings Algorithm (t)
(t)
(t−1)
(t−1)
f (x1 , . . . , xj−1 , xj , xj+1 , . . . , xp
(t−1)
)qj (xj
(t)
(t)
(t−1)
(t−1)
|x1 , . . . , xj−1 , xj , xj+1 , . . . , xp
)
(t) (t) (t−1) (t−1) (t−1) (t) (t) (t−1) (t−1) (t−1) f (x1 , . . . , xj−1 , xj , xj+1 , . . . , xp )qj (xj |x1 , . . . , xj−1 , xj , xj+1 , . . . , xp ) (t)
=
(t)
(t−1)
(t−1)
f (x1 , . . . , xj−1 , xj , xj+1 , . . . , xp
(t)
(t)
(t−1)
(t−1)
|x1 , . . . , xj−1 , xj+1 , . . . , xp
)
(t) (t) (t−1) (t−1) (t−1) (t) (t) (t−1) (t−1) f (x1 , . . . , xj−1 , xj , xj+1 , . . . , xp )fXj |X−j (xj |x1 , . . . , xj−1 , xj+1 , . . . , xp ) (t)
(t)
(t)
(t)
(t−1)
(t)
=
(t−1)
f (x1 , . . . , xj−1 , xj
(t−1)
(t) (t−1) (t−1) ,xj+1 ,...,x(t−1) ) p (t) (t) (t−1) (t−1) f (x1 ,...,xj−1 ,xj+1 ,...,xp )
(t−1) f (x1 ,...,xj−1 ,xj
f (x1 , . . . , xj−1 , xj , xj+1 , . . . , xp
=
(t−1)
)fXj |X−j (xj
)
(t)
(t)
(t−1)
(t−1)
(t−1) f (x1 ,...,xj−1 ,xj ,xj+1 ,...,xp
, xj+1 , . . . , xp
)
)
(t) (t) (t−1) (t−1) f (x1 ,...,xj−1 ,j+1 ,...,xp )
1,
thus αj ≡ 1.
As explained above, the composite kernels K + and K ◦ have the invariant distribution f , if all kernels K (j) have f as invariant distribution. Similarly, it is sufficient for the irreducibility of the kernels K + and K ◦ that all kernels K (j) are irreducible. This is however not a very useful condition, nor is it a necessary condition. Often, some of the kernels K (j) focus on certain subspaces, and thus cannot be irreducible for the entire space. The kernels K (j) corresponding to the Gibbs sampler are not irreducible themselves: the j-th Gibbs kernel K (j) only updates Xj , not the other Xι (ι 6= j).
6. The Reversible Jump Algorithm
6.1 Bayesian multi-model inference Examples 4.1, 4.6, and 5.2 illustrated how MCMC techniques can be used in Bayesian modeling. In both examples we have only considered a single model. In many real word situations however there is (a priori) more than one plausible model. Assume that we consider a countable set of models {M1 , M2 , . . .}. Each model is characterised by a
density fk and the associated parameter space Θ, i.e. Mk := {fk (·|θ), θ ∈ Θk }, where fk is the density and Θk the parameter space of the k-th model.
Using a hierarchical Bayesian setup we first place a prior distribution of the set of models, i.e. P(Mk ) = pk with
P
k
pk = 1. The prior distribution on the model space can for example be used to express our prior
belief in simple models. Further we need to place a prior on each parameter space Θk , i.e. θ|Mk ∼ fkprior (θ). Assume now that we have observed data y1 , . . . , yn . When considering model Mk the likelihood is lk (y1 , . . . , yn |θ) :=
n Y
i=1
fk (yi |θ),
and the posterior density of θ is fkpost (θ) = R
fkprior (θ)lk (y1 , . . . , yn |θ) Θk
fkprior (ϑ)lk (y1 , . . . , yn |ϑ)dϑ
.
Now we can use Bayes formula to compute the posterior probability that the data was generated by model Mk
R pk Θk fkprior (θ)lk (y1 , . . . , yn |θ) dθ P(Mk |y1 , . . . , yn ) = P R prior (θ)lκ (y1 , . . . , yn |θ) dθ κ pκ Θκ fκ
The comparison between two models Mk1 and Mk2 can be summarised by the posterior odds R f prior (θ)lk1 (y1 , . . . , yn |θ) dθ pk 1 P(Mk1 |y1 , . . . , yn ) Θk1 k1 = ·R . P(Mk2 |y1 , . . . , yn ) pk 2 f prior (θ)lk2 (y1 , . . . , yn |θ) dθ Θk2 k2 | {z } “Bayes factor”
64
6. The Reversible Jump Algorithm
Having computed the posterior distribution on the models we can now either consider the model with the highest posterior probability P(Mk |y1 , . . . , yn ) or perform model averaging using P(Mk |y1 , . . . , yn ) as weights.
In order to compute the above probabilities we could run a separate MCMC algorithm for each model (“within model simulation”). Alternatively we could construct a single algorithm that can jump between the different models (“transdimensional simulation”). In order to do this we have to sample from the joint posterior f post (k, θ) = P
defined on
pk fkprior (θ)lk (y1 , . . . , yn |θ) R prior (ϑ)lκ (y1 , . . . , yn |ϑ) dϑ κ pκ Θκ fκ Θ :=
[ k
({k} × Θk ) .
Unfortunately, we cannot use the techniques presented in chapters 4 and 5 to sample from Θ, as Θ is not as well-behaved as the Θk : Θ is a union of spaces of different dimensions, to which — due to measure-theoretic subtleties — the theory of chapters 4 and 5 fails to apply. Bayesian multi-model inference is an example of a variable dimension model. A variable dimension model is a model “where one of the things you do not know is the number of things you do not know” (Green, 2003). In the following two sections we will try to extend the Metropolis-Hastings method to this more general setting.
6.2 Another look at the Metropolis-Hastings algorithm Recall the random walk Metropolis-Hastings algorithm (algorithm 5.2) where we set X := X(t−1) + ε with ε ∼ g. In this section we will generalise this to X = τ (X(t−1) , U(t−1) ), with U(t−1) ∼ g1→2 . For our further developments it will be necessary that the transformation is a bijective map, which requires that the image and the domain of the transformation have the same dimension. Thus we consider T1→2 : (X(t−1) , U(t−1) ) 7→ (X, U), such that X = τ (X(t−1) , U(t−1) ). Furthermore we shall assume that T1→2 is a diffeomorphism1 with inverse T2→1 = T−1 1→2 . If we generate a newly proposed value X as mentioned above, how do we have to choose the probability of acceptance α(X|X(t−1) ) such that the resulting MCMC algorithm fulfils the detailed balance condition? If we set the probability of acceptance to2 f (x(t) )g2→1 (u(t) ) ∂T1→2 (x(t−1) , u(t−1) ) (t) (t−1) , α(x |x ) = min 1, f (x(t−1) )g1→2 (u(t−1) ) ∂(x(t−1) , u(t−1) )
then we can establish that detailed balance holds, as we will see below. Assume that for the corresponding backward move we draw U(t) ∼ g2→1 and set (X(t−1) , U(t−1) ) = T2→1 (X(t) , U(t) ). Then the probability of accepting the backward move is α(x We then obtain that 1 2
(t−1)
f (x(t−1) )g1→2 (u(t−1) ) ∂T2→1 (x(t) , u(t) ) , |x ) = min 1, f (x(t) )g2→1 (u(t) ) ∂(x(t) , u(t) ) (t)
i.e. T1→2 has as inverse T2→1 , and both T1→2 and T−1 1→2 are differentiable. In this lecture course we use the convention |A| = | det(A)|.
min 1,
α(x(t) |x(t−1) )g1→2 (u(t−1) )f (x(t−1) ) du(t−1) dx(t−1)
{u(t) : x(t−1) ∈A}
Z
x(t) ∈B
f (x(t) )g2→1 (u(t) ) ∂T1→2 (x(t−1) , u(t−1) ) (t−1) )f (x(t−1) ) du(t−1) dx(t−1) g1→2 (u (t−1) )g (t−1) ) (t−1) , u(t−1) ) f (x (u ∂(x (t−1) (t−1) (t) 1→2 x ∈A {u : x ∈B} Z Z (t−1) , u(t−1) ) (t−1) (t−1) (t) (t) ∂T1→2 (x min f (x )g1→2 (u ), f (x )g2→1 (u ) du(t−1) dx(t−1) ∂(x(t−1) , u(t−1) ) x(t−1) ∈A {u(t−1) : x(t) ∈B} Z Z (t−1) , u(t−1) ) ∂T2→1 (x(t) , u(t) ) (t−1) (t−1) (t) (t) ∂T1→2 (x min f (x )g1→2 (u ), f (x )g2→1 (u ) dx(t) du(t) ∂(x(t−1) , u(t−1) ) ∂(x(t) , u(t) ) {u(t) : x(t−1) ∈A} x(t) ∈B Z Z f (x(t−1) )g1→2 (u(t−1) ) ∂T2→1 (x(t) , u(t) ) , 1 g2→1 (u(t) )f (x(t) ) dx(t) du(t) min f (x(t) )g2→1 (u(t) ) ∂(x(t) , u(t) ) {u(t) : x(t−1) ∈A} x(t) ∈B Z Z α(x(t−1) |x(t) )g2→1 (u(t) )f (x(t) ) dx(t) du(t)
Z
{u(t−1) : x(t) ∈B}
Z
3
Z x(t) ∈B
π(dx(t−1) )K(x(t−1) , dx(t) ) =
Z x(t−1) ∈A
x(t−1) ∈A
Z
x(t−1) ∈A
Z
x∈A∩B
Z
π(dx(t) )K(x(t) , dx(t−1) )
Z
α(x(t−1) |x(t) )g2→1 (u(t) )f (x(t) ) dx(t) du(t)
IA (x(t) )(1 − a(x(t) )) π(dx(t) )
x(t) ∈B
x(t) ∈B
{u(t) : x(t−1) ∈A}
Z
α(x(t) |x(t−1) )g1→2 (u(t−1) ) du(t−1) + IB (x(t−1) )(1 − a(x(t−1) )).
x(t) ∈B
(1 − a(x)) π(dx) =
α(x(t) |x(t−1) )g1→2 (u(t−1) )f (x(t−1) ) du(t−1) dx(t−1) =
IB (x(t−1) )(1 − a(x(t−1) )) π(dx(t−1) ) =
{u(t−1) : x(t) ∈B}
Z
which is what we have shown in (6.1). (On the left hand side x(t) := x(t) (x(t−1) , u(t−1) ) is defined implicitly such that (x(t) , u(t) ) = T1→2 (x(t−1) , u(t−1) ). On the right hand side x(t−1) := x(t−1) (x(t) , u(t) ) is defined implicitly such that (x(t−1) , u(t−1) ) = T2→1 (x(t) , u(t) ).)
{u(t−1) : x(t) ∈B}
detailed balance is equivalent to Z Z
As
x(t) ∈B
for all Borel sets A, B, where K(x(t−1) , B) = P(X(t) ∈ B|X(t−1) = x(t−1) ). Now we have that Z Z K(x(t−1) , dx(t) ) = K(x(t−1) , B) = P(X(t) ∈ B|X(t−1) = x(t−1) ) =
x(t−1) ∈A
In a general state space detailed balance holds if Z
by analogy with proposition 5.1 detailed balance3 , i.e. the Markov chain generated by the above method has indeed f as invariant distribution.
1→2 (x(t−1) ,u(t−1) ) ∂T2→1 (x(t) ,u(t) ) The fourth row is obtained from the third row by using the change of variable formula. Note that ∂T∂(x · ∂(x(t) ,u(t) ) = 1. Equation 6.1 implies (t−1) ,u(t−1) )
=
=
=
=
=
x(t−1) ∈A
Z
6.2 Another look at the Metropolis-Hastings algorithm 65
66
6. The Reversible Jump Algorithm
Example 6.1 (Random walk Metropolis-Hastings). In order to clarify what we have just derived we will state the random walk Metropolis Hastings algorithm in terms of this new approach. In the random walk Metropolis-Hastings algorithm with a symmetric proposal g1→2 we considered X = X(t−1) + ε,
ε ∼ g1→2 ,
which corresponds to using (X, U) = T1→2 (X(t−1) , U(t−1) ) = (X(t−1) + U(t−1) , U(t−1) ),
U(t−1) ∼ g1→2 .
For the backward move we generate U ∼ g1→2 as well, i.e. we have g1→2 = g2→1 . Further T2→1 (X(t) , U(t) ) = (X(t) − U(t) , U(t) ).4
We accept the newly proposed X with probability ∂T1→2 (X(t−1) , U(t−1) ) f (X) f (X)g2→1 (U) (t−1) , = min 1, α(X|X ) = min 1, f (X(t−1) )g1→2 (U(t−1) ) ∂(X(t−1) , U(t−1) ) f (X(t−1) )
as g1→2 = g2→1 , U = U(t−1) , and
(t−1) (t−1) ∂T1→2 (X ,U ) ∂(X(t−1) , U(t−1) ) =
1 ··· .. . . . .
0 .. .
1 .. .
··· .. .
0 .. .
0 ···
1
0
1
0 ··· .. . . . .
0 .. .
1 .. .
···
··· .. .
0 .. .
0 ···
0
0
···
1
= 1.
⊳
Note that the above holds even if x(t−1) and x(t) have different dimension, as long as the joint vectors (x(t−1) , u(t−1) ) and (x(t) , u(t) ) have the same dimension. Thus we can use the above approach for sampling from variable dimension models, as the next section shows.
6.3 The Reversible Jump Algorithm Coming back to the developments of section 6.1 we need to draw an MCMC sample from the joint posterior distribution
defined on
pk fkprior (θ)lk (y1 , . . . , yn |θ) R prior (ϑ)lκ (y1 , . . . , yn |ϑ) dϑ κ pκ Θκ lκ
f post (k, θ) = P
Θ :=
[ k
(6.1)
({k} × Θk )
A slight modification of the approach discussed in 6.2 allows us to draw samples from f post (k, θ) by jumping between the models. This leads to the reversible jump algorithm proposed by Green (1995):
4
Due to the symmetry of g1→2 this is equivalent to setting X(t) + U(t) , and the forward move (based on T1→2 ) and backward move (based on T2→1 ) are identical.
6.3 The Reversible Jump Algorithm
67
Algorithm 6.1 (Reversible jump). Starting with k (0) and θ (0) iterate for t = 1, 2, . . . 1. Select new model Mk with probability ρk(t−1) →k .5
(With probability ρk(t−1) →k(t−1) update the parameters of Mk(t−1) and skip the remaining steps.)
2. Generate u(t−1) ∼ gk(t−1) →k
3. Set (θ, u) := Tk(t−1) →k (θ (k−1) , u(k−1) ). 4. Compute ) ∂T (t−1) (t−1) (θ , u ) (t−1) k →k α := min 1, (t−1) (t−1) post (t−1) (t−1) (t−1) f (k ∂(θ ,u ,θ )ρk(t−1) →k gk(t−1) →k (u ) ) (
f post (k, θ)ρk→k(t−1) gk→k(t−1) (u)
5. With probability α set k (t) = k and θ (k) = θ, otherwise keep k (t) = k (t−1) and θ (k) = θ (k−1) .
Note that as in section 6.2 we need that Tk→l is a diffeomorphism with Tl→k = T−1 k→l . Note that this implies that (θ, u) has the same dimension as (θ (t−1) , u(t−1) ) (“dimension matching”). It is possible (and a rather popular choice) that u or u(t−1) is zero-dimensional, as long as the dimensions of (θ, u) and (θ (t−1) , u(t−1) ) match. Often ρk→l is only positive if the models Mk and Ml are close in some sense. Note
however that ρk→l = 0 implies ρl→k = 0. In general, the transdimensional moves should be designed such that they yield a high probability of acceptance. Remark 6.1. The probability of acceptance of the reversible jump algorithm does not depend on the normalisation constant of the joint posterior f (post) (k, θ) (i.e. the denominator of (6.1)). Proposition 6.1. The joint posterior f post (k, θ) is under the above conditions the invariant distribution of the reversible jump algorithm.
5
P
k
ρk(t−1) →k = 1.
θ (t−1) ∈Ak(t−1)
Z
θ (t−1) ∈Ak(t−1)
Z
k(t) ∈B
(
f post (k (t) , θ (t) )ρk(t) →k(t−1) gk(t) →k(t−1) (u(t) )
α((k (t) , θ (t) )|(k (t−1) , θ (t−1) ))ρk(t−1) →k(t) gk(t−1) →k(t) (u(t−1) )f post (k (t−1) , θ (t−1) ) du(t−1) dθ (t−1)
) ∂T (t−1) (t−1) (θ , u ) (t−1) (t) k →k min 1, (t−1) (t−1) post (t−1) (t−1) (t−1) (t) f (k ,θ )ρk(t−1) →k(t) gk(t−1) →k(t) (u ) ∂(θ ,u ) {u(t−1) : θ ∈Bk(t) }
X Z
k(t) ∈B
{u(t−1) : θ (t) ∈Bk(t) }
X Z
k(t−1) ∈A
X
k(t) ∈B
θ (t) ∈Bk(t)
{u(t−1) : θ (t) ∈Bk(t) }
) ∂T (t−1) (t−1) (θ , u ) (t−1) (t) k →k )ρk(t) →k(t−1) gk(t) →k(t−1) (u(t) ) du(t−1) dθ (t−1) (t−1) (t−1) ∂(θ ,u ) Z n X min f post (k (t−1) , θ (t−1) )ρk(t−1) →k(t) gk(t−1) →k(t) (u(t−1) ) ,
k(t) ∈B
{u(t) : θ (t−1) ∈Ak(t−1) }
Z
(t)
θ (t−1) ∈Ak(t−1)
f post (k (t) , θ
k(t−1) ∈A
ρk(t−1) →k(t) gk(t−1) →k(t) (u(t−1) )f post (k (t−1) , θ (t−1) ) du(t−1) dθ (t−1) n X Z X Z min f post (k (t−1) , θ (t−1) )ρk(t−1) →k(t) gk(t−1) →k(t) (u(t−1) ) ,
k(t−1) ∈A
X
k(t−1) ∈A
X
S
=
(t)
(t)
k
(t)
k
post
∈B
k∈A {k}
× Ak ⊂ Θ, B =
S
α((k (t−1) , θ (t−1) )|(k (t) , θ (1) ))ρk(t) →k(t−1) gk(t) →k(t−1) (u(t) )f post (k (t) , θ (t) ) dθ (t) du(t)
, θ (t) ) dθ (t) du(t)
k
(t)
θ (t) ∈Bk(t)
(t)
× Ak ⊂ Θ
k(t) ∈B
k∈B {k}
{u(t) : θ (t−1) ∈Ak(t−1) }
ρk(t) →k(t−1) gk(t) →k(t−1) (u )f (k X Z X Z
∈A
k(t−1) ∈A
k
post
) ∂T (t−1) , u(t−1) ) ∂Tk(t) →k(t−1) (θ (t) , u(t) ) k(t−1) →k(t) (θ f (k , θ )ρk(t) →k(t−1) gk(t) →k(t−1) (u ) dθ (t) du(t) ∂(θ (t−1) , u(t−1) ) ∂(θ (t) , u(t) ) ( Z Z (t) (t) X X (t−1) post (t−1) (t−1) ∂Tk(t) →k(t−1) (θ , u ) ) = min f (k ,θ )ρk(t−1) →k(t) gk(t−1) →k(t) (u , (t) (t) ) ∂(θ , u {u(t) : θ (t−1) ∈Ak(t−1) } (t) θ (t) ∈Bk(t) (t−1) k ∈A k ∈B o f post (k (t) , θ (t) )ρk(t) →k(t−1) gk(t) →k(t−1) (u(t) ) dθ (t) du(t) ) ( X Z X Z f post (k (t−1) , θ (t−1) )ρk(t−1) →k(t) gk(t−1) →k(t) (u(t−1) ) ∂Tk(t) →k(t−1) (θ (t) , u(t) ) = min ,1 f post (k (t) , θ (t) )ρk(t) →k(t−1) gk(t) →k(t−1) (u(t) ) ∂(θ (t) , u(t) ) {u(t) : θ (t−1) ∈A (t−1) } (t) θ (t) ∈B (t) (t−1)
=
=
=
for all A =
holds, as
Proof. From the footnote on page 65 we have using x := (k, θ) and the fact that k is discrete (i.e. an integral with respect to k is a sum) that detailed balance
68 6. The Reversible Jump Algorithm
6.3 The Reversible Jump Algorithm
69
Example 6.2. Consider a problem with two possible models M1 and M2 . The model M1 has a single parameter θ1 ∈ [0, 1]. The model M2 has two parameters θ1 , θ2 ∈ D with triangular domain D = {(θ1 , θ2 ) : 0 ≤ θ2 ≤ θ1 ≤ 1}. The joint posterior of (k, θ) is
f post (k, θ) ∝ pk fkprior (θ)lk (y1 , . . . , yn |θ) We need to propose two moves T1→2 and T2→1 such that T1→2 = T−1 2→1 . Assume that we want to get from model M2 to model M1 by dropping θ2 , i.e. T2→1 (θ1 , θ2 ) = (θ1 , ⋆) A move that is compatible6 with T2→1 is T1→2 (θ, u) = (θ, uθ). When setting ⋆ to θ2 /θ1 we have that T1→2 = T−1 2→1 . If we draw U ∼ U[0, 1] we have that T1→2 (θ, U ) ∈ D. The Jacobian is
∂T1→2 (θ, u) 1 ∂(θ, u) = u
0 = |θ| = θ. θ
Using the formula for the derivative of the inverse we obtain that !−1 ∂T ∂T2→1 (θ1 , θ2 ) ∂T1→2 (θ, u) (θ, u) 1→2 = = 1/ = 1/θ1 ∂(θ1 , θ2 ) ∂(θ, u) ∂(θ, u) (θ,u)=T2→1 (θ1 ,θ2 ) (θ,u)=T2→1 (θ1 ,θ2 )
The moves between the models M1 and M2 (and vice versa) keep θ1 constant. An algorithm based only
on these two moves will not yield an irreducible chain. Thus we need to include fixed-dimensional moves. In this simple example it is enough to include a single fixed-dimensional move:7 If we are in model M1
then with probability 1/2 we carry out a Metropolis update (e.g. using an independent proposal from U[0, 1]). This setup corresponds to ρ1→1 = 1/2, ρ1→2 = 1/2, ρ2→1 = 1, ρ2→2 = 0. The reversible jump algorithm specified above consists of iterating for t = 1, 2, 3, . . . – If the current model is M1 (i.e. k (t−1) = 1):
∗ With probability 1/2 perform an update of θ(t−1) within model M1 , i.e. 1. Generate θ1 ∼ U[0, 1].
2. Compute the probability of acceptance ( α = min 1,
f1prior (θ1 )l1 (y1 , . . . , yn |θ1 ) (t−1)
f1prior (θ1
(t−1)
)l1 (y1 , . . . , yn |θ1
)
)
3. With probability α set θ(t) = θ, otherwise keep θ(t) = θ(t−1) . ∗ Otherwise attempt a jump to model M2 , i.e. 1. Generate u(t−1) ∼ U[0, 1]
2. Set (θ1 , θ2 ) := T1→2 (θ(t−1) , u(t−1) ) = (θ(t−1) , u(t−1) θ(t−1) ). 3. Compute
(
α = min 1, 6 7
(−1)
p2 · f2prior (θ1 , θ2 )l2 (y1 , . . . , yn |θ1 , θ2 ) · 1 (t−1)
p1 · f1prior (θ1
(t−1)
)l1 (y1 , . . . , yn |θ1
) · 1/2 · 1
·
(t−1) θ1
)
in the sense that T1→2 = T2→1 In order to obtain an irreducible and fast mixing chain in more complex models, it is typically necessary to allow for fixed-dimensional moves in all models.
70
6. The Reversible Jump Algorithm
4. With probability α set k (t) = 2 and θ (t) = (θ1 , θ2 ), otherwise keep k = 1 and θ(t) = θ(t−1) . – Otherwise, if the current model is M2 (i.e. k (t−1) = 2) attempt a jump to M1 : (t−1)
1. Set (θ, u) := T2→1 (θ1
(t−1)
, θ2
(t−1)
) = (θ1
(t−1)
, θ2
(t−1)
/θ1
).
2. Compute (
α = min 1,
p1 · f1prior (θ1 )l1 (y1 , . . . , yn |θ1 ) · 1/2 · 1 (t−1)
p2 · f2prior (θ1
(t−1)
, θ2
(t−1)
)l2 (y1 , . . . , yn |θ1
(t−1)
, θ2
1
· (t−1) ) · 1 θ1 (t−1)
3. With probability α set k (t) = 1 and θ(t) = θ, otherwise keep k = 2 and θ (t) = (θ1
) (t−1)
, θ2
).
⊳
Example 6.3 (Mixture of Gaussians with a variable number of components). Consider again the Gaussian mixture model from example 4.6, in which we assumed that the density of yi is from a mixture of Gaussians f (yi |π1 , . . . , πk , µ1 , . . . , µk , τ1 , . . . , τk ) =
k X
πκ φ(µκ ,1/τκ ) (yi ).
κ=1
Suitable prior distributions are a Dirichlet distribution for (π1 , . . . , πk ), a Gaussian for µκ and a Gamma distribution for τκ .8 In example 4.6 we assumed that the number of components k is known. In this example we assume that we want to estimate the number of components k as well. Note that the dimension of the parameter vector θ = (π1 , . . . , πk , µ1 , . . . , µk , τ1 , . . . , τk ) depends on k, so we need to use the reversible jump algorithm to move between models with different numbers of components. Denote with pk the prior distribution of the number of components. The easiest way of moving between models, is to allow for two simple transdimensional moves: adding one new component (“birth move”, k → k + 1) and dropping one component (“death move”, k + 1 → k).
Consider the birth move first. We draw the mean and precision parameters of the new component,
which we will call µk+1 and τk+1 for convenience, from the corresponding prior distributions. Furthermore we draw the prior probability of the new component πk+1 ∼ Beta(1, k). As we need that the sum of the Pk+1 (t−1) prior probabilities κ=1 πκ = 1, we have to rescale the other prior probabilities to πκ = πκ (1 − πk+1 ) (κ = 1, . . . , k). Putting this into the notation of the reversible jump algorithm, we draw (t−1)
u1
(t−1)
∼ g1 ,
u2
(t−1)
∼ g2 ,
u3
∼ g3 ,
and set (t−1)
πk+1 = u1
,
(t−1)
µk+1 = u2
(t−1)
,
τk+1 = u3
with g1 being the density of the Beta(1, k) distribution, g2 being the density of the prior distribution on the µκ , and g3 being the density of the prior distribution on τκ . The corresponding transformation Tk→k+1 is
π1 .. .
π k πk+1 . .. µk+1 . .. τk+1
= Tk→k+1
(t−1)
π1
.. . (t−1)
πk
(t−1)
u1
(t−1)
u2
(t−1)
u3
The determinant of the Jacobian of Tk→k+1 is 8
.. .
(t−1)
π1
(t−1)
(1 − u1 .. .
)
(t−1) (t−1) πk (1 − u1 ) (t−1) u1 = .. . (t−1) u2 .. . (t−1) u3
In order to ensure identifiability we assume the µκ are ordered, i.e. µ1 < . . . < µk .
6.3 The Reversible Jump Algorithm
71
(t−1) k
(1 − u1
)
Next we consider the death move, which is the move in the opposite direction. Assume that we drop the κ-th component. To keep the notation simple we assume κ = k + 1. In order to maintain the constraint Pk (t−1) ι=1 πι = 1 we need to rescale the prior probabilities to πι = πι /(1 − πk+1 ) (ι = 1, . . . , k). The
corresponding transformation is
π1 . . . πk . . = Tk+1→k . u1 u2 u3
(t−1)
π1
.. . (t−1)
πk+1 .. .
(t−1)
µk+1 .. .
(t−1)
τk+1
=
(t−1) /(1 − πk+1 ) .. . (t−1) (t−1) πk /(1 − πk+1 ) .. . (t−1) πk+1 (t−1) µk+1 (t−1) τk+1 (t−1)
π1
−1 It is easy to see that Tk+1→k = Tk→k+1 , and thus the modulus of the Jacobian of Tk+1→k is
1 (t−1)
(1 − πk+1 )k Now that we have specified both the birth move and the complementary death move, we can state the probability of accepting a birth move from a model with k components to a model with k + 1 components. It is (
prior pk+1 fk+1 (θ)l(y1 , . . . , yn |θ)
ρk+1→k /(k + 1) (k + 1)! (t−1) k · min 1, · · (1 − u1 ) prior (t−1) (t−1) (t−1) (t−1) (t−1) k! pk fk (θ )l(y1 , . . . , yn |θ ) ρk→k+1 g1 (u1 )g2 (u2 )g3 (u3 ) The factors (k +1)! and k! are required to account for the fact that the model is not uniquely parametrised, and any permutation of the indexes of the components yields the same model. 1/(k + 1) in the probability of picking one of the k + 1 components in the death step. The probability of accepting the death step is the reciprocal of the above probability of acceptance of the birth step. There are other (and more efficient) possibilities of moving between models of different orders. A very efficient pair of moves corresponds to splitting and (in the opposite direction) merging components. For a more detailed review of this model see (Richardson and Green, 1997).
⊳
)
72
6. The Reversible Jump Algorithm
7. Diagnosing convergence
7.1 Practical considerations The theory of Markov chains we have seen in chapter 3 guarantees that a Markov chain that is irreducible and has invariant distribution f converges to the invariant distribution. The ergodic theorems 4.2 and 5.1 allow for approximating expectations Ef (h(X)) by their the corresponding means T 1X h(X(t) ) −→ Ef (h(X)) T t=1
using the entire chain. In practice, however, often only a subset of the chain (X(t) )t is used: Burn-in Depending on how X(0) is chosen, the distribution of (X(t) )t for small t might still be far from the stationary distribution f . Thus it might be beneficial to discard the first iterations X(t) , t = 1, . . . , T0 . This early stage of the sampling process is often referred to as burn-in period. How large T0 has to be chosen depends on how fast mixing the Markov chain (X(t) )t is. Figure 7.1 illustrates the idea of a burn-in period.
burn-in period (discarded)
Fig. 7.1. Illustration of the idea of a burn-in period.
Thinning Markov chain Monte Carlo methods typically yield a Markov chain with positive autocorrela(t)
(t+τ )
tion, i.e. ρ(Xk , Xk
) is positive for small τ . This suggests building a subchain by only keeping
every m-th value (m > 1), i.e. we consider a Markov chain (Y(t) )t with Y(t) = X(m·t) instead of (X(t) )t . If the correlation ρ(X(t) , X(t+τ ) ) decreases monotonically in τ , then (t)
(t+τ )
ρ(Yk , Yk
(t)
(t+m·τ )
) = ρ(Xk , Xk
(t)
(t+τ )
) < ρ(Xk , Xk
),
74
7. Diagnosing convergence
i.e. the thinned chain (Y(t) )t exhibits less autocorrelation than the original chain (X(t) )t . Thus thinning can be seen as a technique for reducing the autocorrelation, however at the price of yielding a chain (Y(t) )t=1,...⌊T /m⌋ , whose length is reduced to (1/m)-th of the length of the original chain (X(t) )t=1,...,T . Even though thinning is very popular, it cannot be justified when the objective is estimating Ef (h(X)), as the following lemma shows. Lemma 7.1. Let (X(t) )t=1,...,T be a sequence of random variables (e.g. from a Markov chain) with X(t) ∼ f and (Y(t) )t=1,...,⌊T /m⌋ a second sequence defined by Y(t) := X(m·t) . If Varf (h(X(t) )) < +∞,
then
Var
! ⌊T /m⌋ T X 1 1X h(X(t) ) ≤ Var h(Y(t) ) . T t=1 ⌊T /m⌋ t=1
Proof. To simplify the proof we assume that T is divisible by m, i.e. T /m ∈ N. Using T X
h(X(t) ) =
t=1
and
Var
t=1
h(X(t·m+τ1 ) ) = Var
τ =0 t=1
= m · Var
T /m
X t=1
h(X(t·m) ) +
T /m
Thus Var
≤ m2 · Var
X t=1
T /m
for τ1 , τ2 ∈ {0, . . . , m − 1}, we obtain that ! /m T m−1 X X TX Var h(X(t) ) = Var h(X(t·m+τ ) ) t=1
h(X(t·m+τ ) )
τ =0 t=1
T /m
X
/m m−1 X TX
m−1 X
X t=1
η6=τ =0
h(X(t·m+τ2 ) )
T /m
Cov
|
h(X(t·m) ) = m2 · Var
X
h(X(t·m+η) ),
t=1
T /m
X t=1
T /m
≤Var
P
{z
T /m t=1
X t=1
h(X(t·m+τ ) )
h(X(t·m) )
}
h(Y(t) ) .
! ! T /m T /m T T 2 X X X 1 m 1 1X h(X(t) ) = 2 Var h(X(t) ) ≤ 2 Var h(Y(t) ) = Var h(Y(t) ) . T t=1 T T T /m t=1 t=1 t=1
The concept of thinning can be useful for other reasons. If the computer’s memory cannot hold the entire chain (X(t) )t , thinning is a good choice. Further, it can be easier to assess the convergence of the thinned chain (Y(t) )t as opposed to entire chain (X(t) )t .
7.2 Tools for monitoring convergence Although the theory presented in the preceding chapters guarantees the convergence of the Markov chains to the required distributions, this does not imply that a finite sample from such a chain yields a good approximation to the target distribution. As with all approximating methods this must be confirmed in practise.
7.2 Tools for monitoring convergence
75
This section tries to give a brief overview over various approaches to diagnosing convergence. A more detailed review with many practical examples can be diagnofound in (Guihennec-Jouyaux et al., 1998) or (Robert and Casella, 2004, chapter 12). There is an R package (CODA) that provides a vast selection of tools for diagnosing convergence. Diagnosing convergence is an art. The techniques presented in the following are nothing other than exploratory tools that help you judging whether the chain has reached its stationary regime. This section contains several cautionary examples where the different tools for diagnosing convergence fail. Broadly speaking, convergence assessment can be split into the following three tasks of diagnosing different aspects of convergence: Convergence to the target distribution. The first, and most important, question is whether (X(t) )t yields a sample from the target distribution? In order to answer this question we need to assess . . . – whether (X(t) )t has reached a stationary regime, and – whether (X(t) )t covers the entire support of the target distribution. PT (t) Convergence of the averages. Does t=1 h(X )/T provide a good approximation to the expectation Ef (h(X)) under the target distribution?
Comparison to i.i.d. sampling. How much information is contained in the sample from the Markov chain compared to i.i.d. sampling? 7.2.1 Basic plots The most basic approach to diagnosing the output of a Markov Chain Monte Carlo algorithm is to plot the sample path (X(t) )t as in figures 4.4 (b) (c), 4.5 (b) (c), 5.3 (a), and 5.4. Note that the convergence of (X(t) )t is in distribution, i.e. the sample path is not supposed to converge to a single value. Ideally, the plot should be oscillating very fast and show very little structure or trend (like for example figure 4.4). The smoother the plot seems (like for example figure 4.5), the slower mixing the resulting chain is. Note however that this plot suffers from the “you’ve only seen where you’ve been” problem. It is impossible to see from a plot of the sample path whether the chain has explored the entire support of the distribution. Example 7.1 (A simple mixture of two Gaussians). In this example we sample from a mixture of two wellseparated Gaussians f (x) = 0.4 · φ(−1,0.22 ) (x) + 0.6 · φ(2,0.32 ) (x) (see figure 7.2 (a) for a plot of the density) using a random walk Metropolis algorithm with proposed value X = X (t−1) + ε with ε ∼ N(0, Var(ε)). If we choose the proposal variance Var(ε) too small, we
only sample from one population instead of both. Figure 7.2 shows the sample paths of for two choices of Var(ε): Var(ε) = 0.42 and Var(ε) = 1.22 . The first choice of Var(ε) is too small: the chain is very likely to remain in one of the two modes of the distribution. Note that it is impossible to tell from figure 7.2 (b) alone that the chain has not explored the entire support of the target.
⊳
In order to diagnose the convergence of the averages, one can look at a plot of the cumulative averages Pt ( τ =1 h(X (τ ) )/t)t . Note that the convergence of the cumulative averages is — as the ergodic theorems
suggest — to a value (Ef (h(X)). Figures 4.3, and 5.3 (b) show plots of the cumulative averages. An ¯ − Pt h(X (τ ) )/t)t alternative to plotting the cumulative means is using the so-called CUSUMs (h(X) j τ =1 ¯ j = PT h(X (τ ) )/T , which is nothing other than the difference between the cumulative averages with X j τ =1 and the estimate of the limit Ef (h(X)).
1 0
sample path X (t)
2
3 2 1 0
sample path X (t)
0.6 0.4
-1
0.0
-1
0.2
density f (x)
3
7. Diagnosing convergence
0.8
76
-2
-1
1
0
2
3
2000
0
x
4000
6000
8000
10000
2000
0
4000
sample
(a) Density f (x)
6000
8000
10000
sample
(b) Sample path of a random walk (c) Sample path of a random walk Metropolis algorithm with proposal Metropolis algorithm with proposal variance Var(ε) = 0.42 variance Var(ε) = 1.22
Fig. 7.2. Density of the mixture distribution with two random walk Metropolis samples using two different variances Var(ε) of the proposal.
Example 7.2 (A pathological generator for the Beta distribution). The following MCMC algorithm (for details, see Robert and Casella, 2004, problem 7.5) yields a sample from the Beta(α, 1) distribution. Starting with any X (0) iterate for t = 1, 2, . . . 1. With probability 1 − X (t−1) , set X (t) = X (t−1) . 2. Otherwise draw X (t) ∼ Beta(α + 1, 1).
This algorithm yields a very slowly converging Markov chain, to which no central limit theorem applies.
X (τ ) /t τ =1
0.8
1.0 0.8
0.6
0.0
0.2
0.4
cumulative average
Pt
0.6 0.4 0.2
sample path X (t)
⊳
1.0
This slow convergence can be seen in a plot of the cumulative means (figure 7.3 (b)).
0
2000
4000
6000
8000
10000
sample
(a) Sample path X (t)
0
2000
4000
6000
8000
10000
sample
(b) Cumulative means
Pt
τ =1
X (τ ) /t
Fig. 7.3. Sample paths and cumulative means obtained for the pathological Beta generator.
Note that it is impossible to tell from a plot of the cumulative means whether the Markov chain has explored the entire support of the target distribution. 7.2.2 Non-parametric tests of stationarity This section presents the Kolmogorov-Smirnov test, which is an example of how nonparametric tests can be used as a tool for diagnosing whether a Markov chain has already converged.
7.2 Tools for monitoring convergence
77
In its simplest version, it is based on splitting the chain into three parts: (X(t) )t=1,...,⌊T /3⌋ , (X(t) )t=⌊T /3⌋+1,...,2⌊T /3⌋ , and (X(t) )t=2⌊T /3⌋+1,...,T . The first block is considered to be the burn-in period. If the Markov chain has reached its stationary regime after ⌊T /3⌋ iterations, the second and third
block should be from the same distribution. Thus we should be able to tell whether the chain has converged by comparing the distribution of (X(t) )t=⌊T /3⌋+1,...,2⌊T /3⌋ to the one of (X(t) )t=2⌊T /3⌋+1,...,T using suitable nonparametric two-sample tests. One such test is the Kolmogorov-Smirnov test. As the Kolmogorov-Smirnov test is designed for i.i.d. samples, we do not apply it to the (X(t) )t directly, but to a thinned chain (Y(t) )t with Y(t) = X(m·t) : the thinned chain is less correlated and thus closer to being an i.i.d. sample. We can now compare the distribution of (Y(t) )t=⌊T /(3m)⌋+1,...,2⌊T /(3m)⌋ to the one of (Y(t) )t=2⌊T /(3m)⌋+1,...,⌊T /m⌋ using the Kolmogorov-Smirnov statistic 1 K = sup Fˆ(Y(t) )t=⌊T /(3m)⌋+1,...,2⌊T /(3m)⌋ (x) − Fˆ(Y(t) )t=2⌊T /(3m)⌋+1,...,⌊T /m⌋ (x) . x∈R
As the thinned chain is not an i.i.d. sample, we cannot use the Kolmogorov-Smirnov test as a formal statistical test (besides we would run into problems of multiple testing). However, we can use it as an √ informal tool by monitoring the standardised statistic tKt as a function of t.2 As long as a significant proportion of the values of the standardised statistic are above the corresponding quantile of the asymptotic distribution, it is safe to assume that the chain has not yet reached its stationary regime. Example 7.3 (Gibbs sampling from a bivariate Gaussian (continued)). In this example we consider sampling from a bivariate Gaussian distribution, once with ρ(X1 , X2 ) = 0.3 (as in example 4.4) and once with ρ(X1 , X2 ) = 0.99 (as in example 4.5). The former leads a fast mixing chain, the latter a very slowly mixing chain. Figure 7.4 shows the plots of the standardised Kolmogorov-Smirnov statistic. It suggests that the sample size of 10,000 is large enough for the low-correlation setting, but not large enough for the high-correlation setting.
⊳
Note that the Kolmogorov-Smirnov test suffers from the “you’ve only seen where you’ve been” problem, as it is based on comparing (Y(t) )t=⌊T /(3m)⌋+1,...,2⌊T /(3m)⌋ and (Y(t) )t=2⌊T /(3m)⌋+1,...,⌊T /m⌋ . A plot of the Kolmogorov-Smirnov statistic for the chain with Var(ε) = 0.4 from example 7.1 would not reveal anything unusual. 7.2.3 Riemann sums and control variates A simple tool for diagnosing convergence of a one-dimensional Markov chain can be based on the fact that 1
The two-sample Kolmogorov-Smirnov test for comparing two i.i.d. samples Z1,1 , . . . , Z1,n and Z2,1 , . . . , Z2,n is based on comparing their empirical CDFs n 1X I(−∞,z] (Zk,i ). Fˆk (z) = n i=1
The Kolmogorov-Smirnov test statistic is the maximum difference between the two empirical CDFs: K = sup |Fˆ1 (z) − Fˆ2 (z)|. z∈R
For n → ∞ the CDF of
√
n · K converges to the CDF R(k) = 1 −
2
+∞ X
(−1)i−1 exp(−2i2 k2 ).
i=1
Kt is hereby the Kolmogorov-Smirnov statistic obtained from the sample consisting of the first t observations only.
7. Diagnosing convergence
7 6 4 3 1
2
Standardised KS statistic
5
√
tKt
1.6 1.4 1.2 1.0 0.8 0.6
Standardised KS statistic
√
tKt
1.8
78
0
2000
4000
6000
8000
10000
0
sample
2000
4000
6000
8000
10000
sample
(a) ρ(X1 , X2 ) = 0.3
(b) ρ(X1 , X2 ) = 0.99 (5·t)
Fig. 7.4. Standardised Kolmogorov-Smirnov statistic for X1 for two different correlations.
Z
from the Gibbs sampler from the bivariate Gaussian
f (x) dx = 1.
E
We can estimate this integral by the Riemann sum T X t=2
(X [t] − X [t−1] )f (X [t] ),
(7.1)
where X [1] ≤ . . . ≤ X [T ] is the ordered sample from the Markov chain. If the Markov chain has explored all
the support of f , then (7.1) should be around 1. Note that this method, often referred to as Riemann sums (Philippe and Robert, 2001), requires that the density f is known inclusive of normalisation constants. Example 7.4 (A simple mixture of two Gaussians (continued)). In example 7.1 we considered two randomwalk Metropolis algorithms: one (Var(ε) = 0.42 ) failed to explore the entire support of the target distribution, whereas the other one (Var(ε) = 1.22 ) managed to. The corresponding Riemann sums are 0.598 and 1.001, clearly indicating that the first algorithm does not explore the entire support.
⊳
Riemann sums can be seen as a special case of a technique called control variates. The idea of control variates is comparing several ways of estimating the same quantity. As long as the different estimates disagree, the chain has not yet converged. Note that the technique of control variates is only useful if the different estimators converge about as fast as the quantity of interest — otherwise we would obtain an overly optimistic, or an overly conservative estimate of whether the chain has converged. In the special case of the Riemann sum we compare two quantities: the constant 1 and the Riemann sum (7.1). 7.2.4 Comparing multiple chains A family of convergence diagnostics (see e.g. Gelman and Rubin, 1992; Brooks and Gelman, 1998) is based on running L > 1 chains — which we will denote by (X(1,t) )t , . . . , (X(L,t) )t — with overdispersed3 starting values X(1,0) , . . . , X(L,0) , covering at least the support of the target distribution. All L chains should converge to the same distribution, so comparing the plots from section 7.2.1 for the L different chains should not reveal any difference. A more formal approach to diagnosing whether the L chains are all from the same distribution can be based on comparing the inter-quantile distances. 3
i.e. the variance of the starting values should be larger than the variance of the target distribution.
7.2 Tools for monitoring convergence
79
We can estimate the inter-quantile distances in two ways. The first consists of estimating the interPL (L,·) quantile distance for each of the L chain and averaging over these results, i.e. our estimate is l=1 δα /L, (L,·)
where δα
(l,t)
is the distance between the α and (1 − α) quantile of the l-th chain(Xk
)t . Alternatively, we
can pool the data first, and then compute the distance between the α and (1 − α) quantile of the pooled
data. If all chains are a sample from the same distribution, both estimates should be roughly the same, so their ratio Sˆαinterval =
PL
(l) l=1 δα /L (·) δα
can be used as a tool to diagnose whether all chains sampled from the same distribution, in which case the ratio should be around 1. Alternatively, one could compare the variances within the L chains to the pooled estimate of the variance (see Brooks and Gelman, 1998, for more details). Example 7.5 (A simple mixture of two Gaussians (continued)). In the example of the mixture of two Gaussians we will consider L = 8 chains initialised from a N(0, 102 ) distribution. Figure 7.5 shows the sample paths of the 8 chains for both choices of Var(ε). The corresponding values of Sˆinterval are: 0.05
Var(ε) = 0.42
:
Var(ε) = 1.22
:
0.9789992 interval = 0.2696962 Sˆ0.05 = 3.630008 3.634382 interval = 0.996687. Sˆ0.05 = 3.646463
3 2 1 -2
-1
0
sample paths X (l,t)
1 0 -2
-1
sample paths X (l,t)
2
3
⊳
0
2000
4000
6000
8000
10000
0
2000
sample
4000
6000
8000
10000
sample
(a) Var(ε) = 0.42
(b) Var(ε) = 1.22
Fig. 7.5. Comparison of the sample paths for L = 8 chains for the mixture of two Gaussians.
Note that this method depends crucially on the choice of initial values X(1,0) , . . . , X(L,0) , and thus can easily fail, as the following example shows. Example 7.6 (Witch’s hat distribution). Consider a distribution with the following density: f (x1 , x2 ) ∝
(
(1 − δ)φ(µ,σ2 ·I) (x1 , x2 ) + δ 0
if x1 , x2 ∈ (0, 1) else,
80
7. Diagnosing convergence
which is a mixture of a Gaussian and a uniform distribution, both truncated to [0, 1] × [0, 1]. Figure 7.6
illustrates the density. For very small σ 2 , the Gaussian component is concentrated in a very small area around µ. The conditional distribution of X1 |X2 is ( (1 − δx2 )φ(µ,σ2 ·I) (x1 , x2 ) + δx2 f (x1 |x2 ) = 0
for x1 ∈ (0, 1)
else.
δ . δ + (1 − δ)φ(µ2 ,σ2 ) (x2 ) Assume we want to estimate P(0.49 < X1 , X2 ≤ 0.51) for δ = 10−3 , µ = (0.5, 0.5)′ , and σ = 10−5
with δx2 =
using a Gibbs sampler. Note that 99.9% of the mass of the distribution is concentrated in a very small area around (0.5, 0.5), i.e. P(0.49 < X1 , X2 ≤ 0.51) = 0.999.
Nonetheless, it is very unlikely that the Gibbs sampler visits this part of the distribution. This is due
to the fact that unless x2 (or x1 ) is very close to µ2 (or µ1 ), δx2 (or δx1 ) is almost 1, i.e. the Gibbs sampler only samples from the uniform component of the distribution. Figure 7.6 shows the samples obtained from 15 runs of the Gibbs sampler (first 100 iterations only) all using different initialisations. On average only 0.04% of the sampled values lie in (0.49, 0.51) × (0.49, 0.51) yielding an estimate of ˆ P(0.49 < X1 , X2 ≤ 0.51) = 0.0004 (as opposed to P(0.49 < X1 , X2 ≤ 0.51) = 0.999).
It is however close to impossible to detect this problem with any technique based on multiple initial-
isations. The Gibbs sampler shows this behaviour for practically all starting values. In figure 7.6 all 15 starting values yield a Gibbs sampler that is stuck in the “brim” of the witch’s hat and thus misses 99.9% ⊳
0.8
1.0
of the probability mass of the target distribution.
0.0
x1
0.2
0.4
x2
0.6
, x2 ) density f (x1
x2
0.0
0.2
0.4
0.6
0.8
1.0
x1
(a) Density for δ = 0.2, µ = (0.5, 0.5)′ , and σ = 0.05
(b) First 100 values from 15 samples using different starting values. (δ = 10−3 , µ = (0.5, 0.5)′ , and σ = 10−5 )
Fig. 7.6. Density and sample from the witch’s hat distribution.
7.2.5 Comparison to i.i.d. sampling and the effective sample size MCMC algorithms typically yield a positively correlated sample (X(t) )t=1,...,T , which contains less information than an i.i.d. sample of size T . If the (X(t) )t=1,...,T are positively correlated, then the variance of the average
7.2 Tools for monitoring convergence
! T 1X h(X(t) ) T t=1
Var
81
(7.2)
is larger than the variance we would obtain from an i.i.d. sample, which is Var(h(X(t) ))/T . The effective sample size (ESS) allows to quantify this loss of information caused by the positive correlation. The effective sample size is the size an i.i.d. would have to have in order to obtain the same variance (7.2) as the estimate from the Markov chain (X(t) )t=1,...,T . In order to compute the variance (7.2) we make the simplifying assumption that (h(X(t) ))t=1,...,T is from a second-order stationary time series, i.e. Var(h(X(t) )) = σ 2 , and ρ(h(X(t) ), h(X(t+τ ) )) = ρ(τ ). Then ! T T X 1 X 1X h(X(t) ) Var(h(X(t) )) +2 = Cov(h(X(s) ), h(X(t) )) Var 2 {z } {z } | | T t=1 T t=1 1≤s
=σ 2
=
If
P+∞
τ =1
σ2 T2
T +2
T −1 X τ =1
!
(T − τ )ρ(τ )
σ2 = T
=σ 2 ·ρ(t−s)
1+2
T −1 X τ =1
! τ ρ(τ ) . 1− T
|ρ(τ )| < +∞, then we can obtain from the dominated convergence theorem4 that T · Var
! T 1X h(X(t) ) −→ σ 2 T t=1
1+2
+∞ X
τ =1
!
ρ(τ )
as T → ∞. Note that the variance would be σ 2 /TESS if we were to use an i.i.d. sample of size TESS . We can now obtain the effective sample size TESS by equating these two variances and solving for TESS , yielding TESS =
1+2
1 P+∞
τ =1
ρ(τ )
· T.
If we assume that (h(X(t) ))t=1,...,T is a first-order autoregressive time series (AR(1)), i.e. ρ(τ ) = P+∞ ρ(h(X(t) ), h(X(t+τ ) )) = ρ|τ | , then we obtain using 1 + 2 τ =1 ρτ = (1 + ρ)/(1 − ρ) that 1−ρ · T. 1+ρ
TESS =
Example 7.7 (Gibbs sampling from a bivariate Gaussian (continued)). In examples 4.4) and 4.5 we ob(t−1)
tained for the low-correlation setting that ρ(X1 TESS =
(t)
, X1 ) = 0.078, thus the effective sample size is
1 − 0.078 · 10000 = 8547. 1 + 0.078 (t−1)
For the high-correlation setting we obtained ρ(X1 considerably smaller: TESS =
4
(t)
, X1 ) = 0.979, thus the effective sample size is
1 − 0.979 · 10000 = 105. 1 + 0.979
see e.g. Brockwell and Davis (1991, theorem 7.1.1) for details.
⊳
82
7. Diagnosing convergence
8. Simulated Annealing
8.1 A Monte-Carlo method for finding the mode of a distribution So far we have studied various methods that allow for appoximating expectations E(h(X)) by ergodic averPT (t) ages T1 t=1 h(Xi ). This section presents an algorithm for finding the (global) mode(s) of a distribution1 . In section 8.2 we will extend this idea to finding global extrema of arbitrary functions.
We could estimate the mode of a distribution by the X(t) with maximal density f (X(t) ), this is however a not very efficient strategy. A sample from a Markov chain with f (·) samples from the whole distribution and not only from the mode(s). This suggests modifying the distribution such that it is more concentrated around the mode(s). One way of achieving this is to consider β
f(β) (x) ∝ (f (x)) for very large values of β.
Example 8.1 (Normal distribution). Consider the N(µ, σ 2 ) distribution with density (x − µ)2 (x − µ)2 1 ∝ exp − . exp − f(µ,σ2 ) (x) = √ 2σ 2 2σ 2 2πσ 2 It is easy to see that the mode of the N(µ, σ 2 ) distribution is µ. We have that β β (x − µ)2 (x − µ)2 ∝ f(µ,σ2 /β) (x). = exp − f(µ,σ2 ) (x) ∝ exp − 2σ 2 2σ 2 /β
In other words, the larger β is chosen, the more concentrated the distribution will be around the mode µ. Figure 8.1 illustrates this idea.
⊳
The result we have obtained of the Gaussian distribution in the above example actually holds in general. For β → ∞ the distribution defined by the density f(β) (x) converges to a distribution that has
all mass on the mode(s) of f (see figure 8.2 for an example). It is instructive to see informally why this is the case when considering a discrete random variable with probability density function p(·) and finite support E. Denote with E ∗ the set of modes of p, i.e. p(ξ) ≥ p(x) for all ξ ∈ E ∗ and x ∈ E, and with
m := p(ξ) with ξ ∈ E ∗ . Then
(p(x)/m)β (p(x))β P P =P p(β) (x) = P β β β x∈E ∗ (p(x)) + x∈E\E ∗ (p(x)) x∈E ∗ 1 + x∈E\E ∗ (p(x)/m) 1
β→+∞
−→
(
1/|E ∗ | 0
if x ∈ E ∗ ifx 6∈ E ∗
In this chapter we define the mode(s) of a distribution to be the set of global maxima of the density, i.e. {ξ : f (ξ) ≥ f (x) ∀x}
84
8. Simulated Annealing
100 φ(0,1) (x) ∝ φ(0,1/100) (x)
10 φ(0,1) (x) ∝ φ(0,1/10) (x)
1000 φ(0,1) (x) ∝ φ(0,1/1000) (x)
6 0
2
4
density
8
10
12
φ(0,1) (x)
-2
-1
1
0
2 -2
-1
x
1
0
2 -2
-1
x
1
0
2 -2
-1
x
1
0
2
x
Fig. 8.1. Density of the N(0, 1) raised to increasing powers. The areas shaded in grey represent 90% of the probability mass.
(f (x))3
f (x)
-2
-1
0 x
1
2
-2
-1
0 x
(f (x))27
(f (x))9
1
2
-2
-1
0 x
1
2
-2
-1
0
1
2
x
Fig. 8.2. An arbitrary multimodal density raised to increasing powers. The areas shaded in grey reach from the 5% to the 95% quantiles.
8.1 A Monte-Carlo method for finding the mode of a distribution
85
In the continuous case the distribution is not uniform on the nodes (see Hwang, 1980, for details). We can use a random-walk Metropolis algorithm to sample from f(β) (·). The probability of accepting a move from X(t−1) to X would be ( β ) f(β) (X) f (X) = min 1, . min 1, f(β) (X(t−1) ) f (X(t−1) ) Note that this probability does not depend on the (generally unknown) normalisation constant of f(β) (·). It is however difficult to directly sample from f(β) for large values of β: for β → ∞ the probability of accepting a newly proposed X becomes 1 if f (X) > f (X (t−1) ) and 0 otherwise. Thus X (t) converges to a
local extrema of the density f , however not necessarily a mode of f (i.e. a global extremum of the density). Whether X (t) gets caught in a local extremum or not, depends on whether we can reach the mode from the local extrema of the density within one step. The following example illustrates this problem. Example 8.2. Consider the following simple optimisation problem of finding the mode of the distribution defined on {1, 2, . . . , 5} by
0.4 p(x) = 0.3 0.1
for x = 2 for x = 4 for x = 1, 3, 5.
Figure 8.3 visualises this distribution. Clearly, the (global) mode of p(x) is at x = 2. Assume we want to sample from p(β) (x) ∝ p(x)β using a random walk Metropolis algorithm with proposed value X =
X (t−1) + ε with P(ε = ±1) = 0.5 for X (t−1) ∈ {2, 3, 4}, P(ε = +1) = 1 for X (t−1) = 1, and P(ε = −1) = 1
for X (t−1) = 5. In other words, we can either move one to the left, stay in the current value (when the proposed value is rejected), or move one to the right. Note that for β → +∞ the probability for accepting
a move from 4 to 3 converges to 0, as p(4) > p(3). As the Markov of chain can only move from 4 to 2 only via 3, it cannot escape the local extremum at 4 for β → +∞.
⊳
p(x) 0.4 0.3
0.1 x 1
2
3
4
5
Fig. 8.3. Illustration of example 8.2
For large β the distribution f(β) (·) is concentrated around the modes, however at the price of being difficult to sample from: the resulting Markov chain has very poor mixing properties: for large β the algorithm can hardly move away from a local extremum surrounded by areas of low probability2 .
2
The density of such a distribution would have many local extrema separated by areas where the density is effectively 0.
86
8. Simulated Annealing
The key idea of simulated annealing3 (Kirkpatrick et al., 1983) is to sample from a target distribution that changes over time: f(βt ) (·) with βt → +∞. Before we consider different strategies for choosing the
sequence (βt ), we generalise the framework developed so far to finding the global extrema of arbitrary functions.
8.2 Minimising an arbitrary function Consider that we want to find the global minimum of a function h : E → R. Finding the global minimum of H(x) is equivalent to finding the mode of a distribution
f (x) ∝ exp(−H(x)) for x ∈ E, if such a distribution exists.4 . As in the previous section we can raise f to large powers to obtain a distribution βt
f(βt ) (x) = (f (x))
∝ exp(−βt · H(x)) for x ∈ E.
We hope to find the (global) minimum of H(x), which is the (global) mode of the distribution defined by fβt (x), by sampling from a Metropolis-Hastings algorithm. As suggested above we let βt → +∞. This yields the following algorithm:
(0)
(0)
Algorithm 8.1 (Simulated Annealing). Starting with X(0) := (X1 , . . . , Xp ) and β (0) > 0 iterate for t = 1, 2, . . . 1. Increase β (t−1) to β (t) (see below for different annealing schedules) 2. Draw X ∼ q(·|X(t−1) ). 3. Compute
α(X|X
(t−1)
(
) = min 1, exp −βt (H(X) − H(X
(t−1)
q(X(t−1) |X) )) · q(X|X(t−1) )
)
.
4. With probability α(X|X(t−1) ) set X(t) = X, otherwise set X(t) = X(t−1) . If a random walk Metropolis update is used (i.e. X = X(t−1) + ε with ε ∼ g(·) for a symmetric g),
then the probability of acceptance becomes
n o α(X|X(t−1) ) = min 1, exp −βt (H(X) − H(X(t−1) )) .
Using the same arguments as in the previous section, it is easy to see that the simulated annealing algorithm converges to a local minimum of H(·). Whether it will be able to find the global minimum depends on how slowly we let the inverse temperature β go to infinity. Logarithmic tempering When choosing βt =
log(1+t) , β0
the inverse temperature increases slow enough that
global convergence results can be established for certain special cases. Hajek (1988) established global convergence when H(·) is optimised over a finite set and logarithmic tempering with a suitably large β0 is used. Assume we choose β0 = ∆H with ∆H := maxx,x′ ∈E |H(x) − H(x′ )|. Then the probability of reaching state x in the t-th step is
3
4
The term annealing comes from metallurgy and refers to the technique of letting metal cool down slowly in order to produce a tougher metal. Taking up this analogy, 1/β is typically referred to as temperature, β as inverse temperature. In this framework, finding the mode of a density f corresponds to finding the minimum of − log(f (x))
8.3 Using annealing strategies for imporving the convergence of MCMC algorithms
P(X (t) = x) =
X ξ
87
P(X (t) = x|X (t−1) = ξ) P(X (t−1) = ξ) ≥ exp(−βt−1 ∆H)/|E| {z } | ≥exp(−βt−1 ∆H)/|E|
Using the logarithmic tempering schedule we obtain P(X (t) = x) ≥ t/|E| and thus the expected number of visits to state x is
∞ X t=0
P(X (t) = x) ≥
∞ X
t/|E| = +∞.
t=0
Thus every state is recurrent. As β increases we however spend an ever increasing amount of time in the global minima of x. On the one hand visiting very state x infinitely often implies that we can escape from local minima. On the other hand, this implies as well that we visit every state x (regardless of how large H(x) is) infinitely often. In other words, the reason why simulated annealing with logarithmic tempering works, is that it still behaves very much like an exhaustive search. However the only reason why we consider simulated annealing is that exhaustive search would be too slow! For this reason, logarithmic tempering has little practical relevance. Geometric tempering A popular choice is βt = αt · β0 for some α > 1. Example 8.3. Assume we want to find the maximum of the function ( 2 |x| mod 2 2 2 H(x) = (x − 1) − 1 + 3 · s(11.56 · x ), with s(x) = 2 − |x| mod 2
for 2k ≤ |x| ≤ 2k + 1
for 2k + 1 ≤ |x| ≤ 2(k + 1)
for k ∈ N0 . Figure 8.4 (a) shows H(x) for x ∈ [−1, 3]. The global minimum of H(x) is at x = 0.
We simulated annealing with a geometric tempering with β0 = 1 and βt = 1.001βt−1 and a random √ walk Metropolis algorithm with ε ∼ Cauchy(0, 0.1). Figure 8.4 (b) shows the first 1,000 iterations of the
Markov chain yielded by the simulated annealing algorithm. Note that when using a Gaussian distribution
with small enough a variance the simulated annealing algorithm is very likely to remain in the local minimum at x ≈ 1.8.
⊳
Note that there is no guarantee that the simulated annealing algorithm converges to the global minimum of H(x) in finite time. In practise, it would unrealistic to expect simulated annealing to converge to a global minimum, however in most cases it will find a “good” local minimum.
8.3 Using annealing strategies for imporving the convergence of MCMC algorithms As we have seen in the previous chapter, Markov chains sampling form distributions whose components are separated by areas of low probability typically have very poor mixing properties. We can use strategies very much like the ones used in the simulated annealing algorithm to bridge these “barriers” of low probability. The basic idea is to consider distributions f(β) (x) = (f (x))
(β)
for small β (0 < β < 1), as opposed to large
β as with simulated annealing. Choosing β < 1 makes the distribution more “spread out” and thus makes it easier to move from one part of the domain to another. Whilst it might be easier to sample from f(β) (·) for β < 1, we are actually not interested in this sample, but in a sample from f (·) itself (i.e. β = 1). One way to accommodate this is to consider an ensemble of distributions (f(β1 ) (·), . . . , f(βr ) (·)) with β1 ≤ . . . ≤ βr = 1, and draw Markov chains from each member
of the ensemble. The key idea is to let these chains “interact” by swapping values between neighbouring
8. Simulated Annealing
0
2
4
H(x)
6
8
10
88
-1
1
0
2
3
x
1.0 1.0 0.0
0.5
H(X (t) )
1.5
2.0
0.0
0.5
X (t)
1.5
2.0
2.5
(a) Objective function
0
200
400
600
800
1000
t (b) Resulting Markov chain (X (t) ) and sequence h(X (t) )
Fig. 8.4. Objective function H(x) from example 8.3 and first 1,000 iterations of the Markov chain yielded by simulated annealing.
8.3 Using annealing strategies for imporving the convergence of MCMC algorithms
89
members, which can be formalised as a Metropolis Hastings algorithm on the augmented distribution Qr f (x1 , . . . , xr ) = ρ=1 f(βρ ) (xρ ). The distribution of interest is the marginal distribution of Xr , which has f (·) as distribution. This setups allows members with small β to “help” members with large β to cross
barriers of low probability, and thus can considerably improve the mixing properties. Chapter 10 of (Liu, 2001) gives a more detailed overview over such approaches.
90
8. Simulated Annealing
9. Hidden Markov Models
Hidden Markov Models (HMMs) are a broad class of models which have found wide applicability in fields as diverse as bioinformatics and engineering. We introduce them briefly here as a broad class of Monte Carlo algorithms have been developed to deal with inference in models of this sort. HMMs will be covered in detail in the Graphical Models lecture course; this introduction is far from complete and is intended only to provide the background essential to understand those Monte Carlo techniques. An enormous amount of research has gone into inference in models of this sort, and is ongoing today. Indeed, an extensive monograph summarising the field was published recently (Capp´e et al., 2005). We shall take HMMs to be a class of models in which an underlying Markov process, (Xt ) (usually termed either the state or signal process) forms the object of inference, whilst a related process, (Yn ) (usually known as the observation process), is observed and provides us with information about the process of interest. In order for such a pair of processes to form an HMM, they must have the conditional independence properties illustrated in figure 9.1. That is, the state process is a Markov chain so the distribution of Xn is, conditional upon Xn−1 independent of all previous values of X and the distribution of each element of the observation process depends only upon the value of the state process at that time (more formally, P(Yn ∈ A|X1:n , Y1:n−1 ) ≡ P(Yn ∈ A|Xn )).
x1
x2
x3
x4
x5
x6
y1
y2
y3
y4
y5
y6
Fig. 9.1. The conditional independence structure of the first few states and observations in a Hidden Markov Model.
The term State-Space Model (SSM) is more popular in some application areas. Sometimes a distinction is made between the two, with HMMs being confined to the class of models in which the hidden states take values in a discrete set and SSMs including the case in which the hidden process takes values in
92
9. Hidden Markov Models
a continuous space (that is, whether the underlying Markov chain lies in a discrete or continuous state space); we will not make this distinction here. It seems clear that a great many different problems in statistics, signal processing and related areas can be cast into a form based upon the HMM. Essentially, whenever one has a problem in which one wishes to perform inference as observations arrive (typically in real time) this is the first class of models which one considers using. It is a tribute to the flexibility and descriptive power of such models that it is relatively unusual for a more complex model to be required in order to capture the key properties of imperfectly-observed discrete time processes. Before considering the particular inferential problems which occur most frequently when dealing with such models, it is worth noticing the particular feature of these models which distinguishes estimation and inference tasks from those considered more generally in the statistics literature, and which makes the application of standard techniques such as MCMC inappropriate much of the time. That feature is the particular temporal character of the models, and those things which are typically inferred in them. The model is designed to capture the essence of systems which move from one state to another, generating an observation after each move. Typically, one receives the observations generated by these systems in real time and wishes to estimate the state or other quantities in real time: and this imposes the particular requirement that the computational cost of the estimates remains fixed and is not a function of the size of the data set (otherwise increasing computational power will be required every time an observation is received). It is for this reason that the HMM is considered here, and a collection of popular Monte Carlo techniques for performing inference in systems of this type will be introduced in chapter 10. In order to illustrate the broad applicability of this framework, and to provide some intuition into the nature of the model, it is interesting to consider some examples.
9.1 Examples 9.1.1 Tracking Arguably the canonical example of the HMM is tracking. The states in the HMM comprise a vector of coordinates (which may include information such as velocity and acceleration as well as physical position, or additional information such as which type of object it is) for some object. The transitions of the Markov chain correspond to a dynamic model for that object, and the distribution of the observations conditional upon the state vector correspond to the measurement model encompassing systematic and random errors. As a definite example, consider something known as the the two dimensional (almost) constant velocity model. The state vector contains the position and velocity of an object in two dimension:
sx
sy Xt = ux uy
Xt+1
1
0 = 0 0
0
∆t
1
0
0
1
0
0
0
∆t Xt + Wt 0 1
where ∆t is the measurement interval and Wt is an independent random variable which allows for variation in velocity. An additive error model is typically assumed in this type of application, so, Yt = Xt + Vt where Vt is an independent random variable with a distribution appropriate to describe measurement noise.
9.2 State Estimation: Optimal Filtering, Prediction and Smoothing
93
9.1.2 Statistical Signal Processing Much of modern signal processing is model based, and estimates of signal parameters are obtained using statistical methods rather than more traditional signal processing techniques. One example might be attempting to infer an audio signal given a model for the time sequence Xt corresponding to audio amplitude and a measurement model which allows for a number of forms of noise. There are innumerable occurrences of models of this type.
9.2 State Estimation: Optimal Filtering, Prediction and Smoothing Perhaps the most common inference problem which occurs with HMMs is the estimation of the current state value (or the sequence of states up to the present time) based upon the sequence of observations observed so far. Bayesian approaches are typically employed as they provide a natural and flexible approach to the problem. In such an approach, one attempts to obtain the conditional distribution of the state variables given the observations. It is convenient to assume that both the state transition (i.e. the Markov kernel) and the conditional distributions of the measurements (the likelihood) admit densities with respect to a suitable version of Lebesgue measure. Writing the density of Xt+1 conditional upon knowledge that Xt = xt as ft+1 (·|xt ) and that of the likelihood as gt+1 (yt+1 |·), and interpreting p(x1:1 |y1:0 ) as some prior
distribution p(x1 ), we may write the density of the distribution of interest via Bayes rule in the following form:
p(x1:t |y1:t ) ∝ p(x1:t |y1:t−1 )gt (yt |xt ) = p(x1:t−1 |y1:t−1 )ft (xt |xt−1 )gt (yt |xt ) This problem is known as filtering and p(x1:t |y1:t ) is known as the smoothing distribution. In some
literature this distribution is known as the filtering distribution, here that term is reserved for the final time marginal p(xt |y1:t ). It is usual to decompose this recursive solution into two parts. The first is termed prediction and corresponds to the estimation of the distribution of the first n states given only n − 1
observations. The second is usually termed the update step (or sometimes, the data update step) and it involves correcting the predicted distribution to take into account the next observation: p(x1:t |y1:t−1 ) = p(x1:t−1 |y1:t−1 )ft (xt |xt−1 ) p(x1:t |y1:t ) = R
p(x1:t |y1:t−1 )gt (yt |xt ) p(x1:t |y1:t−1 )gt (yt |xt )dx1:t
Prediction Update.
Certain other distributions are also of interest. These can be divided into smoothing distributions, in which one wishes to estimate the distribution of some sequence of states conditional on knowledge of the observation sequence up to some stage in the future, and prediction distributions, in which one wishes to estimate the distribution of some group of future states conditional on knowledge of the observations up to the present. Formally, smoothing can be considered to be the estimation of p(xl:k |y1:t ) when l ≤ k ≤ n and such
estimates can all be obtained from the principle smoothing distribution: Z p(xl:k |y1:t ) = p(x1:t |y1:t )dx1:l−1 dxk+1:t .
94
9. Hidden Markov Models
Similarly, prediction can be viewed as the estimation of p(xl:k |y1:t ) when n ≤ k and l ≤ k. Noticing that when k ≥ n,
p(x1:k |y1:t ) = p(x1:t |y1:t )
k Y
j=t+1
fj (xj |xj−1 )
it is analytically straightforward to obtain any prediction density by marginalisation (that is integration over the variables which are not of interest) of these densities. Whilst this appears to be a solution to the problem of estimating the distributions of the state variables, and hence of estimating those parameters using either the posterior mean or the maximum a posteriori estimator, in practice the problem is far from being solved. The integrals which appear in the update step in particular is generally intractable. In the linear Gaussian case (in which the distribution of Xt |Xt−1 and Yt |Xt can both be treated as
Gaussian distributions centred at a point obtained by a linear transformation of the conditioning variable) it is possible to perform the integral analytically and hence to obtain closed form expressions for the distribution of interest. This leads to the widely used Kalman filter which is applicable to an important but restricted collection of problems. Various approaches to obtain approximate solutions by extensions of this approach in various ways (propagating mixtures of Gaussians or using local linearisation techniques, for example) are common in the engineering literature. In small discrete state spaces the integral is replaced with a summation and it becomes formally possible to exactly evaluate the distributions of interest. Indeed, a number of algorithms for performing inference efficiently in discrete HMMs do exist, but these tend to become impractical when the state space is large. The most common approach to dealing with this estimation problem in general models is via a Monte Carlo technique which is often termed the particle filter but which is referred to in this course by the more general term of sequential Monte Carlo. These techniques will be introduced in the next chapter. The approach has a simple motivation. We wish to calculate, recursively in time, a sequence of distributions each of which can be obtained from the previous distribution if it is possible to carry out particular integrals with respect to that distribution. In order to approximate the recursion, it is possible to approximate each distribution with a collection of weighted samples and, using this collection to calculate the necessary integrals, to propagate this collection of samples through time in such a way that it approximates each of the distributions of interest one after another.
9.3 Static Parameter Estimation Thus far, it has been assumed that the system dynamics and the measurement system are both exactly characterised. That is, it has been assumed that given xt , the distributions of Xt+1 and Yt are known. In practice, in many situations this will not be the case. As in any inference problem, it is necessary to have some idea of the structure of the system in which inference is being performed, and the usual approach to incorporating some degree of uncertainty is the introduction of some unknown parameters. If the system dynamics or measurement distributions depend upon some collection of unknown parameters θ, then the problem of state estimation, conditional upon a particular value of those parameters is precisely that introduced in section 9.2. If θ is not known then Bayesian inference is centred around the approximation of the sequence of distributions: p(x1:t , θ|y1:t ) ∝ p(θ)p(x1:t |θ)p(y1:t |x1:t , θ).
9.3 Static Parameter Estimation
95
In some instances, the latent state sequence is not of statistical interest and one is really interested in approximation of the marginal distributions: p(θ|y1:t ) =
Z
p(x1:t , θ|y1:t )dx1:t ,
however, this integral is generally intractable and it is typically necessary to obtain Monte Carlo estimates of p(x1:t , θ|y1:t ) and to use the approximation to p(θ|y1:t ) provided by these samples. Estimation of static parameters appears a relatively simple problem, and one which does not much increase the dimensionality of the space of interest. Indeed, in scenarios in which one is presented with a fixed quantity of data MCMC algorithms are able to provide approximation to the distribution of interest this is the case – numerous algorithms for estimating parameters in latent variable models by various mechanisms exist and can be shown to work well. The difficulty is that inference problems of practical interest generally involve sequential estimation of the value of the parameters based upon the collection of observations available at that time. In order to do this, it is necessary to develop an efficient mechanism for updating the estimate of p(θ, x1:t−1 |y1:t ) to provide an estimate of p(θ, x1:t |y1:t ).
96
9. Hidden Markov Models
10. Sequential Monte Carlo
This chapter is concerned with techniques for iteratively obtaining samples from sequences of distributions by employing importance sampling, and resampling, techniques. The principle application of these techniques is the approximate solution of the filtering, prediction and smoothing problems in HMMs (see chapter 9).
10.1 Importance Sampling Revisited Recall from section 2.3, that importance sampling is a technique for approximating integrals under one probability distribution using a collection of samples from another, instrumental distribution. This was presented using the importance sampling identity that, given a distribution of interest π, and some sampling distribution µ, over some space E, and any integrable function h : E → R Z Z Z π(x) h(x) dx = µ(x)w(x)h(x) dx = Eµ (w(X) · h(X)), (10.1) Eπ (h(X)) = π(x)h(x) dx = µ(x) µ(x) | {z } =:w(x)
if µ(x) > 0 for (almost) all x with π(x) · h(x) 6= 0.
The strong law of large numbers was then employed to verify that, given a collection of iid samples,
{Xi }N i=1 , distributed according to µ, the empirical average of w · h evaluated at the sampled values Xi converges, with probability 1, to the integral of h under the target distribution, π. 10.1.1 Empirical Distributions It is convenient to introduce the concept of random distributions in order to consider some of these things in further detail and to motivate much of the material which follows. If we have a collection of points {xi }N i=1 , in E, then we may associated the following distribution over E with those points: η N (x) =
N 1 X δx (x), N i=1 i
where, for any point x ∈ E, δx is the singular distribution which places all of its mass at that point. A rigorous treatment requires measure-theoretic arguments, but is is possible – albeit not entirely rigorous – to define this distribution in terms of its integrals: for any bounded measurable function:, h : E → R, Z h(y)δx (y) = h(x).
98
10. Sequential Monte Carlo
Given a collection of points and associated positive, real-valued weights, {xi , wi }N i=1 , we can also define
another empirical distribution:
η˜N (x) =
N P
wi δxi (x)
i=1
N P
, wi
i=1
where the normalising term may be omitted if the collection of weights is, itself, properly normalised. When those collections of points are actually random samples, and the weights are themselves random variables (typically deterministic functions of the points with which they are associated), one obtains a random distribution (or random measure). These are complex structures, a full technical analysis of which falls somewhat outside the scope of this course, but we will make some limited use of them here. Within this framework, one could consider, for example, the importance sampling identity which was introduced above in terms of the empirical measure µN (x) associated with the collection of samples from µ; we can approximate integrals with respect to µ by integrals with respect to the associated empirical measure:
Z
h(x)µ(x)dx ≈
Z
h(x)µN (x)dx.
This approximation is justified by asymptotic results which hold as the number of samples tends to infinity, particularly the law of large numbers and related results, including variants of the GlivenkoCantelli theorem. If we consider propagating this approximation through (10.1), then we find that: Z Z Z π(x) h(x) dx = µ(x)w(x)h(x) dx ≈ EµN (w(X) · h(X)) Eπ (h(X)) = π(x)h(x) dx = µ(x) µ(x) Z Z Z π(x) h(x) dx ≈ µN (x)w(x)h(x) dx = EµN (w(X) · h(X)) Eπ (h(X)) = π(x)h(x) dx = µ(x) µ(x) Z Z Z π(x) h(x) dx = µN (x)w(x)h(x) dx = EµN (w(X) · h(X)). Eπ (h(X)) = π(x)h(x) dx ≈ µN (x) µ(x) {z } | =:π N (x)
Perhaps unsurprisingly, the approximation amounts, precisely, to approximating the target distribution, π, with the random distribution π N (x) =
N X
Wi δXi (x),
i=1
where Wi := w(Xi ) = π(Xi )/µ(Xi ) are importance weights. Note that we have assumed that the importance weights are properly normalised here; the self-normalised version of importance sampling can be represented in essentially the same way. This illustrates a particularly useful property of the empirical distribution approach: it is often possible to make precise the nature of the approximation which is employed at one step of an algorithm and then to simply employ the empirical distribution in place of the true distribution which it approximates and in so doing to obtain the corresponding approximation at the next step of the algorithm. This property will become clearer later, when some examples have been seen. 10.1.2 HMMs The remainder of this chapter is concerned with inference in HMMs, carried out by propagating forward an approximation to the distributions of interest using techniques based upon importance sampling. As with many Monte Carlo methods, it is often simpler to interpret the methods presented here as techniques
10.2 Sequential Importance Sampling
99
for approximating distributions of interest, rather than particular integrals. That is, we shall view these approaches as algorithms which approximate the filtering distributions p(xn |y1:n ) directly, rather than R approaches to approximate h(xn )p(xn |y1:n )dxn . These algorithms will have a common character: they
will each involve propagating a weighted empirical distribution forward through time according to some sampling scheme in such a way that at each time it provides an approximation to the distributions of interest. Although many of the techniques which follow are traditionally presented as being related but distinct methods, it should be understood that it is typically straightforward to combine elements of the various different approaches to obtain a single cohesive algorithm which works well in a particular setting. Indeed, it is only in relatively simple settings that any one of the basic techniques presented below work well, and it is the combination of elements of each of these approaches which is essential to obtain algorithms which perform well in interesting problems. It will be convenient to refer to the smoothing and filtering distributions of interest using the notation: πt (x1:t ) := p(x1:t |y1:t−1 ) π ˆt (x1:t ) := p(x1:t |y1:t ) πt (xt ) := p(xt |y1:t−1 ) = π ˆt (xt ) := p(xt |y1:t )
=
Z
Z
πt (x1:t )dx1:t−1 π ˆt (x1:t )dx1:t−1 .
10.2 Sequential Importance Sampling The basis of sequential importance sampling techniques, is the following idea which allows importance samples to be obtained from a sequence of distributions defined on increasing spaces to be obtained iteratively using a single collection of samples, which are termed particles. Given a sequence of spaces, E1 , E2 , . . . and a sequence of distributions π1 on E1 , π2 on E1 ×E2 , . . . each of which is to be approximated
via importance sampling, it is useful to make use of the following. Consider only the first two distributions in the first instance. Given a function h2 : E1 ×E2 → R, beginning with the standard importance sampling identity:
Z
Z
π2 (x1 , x2 ) µ(x1 , x2 )dx1 dx2 µ(x1 , x2 ) Z π2 (x1 , x2 ) µ(x1 )µ(x2 |x1 )dx1 dx2 = h(x1 , x2 ) µ(x2 |x1 )µ(x1 ) Z π2 (x1 , x2 ) π1 (x1 ) µ(x1 )µ(x2 |x1 )dx1 dx2 = h(x1 , x2 ) µ(x2 |x1 )π1 (x1 ) µ(x1 )
h(x1 , x2 )π2 (x1 , x2 )dx1 dx2 =
h(x1 , x2 )
The last equality illustrates the key point: by decomposing the importance weight into two components in this manner, we are able to perform our calculations sequentially. We first draw a collection of samples (i)
(i)
{X1 }N i=1 according to a distribution µ and then set their importance weights W1
(i)
(i)
∝ π1 (X1 )/µ(X1 )
and thus obtain a weighted collection of particles which target the distribution π1 . In order to obtain a weighted collection of particles which target π2 , we simply extend each of these particles according to the conditional distribution implied by µ: (i)
(i)
X2 ∼ µ(·|X1 ),
100
10. Sequential Monte Carlo
and then set the importance weights taking advantage of the above decomposition: (i) (i) f (i) W2 ∝ W1 W 2
f (i) = W 2
(i)
(i)
π2 ((X1 , X2 )) (i)
(i)
(i)
π1 (X1 )µ(X2 |X1 )
.
f (i) as incremental weights in this context. Thus we obtain a weighted sample It is usual to refer to the W 2 (i)
which targets π1 , {X1 , W1 }N i=1 } and subsequently one which targets π2 , with the minimum amount of ad(i)
(i)
(i)
ditional sampling, {(X1 , X2 ), W2 }N i=1 in which π1 (X1 ) × µ(X2 |X1 ) is essentially used as an importance distribution for approximating π2 .
This procedure can be applied iteratively, leading to a sequence of weighted random samples which can be used to approximate a sequence of distributions on increasing state spaces. This scenario is precisely that faced in the smoothing scenario and is closely related to the filtering problem. In what follows, techniques for sequentially approximating the filtering and smoothing distributions will be presented. 10.2.1 The Natural SIS Filter The first approach which one might consider is to use only samples from the prior distribution and the state transition densities together with importance sampling techniques in order to approximate the filtering, smoothing and one-step-ahead prediction distributions of interest. This approach is particularly natural and easy to understand, although we shall subsequently see that it has a number of shortcomings and more sophisticated techniques are generally required in practice. As with all sequential importance sampling algorithms, the principle underlying this filter is that the empirical measure associated with a collection of weights and samples which approximates the filtering distribution at time t may be propagated forward in a particularly simple manner in order to obtain approximations to the predictive and filtering distributions at time t + 1. Given a collection of samples and associated weights, which we shall term particles at time t, (i) ˆ (i) {X1:t , W t }
the associated empirical measure, N π ˆN SIS,t (x1:t )
=
N X
ˆ i δ (i) (x1:t ), W X
i=1
1:t
provides an approximation to π ˆt , an approximation to πt+1 can be obtained. We know that, πt+1 (x1:t+1 ) = π ˆt (x1:t )ft+1 (xt+1 |xt ), and so one would like to use the approximation N π ˆN SIS,t (x1:t )ft+1 (xt+1 |xt ) =
N X i=1
δX (i) (x1:t )ft+1 (xt+1 |xt ) . 1:t
However, this would present us with a distribution which it is difficult to propagate forward further, and with respect to which it is difficult to calculate integrals. Instead, one might consider sampling from this distribution. As we will see later, this is the approach which is employed in some algorithms. In the present algorithm, a little more subtlety is employed: we have a mixture representation and so could use stratified sampling techniques to reduce the variance. Taking this to its logical conclusion, the simplest approach is to retain the mixture weights and sample a single value from each mixture component: (i) (i) Xt+1 ∼ ft+1 ·|Xt , leading to the empirical measure,
10.2 Sequential Importance Sampling
N πN SIS,t+1 (x1:t+1 )
N X
=
101
(i)
Wt δX (i) (x1:t+1 ). 1:t+1
i=1
N We can obtain an approximation to π ˆt+1 by substituting πN SIS,t+1 directly into the recursion:
π ˆt+1 (x1:t+1 ) = R
N ⇒π ˆN SIS,t+1 (x1:t+1 ) = R
=
πt+1 (x1:t+1 )gt+1 (yt+1 |xt+1 ) πt+1 (x′1:t+1 )gt+1 (yt+1 |x′t+1 )dx′1:t+1 N πN SIS,t+1 (xt+1 )gt+1 (yt+1 |xt+1 )
N ′ ′ ′ πN SIS,t+1 (xt+1 )gt+1 (yt+1 |xt+1 )dxt+1
N P
i=1
(i)
t+1
N P
i=1
=
N X
(i)
Wt gt+1 (yt+1 |Xt+1 )δX (i) (x) (i)
(i)
Wt gt+1 (yt+1 |Xt+1 )
(i)
(i)
Wt+1 δXt+1 (x),
i=1
where the updated importance weights are: (i)
(i)
Wt+1 =
(i)
Wt gt+1 (yt+1 |Xt+1 ) . N P (j) (j) Wt gt+1 (yt+1 |Xt+1 )
j=1
ˆ (i) = Wt(i) gt+1 (yt+1 |X (i) ) Note, that in practice these weights can be updated efficiently by setting W t+1 t+1 P (i) ˆ (i) / N W ˆ (j) . and then setting Wt+1 = W t+1 t+1 j=1
If one can obtain a collection of weighted particles which target π1 – which can typically be achieved
by sampling from π1 and setting all of the weights equal to 1/N – then this filter provides, recursively, weighted collections of particles from the smoothing distribution. As a byproduct, the filtering distribution, π ˆt (Xt ), and the one-step-ahead predictive distribution, πt (Xt ) can also be approximated using the same particle set: N πN SIS,t (xt ) =
N X
Wt−1 δX (i) (xt ) t
i=1
N π ˆN SIS,t (xt ) =
N X
Wt δX (i) (xt ). t
i=1
Algorithm 1 provides a formal algorithm specification. Algorithm 1 A Natural SIS Filter 1: Set t = 1. (i) 2: For i = 1 : N , sample X1 ∼ π1 (·). (i)
3: For i = 1 : N , set W1
(i)
∝ g1 (y1 |X1 ). Normalise such that
4: t ← t + 1 (i) (i) 5: For i = 1 : N , sample Xt ∼ ft (·|Xt=1 ). (i)
6: For i = 1 : N , set Wt
(i)
N P
i=1
(i)
(i)
W1
∝ Wt−1 gt (yt |Xt ). Normalise such that
N P
i=1
= 1.
(i)
Wt
= 1.
7: The smoothing, filtering and predictive distributions at time t may be approximated with N π ˆN SIS,t =
N X i=1
8: Go to step 4.
Wt δX (i) , 1:t
N π ˆN SIS,t =
N X i=1
Wt δX (i) t
and
N πN SIS,t =
N X i=1
Wt−1 δX (i) , t
respectively.
102
10. Sequential Monte Carlo
By way of illustration, we consider applying the filter to a particularly simple HMM, in which Xt ∈ R,
and:
X1 ∼ N(0, 1) Xt |Xt−1 = xt−1 ∼ N(0.9xt−1 + 0.7, 0.25) Yt |Xt = xt ∼ N(xt , 1). This system is so simple that it can be solved analytically (in the sense that the distributions of interest are all Gaussian and a recursive expression for the parameters of those Gaussians may be derived straightforwardly), however, it provides a simple illustration of a number of important phenomena when the filter is applied to it. 24 states and their associated observations were generated by sampling directly from the model. Figure 10.1 shows the true states, the observations and the mean an 90% confidence interval for the state estimate obtained by using the filtering distribution obtained at each time in order to estimate the state at that time. While figure 10.2 shows the true states, the observations and the mean and 90% confidence interval associated with each state using the approximate smoothing distribution at time t = 24. These two figures appear to illustrate the behaviour that would be expect, and to show good performance. In both instances, these graphs suggest that the empirical measures appear to provide a reasonable description of the distribution of the state sequence. It is also apparent that, by incorporating information from future observations as well as the past, the smoother is able to reduce the influence of occasional outlying observations.
States and Observations 8
7
6
5
4
3
2
1
True states Observations Estimates 90% Confidence Bounds
0 5
10
15
20
Time Fig. 10.1. Toy example: the true state and observation sequence, together with natural SIS filtering estimates.
10.2 Sequential Importance Sampling
103
States and Observations 8
7
6
5
4
3
2
1
True states Observations Estimates 90% Confidence Bounds
0 5
10
15
20
Time Fig. 10.2. Toy example: the true state and observation sequence, together with natural SIS smoothing estimates.
However, it is always important to exercise caution when considering simulation output. In order to illustrate how the filter is actually behaving, it is useful to look at a graphical representation of the approximate filtering and prediction distributions produced by the algorithm. Figure 10.3 shows the location and weights of the particles at for values of t. Note that this type of plot, which is commonly used to illustrate the empirical measure associated with a weighted collection of samples, does not have quite the same interpretation as a standard density plot. One must consider the number of particles in a region as well as their weights in order to associate a probability with that region. The first few plots show the behaviour that would be expected: the system is initialised with a collection of unweighted samples which are then weighted to produce an updated, weighted collection. For the first few time-steps everything continues as it should. The bottom half of the figure illustrates the empirical measures obtained during the twelfth and twenty-fourth iterations of the algorithm, and here there is some cause for concern. It appears that, as time passes, the number of particles whose weight is significantly above zero is falling with each iteration, by the twenty-fourth iteration there are relatively few samples contributing any significant mass to the empirical measure. In order to investigate these phenomenon, it is useful to consider a very similar model in which the observations are more informative, and there is a greater degree of variability in the system trajectory: X1 ∼ N(0, 1) Xt |Xt−1 = xt−1 ∼ N(0.9xt−1 + 0.7, 1) Yt |Xt = xt ∼ N(xt , 0.1).
104
10. Sequential Monte Carlo
Predicted empirical measure at time 1
Updated empirical measure at time 1
0.0202
0.06
0.02015 0.05 0.0201 0.04 0.02005
0.02
0.03
0.01995 0.02 0.0199 0.01 0.01985
0.0198 -2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
0 -2.5
-2
-1.5
-1
Predicted empirical measure at time 2
-0.5
0
0.5
1
1.5
2
2.5
Updated empirical measure at time 2
0.06
0.14
0.12
0.05
0.1 0.04 0.08 0.03 0.06 0.02 0.04
0.01
0.02
0 -1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
0 -1.5
-1
-0.5
0
Predicted empirical measure at time 12
0.5
1
1.5
2
2.5
3
Updated empirical measure at time 12
0.14
0.16
0.14
0.12
0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02
0.02
0
0 3
3.5
4
4.5
5
5.5
6
6.5
7
3
3.5
4
Predicted empirical measure at time 24
4.5
5
5.5
6
6.5
7
Updated empirical measure at time 24
0.25
0.3
0.25 0.2
0.2 0.15 0.15 0.1 0.1
0.05 0.05
0
0 5
5.5
6
6.5
7
7.5
5
5.5
6
Fig. 10.3. Some predictive and filtering measures obtained via the natural SIS filter.
6.5
7
7.5
10.2 Sequential Importance Sampling
105
States and Observations 8
7
6
5
4
3
2
1
True states Observations Estimates 90% Confidence Bounds
0 5
10
15
20
Time Fig. 10.4. Toy example: the true state and observation sequence, together with natural SIS filtering estimates.
States and Observations 8
7
6
5
4
3
2
1
True states Observations Estimates 90% Confidence Bounds
0 5
10
15 20 Time Fig. 10.5. Toy example: the true state and observation sequence, together with natural SIS smoothing estimates.
106
10. Sequential Monte Carlo
Figure 10.4 shows the filtering estimate of the trajectory of this system, while figure 10.5 illustrates the smoothed estimate. A selection of the associated empirical distributions are shown in figure 10.6. In the case of this example, it is clear that there is a problem. 10.2.2 Weight Degeneracy The problems which are observed with the natural SIS filter are an example of a very general phenomenon known as weight degeneracy. Direct importance sampling on a very large space is rarely efficient as the importance weights exhibit very high variance. This is precisely the phenomenon which was observed in the case of the SIS filter presented in the previous section. In order to understand this phenomenon, it is useful to introduce the following technical lemma which relates the variance of a random variable to its conditional variance and expectation (for any conditioning variable). Throughout this section, any expectation or variance with a subscript corresponding to a random variable should be interpreted as the expectation or variance with respect to that random variable. Lemma 10.1 (Law of Total Variance). Given two random variables, A and B, on the same probability space, such that Var (A) < ∞, then the following decomposition exists: Var (A) = EB [VarA (A|B)] + VarB (EA [A|B]) . Proof. By definition, and the law of total probability, we have: 2 Var (A) = E A2 − E [A] 2 = EB EA A2 |B − EB [EA [A|B]] .
Considering the definition of conditional variance, and then variance, it is clear that: h i 2 2 Var (A) = EB VarA (A|B) + EA [A|B] − EB [EA [A|B]] h i 2 2 = EB [VarA (A|B)] + EB EA [A|B] − EB [EA [A|B]] = EB [VarA (A|B)] + VarB (EA [A|B]) .
For simplicity, consider only the properly normalised importance sampling approach in which the normalising constant is known and used, rather than using the sum of the weights to renormalise the empirical measure. Consider a target distribution on E = E1 × E2 , and some sampling distribution µ on
the same space. We may decompose the importance weight function as: π(x1 , x2 ) µ(x1 , x2 ) π(x1 ) π(x2 |x1 ) = µ(x1 ) µ(x2 |x1 ) | {z } | {z }
W2 (x1 , x2 ) =
=:W1 (x1 ) =:W f2 (x2 |x1 )
We are now in a position to show that the variance of the importance weights over E is greater than that which would be obtained by considering only the marginal distribution over E1 . The following variance decomposition, together with lemma 10.1 provide the following equality: f2 (X2 |X1 ) Var (W2 ) = Var W1 (X1 )W h i h i f2 (X2 |X1 )|X1 + EX VarX W1 (X1 )W f2 (X2 )|X1 . = VarX1 EX2 W1 (X1 )W 1 2
10.2 Sequential Importance Sampling
Predicted empirical measure at time 1
107
Updated empirical measure at time 1
0.0202
0.25
0.02015 0.2 0.0201
0.02005
0.15
0.02 0.1
0.01995
0.0199 0.05 0.01985
0.0198
0 -3
-2
-1
0
1
2
3
-3
-2
-1
Predicted empirical measure at time 2
0
1
2
3
3
4
Updated empirical measure at time 2
0.25
0.9
0.8 0.2
0.7
0.6 0.15 0.5
0.4 0.1 0.3
0.2
0.05
0.1
0
0 -2
-1
0
1
2
3
4
-2
-1
0
Predicted empirical measure at time 12
1
2
Updated empirical measure at time 12
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 -4
-2
0
2
4
6
8
10
-4
-2
0
Predicted empirical measure at time 24
2
4
6
8
10
10
12
14
Updated empirical measure at time 24
1
0.7
0.9 0.6 0.8 0.5
0.7 0.6
0.4
0.5 0.3
0.4 0.3
0.2
0.2 0.1 0.1 0
0 0
2
4
6
8
10
12
14
0
2
4
6
Fig. 10.6. Some predictive and filtering measures obtained via the natural SIS filter.
8
108
10. Sequential Monte Carlo
Noting that W1 (X1 ) is a deterministic function of X1 , we obtain: h i h i 2 f f Var (W2 ) = VarX1 W1 (X1 ) EX2 W2 (X2 |X1 )|X1 + EX1 W1 (X1 ) VarX2 W2 (X2 )|X1 {z } | 1
h
i f2 (X2 )|X1 ≥ Var (W1 (X1 )) . = VarX1 (W1 (X1 )) + EX1 W1 (X1 )2 VarX2 W
f2 (X2 )|X1 non-negative functions. Indeed, equalThe final inequality holds as W1 (X1 )2 and VarX2 W
ity is achieved only if the weight function is constant over the support of X2 , a situation which is somewhat uncommon.
This tells us, then, that there is a problem with the sequential importance sampling approach proposed above: the variance of the importance weights will increase with every iteration1 and consequently, the quality of the estimators which the empirical measure provides will decrease as time progresses. 10.2.3 A More General Approach The first approach to solving the problem of weight degeneracy or, at least, reducing its severity is to directly reduce the variance of the incremental importance weights. In order to attempt to do this, it is first necessary to consider a more general formulation of sequential importance sampling than the “natural” approach considered thus far. Whilst there is a pleasing simplicity in iteratively approximating first the predictive and then the filtering/smoothing distributions, there is no need to do this in practice. It is often sufficient to obtain a sequence of measures which approximate the filtering/smoothing distributions – indeed, from these one can also form an estimate of the predictive distribution if one is required. With this perspective, the (i)
(i)
sequential importance sampling approach is this: given a weighted collection of samples {Wt , X1:t }N i=1 , which target the distribution π ˆt , how can we obtain a collection of particles which target π ˆt+1 ?
The answer to this question becomes apparent very quickly if the problem is looked at from the right (i)
(i)
angle. We have a collection of samples {X1:t }N i=1 from some distribution µt , and a collection of importance (i)
weights, Wt
(i)
=π ˆt (X1:t )/µt (X1:t ). If we extend these samples, by sampling Xt+1 ∼ qt+1 (·|Xt ), and wish (i)
to weight the collection {X1:t+1 }N ˆt+1 , we have: i=1 such that it targets π (i)
π ˆt+1 (X1:t+1 )
(i)
Wt+1 =
(i)
(i)
(i)
µt (X1:t )qt+1 (Xt+1 |Xt )
,
and this leads immediately to the iterative decomposition (i)
(i)
(i)
Wt+1 =
π ˆt+1 (X1:t+1 )
π ˆt (X1:t ) (i)
(i)
(i)
(i)
µt (X1:t ) π ˆt (X1:t )qt+1 (Xt+1 |Xt ) (i)
(i)
=Wt
π ˆt+1 (X1:t+1 ) (i)
(i) ˆt+1 (Xt ) (i) π (i) π ˆt (Xt )
=Wt
1
(i)
(i)
π ˆt (X1:t )qt+1 (Xt+1 |Xt ) (i)
(i)
(i)
ft+1 (Xt+1 |Xt )gt+1 (yt+1 |Xt+1 ) (i)
(i)
qt+1 (Xt+1 |Xt )
Actually, this is not quite true. We have not considered the effect of renormalising importance weights here. It is evident that, at least asymptotically (in the number of particles) such normalisation has no effect on the outcome. A detailed analysis lends little additional intuition.
10.3 Sequential Importance Resampling
109
It is immediately clear that using a proposal distribution qt+1 (xt+1 |xt−1 ) ∝ ft+1 (xt+1 |xt )gt+1 (yt+1 |xt )
will minimise the variance of the importance weights. Thus, this is clearly the proposal distribution which minimises the variance of the incremental importance weights and hence, in that sense, is optimal. All that this result tells us is that better performance will be expected if we take observations into account via the proposal distribution rather than solely via importance weighting. This is consistent with what we know about standard importance sampling: the best results are obtained when the proposal distribution matches the target distribution as closely as possible. Initially, this result may seem rather formal: in general, it will not be possible to sample from this optimal proposal distribution. However, knowledge of its form may be used to guide the design of tractable distributions from which samples can easily be obtained and it is often possible to obtain good approximations of this distribution in quite complex realistic problems. Furthermore, this more general method has another substantial advantage over the natural approach considered above: it is only necessary to be able to evaluate the density of the transition probabilities pointwise up to a normalising constant whilst in the natural case it was necessary to be able to sample from that density. When using an approximation to the optimal importance variance, of course, the variance of the incremental importance weights is non-zero and the variance of the weights still increases over time. Unsurprisingly, using a better importance distribution reduces the variance but does not eliminate it. In the context of filtering, this means that using a close to optimal proposal distribution increases the time-scale over which the distributions of interest remain well characterised: eventually, degeneracy still occurs. Algorithm 2 A General SIS Filter 1: Set t = 1. (i)
2: For i = 1 : N , sample X1 ∼ q1 (·). (i)
3: For i = 1 : N , set W1 4: t ← t + 1
(i)
(i)
5: For i = 1 : N , sample Xt (i)
6: For i = 1 : N , set Wt
(i)
(i)
∝ π1 (X1 )g1 (y1 |X1 )/q1 (X1 ). Normalise such that
N P
i=1
N X i=1
= 1.
(i)
∼ qt (·|Xt−1 ). (i)
(i)
(i)
(i)
(i)
(i)
∝ Wt−1 ft (Xt |Xt−1 )gt (yt |Xt )/qt (Xt |Xt−1 ). Normalise such that
7: The smoothing and filtering distributions at time t may be approximated with N π ˆSIS,t =
(i)
W1
Wt δX (i) , 1:t
N and π ˆSIS,t =
N X i=1
Wt δX (i) ,
N P
i=1
(i)
Wt
= 1.
respectively.
t
8: Go to step 4.
10.3 Sequential Importance Resampling Having established that there is a fundamental problem with SIS, the question of how to deal with that problem arises. The difficulty is that the variance of the importance weights increases over time; eventually, the associated particle system and empirical measure provide an inadequate description of the distributions of interest. The variance in the importance weights is something which accumulates over a number of iterations: this suggests that that what is required is a mechanism for resetting the importance weights regularly to prevent this accumulation of variance. It is not immediately apparent that there is a mechanism by which this can be accomplished given that we are unable to sample directly from the distributions of interest
110
10. Sequential Monte Carlo
in any degree of generality. However, when concerned with approximating only the filtering distributions, (Gordon et al., 1993) developed the following approach. Under suitable regularity conditions, the law of large numbers tells us that for importance sampling: Z Z N lim ϕ(xt )ˆ πSIS,t (xt )dxt → ϕ(xt )ˆ πt (xt )dxt , N →∞
for any suitably regular test function ϕ and so, any consistent estimator of the left hand side will provide a consistent estimate of the integral of interest. N We wish to obtain an unweighted collection of particles which approximate the distribution πˆSIS,t (xt )
in some sense: this is exactly what we do in simple Monte Carlo. In order to approximate the integral on the left hand side, we can apply a crude Monte Carlo approximation, sampling, for i = 1 to N ′ , N′ ˜ t(i) ∼ π ˆSIS,t and then using the simple Monte Carlo estimate of the integral. It is then straightforward X to employ the law of large numbers to verify that:
Z N 1 X ˜ (i) N ϕ(Xt ) → ϕ(xt )ˆ πSIS,t (xt )dxt lim N ′ →∞ N ′ i=1 ′
For simplicity, we assume that we sample N times from the empirical distribution in order to obtain our new distribution: N ′ = N – this simplifies the computational implementation as well as analysis. Note that it is not possible to simply increase the number of particle samples at every iteration, although it initially seems advisable to do so to compensate for the additional Monte Carlo error introduced by the additional sampling step. Whilst it would, indeed, reduce the variance it would also lead to an exponentially increasing computational cost with the number of particles used rapidly exceeding the number that available computing facilities are able to deal with. Instead, it is generally recommended that one should use as many particles as computational resources permit from the outset. Algorithm 3 A General SIR Filter 1: Set t = 1. ˜ (i) ∼ q1 (·). 2: For i = 1 : N , sample X 1,1 (i)
3: For i = 1 : N , set W1
N (i) ˜ (i) )/q1 (X ˜ (i) ). Normalise such that P W (i) = 1. ∝ π1 (X1,1 )g1 (y1 |X 1,1 1 1 i=1
4: Resample: for i = 1 : N , sample
(i) X1,1,
∼
N P
(i)
i=1
W1 δX˜ (i)
1,1
N P
j=1
.
(j)
W1
5: t ← t + 1 (i) ˜ (i) ˜ (i) ˜ (i) 6: For i = 1 : N , set X t,1:t−1 = Xt−1,1:t−1 and sample Xt,t ∼ qt (·|Xt,t−1 ). (i)
7: For i = 1 : N , set Wt
N P (i) ˜ (i) (i) (i) ˜ (i) (i) ˜ t,t ˜ t,t ˜ t,t ∝ ft (X |Xt,t−1 )gt (yt |X )/qt (X |Xt,t−1 ). Normalise such that Wt = 1. i=1
8: Resample: for i = 1 : N , sample
(i) Xt,1:t,
∼
N P
(i)
i=1
W1 δX˜ (i)
t,1:t
N P
j=1
.
(j) W1
9: The smoothing and filtering distributions at time t may be approximated with N π ˆSIR,t (x1:t ) =
10: Go to step 5.
N 1 X δ (i) (x1:t ), N i=1 Xt,1:t
N and π ˆSIR,t (xt ) =
N 1 X δ (i) (xt ), N i=1 Xt,t
respectively.
10.3 Sequential Importance Resampling
111
As algorithm 3 demonstrates, this technique can formally be generalised (in the obvious way) to provide an approximation to the smoothing distributions; however, as will be made clear shortly, there are difficulties with this approach and there is a good reason for concentrating on the filtering distribution. The generalisation is simply to resample the entire path followed by each particle through time, rather than only its terminal state. Note that doing this makes it necessary to introduce an additional subscript to account for the fact that the location of the j th coordinate of the ith particle changes through time (as (i)
we resample the full trajectory associated with each particle). Hence Xt,j is used to refer to this quantity (i)
at time t, and Xt,1:t refers to the full trajectory of the ith particle at time t. 10.3.1 Sample Impoverishment The motivation behind the resampling step introduced in the bootstrap filter and its generalisation was that it provides a method for eliminating particles with very low weights whilst replicating those with large weights. Doing this allows all of the particles to contribute significantly to the approximation of the distributions of interest. When considering the filtering distributions it is clear, loosely speaking, that resampling at the previous time step does lead to a better level of sample diversity as those particles which it was already extremely improbable at time t − 1 are likely to have been eliminated and those which remain have a better chance of representing the situation at time t accurately.
There is, however, a difficulty. The resampling step does homogenize the particle weights but it can only do this by replicating particle values: we typically obtain several particles with particular values after each resampling step. This is not of itself a problem, providing that we retain a reasonable number of distinct particles and, in a reasonably well designed algorithm, this can be true for the filtering distribution. Considering the smoothing distribution makes it clear that there is a problem: the algorithm only ever reduces the number of distinct values taken at any time in the past. As the sampling mechanism only appends new values to the particle trajectories and the resampling mechanism eliminates some trajectories with every iteration, we ultimately end up with a degeneracy problem. The beginning of every particle trajectory will ultimately become the same. The reduction of the number of distinct samples within a collection by this mechanism, particularly when the reduction is pronounced is termed sample impoverishment. Resampling is a more than cosmetic activity and it does more than eliminating direct evidence of sample degeneracy. However, it works by maintaining a good degree of sample diversity at the end of the particle trajectories at any given time: it does nothing to improve diversity in the past and the accumulation of degeneracy consequently remains a problem if one is interested in the full history. Resampling at any given time simply increases the Monte Carlo variance of the estimator at that time. However, it can reduce the variance of estimators at later times and it is for that reason that it is widely used. For simplicity, in the algorithms presented in this course the estimator at time t is often given immediately after the resampling step. However, lower variance estimates would be obtained by employing the weighted sample available immediately before resampling and this is what should be done in practice except in situations in which it is essential to have an unweighted sample. 10.3.2 Effective Sample Sizes It is clear that resampling must increase the immediate Monte Carlo variance (it is used because it will hopefully reduce the variance in the future and hence the variance cost of resampling may be offset by the reduction in variance obtained by having a more diverse collection of samples available). Consequently, it would be preferable not to resample during every iteration of the algorithm.
112
10. Sequential Monte Carlo
There competing requirements which must somehow be balanced: we must resample often enough to prevent sample degeneracy but we wish to do so sufficiently infrequently that sample impoverishment does not become a serious problem, at least for the filtering estimator. It would be useful to have a method for quantifying the quality of the current sample or, perhaps, to determine whether the current collection of weights is acceptable. It is, of course, difficult to quantify the quality of a weighted sample which targets a distribution which is not known analytically. This is the problem which must be addressed when dealing with iterative algorithms of the sort described within this chapter. One figure of merit which could be used if it were available would be the ratio of the variance of an estimator obtained from this sample to that of an estimator obtained using crude Monte Carlo to sample from the target distribution itself (were that possible). Noting that the variance of the crude Monte Carlo estimator is proportional to the inverse of the sample size, N times this quantity corresponds to the effective sample size in the sense that this is the number of iid samples from the target distribution which would be required to obtain an estimator with the same variance. It is useful to consider an abstract scenario. Let π denote a target distribution and µ an instrumental distribution. Assume that N samples from each are available, {Xi }N i=1 are iid samples from µ, whilst
{Yi }N i=1 are distributed according to π. Letting W (x) ∝ π(x)/µ(x), and allowing ϕ to be a real-valued function whose integral we wish to estimate, we have two natural estimators: N X ˆ CM C = 1 ϕ(Yi ) h N i=1
ˆ IS = h
N P
W (Xi )ϕ(Xi )
i=1 N P
. W (Xi )
j=1
The effective sample size mentioned above may be written as: ˆ CM C Var h . NESS = N ˆ IS Var h
Whilst this quantity might provide a reasonable measure of the quality of the weighted sample, it remains impossible to calculate it in any degree of generality. In order to make further progress it is necessary to make use of some approximations. It is shown in (Kong et al., 1994) that the quantity of interest may ˆ IS /Var h ˆ CM C ≈ 1 + Varµ W ¯ , where W ¯ is the normalised version of the be approximated by Var h R ¯ (x) = π(x)/µ(x) ≡ W (x)/ W (y)µ(y)dy. This approximation may be obtained importance weights, W
via the delta method (which may be viewed as a combination of a suitable suitable central limit theorem and a Taylor expansion). Considering this approximation: 2 2 ¯ = 1 + Eµ W ¯ − Eµ W ¯ 1 + Varµ W 2 2 ¯ − 1 = Eµ W ¯ , = 1 + Eµ W
2 ¯ . It isn’t typically possible to evaluate even this approximation, which allows us to write NESS ≈ 1/Eµ W
but looking at the expression we see that we need only evaluate an integral under the sampling distribution. This is precisely what is normally done in Monte Carlo simulation, and the approach which is usually used is to make use of the sample approximation of the expectation in place of the truth, so, we set:
10.3 Sequential Importance Resampling
ˆESS N
113
2 −1 N N X X 1 1 W (Xj ) :=N W (Xi )2 N i=1 N j=1
2 X N N X = W (Xj ) W (Xi )2 . j=1
i=1
This quantity can readily be calculated to provide an estimate of the quality of the sample at any iteration in a particle filtering algorithm. It is common practice to choose a threshold value for the effective sample size and to resample after those iterations in which the effective sample size falls below that threshold. Doing this provides a heuristic method for minimising the frequency of resampling subject to the constraint that it is necessary to retain a sample in which weight degeneracy is not too severe. Whilst good results are obtained by using this approach when the proposal distribution is adequate, caution must be exercised in interpreting this approximation as an effective sample size. The use of the collection of samples itself to evaluate the integral which is used to assess its quality poses an additional problem. If the sample if particularly poor then it might provide a poor estimate of that integral and consequently, suggest that the sample is rather better than it actually is. In order to understand the most common pathology it is useful to consider a simple importance sampling scenario which generalises directly to the sequential cases considered within this chapter. Consider a situation in which the proposal distribution has little mass in the mode of the target distribution and in which both distributions are reasonably flat in the mode of the proposal. It is quite possible that all of the importance weights are similar and the estimated ESS is large but, in fact, it is an extremely poor representation of the distribution of interest (in this setting, such a pathology will be revealed by using a large enough sample size; unfortunately, one does not know a priori how large that sample must be, and it is typically impossible to employ such a strategy in a sequential setting). For definiteness, let π = 0.9N(0, 1) + 0.1N(4, 0.12 ) and let µ = N(0, 1). The proposal and target densities are illustrated in figure 10.7. Whilst the supports of these two distributions are the same and the tails of the proposal are as heavy as those of the target, it is immediately apparent that the two distributions are poorly matched: there is small probability of proposing samples with values greater than 3 and yet more than 90% of the mass of the target lies in this region. Consequently, the variance is finite but potentially extremely large. However, drawing collections of samples of various sizes from µ and calculating their ESS once they have been weighted to target π reveals a problem. Table N
100
500
1,000
10,000
NESS
100
500
999.68
1.8969
NESS /N 1.000 1.000 0.9997 0.0002 Table 10.1. Effective sample sizes for the ESS example for various sample sizes.
10.1 shows the effective sample sizes estimated from samples of various different sizes. It is clear that when the estimator is a good one, NESS /N should converge to a constant. What we observe is that this ratio is close to unity for the smaller values of N which are considered (and notice that for a one dimensional problem these numbers do not seem that small). However, when a larger sample is used the effective sample size drops suddenly to a very low value: one which clearly indicates that the sample is no more use than one or two iid samples from the target distribution (in the sense quantified by the ESS, of course). On consideration it is clear that the problem is that unless the sample is fairly large, no particles will be
114
10. Sequential Monte Carlo
Proposal and Target Densities 10 Target Proposal 1
0.1
0.01
0.001
0.0001
1e-05
1e-06
1e-07 -6
-4
-2
0
2
4
6
Fig. 10.7. Proposal and target densities for the ESS example scenario.
sampled in the region in which the importance weights are very large: outside this region those weights are approximately constant. In the example here it is clear that a different proposal distribution would be required, it would require an enormous number of particles to reduce the estimator variance to reasonable levels with the present choice. Whilst this example is somewhat contrived, it demonstrates a phenomenon which is seen in practice in scenarios in which it is much more difficult to diagnose. In general, it is not possible to plot the target distribution at all and the true normalising constant associated with the importance weights is unknown.In summary, the standard approximation to the ESS used in importance sampling provides a useful descriptive statistic and one which can be used to guide the design of resampling strategies but it is not infallible and care should be taken before drawing any conclusions from its value. As always, there is no substitute for careful consideration of the problem at hand. 10.3.3 Approaches to Resampling Whilst the presentation above suggests that only one approach to resampling exists, a more sophisticated analysis reveals that this isn’t the case. Indeed, there are approaches which are clearly superior in terms of the Monte Carlo variance associated with them. Resampling has been justified as another sampling step: approximating the weighted empirical distribution with an unweighted collection of samples from that distribution. Whilst this is intuitively appealing, an alternative view of resampling makes it clear that there are lower variance approaches which introduce no additional bias. We require that the expected value of the integral of a test function under the empirical distribution associated with the resampled particles matches that under the weighted empirical measure before resampling for all bounded integrable test functions. Assume that we have a collection of samples
10.3 Sequential Importance Resampling
115
˜ {Wi , Xi }N i=1 and allow {Xi } to denote the collection of particles after resampling. We want to ensure that: # " N N X 1 X ˜ ϕ(Xi ) X1:N = Wi ϕ(Xi ). E N i=1
i=1
˜ i must all be equal to one of the Xi as we a drawing from a discrete It is useful to consider that the X
distribution with mass on only those values. Consequently, we could, alternatively consider the number of replicates of each of the original particles which are present in the resampled set. Let Mi = |{j : X˜j = Xi }|
be the number of replicates of Xi which are present in the resampled set. Then, we have: # " # " N N 1 X 1 X ˜ ϕ(Xi ) X1:N = E Mi ϕ(Xi ) X1:N E N i=1 N i=1 N X Mi ϕ(Xi ) X1:N = E N i=1 N X Mi = X ϕ(Xi ). E 1:N N i=1
It is clear that any scheme in which E
M i N X1:N = Wi will introduce no additional bias. This is unsur-
prising: any scheme in which the expected number of replicates of a particle after resampling is precisely the weight associated with that sample will be unbiased. Clearly, many such schemes will have extremely large variance – for example, selecting a single index with probability proportional to Wi and setting all of the particles to Xi . However, there are a number of techniques by which such unbiased sampling can be accomplished with small variance. A brief review of possible strategies is provided by (Robert and Casella, 2004, Section 14.3.5), but three of the most commonly used approaches are summarised here. ˜ i is Multinomial Resampling. The approach which has been described previously, in which each X drawn independently from the empirical distribution associated with the collection {Wi , Xi }N i=1 is equiv-
alent to drawing the vector of replicate counts M from a multinomial distribution with N trials and parameter vector W1:N . That is: M ∼ M(·|N, W ), where the multinomial distribution is a generalisation
of the binomial distribution to trials which have more than two possible outcomes. It takes mass on the PN points {M ∈ RN : i=1 Mi = N } and has the probability mass function: N N Q P N! Q WiMi if Mi = N, N i=1 Mi ! i=1 M(M |N, W ) = i=1 0 otherwise.
For this reason the simple scheme described above is usually termed multinomial resampling.
Residual Resampling. The reason that resampling is not carried out deterministically is that in general N Wi is not an integer and so it is not possible to replicate particles deterministically in a manner which is unbiased. However, it is possible to remove the integer component of N Wi for each weight and then to assign the remainder of the mass by multinomial resampling. This is the basis of a scheme known as residual resampling. The approach is as follows: ˜ i = ⌊N Wi ⌋ where ⌊N Wi ⌋ := sup{z ∈ Z : z < N Wi }. 1. Set M ˜ i = Wi − ⌊N Wi ⌋/N . Set W N P ˜ ). 2. Sample M ′ ∼ M(·|N − ⌊N Wi ⌋, W ˜ i + M ′. 3. Set Mi = M i
i=1
116
10. Sequential Monte Carlo
The intuition is that by deterministically replicating those particles that we expect at least one of in the replicated set and introducing randomness only to allow us to deal with the non-integer (residual) components of N Wi we retain the unbiased behaviour of multinomial resampling whilst substantially reducing the degree of randomness introduced by the resampling procedure. It is straightforward to verify that the variance of this approach is, indeed, less than that of multinomial resampling. Stratified and Systematic Resampling. The motivation for residual resampling was that deterministically replicating particles and hence reducing the variability of the resampled particle set whilst retain the lack of bias must necessarily provide an improvement. Another approach is motivated by the stratified sampling technique. In order to sample from mixture distributions with known mixture weights, the variance is reduced if one draws a deterministic number of samples from each component, with the number proportional to the weight of that component2 . Taking this to its logical conclusion, one notices that one can partition any single dimensional distribution in a suitable manner by taking it cumulative distribution function, dividing it into a number of segments, each with equal associated probability mass, and then drawing one sample from each segment (which may be done, at least in the case of discrete distributions by drawing a uniform random variable from the range of each of the CDF segments and applying inversion sampling). Cumulative Distribution Function
1.0
0.0 x 1
x2
x3 x4
x5
x6 x7 x8
x9
CDF of Stratum 7
1.0
x10
0.0
x7
x8
Fig. 10.8. The cumulative distribution associated with a collection of ten particles divided into ten strata, together with the distribution function associated with the seventh stratum.
Figure 10.8 illustrates this technique for a simple scenario in which ten particles are present. As usual, the cumulative distribution function associated with a discrete distribution (the empirical distribution associated with the weighted particles in this case) consists of a number of points of discontinuity connected by regions in which the CDF is constant. These discontinuities have magnitudes corresponding to the sample weights and locations corresponding to the values of those particles. If the sample values are not one dimensional it becomes easier to consider sampling the particle indices rather than the values themselves and then using those indices to deduce the values of the resampled particles; this adds little 2
Again, a little additional randomness may be required to prevent bias if the product of the number of samples required and the various mixture weights is not an integer.
10.4 Resample-Move Algorithms
117
complexity. Having obtained the CDF, one divides it into N equally-spaced segments vertically (such that each encloses equal probability mass). Each segment of the CDF may then be rescaled to produce a new CDF describing the distribution conditional upon it being drawn from that stratum – the figure illustrates the case of the seventh stratum from the bottom in a particular case. Inversion sampling once from each stratum then provides a new collection of N samples. It can be seen that this approach has some features in common with residual resampling. For example, a particle with a large weight will be replicated with probability one (in the case of stratified resampling, if a particle has weight Wi such that N Wi ≥ n + 1 then it is guaranteed to produce at least n replicates
in the resampled population. For example, in figure 10.8, the particle with value x3 has a sufficiently large weight that it will certainly be selected from stratum 3 and 4, and it has positive probability of being drawn in the two adjacent strata as well. In fact, it is possible to further reduce the randomness, by drawing a single uniform random number and using that to select the value sample from all of the segments of the CDF. This technique is widely used in practice although it is relatively difficult to analyse. In order to implement this particular version of resampling, typically termed systematic resampling, one simply draws a random number, U , uniformly in the interval [0, 1/N ] and then, allowing FN to denote the CDF of the empirical distribution of the particle set before resampling, set, for each i in 1, . . . , N : ˜ i := F −1 (i/N − U ). X N In terms of figure 10.8, this can be understood as drawing a random variable, U , uniformly from an interval describing the height of a single stratum and then taking as the new sample those values obtained from the CDF at a point U below the top of any stratum. It is straightforward to establish that this technique is unbiased, but the correlated samples which it produces complicate more subtle analysis. The systematic sampling approach ensures that the expected number of replicates of a particle with weight Wi is N Wi and, furthermore, the actual number is within 1 of N Wi .
10.4 Resample-Move Algorithms An attempt at reducing the sample impoverishment was proposed by (Gilks and Berzuini, 2001a). Their approach is based upon the following premise which (Robert and Casella, 2004) describe as generalised importance sampling. Given a target distribution π, an instrumental distribution µ and a π-invariant Markov kernel, K, the following generalisation of the importance sampling identity is trivially true: Z Z Z Z π(x) π(x) h(y)dxdy = µ(x)K(x, y) dxh(y)dy = π(x)K(x, y)dxh(y)dy = π(y)h(y)dy. µ(x)K(x, y) µ(x) µ(x) In conjunction with the law of large numbers this tells us, essentially, that given a weighted collection of (i) particles {W (i) , X (i) }N ∼ K(X (i) , ·) then the weighted i=1 from µ weighted to target π, if we sample Y
collection of particles {W (i) , Y (i) }N i=1 also targets π.
Perhaps the simplest way to verify that this is true is simply to interpret the approach as importance
sampling on an enlarged space using µ(x)K(x, y) as the proposal distribution for a target π(x)K(x, y) and then estimating a function h′ (x, y) = h(y). It is then clear that this is precisely standard importance sampling. As with many areas of mathematics the terminology can be a little misleading: correctly interpreted, “generalised importance sampling” is simply a particular case of importance sampling. Whilst this result might seem rather uninteresting, it has a particularly useful property. Whenever we have a weighted sample which targets π, we are free to move the samples by applying any π-invariant
118
10. Sequential Monte Carlo
Markov kernel in the manner described above and we retain a weighted sample targeting the distribution of interest. (Gilks and Berzuini, 2001a) proposed to take advantage of this to help solve the sample impoverishment problem. They proposed an approach to filtering which they termed resample-move, in which the particle set is moved according to a suitable Markov kernel after the resampling step. In the original paper, and the associated tutorial (Gilks and Berzuini, 2001b), these moves are proposed in the context of a standard SIR algorithm but, in fact, is is straightforward to incorporate them into any iterative algorithm which employs a collection of weighted samples from a particular distribution at each time step. Algorithm 4 A Resample-Move Filter 1: Set t = 1. ˜ (i) ∼ q1 (·). 2: For i = 1 : N , sample X 1,1 (i)
3: For i = 1 : N , set W1
(i)
(i)
(i)
˜ )/q1 (X ˜ ). Normalise such that ∝ π1 (X1,1 )g1 (y1 |X 1,1 1
4: Resample: for i = 1 : N , sample ˆ (i) X 1,1,
∼
N P
5: Move: for i = 1 : N , sample ∼ 6: t ← t + 1 (i) ˜ (i) 7: For i = 1 : N , set X =X t,1:t−1 (i)
8: For i = 1 : N , set Wt
ˆ (i) , ·), K1 (X 1,1
t−1,1:t−1
i=1
(i)
W1
= 1.
(i)
i=1
W1 δX˜ (i)
1,1
N P
j=1 (i) X1,1
N P
.
(j) W1
where K1 is π1 -invariant.
(i) ˜ t,t ˜ (i) ). and sample X ∼ qt (·|X t,t−1
N P (i) (i) ˜ (i) (i) (i) ˜ (i) (i) ˜ t,t ˜ t,t ˜ t,t ∝ Wt−1 ft (X |Xt,t−1 )gt (yt |X )/qt (X |Xt,t−1 ). Normalise such that Wt = 1. i=1
9: Resample: for i = 1 : N , sample
ˆ (i) ∼ X t,1:t,
N P
(i)
i=1
W1 δX˜ (i)
t,1:t
N P
j=1 (i) Xt,1:t
.
(j)
W1
ˆ t,1:t , ·), where Kt is πt -invariant. 10: Move: for i = 1 : N , sample ∼ Kt (X 11: The smoothing and filtering distributions at time t may be approximated with N π ˆSIS,t (x1:t ) =
N 1 X δ (i) (x1:t ), N i=1 Xt,1:t
N and π ˆSIS,t (xt ) =
N 1 X δ (i) (xt ), N i=1 Xt,t
respectively.
12: Go to step 6.
Thus some progress has been made: a mechanism for injecting additional diversity into the sample has been developed. However, there is still a problem in the smoothing scenario. If we were to apply Markov Kernels of invariant distribution corresponding to the full path-space smoothing distribution at each time, then the space on which those Markov kernels were defined must increase with every iteration. This would have two effects: the most direct is that it would take an increasing length of time to apply the move with every iteration; the second is that it is extremely difficult to design fast-mixing Markov kernels on spaces of high dimension. If an attempt is made to do so in general it will be found that the degree of movement provided by application of the Markov kernel decreases in some sense as the size of the space increases. For example, if a sequence of Gibbs sampler moves are used, then each coordinate will move only very slightly whilst a Metropolis Hastings approach will either proposal only very small perturbations or will have a very high rate of rejection. In practice, one tends to employ Markov kernels which only alter the terminal value of the particles (although there is no difficulty with updating the last few values by this mechanism, this has an increased computational cost and does not directly alter the future trajectories). The insertion of additional MCMC
10.5 Auxiliary Particle Filters
119
moves (as the application of a Markov kernel of the correct invariant distribution is often termed) in a filtering context, then, is a convenient way to increase the diversity of the terminal values of the sample paths. It is not a coincidence that the various techniques to improve sample diversity all provide improved approximations of the filtering distribution but do little to improve the distribution over the path-space distributions involved in smoothing. In fact, theoretical analysis of the various particle filters reveal that their stability arises from that of the system which they are approximating. Under reasonable regularity conditions, the filtering equations can be shown to forget their past: a badly initialised filter will eventually converge to the correct distribution. As the dynamics of the particle system mirror those of the exact system which it is approximating, this property is also held by the particle system. It prevents the accumulation of errors and allows convergence results to be obtained which are time uniform in the sense that the same quality of convergence (in a variety of senses) can be obtained at all times with a constant number of particles if one is considering only the filtering distributions; similar results have not been obtained in the more general smoothing case and there are good technical reasons to suppose that they will not be forthcoming. Technical results are provided by (Del Moral, 2004, Chapter 7) but these lie far outside the scope of this course. It is, however, important to be aware of this phenomenon when implementing these algorithms in practice.
10.5 Auxiliary Particle Filters The auxiliary particle filter (APF) is a technique which was originally proposed by (Pitt and Shephard, 1999, 2001), in the form presented in section 10.5.2, based upon auxiliary variables. It has recently been recognised that it can, in fact, be interpreted as no more than an SIR algorithm which targets a slightly different sequence of distributions, and then uses importance sampling to correct for the discrepancy; this intuitive explanation, due to (Johansen and Doucet, 2007) is presented in section 10.5.3. 10.5.1 Motivations Before becoming involved in the details, it is necessary to look at one particular problem from which SMC methods can suffer: and what one would like to do in order to assuage this problem. That difficulty is that one resamples the particle set at the conclusion of one iteration of the algorithm before moving them and then weighting them, taking into account the most recent observation. In practice, this observation can provide significant information about the likely state at the previous time (whilst xt is independent of yt+1 conditional upon the knowledge of xt+1 we do not, in fact, know xt+1 ). Whilst deferring resampling isn’t really an option as it would simply mean leaving resampling until the following iteration and leaving the same problem for that iteration, it would be nice if one could pre-weight the particles prior to resampling to reflect how their compatibility with the next observation. This is essentially the idea behind the APF. The relationship, in the filtering and smoothing context, between xn and yn+1 , assuming that xn+1 is unknown (we wish to establish how well xn is able to account for the next observation before sampling the next state) is: p(yn+1 |x1:n , y1:n ) = p(xn |y1:n )
Z
fn+1 (xn+1 |xn )gn+1 (yn+1 |xn+1 )dxn+1
which is simply the integral of the joint distribution of xn:n+1 and yn+1 given y1:n . Defining gn+1 (yn+1 |xn ) = R fn+1 (xn+1 |xn )gn+1 (yn+1 |xn+1 )dxn+1 , it would be desirable to use a term of this sort to determine how
120
10. Sequential Monte Carlo
well a particle matches the next observation in advance of resampling (before correcting with the discrepancy that this introduces into the distribution after resampling by importance weighting). 10.5.2 Standard Formulation The usual view of the APF is that it is a technique which employs some approximation gˆn+1 (yn+1 |xn ) to
the predictive likelihood, gn (yn+1 |xn ) in an additional pre-weighting step and employs an auxiliary variable technique to make use of these weights. The original proposal for the APF had a number of steps for each iteration: 1. Apply an auxiliary weighting proportional to gˆn+1 (yn+1 |xn ).
2. Using the auxiliary weights propose moves from a mixture distribution. 3. Weight moves to account for the auxiliary weighting introduced previously, the observation and the proposal distribution. 4. Resample according to the standard weights.
Although the first three of these steps differ from a standard SIR algorithm the third differs only in the nature of the importance weighting and the first two are essentially the key innovation associated with the auxiliary particle filter. The idea is that, given a sample of particles which target π ˆt , and knowledge of the next observation, yt+1 , we can make use of an approximation of the conditional likelihood to attach a weight, λ(i) each particle. This weight has the interpretation that it describes how consistent each element of the sample is with the next observation. Having determined these weightings, one proposes the values of each particle at time t + 1 (in the filtering case which was considered originally) independently from the mixture distribution
N X j=1
λ(j) qn+1 (·|Xtj ). N P λ(k)
k=1
The samples are then weighted such that they target π ˆt+1 and then resampling is carried out as usual. This procedure is carried out iteratively as in any other particle filter, with the detailed algorithmic description being provided by algorithm 5. 10.5.3 Interpretation as SIR However, over time it has become apparent that there is actually little point in resampling immediately prior to the auxiliary weighting step. Indeed, the use of an auxiliary variable in the manner described in the previous section is exactly equivalent to first resampling the particles (via a multinomial scheme) according to the auxiliary weights and then making standard proposals with that for each particle being made from the previous value of that particle. Thus resampling, weighting and resampling again will introduce additional Monte Carlo variance for which there is no benefit. In fact, it makes more sense to resample after the auxiliary weighting step and not at the end of the algorithm iteration; the propagation of weights from the previous iterations means that resampling is then carried out on the basis of both the importance weights and the auxiliary weights. Having noticed this, it should be clear that there is actually little more to the APF than there is to SIR algorithms and, indeed, it is possible to interpret the auxiliary particle filter as no more than an SIR algorithm which targets a slightly different sequence of distributions to those of interest and then to use
10.5 Auxiliary Particle Filters
121
Algorithm 5 Traditional View of the APF 1: Set t = 1. ˜ (i) ∼ q1 (·). 2: For i = 1 : N , sample X 1,1 (i)
3: For i = 1 : N , set W1
N (i) ˜ (i) )/q1 (X ˜ (i) ). Normalise such that P W (i) = 1. ∝ π1 (X1,1 )g1 (y1 |X 1,1 1 1 i=1
4: Resample: for i = 1 : N , sample
(i) X1,1,
∼
N P
(i)
i=1
W1 δX˜ (i)
1,1
N P
j=1
5: t ← t + 1
.
(j) W1
N P (i) (i) (i) 6: Calculate auxiliary weights: set λt ∝ gˆt (yt |Xt−1,t−1 ) and normalise such that λt = 1. i=1 (i) (i) (j) 7: Sample αt such that P αt = j = λt (i.e. sample from the discrete distribution with parameter λt ). (i)
(αt ) ˜ (i) ˜ (i) ˜ (i) 8: For i = 1 : N , set X t,1:t−1 = Xt−1,1:t−1 and sample Xt,t ∼ qt (·|Xt,t−1 ).
9: For i = 1 : N , set
(i)
(i)
Wt
Normalise such that
N P
i=1
(i)
Wt
∝
(i)
(i)
˜ t,t |X ˜ ˜ ft (X t,t−1 )gt (yt |Xt,t ) (i)
(αt )
λt
.
(i) ˜ (i) ˜ t,t qt (X |Xt,t−1 )
= 1.
10: Resample: for i = 1 : N , sample (i)
Xt,1:t, ∼
N P
(i)
i=1
W1 δX˜ (i)
t,1:t
N P
j=1
.
(j)
W1
11: The smoothing and filtering distributions at time t may be approximated with N π ˆSIS,t (x1:t ) =
N 1 X δ (i) (x1:t ), N i=1 Xt,1:t
N and π ˆSIS,t (xt ) =
N 1 X δ (i) (xt ), N i=1 Xt,t
respectively.
12: Go to step 5.
Algorithm 6 The APF as SIR 1: Set t = 1. (i) 2: For i = 1 : N , sample X1,1 ∼ q1 (·). (i)
(i)
(i)
(i)
3: For i = 1 : N , set W1 ∝ π1 (X1,1 )g1 (y1 |X1,1 )/q1 (X1 ). 4: t ← t + 1 N P (i) (i) ˜ t(i) = 1. ˜ t(i) ∝ W (i) λ(i) where λt ∝ gˆt (yt |Xt−1,t−1 ) and normalise such that W 5: Set W t−1 t i=1
6: Resample: for i = 1 : N , sample
(i)
Xt,1:t−1 ∼
N P ˜ t(i) δ (i) W X
t−1,1:t−1
i=1
N P ˜ t(j) W
.
j=1
7: For i = 1 : N , 8: For i = 1 : N P (i) Wt = 1.
(i) (i) sample Xt,t ∼ qt (·|Xt,t−1 ). (i) (i) (i) (i) (i) ˜ (i) (i) ˜ t,t N , set Wt ∝ ft (Xt,t |Xt,t−1 )gt (yt |Xt,t )/qt (X |Xt,t−1 )ˆ gt (yt |Xt,t−1 ).
Normalise such that
i=1
9: The smoothing and filtering distributions at time t may be approximated with N (x1:t ) = π ˆSIS,t
N X i=1
10: Go to step 5.
(i)
Wt δX (i) (x1:t ), t,1:t
N (xt ) = and π ˆSIS,t
N X i=1
(i)
Wt δX (i) (xt ), t,t
respectively.
122
10. Sequential Monte Carlo
these as an importance sampling proposal distribution in order to estimate quantities of interest (Johansen and Doucet, 2007). Although the precise formulation shown in algorithm 6 appears to differ in some minor details from algorithm 5 it corresponds exactly to the formulation of the APF which is most widely used in practice and which can be found throughout the literature. The significant point is that this type of filtering algorithm has an interpretation as a simple SIR algorithm with an importance sampling step and can be analysed within the same framework as the other algorithms presented within this chapter. 10.5.4 A Note on Asymptotic Variances It is often useful in estimation scenarios based upon sampling to compare the asymptotic variance of those estimators which are available. In the context of sequential Monte Carlo methods a substantial amount of calculation is required to obtain even asymptotic variance expressions and these can be difficult to interpret. In the case of the standard SIS algorithm one is performing standard importance sampling and so the asymptotic variance can be obtained directly by standard methods, see (Geweke, 1989), for example. In the case of SIR and the APF a little more subtlety is required. Relatively convenient variance decompositions for these algorithms may be found in (Johansen and Doucet, 2007), which makes use of expressions applicable to the SIR case obtained by (Del Moral, 2004; Chopin, 2004).
10.6 Static Parameter Estimation The difficulty with static parameter estimation becomes clear when considered in light of the information acquired in studying the filtering and smoothing problems: although we are able to obtain a good characterisation of the filtering distribution, the smoothing distribution is much more difficult to deal with. Typically, degeneracy occurs towards the beginning of the trajectory and the estimated distribution of the earlier coordinates of the trajectory are extremely poor. In order to estimate the distribution of a static parameter within a HMM, it is generally necessary to also estimate the distribution of the latent state sequence. The posterior distribution of the static parameter is heavily dependent upon the distribution of the full sequence of latent variables and the aforementioned degeneracy problems lead to severe difficulties when attempting to perform online parameter estimation within this framework. A heuristic approach which has been advocated in some parts of the literature is to modify the model in such a way that the static parameter is replaced with a slowly time-varying parameter. This introduces a degree of forgetting and it is possible to produce a model in which the time-scale over which degeneracy becomes a problem is less than that over which the parameter’s dynamics allow it to forget the past. However, this is clearly rather unsatisfactory and it is far from clear how inference based upon this model relates to that of interest. A more principled approach was suggested by (Chopin, 2002). His approach was based upon something which he termed artificial dynamics. Essentially, the static parameter is updated according to a Markov kernel of the correct invariant distribution after each iteration of the algorithm, allowing it to change. This overcomes one problem: specifically that otherwise no new values of the parameter would ever be introduced into the algorithm, but does not eliminate all of the problems which arise from the ultimate degeneracy of the path-space particle distribution. As of the time of writing, no entirely satisfactory solution to the problem of online estimation of static parameters in a HMM via their posterior mode has been proposed, although a number of techniques which provide some improvement have been developed in recent years.
10.7 Extensions, Recent Developments and Further Reading
123
10.7 Extensions, Recent Developments and Further Reading Much recent work has focused upon the problem of sampling from arbitrary sequences of distributions using the techniques developed in the SMC setting presented here. This work is interesting, but cannot be included in the present course due to time constraints; the interested reader is referred to (Del Moral et al., 2006a,b) which present a sophisticated method which include other approaches which have been proposed recently (Neal, 1998) as particular cases. An introductory, application-oriented text on sequential Monte Carlo methods is provided by the collection (Doucet et al., 2001). Short, but clear introductions are provided by (Doucet et al., 2000) and (Robert and Casella, 2004, chapter 14). A simple self-contained and reasonably direct theoretical analysis providing basic convergence results for standard SMC methods of the sort presented in this chapter can be found in (Crisan and Doucet, 2002). Direct proofs of central limit theorems are given by (Chopin, 2004; K¨ unsch, 2005). A comprehensive mathematical treatment of a class of equations which can be used to analyse sequential Monte Carlo methods with some degree of sophistication is provided by (Del Moral, 2004) – this is necessarily somewhat technical, and requires a considerably higher level of technical sophistication than this course; nonetheless, the treatment provided is lucid, self-contained and comprehensive incorporating convergence results, central limit theorems and numerous stronger results.
124
10. Sequential Monte Carlo
Bibliography
Badger, L. (1994) Lazzarini’s lucky approximation of π. Mathematics Magazine, 67, 83–91. Brockwell, P. J. and Davis, R. A. (1991) Time series: theory and methods. New York: Springer, 2 edn. Brooks, S. and Gelman, A. (1998) General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434–455. Buffon, G. (1733) Editor’s note concerning a lecture given 1733 to the Royal Academy of Sciences in Paris. Histoire de l’Acad´emie Royale des Sciences, 43–45. — (1777) Essai d’arithm´etique morale. Histoire naturelle, g´en´erale et particuli`ere, Suppl´ ement 4, 46–123. Capp´e, O., Moulines, E. and Ryd´en, T. (2005) Inference in Hidden Markov Models. Springer Series in Statistics. Springer. Chopin, N. (2002) A sequential particle filter method for static models. Biometrika, 89, 539–551. — (2004) Central limit theorem for sequential Monte Carlo methods and its applications to Bayesian inference. Annals of Statistics, 32, 2385–2411. Crisan, D. and Doucet, A. (2002) A survey of convergence results on particle filtering methods for practitioners. IEEE Transactions on Signal Processing, 50, 736–746. Del Moral, P. (2004) Feynman-Kac formulae: genealogical and interacting particle systems with applications. Probability and Its Applications. New York: Springer. Del Moral, P., Doucet, A. and Jasra, A. (2006a) Sequential Monte Carlo methods for Bayesian Computation. In Bayesian Statistics 8. Oxford University Press. — (2006b) Sequential Monte Carlo samplers. Journal of the Royal Statistical Society B, 63, 411–436. Doucet, A., de Freitas, N. and Gordon, N. (eds.) (2001) Sequential Monte Carlo Methods in Practice. Statistics for Engineering and Information Science. New York: Springer. Doucet, A., Godsill, S. and Andrieu, C. (2000) On sequential simulation-based methods for Bayesian filtering. Statistics and Computing, 10, 197–208. Doucet, A., Godsill, S. J. and Robert, C. P. (2002) Marginal maximum a posteriori estimation using Markov chain Monte Carlo. Statistics and Computing, 12, 77–84. Fahrmeir, L. and Tutz, G. (2001) Multivariate Statistical Modelling Based on Generalised Linear Models. New York: Springer, 2 edn. Gelfand, A. E. and Smith, A. F. M. (1990) Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409. Gelman, A., Gilks, W. R. and Roberts, G. O. (1997) Weak convergence and optimal scaling of random walk Metropolis algorithms. Annals of. Applied Probability, 7, 110–120.
126
BIBLIOGRAPHY
Gelman, A., Roberts, G. O. and Gilks, W. R. (1995) Efficient Metropolis jumping rules. In Bayesian Statistics (eds. J. M. Bernado, J. Berger, A. Dawid and A. Smith), vol. 5. Oxford: Oxford University Press. Gelman, A. and Rubin, B. D. (1992) Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457–472. Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. Geweke, J. (1989) Bayesian inference in econometrics models using Monte Carlo integration. Econometrica, 57, 1317–1339. Gikhman, I. I. and Skorokhod, A. V. (1996) Introduction to the Theory of Random Processes. 31 East 2nd Street, Mineola, NY, USA: Dover. Gilks, W. R. and Berzuini, C. (2001a) Following a moving target – Monte Carlo inference for dynamic Bayesian models. Journal of the Royal Statistical Society B, 63, 127–146. — (2001b) RESAMPLE-MOVE filtering with Cross-Model jumps. In Doucet et al. (2001), 117–138. Gilks, W. R., Richardson, S. and Spieghalter, D. J. (eds.) (1996) Markov Chain Monte Carlo In Practice. Chapman and Hall, first edn. Gordon, N. J., Salmond, S. J. and Smith, A. F. M. (1993) Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings-F, 140, 107–113. Green, P. J. (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711–732. — (2003) Trans-dimensional Markov chain Monte Carlo. In Highly Structured Stochastic Systems (eds. P. J. Green, N. L. Hjort and S. Richardson), Oxford Statistical Science Series, chap. 6, 179–206. Oxford University Press. Guihennec-Jouyaux, C., Mengersen, K. L. and Robert, C. P. (1998) Mcmc convergence diagnostics: A ”reviewww”. Tech. Rep. 9816, Institut National de la Statistique et des Etudes Economiques. Hajek, B. (1988) Cooling schedules for optimal annealing. Mathematics of Operations Research, 13, 311– 329. Halton, J. H. (1970) A retrospective and prospective survey of the Monte Carlo method. SIAM Review, 12, 1–63. Hastings, W. K. (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. Hwang, C.-R. (1980) Laplace’s method revisited: Weak convergence of probability measures. The Annals of Probability, 8, 1177–1182. Johansen, A. M. (2008) Markov chains. In Encyclopaedia of Computer Science and Engineering. Wiley and Sons. In preparation. Johansen, A. M. and Doucet, A. (2007) Auxiliary variable sequential Monte Carlo methods. Tech. Rep. 07:09, University of Bristol, Department of Mathematics – Statistics Group, University Walk, Bristol, BS8 1TW, UK. Jones, G. L. (2004) On the Markov chain central limit theorem. Probability Surveys, 1, 299–320. Kirkpatrick, S., Gelatt, C. D. and Vecchi, M. P. (1983) Optimization by simulated annealing. Science, 220, 4598, 671–680. Knuth, D. (1997) The Art of Computer Programming, vol. 1. Reading, MA: Addison-Wesley Professional. Kong, A., Liu, J. S. and Wong, W. H. (1994) Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89, 278–288.
BIBLIOGRAPHY
127
K¨ unsch, H. R. (2005) Recursive Monte Carlo filters: Algorithms and theoretical analysis. Annals of Statistics, 33, 1983–2021. Laplace, P. S. (1812) Th´eorie Analytique des Probabilit´es. Paris: Courcier. Lazzarini, M. (1901) Un’ applicazione del calcolo della probabilit`a alla ricerca esperimentale di un valore approssimato di π. Periodico di Matematica, 4. Liu, J. S. (2001) Monte Carlo Strategies in Scientific Computing. New York: Springer. Liu, J. S., Wong, W. H. and Kong, A. (1995) Covariance structure and convergence rate of the Gibbs sampler with various scans. Journal of the Royal Statistical Society B, 57, 157–169. Marsaglia, G. (1968) Random numbers fall mainly in the planes. Proceedings of the National Academy of Sciences of the United States of America, 61, 25–28. Marsaglia, G. and Zaman, A. (1991) A new class of random number generators. The Annals of Applied Probability, 1, 462–480. Matsumoto, M. and Nishimura, T. (1998) Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul., 8, 3–30. Metropolis, N. (1987) The beginning of the Monte Carlo method. Los Alamos Science, 15, 122–143. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. B., Teller, A. H. and Teller, E. (1953) Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21, 1087–1092. Metropolis, N. and Ulam, S. (1949) The Monte Carlo method. Journal of the American Statistical Association, 44, 335–341. Meyn, S. P. and Tweedie, R. L. (1993) Markov Chains and Stochastic Stability. Springer, New York, Inc. URL http://probability.ca/MT/. Neal, R. M. (1998) Annealed importance sampling. Technical Report 9805, University of Toronto, Department of Statistics. URL ftp://ftp.cs.toronto.edu/pub/radford/ais.ps. Nummelin, E. (1984) General Irreducible Markov Chains and Non-Negative Operators. No. 83 in Cambridge Tracts in Mathematics. Cambridge University Press, 1st paperback edn. Philippe, A. and Robert, C. P. (2001) Riemann sums for mcmc estimation and convergence monitoring. Statistics and Computing, 11, 103–115. Pitt, M. K. and Shephard, N. (1999) Filtering via simulation: Auxiliary particle filters. Journal of the American Statistical Association, 94, 590–599. — (2001) Auxiliary variable based particle filters. In Doucet et al. (2001), chap. 13, 273–293. Richardson, S. and Green, P. J. (1997) On the bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society B, 59, 731–792. Ripley, B. D. (1987) Stochastic simulation. New York: Wiley. Robert, C. P. and Casella, G. (2004) Monte Carlo Statistical Methods. Secaucus, NJ, USA: Springer, New York, Inc., 2 edn. Roberts, G. and Tweedie, R. (1996) Geometric convergence and central limit theorems for multivariate Hastings and Metropolis algorithms. Biometrika, 83, 95–110. Roberts, G. O. (1996) Markov Chain concepts related to sampling algorithms. In Gilks et al. (1996), chap. 3, 45–54. Roberts, G. O. and Rosenthal, J. S. (2004) General state space Markov chains and MCMC algorithms. Probability Surveys, 1, 20–71. Tierney, L. (1994) Markov Chains for exploring posterior distributions. The Annals of Statistics, 22, 1701–1762. — (1996) Introduction to general state space Markov Chain theory. In Gilks et al. (1996), chap. 4, 59–74.
128
BIBLIOGRAPHY
Ulam, S. (1983) Adventures of a Mathematician. New York: Charles Scribner’s Sons.
Monte Carlo Methods Computer Practical 1 You can download a file with the code from the course website (section Teaching material available for download): http://www.maths.bris.ac.uk/~manpw/teaching/mcm.
This practical focuses on sampling from the exponential distribution and from the gamma distribution. The density of the exponential distribution is for x > 0 (λ > 0) f(λ) (x) = λ exp(−λx), the corresponding CDF is F(λ) (x) = 1 − exp(−λx). The density of the gamma distribution is for x > 0 (α, λ > 0): f(α,λ) (x) =
1 α−1 α x λ exp(−λx) Γ(α)
The exponential distribution is a special case of the gamma distribution with α = 1. Further if i.i.d. Xi ∼ Expo(λ) and Y = X1 + . . . + Xk , then Y ∼ Gamma(k, λ). For many distributions, the density (d...), CDF (p...), and inverse CDF (q...), as well as a random number generator (r...), are built into R. Examples are the normal distribution (dnorm, pnorm, qnorm, rnorm), or the gamma distribution (dgamma, pgamma, qgamma, rgamma).1 In this practical we will only use runif, that generates pseudo-random numbers from a U[0, 1] distribution, not the other r... functions built into R. First of all, we create a random sample from an exponential distribution. In the lectures we have seen that if U ∼ U[0, 1], then − log(U )/λ ∼ Expo(λ). The function runif(n) creates n pseudo-random numbers from U[0, 1]. So we can generate one realisation from an Expo(λ) distribution by: 1 2 3 4
lambda <- 1 u <- runif(1) x <- -log(u)/lambda x
# set the rate # create a pseudo-random number # transform to an exponential # output x
To generate more than one realisation, we could use a loop, as in 5 6 7 8 9 10 11 12
n <- 20 lambda <- 1 x <- numeric(n) for (i in 1:n) { u <- runif(1) x[i] <- -log(u)/lambda } x
# # # # #
set the sample size set the rate create a vector of size n to hold the result repeat n times . . . create a pseudo-random number
# transform to an exponential # output x
This is however not necessary (and inefficient), as R can work with vectors. So we can create n samples by 13 14 15 16 17
n <- 20 lambda <- 1 u <- runif(n) x <- -log(u)/lambda x 1 If
# # # # #
set the sample size set the rate create a vector of n pseudo-random numbers transform all elements to an exponential output x
you want to get help on a specific function (say runif), just type ?runif.
If we want to reuse the above code to generate exponential random numbers, it is best to wrap the code in a function. This can be done as follows: 18 19 20 21 22 23
rExpo <- function(n,lambda) { u <- runif(n) x <- -log(u)/lambda x } rExpo(n=20,lambda=1)
# define the function rExpo # create a vector of n pseudo-random numbers # transform all elements to an exponential # return x # test the function
To check informally that our code samples from the right distribution we can compare a histogram with a plot of the density. 24 25 26 27
x <- rExpo(n=1000,lambda=0.5) hist(x,freq=FALSE) t <- 0:150/10 lines(t,dexp(t,rate=0.5),lwd=2)
# create a rather big sample # plot a histogram # create a sequence 0,0.1,...,15 # add the corresponding density
Alternatively, we could use a Kolmogorov-Smirnov test to test the null hypothesis, that our sample is from the Expo(0.5) distribution. 28
ks.test(x,pexp,rate=0.5)
# carry out a K-S test
Note that the test does not “prove” that the sample is from the Expo(0.5) distribution; we just cannot prove the opposite, which might be simply due to the fact that the Kolmogorov-Smirnov test has very little power. As mentioned above, we have that Y ∼ Gamma(k, λ) with k ∈ N if Y = X1 + . . . + Xk i.i.d.
with Xj ∼ Expo(λ). Thus we can sample from Y ∼ Gamma(k, λ) by summing up k numbers drawn from an Expo(λ) distribution: 29 30 31 32 33
k <- 4 lambda <- 1 x <- rExpo(n=k,lambda=lambda) y <- sum(x) y
# set the parameter k # set the rate # draw k exponential random variables # sum up the x # output y
If we want to draw n random numbers from the Gamma(k, λ) distribution we can first create a matrix with n rows and k columns, consisting of n · k numbers drawn from Expo(λ) and then compute the sum of each row: 34 35 36 37
n <- 20 # set the sample size k <- 4 # set the parameter k lambda <- 1 # set the rate x <- matrix(rExpo(n=n*k,lambda=lambda),ncol=k) # draw an n by k matrix of exponentials
38 39 40
y <- apply(x,1,sum) y
# sum up each row # output y
Once again we can wrap our code in a function 41 42
rGammaInt <- function(n,k,lambda) { # define the function rGammaInt x <- matrix(rExpo(n=n*k,lambda=lambda),ncol=k)
43 44 45 46
apply(x,1,sum) } y <- rGammaInt(20,4,1)
# draw an n by k matrix of exponentials # return the row-wise sums # try out the function
1. Modify the code above to create a sample of size 1,000 from the Gamma(6, 0.5) distribution. 2. Plot a histogram of the data and compare it to the density of the Gamma(6, 0.5) distribution.2 3. Perform a Kolmogorov-Smirnov test to check your results.3 Our method of drawing from the Gamma distribution clearly only works for integers k. If we want to sample from Gamma(α, λ) with α ≥ 1 and λ > 1 we could use rejection sampling using a Gamma(k, λ−1) distribution with k = bαc as instrumental distribution. Then we have for the ratio that f(α,λ) (x) xα−1 λα exp(−λx)/Γ(α) Γ(k)λα xα−k exp(−x) = k−1 = f(k,λ−1) (x) x (λ − 1)k exp(−(λ − 1)x)/Γ(k) Γ(α)(λ − 1)k The ratio attains its maximum of M := that M =
Γ(k)λα (α − k)α−k Γ(α)(λ−1)k
exp(k − α) at x = α − k. It is easy to see
f(α,λ) (α−k) f(k,λ−1) (α−k) .
Let’s check our result by plotting the target density f(α,λ) (·) and the scaled proposal M f(k,λ−1) (·). The two densities should touch at α − k. 47 48 49 50
alpha <- 2.4 lambda <- 3 k <- floor(alpha) M <- dgamma(alpha-k,alpha,lambda) /
53 54 55 56
# set the parameter lambda # compute k
dgamma(alpha-k,k,lambda-1) # compute M
51 52
# set the parameter alpha
t <- 0:1000/250 # set t to 0,0.004,...,4 plot(t,M*dgamma(t,k,lambda-1),type="l") # plot scaled instrumental density lines(t,dgamma(t,alpha,lambda),lty=2) # add the target density abline(v=alpha-k,lty=3,col="grey") # add a vertical line at alpha-k legend("topright",lty=2:1,c("f(t)","M*g(t)")) # add a legend
57
Rejection sampling from the Gamma(α, λ) distribution using the Gamma(k, λ − 1) distribution as instrumental distribution can be carried out as follows: 58 59 60 61 62
n <- 20 alpha <- 2.4 lambda <- 3 k <- floor(alpha) M <- dgamma(alpha-k,alpha,lambda) /
65 66 67 68 69 70 71 72 73 74 75
# set the parameter lambda # compute k
dgamma(alpha-k,k,lambda-1) # compute M
63 64
# set the sample size # set the parameter alpha
y <- numeric(n) # create a vector to hold the result accepted.samples <- 0 # counter of accepted samples while (accepted.samples
density f(k,λ) (x) of the Gamma(6, 0.5) can be evaluated using dgamma(x,k,λ). function for the CDF of a gamma distribution is pgamma.
Of course we can use importance sampling, too. Assume we want to estimate E(X) and E(X 2 ) for X ∼ Gamma(α, λ) using importance sampling with the same instrumental distribution as above. 76 77 78 79 80 81 82 83 84
n <- 100 # set the sample size alpha <- 2.4 # set the parameter alpha lambda <- 3 # set the parameter lambda k <- floor(alpha) # compute k x <- numeric(n) # create a vector to hold the sampled values w <- numeric(n) # create a vector to hold the weights for (i in 1:n) { # repeat n times ... x[i] <- rGammaInt(1,k,lambda-1) # draw from the instrumental distribution w[i] <- dgamma(x[i],alpha,lambda)/dgamma(x[i],k,lambda-1) # compute the weights
85 86 87 88
} estimate.of.e.x <- sum(x*w)/n estimate.of.e.x.2 <- sum(x^2*w)/n
# importance sampling estimate of E(X) # importance sampling estimate of E(Xˆ2)
In order to improve the efficiency of our code, we could get rid of the loop and use: 89 90 91 92 93 94
n <- 1000 # set the sample size alpha <- 2.4 # set the parameter alpha lambda <- 3 # set the parameter lambda k <- floor(alpha) # compute k x <- rGammaInt(n,k,lambda-1) # draw n times from the instrumental distribution w <- dgamma(x,alpha,lambda)/dgamma(x,k,lambda-1)
95 96 97
estimate.of.e.x <- sum(x*w)/n estimate.of.e.x.2 <- sum(x^2*w)/n
# compute the weights # importance sampling estimate of E(X) # importance sampling estimate of E(Xˆ2)
We can visualise the weighted sample as follows: 98 99 100
t <- 0:500/100 # set t=0,0.01,...,5 plot(t, dgamma(t,alpha,lambda)/dgamma(t,k,lambda-1), type="l",col="grey",xlab="sample",ylab="weights")
101 102 103 104 105
points(x,w,pch=15) for (i in 1:n) { lines(c(x[i],x[i]),c(0,w[i])) }
# plot weights as a function of x # add the sampled values # make it pretty by adding vertical lines
To see how the importance sampling estimate of E(X) evolves over time we can look at a plot similar to figure 2.5 in the notes: 106 107 108
estimate.over.time <- cumsum(w*x)/1:n # compute the cumulative sum and divide by its length plot(estimate.over.time,ylim=c(0.75,0.85),type="l") abline(h=2.4/3,lty=3) # add a horizontal line at the true value 6. Modify the above code such that the self-normalised estimate µ ˆ (instead of µ ˜) is computed. Of course, a random number generator for the gamma distribution is built into R, so we could have just used
109
y <- rgamma(20,2.4,3)
# using the built-in method
Monte Carlo Methods Computer Practical 1 (Answers) 1. We can use the function rGammaInt. 1
y <- rGammaInt(1000,6,0.5)
# generate the sample
2. Using the y generated in 1. 2 3 4
hist(y,freq=FALSE) t <- 0:350/10 lines(t,dgamma(t,6,0.5),lwd=2)
# plot a histogram # create a sequence 0,0.1,...,35 # add the corresponding density
3. The Kolmogorov-Smirnov test can be obtained using 5
ks.test(y,pgamma,6,0.5)
# carry out a K-S test
4. First of all, we need to generate the sample of size 1,000: 6 7 8 9 10
n <- 1000 alpha <- 2.4 lambda <- 3 k <- floor(alpha) M <- dgamma(alpha-k,alpha,lambda) /
13 14 15 16 17 18 19 20 21 22 23
# set the parameter alpha # set the parameter lambda # compute k
dgamma(alpha-k,k,lambda-1) # compute M
11 12
# set the sample size
y <- numeric(n) # create a vector to hold the result accepted.samples <- 0 # counter of accepted samples while (accepted.samples
24 25 26
hist(y,freq=FALSE) t <- 0:100/20 lines(t,dgamma(t,2.4,3),lwd=2)
# plot a histogram # create a sequence 0,0.05,...,5 # add the corresponding density
For a Kolmogorov-Smirnov test: 27
ks.test(y,pgamma,2.4,3)
# carry out a K-S test
5. The only thing we have to do is to introduce a variable that counts the number of attempts. 28 29 30 31 32 33 34 35 36 37 38
n <- 1000 alpha <- 2.4 lambda <- 3 k <- floor(alpha) M <- dgamma(alpha-k,alpha,lambda) /
# # # #
set the sample size set the parameter alpha set the parameter lambda compute k
dgamma(alpha-k,k,lambda-1) # compute M
y <- numeric(n) # create a vector to hold the result accepted.samples <- 0 # counter of accepted samples number.of.attempts <- 0 # counter of attempts while (accepted.samples
# increase counter
39 40 41 42 43 44 45 46 47 48 49 50
x <- rGammaInt(1,k,lambda-1) # draw proposed value f.at.x <- dgamma(x,alpha,lambda) # evaluate target density g.at.x <- dgamma(x,k,lambda-1) # evaluate instrumental density accept <- f.at.x / (M*g.at.x) # compute probability of accepting x if (runif(1)
# what the theory suggests # what we have obtained
6. We only have to change the lines where we compute the estimates. 51 52 53 54 55 56 57 58 59
n <- 1000 # set the sample size alpha <- 2.4 # set the parameter alpha lambda <- 3 # set the parameter lambda k <- floor(alpha) # compute k x <- rGammaInt(n,k,lambda-1) # draw n times from the instrumental distribution w <- dgamma(x,alpha,lambda)/dgamma(x,k,lambda-1) # compute the weights
sn.estimate.of.e.x <- sum(x*w)/sum(w) # self-normalised estimates sn.estimate.of.e.x.2 <- sum(x^2*w)/sum(w)
Monte Carlo Methods Computer Practical 2 You can download a file with the code from the course website (section Teaching material available for download): http://www.maths.bris.ac.uk/~manpw/teaching/mcm.
This practical focuses on sampling from the bivariate Gaussian and of bivariate distribution mixtures µ1 σ12 σ12 Gaussians. The density of an N(µ, Σ) distribution with µ = and Σ = is µ2 σ12 σ22 1 1 0 −1 exp − (x − µ) Σ (x − µ) f(µ,Σ) (x1 , x2 ) = 2 2π|Σ|1/2 where x = (x1 , x2 ). The conditional distribution of X1 given X2 is X1 |X2 = x2 ∼ N(µ1 + σ12 /σ22 (x2 − µ2 ), σ12 − (σ12 )2 /σ22 ) First of all we need to define the density of a bivariate normal distribution: 1 2 3
mvdnorm <- function(x, mu, sigma) { # a simple, but not very efficient implementation of the density x.minus.mu <- x - mu # subtract mu from x exp.arg <- -0.5*sum(x.minus.mu*solve(sigma,x.minus.mu)) # evaluate what’s inside the exp(...)
4
1 / (2*pi * sqrt(det(sigma))) * exp(exp.arg)
5
# return the density
6 7
} A more efficient implementation that can evaluate the density at more than one point is:
8 9 10 11 12 13 14
mvdnorm <- function(x, mu, sigma) { # a more complex, but more efficient implementation of the density if (is.vector(x)) x <- t(x) # if x is a vector, coerce it into a matrix x.minus.mu <- t(sweep(x,2,mu,’-’)) # subtract mu from x sigma.chol <- chol(sigma) # compute the Choleski decomposition of sigma sqrt.det <- prod(diag(sigma.chol)) # compute sqrt(det(sigma)) exp.arg <- -0.5 * colSums(x.minus.mu * backsolve(sigma.chol,forwardsolve(sigma.chol, x.minus.mu,upper.tri=TRUE,transpose=TRUE))) # evaluate what’s inside the exp(...)
15
drop(1 / ((2*pi)^(ncol(x)/2) * sqrt.det) * exp(exp.arg))
16
# return the density
17 18
} We start with using random walk Metropolis a algorithm to sample from the bivariate Gaussian distri0 4 1 bution with µ = and Σ = . 0 1 4
19 20 21 22 23 24 25 26 27 28 29
mu <- c(0,0) # set mean parameter sigma <- matrix(c(4,1,1,4),nrow=2) # set the covariance parameter sd.proposal <- 2.5 # set the standard deviation of the proposal n <- 1000 # set the desired sample size x <- matrix(nrow=n,ncol=2) # create matrix to hold the result cur.x <- c(0,0) # set the starting value cur.f <- mvdnorm(cur.x,mu,sigma) # evaluate density at cur.x n.accepted <- 0 # counter of the accepted values for (i in 1:n) { # repeat n times ... new.x <- cur.x + sd.proposal * rnorm(2) # create proposed value
new.f <- mvdnorm(new.x,mu,sigma) if (runif(1)
30 31 32 33 34 35 36 37
# evaluate density at proposed value # if we accept the proposed value ... # ... increment the counter of accepted values # ... accept the new value # ... retain the density at the accepted value # store current value
} First of all we plot the sample we obtained:
38
plot(x)
# plot the sample
We can see more if we superimpose the density: 39 40
t <- -75:75/10 # define one side of the grid mat <- outer(t,t,function(x1,x2,...) mvdnorm(cbind(x1,x2),...),mu,sigma) # evaluate the density over a grid defined by t
41 42 43
image(t,t,mat) points(x)
# draw the true density # add points for the sample
In order to see how well the chain is mixing, we can look at the same plot as above, but with subsequent values linked with a line. In order not to get too chaotic a plot we only look at the last 100 values. 44
plot(x[901:1000,],type="b")
# plot last 100 observations
A look at the sample paths of both variables: 45 46 47
par(mfrow=c(2,1)) plot(x[,1],type="l") plot(x[,2],type="l")
# split screen # plot sample path of x1 # plot sample path of x2
Next we obtain the proportion of accepted values . . . 48
n.accepted/n
# proportion of accepted values (t−1)
. . . as well as the correlations ρ(Xj 49 50
cor(x[-1,1],x[-nrow(x),1]) cor(x[-1,2],x[-nrow(x),2])
(t)
, Xj ). # autocorrelation of x1 # autocorrelation of x2
1. Change the standard deviation of the proposal once to 0.1 and once to 10. Compare the diagnostic plots, the proportion of accepted values, and the autocorrelation. Instead of using a random walk Metropolis algorithm we might as well use a Gibbs sampler: 51 52 53 54 55 56
n <- 1000 # set the desired sample size x <- matrix(nrow=n,ncol=2) # create matrix to hold the result cur.x <- c(0,0) # set the starting value for (i in 1:n) { # repeat n times ... cur.x[1] <- rnorm(1,mean=mu[1]+sigma[1,2]/sigma[2,2]*(cur.x[2]-mu[2]), sd=sqrt(sigma[1,1]-sigma[1,2]^2/sigma[2,2])) # update x1 by sampling from x1 given x2
57
cur.x[2] <- rnorm(1,mean=mu[2]+sigma[1,2]/sigma[1,1]*(cur.x[1]-mu[1]), sd=sqrt(sigma[2,2]-sigma[1,2]^2/sigma[1,1]))
58 59 60
x[i,] <- cur.x
61 62
# update x2 by sampling from x2 given x1 # store the current value
} 2. Analyse the output of the Gibbs sampler the same way we have analysed the output of the random walk Metropolis algorithm. Compare the two methods. 4 2.8 3. Change the covariance matrix to Σ = and run both algorithms again. What can 2.8 2 you observe from the diagnostic plots?
Of course, we can sample from a bivariate Gaussian distribution without having to resort to Markov Chain Monte Carlo methods. If Z ∼ N(0, I), then Σ1/2 Z + µ ∼ N(µ, Σ). Thus we can sample 63 64 65
sigma.chol <- chol(sigma) z <- rnorm(2) y <- mu + t(sigma.chol)%*%z
# compute the Choleski root # generate a pair of N(0,1)s # compute y
A function for sampling from a bivariate Gaussian is available in the package MASS in R (another one is available in the package mvtnorm). 66 67
library(MASS) mvrnorm(n=1,mu=mu,Sigma=sigma)
# load the required library # sample from a multivariate Gaussian
In the second half of the practical we look at sampling from a mixture of two bivariate Gaussians, i.e. f (x1 , x2 ) = pφ(µ1 ,Σ1 ) (x) + (1 − p)φ(µ2 ,Σ2 ) (x) with p = 0.4,
µ1 =
µ1,1 µ1,2
,
Σ1 =
2 σ1,1 σ1,12
σ1,12 2 σ1,2
,
µ2 =
µ2,1 µ2,2
,
Σ2 =
2 σ2,1 σ2,12
σ2,12 2 σ2,2
The conditional distribution of X1 |X2 is a univariate mixture of two Gaussians: 2 (x −µ 2 2 2 2 (x −µ 2 2 2 f (x1 |x2 ) = px2 φ(µ1 +σ1,12 /σ1,2 (x1 )+(1−px2 )φ(µ1 +σ2,12 /σ2,2 (x1 ) 2 1,2 ),σ1,1 −σ1,12 /σ1,2 ) 2 2,2 ),σ2,1 −σ2,12 /σ2,2 )
with px2 =
2 ) (x2 ) pφ(µ1,2 ,σ1,2 2 ) (x2 ) + (1 − p)φ(µ 2 pφ(µ1,2 ,σ1,2 (x2 ) 2,2 ,σ2,2 )
.
First of all he have to define the joint density: 68
dmix <- function(x,p,mu1,sigma1,mu2,sigma2) { # computes the density of a mixture of two Gaussians
69
p*mvdnorm(x,mu1,sigma1) + (1-p)*mvdnorm(x,mu2,sigma2)
70 71
} Consider the following parameters −3 1 µ1 = , Σ1 = 2 0.5
0.5 1
,
µ2 =
3 −2
,
Σ2 =
1 −0.5 −0.5 1
.
Let’s first look at how the density looks like: 72 73 74
p <- 0.4 # prior probability of population 1 mu1 <- c(-3,2) # set mean of population 1 sigma1 <- matrix(c(1,0.5,0.5,1),nrow=2) # set the covariance of population 1
75 76 77
mu2 <- c(3,-2) # set mean of population 2 sigma2 <- matrix(c(1,-0.5,-0.5,1),nrow=2) # set the covariance of population 2
78 79 80
t <- -60:60/10 # define one side of the grid mat <- outer(t,t,function(x1,x2,...) dmix(cbind(x1,x2),...),p,mu1,sigma1,mu2,sigma2)
81 82
image(t,t,mat)
# evaluate the density over a grid defined by t # draw the true density
Let’s first implement a random walk Metropolis algorithm. 83 84 85 86 87 88 89 90 91
sd.proposal <- 3 # set the standard deviation of the proposal n <- 1000 # set the desired sample size x <- matrix(nrow=n,ncol=2) # create matrix to hold the result cur.x <- c(0,0) # set the starting value cur.f <- dmix(cur.x,p,mu1,sigma1,mu2,sigma2) # evaluate density at cur.x
n.accepted <- 0 # counter of the accepted values for (i in 1:n) { # repeat n times ... new.x <- cur.x + sd.proposal * rnorm(2)
.
# create proposed value
92
new.f <- dmix(new.x,p,mu1,sigma1,mu2,sigma2)
93
# evaluate density at proposed value
94
if (runif(1)
95 96 97 98 99 100 101
# if we accept the proposed value ... # ... increment the counter of accepted values # ... accept the new value # ... retain the density at the accepted value # store current value
} Let’s add the sample points to the plot of the density in order to be sure that we explore the entire support:
102 103
image(t,t,mat) points(x)
# draw the true density # add points for the sample
4. Analyse the chain using the code from above (diagnostic plots, autocorrelation, . . . ). 5. The proportion of accepted values is very low. This might be due to sd.proposal being relatively large. What happens if you change sd.proposal to 1? Let’s finally look at a Gibbs sampler for the mixture of Gaussians. First of all, we create a function for sampling from the full conditionals: 104
rcondmix <- function(x2,p,mu1,sigma1,mu2,sigma2) { # samples from the conditional distribution
105
w1 <- p * dnorm(x2,mu1[2],sigma1[2,2]) w2 <- (1-p) * dnorm(x2,mu2[2],sigma2[2,2]) w1 <- w1 / (w1+w2) # probability that x2 is from the first population if (runif(1)<w1) { # with this probability ... mu <- mu1 # ... set the mean to the one of the first population sigma <- sigma1 # ... set the covariance to the one of the first population } else { # ... else ... mu <- mu2 # ... set the mean to the one of the second population sigma <- sigma2 # ... set the covariance to the one of the second population } rnorm(1,mean=mu[1]+sigma[1,2]/sigma[2,2]*(x2-mu[2]), sd=sqrt(sigma[1,1]-sigma[1,2]^2/sigma[2,2]))
106 107 108 109 110 111 112 113 114 115 116 117
# return a sample from that distribution
118 119 120
} Now we are ready to run the Gibbs sampler:
121 122 123 124
x <- matrix(nrow=n,ncol=2) # create matrix to hold the result cur.x <- c(0,0) # set the starting value for (i in 1:n) { # repeat n times cur.x[1] <- rcondmix(cur.x[2],p,mu1,sigma1,mu2,sigma2) # sample from x1 given x2
125
cur.x[2] <- rcondmix(cur.x[1],p,rev(mu1),matrix(rev(sigma1),nrow=2), rev(mu2),matrix(rev(sigma2),nrow=2))
126 127
# sample from x2 given x1 # store current value
128
x[i,] <- cur.x
129 130
} 6. Analyse the output of your MCMC sampler using the code from above. 7. Change the mean parameters to µ1 =
−5 4
,
µ2 =
and run the Gibbs sampler. Carefully diagnose the output.
5 −4
,
Of course, there we can sample from a mixture of Gaussians directly without having to resort to MCMC methods. 131 132 133 134 135 136
pop <- sample(2,1,prob=c(p,1-p)) if (pop==1) { x <- mvrnorm(1,mu=mu1,Sigma=sigma1) } else { x <- mvrnorm(1,mu=mu2,Sigma=sigma2) }
# choose a population at random # if we pick the first one ... # ... we sample from it ... # ... otherwise ... # ... we sample from the other population
Monte Carlo Methods Problem sheet T1 During weeks 3-7 you will get one problem sheet per week, i.e. there will be five problem sheets: three problem sheets (T1,T2, and T3) will focus on theoretical issues, two (P1 and P2) will be based on the computer practicals. The best four problem sheets will contribute 20% to your final mark. (The final exam accounts for 80% of your mark.)
Please hand in your answers to questions 1 and 2 of this problem sheet before Friday October 23rd, 11am. Answering questions 3 and 4 will earn you some bonus marks.
? 1. Assume that the instrumental distribution g in importance sampling1 is chosen such that f (x) < M · g(x) for all x and a suitable M ∈ R, where f is the density of the target distribution. (a) Show that Varg (w(X)) < M − 1. (b) Show that Varg (w(X) · h(X)) is finite, if Varf (h(X)) is finite. ? 2. Assume you want to sample from a N(0, 1) distribution using a N(1, σ 2 ) distribution as instrumental distribution. (a) Show that the rejection sampling algorithm yields samples from the target N (0, 1) distribution iff σ 2 > 1. (b) Show that the variance of the weights using importance sampling is finite iff σ 2 > 1/2. (c) Which choice of µ and σ 2 would minimise the variance of the weights if you could use a N(µ, σ 2 ) distribution (instead of the N(1, σ 2 ) distribution from above) as instrumental distribution? What is the minimal variance of the weights that you can achieve this way? 3. Show that the expected number of times step 1 of algorithm 2.1 has to be carried out until a proposed value is accepted, is M , if the density f (x) is used, and M · C if density is only known up to a multiplicative constant C, i.e. f (x) = C · π(x), and π(x) is used instead of f (x). 4. Let F (·) be a CDF, and let F − (·) be its generalised inverse. (a) Let U1 , U2 ∼ U(0, 1). Show that 2 · Cov F − (U1 ), F − (1 − U1 ) = E
F − (U1 ) − F − (U2 ) F − (1 − U1 ) − F − (1 − U2 )
(b) Show that for all u1 , u2 ∈ R F − (u1 ) − F − (u2 ) F − (1 − u1 ) − F − (1 − u2 ) ≤ 0. (c) Deduce from what you have obtained so far that Cov (F − (U1 ), F − (1 − U1 )) ≤ 0. (d) Show that Var
F − (U1 ) + F − (1 − U1 ) 2
≤ Var
F − (U1 ) + F − (U2 ) 2
.
Interpret this result.
1 On
this problem sheet “importance sampling” refers to importance sampling not using self-normalised weights.
Monte Carlo Methods Solutions T1 Z
f (x) g(x) dx = 1. Further g(x) Z Z Z f (x) f (x)2 2 g(x) dx = f (x) dx < M f (x) dx = M. w(X) = g(x)2 g(x) | {z } | {z }
1. (a) From the lectures we have that Eg (w(X)) =
Eg
<M
=1
Thus 2 Varg (w(X)) = Eg w(X)2 − (Eg (w(X))) < M − 1. (b) The importance sampling estimate has finite variance iff Eg h(X)2 w(X)2 < +∞. We have that Z Z f (x) f (x)2 2 2 2 h(x) g(x) dx = h(x)2 f (x) < M · Ef h(X)2 < +∞, Eg w(X) h(X) = g(x)2 g(x) | {z } <M
as Varf (h(X)) < +∞. 2. (a) In order to be able to carry out rejection sampling we need that there is a finite M > 0 such that φ(0,1) (x) < M φ(1,σ2 ) (x). This is the case iff φ(0,1) (x)/φ(1, σ 2 ) is bounded. 2 √1 exp − x φ(0,1) (x) 2 2π = 2 (x−1) φ(1,σ2 ) (x) √ 1 exp − 2 2 2σ 2πσ 2 √ √ (x − 1)2 x 1 2 2 2 2 = σ exp − + = σ exp − 2 (σ − 1)x + 2x − 1 2 2σ 2 2σ It is easy to see that the right-hand side is bounded above iff σ 2 > 1 (the “overall sign” of the x2 has to be negative, otherwise the exponential is unbounded). (b) Varg (W (X)) is finite iff Eg W (X)2 is finite. Eg W (X)2
2
Z φ(0,1) (x)2 φ(1,σ2 ) (x) dx = dx φ(1,σ2 ) (x) Z 1 2 2π exp −x dx = (x−1)2 √ 1 exp − 2 2 2σ 2πσ r Z 2 σ 1 exp −x2 + 2 (x − 1)2 dx = 2π 2σ r Z σ2 1 2 2 = exp − 2 (2σ − 1)x + 2x − 1 dx 2π 2σ Z
=
φ(0,1) (x) φ(1,σ2 ) (x)
It is easy to see that the right-hand side is finite iff σ 2 > 1/2 (the “overall sign” of the x2 has to be negative, otherwise the integral is +∞). (c) The variance of the weights is zero, when we set µ = 1 and σ 2 = 1, as the instrumental distribution is identical to the target distribution. 3. Consider first the case where the normalised target density, f (x) is available. Denote the acceptance (x) probability by a(x) := Mf g(x) . Denote by N the number of attempts required until the first acceptance occurs.
Then, for integer k ≥ 1, we have the following: P(N = k) = E [P(N = k|X1:∞ )] k−1 Y = E a(Xk ) [1 − a(Xj )] j=1
= E [a(Xk )]
k−1 Y
E [1 − a(Xj )]
j=1
= 1/M (1 − 1/M )k−1 , where the penultimate and final equalities hold as X1 , X2 , ... are i.i.d. according to g(x). The random variable N is marginally geometrically distributed with parameter 1/M and therefore E[N ] = M . The case where the unnormalised target density is available can be analysed in the same manner. 4. (a) Using the fact that U1 and U2 have the same distribution and are independent we obtain 2 · Cov F − (U1 ), F − (1 − U1 ) = 2 · E F − (U1 )F − (1 − U1 ) − 2 · E F − (U1 ) E F − (1 − U2 ) = 2 · E F − (U1 )F − (1 − U1 ) − 2 · E F − (U1 )F − (1 − U2 ) = E F − (U1 )F − (1 − U1 ) − E F − (U1 )F − (1 − U2 ) −E F − (U2 )F − (1 − U1 ) + E F − (U2 )F − (1 − U2 ) = E F − (U1 ) − F − (U2 ) F − (1 − U1 ) − F − (1 − U2 ) (b) Let without loss of generality u1 ≤ u2 (otherwise relabel u1 and u2 ), thus 1 − u2 ≤ 1 − u1 . As F − (·) is non-decreasing we have that F − (u1 ) ≤ F − (u2 ) and F − (1 − u2 ) ≤ F − (1 − u1 ). Thus F − (u1 ) − F − (u2 ) F − (1 − u1 ) − F − (1 − u2 ) ≤ 0. {z }| {z } | ≤0
(a) (c) 2 · Cov F − (u1 ), F − (1 − u1 ) = E (d) We have that − F (U1 ) + F − (1 − U1 ) = Var 2 =
≥0
(b) F − (U1 ) − F − (U2 ) F − (1 − U1 ) − F − (1 − U2 ) ≤ 0
Var(F − (U1 )) + 2 · Cov(F − (U1 ), F − (1 − U1 )) + Var(F − (1 − U1 )) 4 − − Var(F (U1 )) Cov(F (U1 ), F − (1 − U1 )) + 2 2 {z } | ≤0
≤
Var(F − (U1 )) = Var 2
F − (U1 ) + F − (U2 ) 2
=
Var(F − (U1 )) 2
The variance of a Monte Carlo estimate can be reduced by using negatively correlated samples. The method described above (often referred to as “antithetic variables”) is one way of obtaining a negatively correlated sample.
Monte Carlo Methods Problem sheet T2 Please hand in your answers to questions 1 and 2 of this problem sheet before Friday November 6th, 11am. Answering questions 3, 4, and 5 will earn you some bonus marks.
⋆ 1. Let f (·) be the density of random variable with support supp(f ) = [a, b] with a < b ∈ R. Consider the following algorithm for t = 1, 2, . . . 1. Draw X ∼ U[a, b].
2. With probability min 1,
f (X) f (X (t−1) )
set X (t) = X, otherwise X (t) = X (t−1) .
(a) Write down the Markov kernel corresponding to one iteration of the above algorithm. (b) Verify that the above algorithm satisfies the detailed balance condition. (c) Verify that the Markov chain generated by the above algorithm is f -irreducible. (d) The Markov chain generated by the above algorithm is also aperiodic. What can we conclude? ⋆ 2. Consider a vector (X1 , . . . , XK ) of length K > 2. (X1 , . . . , XK ) is said to be from a Dirichlet distribution with parameter vector (α1 , . . . , αK ) (αk > 0) if its density is PK K Γ( αk ) Y αk −1 f (x1 , . . . , xK ) = QK k=1 xk k=1 Γ(αk ) k=1
PK for xk ≥ 0 and k=1 xk = 1, and 0 otherwise. As we have seen in section 4.5 of the lecture notes, the Dirichlet distribution can be used as a prior distribution over the space of probability distributions on a finite set {1, . . . , K}. Define for i < j X−ij := (X1 , . . . , Xi−1 , Xi+1 , . . . , Xj−1 , Xj+1 , . . . , XK ). (a) Show that for i < j the conditional density of Xi , Xj |X−ij is α −1
i −1 fXi ,Xj |X−ij (xi , xj |x−ij ) ∝ xα xj j i
for xi , xj ≥ 0 such that xi + xj = 1 − (b) Define ˜ i := Xi X
PK
k=1, i6=k6=j
xk .
K K . . X X ˜ 1− Xk and Xj := Xj 1− Xk . k=1 i6=k6=j
k=1 i6=k6=j
˜i, X ˜ j ≥ 0 and X ˜i + X ˜ j = 1, and that Show that X i −1 fX˜ i ,X˜ j |X−ij (˜ xi , x ˜j |x−ij ) ∝ x ˜α (1 − x ˜i )αj −1 . i
˜ i |X−ij ∼ Beta(αi , αj ), a distribution which is easy to sample from.1 Deduce from this that X (c) Using your results from part (b), propose a way of sampling from the distribution of Xi , Xj |X−ij . (d) Use your results from above to propose a random scan Gibbs sampler for sampling from the Dirichlet distribution.2 PK Note that because of the constraint k=1 Xk = 1, each step of the Gibbs sampler has to update a pair (Xi , Xj ). Γ(a+b)
density of a Beta distribution is f(a,b) (z) = Γ(a)Γ(b) z a−1 (1 − z)b−1 for z ∈ [0, 1]. 2 As a side note, this is not the most efficient way of sampling from a Dirichlet distribution. It is more efficient to use P the fact that if Yk ∼ Gamma(αk , 1), then (X1 , . . . , Xk ) ∼ Dirichlet(α1 , . . . , αK ) if Xk := Yk / K κ=1 Yκ . 1 The
3. Let (X1 , X2 ) be a two-dimensional random variable with joint density f (x1 , x2 ). Denote by fX1 |X2 (x1 |x2 ) and fX2 |X1 (x2 |x1 ) the corresponding conditional densities. Show that f (x1 , x2 ) = Z
fX1 |X2 (x1 |x2 ) . fX1 |X2 (x1 |x2 ) dx1 fX2 |X1 (x2 |x1 )
4. Consider the following algorithm to sample from a distribution with density f (x). Starting with an initial value X (0) ∈ supp(f ), iterate for t = 1, 2, . . .. 1. Draw U (t) ∼ U[0, f (X (t−1) )].
2. Draw X (t) ∼ U{x : f (x) ≥ U (t) }. The figure below illustrates the above algorithm:
1.0 0.0
0.5
U (t)
1.5
2.0
Example: Sampling from a Beta(3, 5) distribution
0.0
0.4
0.2
0.6
1.0
0.8
X (t)
(a) Show that the above algorithm is a Gibbs sampler that samples from the uniform distribution on the set {(x, u) : u ≤ f (x)}.
(b) Deduce from this that the invariant distribution of X (t) has the density f (·).
5. (a) Survival analysis models the time between the beginning of the observation period (e.g. the enrollment in a clinical study) until a certain event (e.g. death, occurrence of metastases, complete remission) occurs. The time Ti that lapses until the event occurs to the i-th individual can be modeled using a log-normal distribution. It has the density √ τ τ 2 fTi |µ,τ (t) = √ exp − (log(t) − µ) for t > 0, 2 t 2π i.e. log(Ti ) ∼ N (µ, 1/τ ). For the sake of simplicity we will assume that τ is known.3 The prior distribution of µ is a normal distribution with mean µ0 and variance 1/τ0 , i.e. µ ∼ N(µ0 , 1/τ0 ). Show that given observations T1 , . . . , Tn the posterior distributions of µ is µ|T1 , . . . , Tn ∼ N
1 τ S n + τ0 µ , nτ + τ0 nτ + τ0
i.e. the above prior is conjugate. 3 The
conjugate prior for τ would be a Gamma distribution.
,
where Sn =
n X i=1
log(Ti ),
(b) In practice, it is however impossible to follow each individual until the event occurs. For some individuals it might occur after the end of the study, for others it might not occur at all. This is referred to as censoring. Censoring can be modeled by introducing a second random variable, the censoring time Ui , which indicates when an individual drops out of the study. Of course, we cannot observe both Ti and Ui . If the event occurs before the individual i drops out of the study (i.e. Ti < Ui ), we can only observe Ti , but not Ui . If no event has occurred before the individual drops out of the study (i.e. Ti > Ui ), then we can only observe Ui , but not Ti . Thus the data we can observe is (Si , Ci ), where Si = min{Ti , Ui } and Ci is an indicator whether the i-th individual dropped out of the study, i.e. Ci = I{Ui <Ti } . The figure below illustrates the censoring process. Crosses indicate censoring (i.e. the individual drops out of the study), dots indicate observed event times. The lines correspond to the periods during which the individuals can be observed. T1
U1
×
U2 T2
×
T3 U3
×
U4
×
T4
(no censoring)
T1 < U1
T1 can be observed
T2 > U2
T2 cannot be observed (censored)
T3 < U3
T3 can be observed
T4 > U4
T4 cannot be observed (censored)
(no censoring)
(S1 , C1 ) = (T1 , 0) (S2 , C2 ) = (U2 , 1) (S3 , C3 ) = (T3 , 0) (S4 , C4 ) = (U4 , 1)
time Note that if Ti cannot be observed, we know that Ti > Ui , where Ui is the observed censoring time. Assume that the censoring time Ui is independent of Ti . Propose a Gibbs sampling strategy to obtain samples from the posterior distribution of µ given the data ((S1 , C1 ), . . . , (Sn , Cn )). Hint: Consider introducing the unobserved event times Ti (or their logarithms) as auxiliary variables.
Monte Carlo Methods Problem sheet P1 Please hand in your answers (commented R code and suitably summarised output) to questions 1(a), 2(b), and 3 of this problem sheet before Friday October 30th, 11am. Answering questions 1(b), 2(a), and 4 will earn you some bonus marks.
1. This question is about the Box-Muller method that can be used to generate a pair of Gaussian random variables (see example 2.2 in the lecture notes). ?
(a) Implement the Box-Muller method in R. Check your code by generating 10,000 Gaussian random numbers1 and check whether they seem to come from the desired N(0, 1) distribution using suitable plots or statistical tests. (b) The Polar Marsaglia method is a modification of the Box-Muller method that avoids having to compute the sine and cosine, however at the price of introducing a rejection sampling step: i.i.d.
1. Generate U1 , U2 ∼ U [−1, 1]. 2. Compute R2 := U12 + U22 . 3. If R2 > 1 reject the pair (U1 , U2 ), and go back to step 1. r r −2 log(R2 ) −2 log(R2 ) and X := U 4. Set X1 := U1 2 2 R2 R2 i.i.d.
The above algorithm generates a pair X1 , X2 ∼ N(0, 1). Implement it in R. Perform the same checks as in part (a). 2. Consider as in question 2 of the first problem sheet (T1) sampling from the N(0, 1) distribution using the N(1, σ 2 ) distribution as instrumental distribution. (a)
i. Draw a sample of size 10,000 from the N(0, 1) distribution using rejection sampling with the N(1, σ 2 ) distribution with σ 2 = 2 as instrumental distribution.2 ii. Verify using suitable plots or statistical tests that the sample you generated in part i. is indeed from the N(0, 1) distribution.
?
(b)
i. Draw a weighted sample of size 10,000 from the N(0, 1) distribution using importance sampling with the N(1, σ 2 ) distribution with σ 2 = 2 as instrumental distribution. ii. Check your results by computing the weighted mean and the weighted variance; they should be 0 and 1. iii. Empirically find the value of σ 2 yielding a minimal variance of the weights.
Use the functions dnorm to evaluate φ(µ,σ2 ) and rnorm to draw from the N(0, 1) distribution. ? 3. A random variable is said to be from a left-truncated normal distribution if its density is 0 for x ≤ τ f(µ,σ2 ,τ ) (x) = φ(µ,σ2 ) (x)/(1 − Φ(µ,σ2 ) (τ )) for x > τ , where φ(µ,σ2 ) (·) is the density of an N(µ, σ 2 ) random variable, and Φ(µ,σ2 ) (·) the corresponding CDF. (a) One way of sampling from the left-truncated normal distribution is rejection sampling with the N (µ, σ 2 ) distribution as instrumental distribution. Implement this algorithm. (Note that in this example, the probability of accepting a proposed X = x is either 0 or 1). 1 Please do not hand in a printout of the values you generated as part of your answer to this question, or the other questions on this sheet. √ 2 You can use without proof that the optimal M = σ 2 exp(1/(2σ 2 − 2)).
(b) Use the code from part (a) to draw 10 realisations from a left-truncated normal distribution with parameters µ = 0, σ 2 = 1, and τ = 4. What is the proportion of rejected values you observed? (c) Clearly, the method proposed in part (a) is very inefficient. Propose and implement a more efficient instrumental distribution, i.e. one that yields less rejected values. Attempt to obtain a proportion of rejected values of at most 90%. Hint: all the mass of the left-truncated normal distribution is in (τ, +∞). 4. This question is about estimating E(X(1 − X)) for X ∼ Beta(α, β) using importance sampling with the uniform distribution U[0, 1] used as instrumental distribution. (a) Based on a sample of size 10 compute an importance sampling estimate of E(X(1−X)) for α = 2 and β = 3, once using the self-normalised estimate µ ˆ and once using the “standard” estimate µ ˜. (b) Estimate the bias, variance, and the mean-squared error of both methods based on 100,000 replications of your computations in part (a). For computing the bias, use E(X(1 − X)) = 1/5. Based on your results, would you prefer the self-normalised estimate?
Monte Carlo Methods: Lecture 1 : Introduction Nick Whiteley 2009
5.10.2009
Course material originally by Adam Johansen and Ludger Evers 2007
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Timetable - tbc
3 Hours each week: either 3 lectures or 2 lectures + 1 computer practical See the course website http://www.maths.bris.ac.uk/~manpw/teaching/mcm for teaching material to download, etc.
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Unit assessment Overall assessment 20% Coursework 80% Standard 1 1/2 hour examination
Assessment of the course work 5 problem sheets in total (2 mandatory questions each + optional ones) 3 on theory: T1 (week 2), T2 (week 4), and T3 (week 6) 2 on computer practicals: P1 (week 3), P2 (week 5) Coursework mark based on the best four problem sheets.
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
1.1 & 1.3 Introduction
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
What is Monte Carlo?
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
What are Monte Carlo Methods? One of many definitions A Monte Carlo method consists of “representing the solution of a problem as a parameter of a hypothetical population, and using a random sequence of numbers to construct a sample of the population, from which statistical estimates of the parameter can be obtained.” (Halton, 1970)
Sometimes referred to as stochastic simulation.
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Examples of applications of Monte Carlo methods (1) Numerical Integration Objective is to estimate an integral Z f (x) dx, X
which is analytically intractable.
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Examples of applications of Monte Carlo methods (2a)
Bayesian statistics Data y1 , . . . , yn and model f (yi |θ) where θ is some parameter of interest. n Y Likelihood l(y1 , . . . , yn |θ) = f (yi |θ) i=1
Frequentist estimate of θ is the maximiser of l(y1 , . . . , yn ) (“maximum likelihood estimate”). In the frequentist framework θ is a parameter, not a random variable.
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Examples of applications of Monte Carlo methods (2b) Bayesian statistics (continued) In the Bayesian framework θ is a random variable with prior distribution f prior (θ). After observing y1 , . . . , yn the posterior density of f is f post (θ) = f (θ|y1 , . . . , yn ) f prior (θ)l(y1 , . . . , yn |θ) = R prior (ϑ)l(y1 , . . . , yn |ϑ) dϑ Θf ∝ f prior (θ)l(y1 , . . . , yn |θ)
For many complex models the integral in the denominator is hard to compute use of a Monte Carlo approximation
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
What you will learn in this lecture course
Basic concepts: transformation, rejection, and reweighting. A brief reminder of important properties of Markov chains. Markov Chain Monte Carlo (MCMC) methods: Gibbs sampling, Metropolis-Hastings, and Reversible Jump. Sequential Monte Carlo (SMC).
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
History of Monte Carlo methods 1733 Buffon’s needle problem. 1812 Laplace suggests using Buffon’s needle experiment to estimate π. 1946 ENIAC (Electronic Numerical Integrator And Computer) built. 1947 John von Neuman and Stanislaw Ulam propose a computer simulation to solve the problem of neutron diffusion in fissionable material. 1949 Metropolis and Ulam publish their results in the Journal of the American Statistical Association. 1984 Geman & Geman publish their paper on the Gibbs sampler From then onwards: continuously growing interest of statisticians in Monte Carlo methods. Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
1.2 Introductory examples
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Example 1.1: Raindrop experiment for computing π (1) Consider “uniform rain” on the square [−1, 1] × [−1, 1], i.e. the two coordinates i.i.d. X, Y ∼ U[−1, 1]. Probability that a rain drop falls into the dark circle is
P(drop within circle) = =
1
−1 −1 area of the unit circle area of the square RR 1 dxdy {x2 +y 2 ≤1}
RR
1 dxdy
{−1≤x,y≤1} Lecture 1: Introduction Nick Whiteley 2009
=
1
π π = . 2·2 4 1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Example 1.1: Raindrop experiment for computing π (2) π . 4 Consider n independent raindrops, then the number of rain drops Z falling in the dark circle is a binomial random variable: If we know π, we can compute P(drop within circle) =
Z ∼ B(n, p),
with p = P(drop within circle).
We can estimate p by pˆ =
Z . n
Thus we can estimate π by π ˆ = 4ˆ p=4·
Lecture 1: Introduction Nick Whiteley 2009
Z . n
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Example 1.1: Raindrop experiment for computing π (3) Result obtained for n = 100 raindrops: 77 points inside the dark circle.
1
Resulting estimate of π is π ˆ=
4 · Zn 4 · 77 = = 3.08, n 100
(rather poor estimate) However: the law or large numbers guarantees that n π ˆn = 4·Z n → π almost surely for n → ∞. Lecture 1: Introduction Nick Whiteley 2009
−1 −1
1
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
3.0 2.0
2.5
Estimate of π
3.5
4.0
Example 1.1: Raindrop experiment for computing π (4)
0
500
1000
1500
2000
Sample size Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Example 1.1: Raindrop experiment for computing π (5) How fast does π ˆ converge to π? Central limit theorem gives the answer. (1 − 2α) confidence interval for p (ˆ pn = Zn /n): " # r r pˆn (1 − pˆn ) pˆn (1 − pˆn ) pˆn − z1−α , pˆn + z1−α n n (1 − 2α) confidence interval for π (ˆ πn = 4ˆ pn ): " # r r π ˆn (4 − π ˆn ) π ˆn (4 − π ˆn ) π ˆn − z1−α ,π ˆn + z1−α n n Width of the interval is O(n−1/2 ), thus speed of convergence OP (n−1/2 ). Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Example 1.1: Raindrop experiment for computing π (6) Recall the two core steps used in the example: 1
We have written the quantity of interest (in our case π) as an expectation: π = 4P(drop within circle) = E 4 · I{drop within circle}
2
We have replaced this algebraic representation of the quantity of interest by a sample approximation to it. Law of large numbers guarantees sample approximation converges to the algebraic representation. Central limit theorem gives information about the speed of convergence.
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Generalisation to Monte Carlo Integration (cf. example 1.2) 1
Z
f : [0, 1] → [0, 1]
f (x) dx
1
0
=
1 Z f (x)
Z 0
=
1 dt dx
Z 0Z
1dt dx
{(x,t):t≤f (x)}
RR
=
RR
1dt dx
{0≤x,t≤1}
Lecture 1: Introduction Nick Whiteley 2009
1 dt dx
{(x,t):t≤f (x)}
x 0
1 1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Comparison of the speed of convergence Speed of convergence of Monte Carlo integration is OP (n−1/2 ). Speed of convergence of numerical integration of a one-dimensional function by Riemann sums is O(n−1 ). Does not compare favourably for one-dimensional problems. However: Order of convergence of Monte Carlo integration is independent of the dimension. Order of convergence of numerical integration techniqes like Riemann sums deteriorates with the dimension increasing.
Monte Carlo methods can be a good choice for high-dimensional integrals.
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
1.4 Pseudo-random numbers
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
First thoughts
Philosophical paradox: We need to reproduce randomness by a computer algorithm. A computer algorithm is deterministic in nature.
“pseudo-random numbers” Pseudo-random number from U[0, 1] will be our only “source of randomness”. Other distributions can be derived from U[0, 1] pseudo-random numbers using deterministic algorithms.
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Characterisation of a pseudo-random number generator
A pseudo-random number generator (RNG) should produce output for which the U[0, 1] distribution is a suitable model. The pseudo-random numbers X1 , X2 , . . . should thus have the same relevant statistical properties as independent realisations of a U[0, 1] random variable. They should reproduce independence (“lack of predictability”): X1 , . . . , Xn should not contain any discernible information on the next value Xn+1 .This property is often referred to as the lack of predictability. The numbers generated should be spread out evenly across [0, 1].
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
A simple example
Algorithm 1.1: Congruential pseudo-random number generator 1. Choose a, M ∈ N ,c ∈ N0 , and the initial value (“seed”) Z0 ∈ {1, . . . M − 1}. 2. For i = 1, 2, . . . Set Zi = (aZi−1 + c) mod M , and Xi = Zi /M .
Zi ∈ {0, 1, . . . , M − 1}, thus Xi ∈ [0, 1).
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Example 1.4
Cosider the choice of a = 81, c = 35, M = 256, and seed Z0 = 4. Z1 = (81 · 4 + 35) mod 256 = 359 mod 256 = 103
Z2 = (81 · 103 + 35) mod 256 = 8378 mod 256 = 186
Z3 = (81 · 186 + 35) mod 256 = 15101 mod 256 = 253 ...
The corresponding Xi are X1 = 103/256 = 0.4023438, X2 = 186/256 = 0.72656250, X1 = 253/256 = 0.98828120.
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
RANDU: A very poor choice of RNG Very popular in the 1970s (e.g. System/360, PDP-11). Linear congruential generator with a = 216 + 3, c = 0, and M = 231 . The numbers generated by RANDU lie on only 15 hyperplanes in the 3-dimensional unit cube! According to a salesperson at the time: “We guarantee that each number is random individually, but we don’t guarantee that more than one of them is random.” Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
The flaw on the linear congruential generator “Crystalline” nature is a problem for every linear congurentrial generator. Sequence of generated values X1 , X2 , . . . viewed as points in an n-dimension cube lies on a finite, and often very small number of parallel hyperplanes. Marsaglia (1968): “the points [generated by a congruential generator] are about as randomly spaced in the unit n-cube as the atoms in a perfect crystal at absolute zero.” The number of hyperplanes depends on the choice of a, c, and M . For these reasons do not use the linear congurential generator! Use more powerful generators (like e.g. the Mersenne twister, available in GNU R).
Lecture 1: Introduction Nick Whiteley 2009
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Another cautionary example
0.0
-5
0.2
0.4
0
X2k
0.6
5
0.8
−2 log(X2k−1 ) sin(2πX2k )
1.0
Linear congruential generator with a = 1229, c = 1, and M = 211 .
0.0
0.2
0.4
0.6
0.8
1.0
X2k−1
Pairs of generated values (X2k−1 , X2k )
Lecture 1: Introduction Nick Whiteley 2009
-10
-5
0
5
−2 log(X2k−1 ) cos(2πX2k )
Transformed by Box-Muller method
1.1 & 1.3 Introduction 1.2 Introductory examples 1.4 Pseudo-random numbers
Monte Carlo Methods: Lecture 2 : Transformation and Rejection Nick Whiteley
Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
Overview of this lecture
What we have seen . . . How to generate uniform U[0, 1] pseudo-random numbers.
This lecture will cover . . . Generating random numbers from any distribution using transformations (CDF inverse, Box-Muller method). rejection sampling.
Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
2.1 Transformation Methods
Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
Transformation methods: Idea
We can generate U ∼ U[0, 1]. Can we find a transformation T such that T (U ) ∼ F for a distribution of interest with CDF F ? One answer to this question: inversion method.
Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
The CDF and its generalised inverse Cumulative distribution function (CDF) F (x) = P(X ≤ x)
Generalised inverse of the CDF F − (u) := inf{x : F (x) ≥ u} 1 F (x) u
F − (u) Lecture 2: Transformation and Rejection Nick Whiteley
x 2.1 Transformation methods 2.2 Rejection sampling
CDF inversion method
Theorem 2.1: Inversion method Let U ∼ U[0, 1] and F be a CDF. Then F − (U ) has the CDF F .
So we have a simple algorithm for drawing X ∼ F : 1
Draw U ∼ U[0, 1].
2
Set X = F − (U ).
(requires that F − (·) can be evaluated efficiently)
Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
Example 2.1: Exponential distribution
The exponential distribution with rate λ > 0 has the CDF (x ≥ 0) Fλ (x) = 1 − exp(−λx) Fλ− (u) = Fλ−1 (u) = − log(1 − u)/λ. So we have a simple algorithm for drawing Expo(λ): 1
2
Draw U ∼ U[0, 1]. log(1 − U ) log(U ) Set X = − , or equivalently X = − . λ λ
Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
Example 2.2: Box-Muller method for generating Gaussians Consider a bivariate real-valued random variable (X1 , X2 ) and its polar coordinates (R, θ), i.e. X1 = R · cos(θ),
X2 = R · sin(θ)
(1)
Then the following equivalence holds: i.i.d. X1 , X2 ∼ N(0, 1) ⇐⇒ θ ∼ U[0, 2π] and R2 ∼ Expo(1/2) indep. Suggests following algorithm for generating two Gaussians i.i.d. X1 , X2 ∼ N(0, 1): 1 2
Draw angle θ ∼ U[0, 2π] and squared radius R2 ∼ Expo(1/2). Convert to Cartesian coordinates as in (1) i.i.d.
From U1 , U2 ∼ U[0, 1] we can generate R and θ by p R = −2 log(U1 ), θ = 2πU2 , giving X1 =
p −2 log(U1 )·cos(2πU2 ),
Lecture 2: Transformation and Rejection Nick Whiteley
X2 =
p
−2 log(U1 )·sin(2πU2 ) 2.1 Transformation methods 2.2 Rejection sampling
Example 2.2: Box-Muller method for generating Gaussians
Box-Muller method 1
Draw i.i.d.
U1 , U2 ∼ U[0, 1]. 2
Set X1 =
p
−2 log(U1 ) · cos(2πU2 ),
X2 =
p
−2 log(U1 ) · sin(2πU2 ).
i.i.d.
Then X1 , X2 ∼ N(0, 1).
Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
2.2 Rejection sampling
Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
Basic idea of rejection sampling
Assume we cannot directly draw from density f . Tentative idea: 1
2
Draw X from another density g (similar to f , easy to sample from). Only keep some of the X depending on how likely they are under f .
Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
Basic idea of rejection sampling Consider the identity Z f (x) = 0
f (x)
Z 1 du =
10
f (x) can be interpreted as the marginal density of a uniform distribution on the area under the density f (x): {(x, u) : 0 ≤ u ≤ f (x)}. Sample from f by sampling from the area under the density. u
x Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
Example 2.3: Sampling from a Beta(3, 5) distribution (1) How can we draw points from the area under the density? 1
2
Draw (X, U ) from the grey rectangle, i.e. X ∼ U(0, 1) and U ∼ U(0, 2.4). Accept X as a sample from f if (X, U ) lies under the density (dark grey area).
u 2.4
x 0
1
Step 2 equivalent to: Accept X if U < f (X), i.e. accept X with probability P(U < f (X)|X = x) = f (X)/2.4. Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
Example 2.3: Sampling from a Beta(3, 5) distribution (2) Resulting algorithm: 1 2
Draw X ∼ U(0, 1). Accept X as a sample from Beta(3, 5) with probability f (X) 2.4 .
Not every density can be bounded by a box. How can we generalise the idea? Bounding f by M times another density g. M · g(x) f (x)
−6 −5 −4 −3 −2 −1 Lecture 2: Transformation and Rejection Nick Whiteley
1
2
3
4
5
6
2.1 Transformation methods 2.2 Rejection sampling
The rejection sampling algorithm (1) Algorithm 2.1: Rejection sampling Given two densities f, g with f (x) < M · g(x) for all x, we can generate a sample from f by 1. Draw X ∼ g. 2. Accept X as a sample from f with probability f (X) , M · g(X) otherwise go back to step 1. Note: f (x) < M · g(x) implies that f cannot have heavier tails than g. Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
The rejection sampling algorithm (2) Remark 2.1 If we know f only up to a multiplicative constant, i.e. if we only know π(x), where f (x) = C · π(x), we can carry out rejection sampling using π(X) M · g(X) as probability of rejecting X, provided π(x) < M · g(x) for all x.
Can be useful in Bayesian statistics: f post (θ) = R
f prior (θ)l(y1 , . . . , yn |θ) = C·f prior (θ)l(y1 , . . . , yn |θ) prior (ϑ)l(y , . . . , y |ϑ) dϑ f 1 n Θ
Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
Example 2.4: Rejection sampling from the N(0, 1) distribution using a Cauchy proposal (1) Recall the following densities: 1 x2 N(0, 1) f (x) = √ exp − 2 2π 1 Cauchy g(x) = π(1 + x2 ) √ For M = 2π · exp(−1/2) we have that f (x) ≤ M g(x). We can use rejection sampling to sample from f using g as proposal. M · g(x) f (x)
−6 −5 −4 −3 −2 −1 Lecture 2: Transformation and Rejection Nick Whiteley
1
2
3
4
5
6
2.1 Transformation methods 2.2 Rejection sampling
Example 2.4: Rejection sampling from the N(0, 1) distribution using a Cauchy proposal (2)
We cannot sample from a Cauchy distribution (g) using a Gaussian (f ) as instrumental distribution. Whe Cauchy distribution has heavier tails than the Gaussian distribution: there is no M ∈ R such that 2 1 1 x √ < M · exp − . 2 π(1 + x2 ) 2 2πσ
Lecture 2: Transformation and Rejection Nick Whiteley
2.1 Transformation methods 2.2 Rejection sampling
Monte Carlo Methods: Lecture 3 : Importance Sampling Nick Whiteley
12.10.2009
Lecture 3: Importance Sampling
2.3 Importance sampling
Overview of this lecture What we have seen . . . Rejection sampling.
This lecture will cover . . . Importance sampling. Basic importance sampling Importance sampling using self-normalised weights Finite variance estimates Optimal proposals Example
Lecture 3: Importance Sampling
2.3 Importance sampling
Recall rejection sampling Algorithm 2.1: Rejection sampling Given two densities f, g with f (x) < M · g(x) for all x, we can generate a sample from f by 1. Draw X ∼ g.
2. Accept X as a sample from f with probability f (X) , M · g(X) otherwise go back to step 1. Drawbacks: We need that f (x) < M · g(x)
On average we need to repeat the first step M times before we can accept a value proposed by g. Lecture 3: Importance Sampling
2.3 Importance sampling
2.3 Importance sampling
Lecture 3: Importance Sampling
2.3 Importance sampling
The fundamental identities behind importance sampling (1) Assume that g(x) > 0 for (almost) all x with f (x) > 0. Then for a measurable set A: f (x) P(X ∈ A) = dx = f (x) dx = g(x) g(x) A A | {z } Z
Z
Z g(x)w(x) dx A
=:w(x)
For some integrable test function h, assume that g(x) > 0 for (almost) all x with f (x) · h(x) 6= 0 Ef (h(X)) =
Z
f (x)h(x) dx =
Z g(x)
f (x) h(x) dx g(x) | {z }
=:w(x)
=
Z
g(x)w(x)h(x) dx = Eg (w(X) · h(X)),
Lecture 3: Importance Sampling
2.3 Importance sampling
The fundamental identities behind importance sampling (2) How can we make use of Ef (h(X)) = Eg (w(X) · h(X))?
Consider X1 , . . . , Xn ∼ g and Eg |w(X) · h(X)| < +∞. Then n
a.s.
1X n→∞ w(Xi )h(Xi ) −→ Eg (w(X) · h(X)) n i=1
(law of large numbers), which implies n
a.s.
1X n→∞ w(Xi )h(Xi ) −→ Ef (h(X)). n i=1
Thus we can estimate µ := Ef (h(X)) by 1 2
Sample P X1 , . . . , Xn ∼ g n µ ˜ := n1 i=1 w(Xi )h(Xi )
Lecture 3: Importance Sampling
2.3 Importance sampling
The importance sampling algorithm Algorithm 2.1a: Importance Sampling Choose g such that supp(g) ⊃ supp(f · h). 1. For i = 1, . . . , n: i. Generate Xi ∼ g. (Xi ) ii. Set w(Xi ) = fg(X . i)
2. Return µ ˜=
i=1 w(Xi )h(Xi )
Pn
n
as an estimate of Ef (h(X)). Contrary to rejection sampling, importance sampling does not yield realisations from f , but a weighted sample (Xi , Wi ). The weighted sample can be used for estimating expectations Ef (h(X)) (and thus probabilities, etc.) Lecture 3: Importance Sampling
2.3 Importance sampling
Basic properties of the importance sampling estimate We have already seen that µ ˜ is consistent if supp(g) ⊃ supp(f · h) and Eg |w(X) · h(X)| < +∞, as n
a.s.
1X n→∞ µ ˜ := w(Xi )h(Xi ) −→ Ef (h(X)) n i=1
The expected value of the weights is Eg (w(X)) = 1. µ ˜ is unbiased (see theorem below)
Theorem 2.2: Bias and Variance of Importance Sampling Eg (˜ µ) = µ Varg (w(X) · h(X)) Varg (˜ µ) = n Lecture 3: Importance Sampling
2.3 Importance sampling
What if f is known only up to a multiplicative constant? Assume f (x) = Cπ(x). Then µ ˜=
i=1 w(Xi )h(Xi )
Pn
n
n
=
1 X Cπ(Xi ) h(Xi ) n g(Xi ) i=1
Idea: Estimate 1/C aswell. Consider the estimator Pn w(Xi )h(Xi ) Pn µ ˆ = i=1 i=1 w(Xi ) Now we have that Pn π(Xi ) Pn w(X )h(X ) i=1 g(Xi ) h(Xi ) i i Pn µ ˆ = i=1 = Pn π(Xi ) , i=1 w(Xi ) i=1 g(Xi )
µ ˆ does not depend on C Lecture 3: Importance Sampling
2.3 Importance sampling
The importance sampling algorithm (2)
Algorithm 2.1b: Importance Sampling using self-normalised weights Choose g such that supp(g) ⊃ supp(f · h). 1. For i = 1, . . . , n: i. Generate Xi ∼ g. (Xi ) ii. Set w(Xi ) = fg(X . i)
2. Return
Pn w(Xi )h(Xi ) Pn µ ˆ = i=1 i=1 w(Xi )
as an estimate of Ef (h(X)).
Lecture 3: Importance Sampling
2.3 Importance sampling
Basic properties of the self-normalised estimate µ ˆ is consistent as Pn a.s. w(Xi )h(Xi ) n n→∞ Pn µ ˆ = i=1 −→ Ef (h(X)), w(Xi ) n {z } | i=1{z | } =˜ µ−→Ef (h(X))
−→1
(provided supp(g) ⊃ supp(f · h) and Eg |w(X) · h(X)| < +∞) µ ˆ is biased, but asymptotically unbiased (see theorem below)
Theorem 2.2: Bias and Variance (ctd.) µVarg (w(X)) − Covg (w(X), w(X) · h(X)) + O(n−2 ) n Varg (w(X) · h(X)) − 2µCovg (w(X), w(X) · h(X)) Varg (ˆ µ) = n µ2 Varg (w(X)) + + O(n−2 ) n Eg (ˆ µ) = µ +
Lecture 3: Importance Sampling
2.3 Importance sampling
Finite variance estimators Importance sampling estimate consistent for large choice of g. (only need that ...) More important in practice: finite variance estimators, i.e. Pn i=1 w(Xi )h(Xi ) Var(˜ µ) = Var < +∞ n Necessary (albeit very restrictive) conditions for finite variance of µ ˜: f (x) < M · g(x) and Varf (h(X)) < ∞, or E is compact, f is bounded above on E, and g is bounded below on E.
Note: If f has heavier tails then g, then the weights will have infinite variance!
Lecture 3: Importance Sampling
2.3 Importance sampling
Optimal proposals
Theorem 2.3: Optimal proposal The proposal distribution g that minimises the variance of µ ˜ is g ∗ (x) = R
|h(x)|f (x) . |h(t)|f (t) dt
Theorem of little practical use: the optimal proposal involves R |h(t)|f (t) dt, which is the integral we want to estimate! Practical relevance of theorem 2.3: Choose g such that it is close to |h(x)| · f (x)
Lecture 3: Importance Sampling
2.3 Importance sampling
Super-efficiency of importance sampling For the optimal g ∗ we have that h(X1 ) + . . . + h(Xn ) Varf > Varg? (˜ µ), n if h is not almost surely constant.
Superefficiency of importance sampling The variance of the importance sampling estimate can be less than the variance obtained when sampling directly from the target f . Intuition: Importance sampling allows us to choose g such that we focus on areas which contribute most to the integral R h(x)f (x) dx. Even sub-optimal proposals can be super-efficient. Lecture 3: Importance Sampling
2.3 Importance sampling
Example 2.5: Setup
Compute Ef |X| for X ∼ t3 by . . . (a) sampling directly from t3 .
(b) using a t1 distribution as instrumental distribution. (c) using a N(0, 1) distribution as instrumental distribution.
Lecture 3: Importance Sampling
2.3 Importance sampling
|x| · f (x) (Target) f (x) (direct sampling) gt1 (x) (IS t1 ) gN(0,1) (x) (IS N(0, 1))
0.0
0.1
0.2
0.3
0.4
Example 2.5: Densities
-4
-2
0
2
4
x
Lecture 3: Importance Sampling
2.3 Importance sampling
Example 2.5: Estimates obtained IS using t1 as instrumental dist’n
IS using N(0, 1) as instrumental dist’n
1.5 1.0 0.0
0.5
IS estimate over time
2.0
2.5
3.0
Sampling directly from t3
0
500
1000
1500 0
500
1000
1500 0
500
1000
1500
Iteration
Lecture 3: Importance Sampling
2.3 Importance sampling
Example 2.5: Weights IS using t1 as instrumental dist’n
IS using N(0, 1) as instrumental dist’n
1.5 0.0
0.5
1.0
Weights Wi
2.0
2.5
3.0
Sampling directly from t3
-4
-2
0
2
4
-4
-2
0
2
4
-4
-2
0
2
4
Sample Xi from the instrumental distribution
Lecture 3: Importance Sampling
2.3 Importance sampling
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Rejection sampling & Importance sampling Objective: approximate an expectation Ef (h(X)) without having to sample directly from f . Key idea: Sample from an instrumental distribution g and correct for it by the rejection of some values, or by reweighting. Yields an independent sample X (1) ,X (2) , . . . Problem: Finding suitable instrumental distributions is hard in high dimensions.
Markov Chain Monte Carlo methods (MCMC) Key idea: Create a dependent sample, i.e. X (t) depends on the previous value X (t−1) . allows for “local” updates. Only yields an approximate sample from the target distribution. More mathematically speaking: yields a Markov chain with the target distribution f as stationary distribution. Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
4.1 Introduction 4.2 Algorithm 4.3 Hammersley-Clifford theorem
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
0
1
2
3
λ
4
5
6
7
Example 4.1 Poisson change point model (1)
rate λ for y1
...
rate λ for yM rate λ for yM +1
Yi ∼ Poi(λ1 ) Yi ∼ Poi(λ2 )
...
...
rate λ for yn
for
i = 1, . . . , M
for
i = M + 1, . . . , n
Objective: (Bayesian) inference about the parameters λ1 , λ2 , and M given observed data Y1 , . . . , Yn . Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.1 Poisson change point model (2) Prior distributions: λj ∼ Gamma(αj , βj ) (j = 1, 2), i.e. 1 α −1 α f (λj ) = λ j βj j exp(−βj λj ). Γ(αj ) j (discrete uniform prior on M, i.e. p(M ) ∝ 1). Likelihood: l(y1 , . . . , yn |λ1 , λ2 , M ) ! ! M n Y Y exp(−λ1 )λy1i exp(−λ2 )λy2i = · yi ! yi ! i=1
i=M +1
Joint distribution f (y1 , . . . , yn , λ1 , λ2 , M ) = l(y1 , . . . , yn |λ1 , λ2 , M ) · f (λ1 ) · f (λ2 ) · p(M ) ! ! M n Y Y exp(−λ1 )λy1i exp(−λ2 )λy2i ∝ · yi ! yi ! i=1
i=M +1
1 1 · λα1 1 −1 β1α1 exp(−β1 λ1 ) · λα2 −1 β2α2 exp(−β2 λ2 ) Γ(α1 ) Γ(α2 ) 2 Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.1 Poisson change point model (3) Joint posterior distribution f (λ1 , λ2 , M |y1 , . . . , yn ) α −1+
∝ λ1 1
PM
i=1
yi
exp(−(β1 + M )λ1 )
P α2 −1+ n i=M +1 yi ·λ2
exp(−(β2 + n − M )λ2 )
Conditional on M (i.e. if M was known) we have P α −1+ M i=1 yi
f (λ1 |y1 , . . . , yn , M ) ∝ λ1 1
exp(−(β1 + M )λ1 ),
i.e. ∼ Gamma α1 +
λ1 |Y1 , . . . Yn , M
∼ Gamma α2 +
λ2 |Y1 , . . . Yn , M PM
p(M | . . .) ∝ λ1 Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
i=1
yi
Pn
· λ2
i=M +1
yi
M X
! yi , β1 + M
i=1 n X i=M +1
! yi , β2 + n − M
.
· exp((λ2 − λ1 ) · M ) 4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.1 Poisson change point model (4) This suggests the algorithm: iterate repeatedly . . . 1 Draw λ from the conditional distribution λ |Y , . . . , Y , M , 1 1 1 n i.e. draw ! M X λ1 ∼ Gamma α1 + yi , β1 + M i=1
2
Draw λ2 from the conditional distribution λ2 |Y1 , . . . , Yn , M , i.e. draw ! n X λ2 ∼ Gamma α2 + yi , β2 + n − M i=M +1
3
Draw M from the conditional distribution M |Y1 , . . . , Yn , λ1 , λ2 , i.e. draw PM
p(M ) ∝ λ1 Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
i=1
yi
Pn
· λ2
i=M +1
yi
· exp((λ2 − λ1 ) · M ) 4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
The systematic scan Gibbs sampler
Algorithm 4.1: (Systematic scan) Gibbs sampler (0)
(0)
Starting with (X1 , . . . , Xp ) iterate for t = 1, 2, . . . (t)
(t−1)
(t)
(t)
(t)
(t)
(t)
(t)
1. Draw X1 ∼ fX1 |X−1 (·|X2
(t−1)
, . . . , Xp
).
...
(t−1)
(t−1)
j. Draw Xj ∼ fXj |X−j (·|X1 , . . . , Xj−1 , Xj+1 , . . . , Xp
).
...
p. Draw Xp ∼ fXp |X−p (·|X1 , . . . , Xp−1 ).
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
The random scan Gibbs sampler
Algorithm 4.2: Random scan Gibbs sampler (0)
(0)
Starting with (X1 , . . . , Xp ) iterate for t = 1, 2, . . . 1. Draw an index j from a distribution on {1, . . . , p} (e.g. uniform) (t)
(t−1)
2. Draw Xj ∼ fXj |X−j (·|X1 and set
(t) Xι
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
:=
(t−1) Xι
(t−1)
(t−1)
(t−1)
, . . . , Xj−1 , Xj+1 , . . . , Xp
),
for all ι 6= j.
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Illustration of the Gibbs sampler (3)
(1)
(1)
(X1 , X2 )
(2)
(X1 , X2 )
(2)
(2)
(2)
(1)
(X1 , X2 ) (X1 , X2 ) (6)
(6)
(6)
(5)
X2
(t)
(X1 , X2 ) (X1(4) , X2(4) ) (X1(5) , X2(4) )
(5)
(0)
(5)
(X1 , X2 )
(X1 , X2 ) (0)
(X1 , X2 ) (1) (0) (X1 , X2 )
(3)
(3)
(4)
(X1 , X2 )
(3)
(X1 , X2 )
(t)
X1 Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Important questions to ask
Only the so-called full-conditional distributions Xi |X−i are used in the Gibbs sampler. Do the full conditionals fully specify the joint distribution?
The sequence (X(0) , X(1) , . . .) is a Markov chain. Is the target distribution f (x1 , . . . , xp ) the invariant distribution of this Markov chain? Will the Markov chain converge to this distribution? If so, what can we use for inference: the whole chain (X(0) , X(1) , . . . , X(T ) ) or only the last value X(T ) ?
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
The Hammersley-Clifford theorem Definition 4.1: Positivity condition A distribution with density f (x1 , . . . , xp ) and marginal densities fXi (xi ) is said to satisfy the positivity condition if f (x1 , . . . , xp ) > 0 for all x1 , . . . , xp with fXi (xi ) > 0.
Theorem 4.1: Hammersley-Clifford Let (X1 , . . . , Xp ) satisfy the positivity condition and have joint density f (x1 , . . . , xp ). Then for all (ξ1 , . . . , ξp ) ∈ supp(f ) p Y fXj |X−j (xj |x1 , . . . , xj−1 , ξj+1 , . . . , ξp ) f (x1 , . . . , xp ) ∝ fXj |X−j (ξj |x1 , . . . , xj−1 , ξj+1 , . . . , ξp ) j=1
Note the theorem does not guarantee the existence of a joint distribution for every set of full conditionals! Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.2 Consider the following “model” X1 |X2 ∼ Expo(λX2 )
X2 |X1 ∼ Expo(λX1 ),
Trying to apply the Hammersley-Clifford theorem, we obtain f (x1 , x2 ) ∝
fX1 |X2 (x1 |ξ2 ) · fX2 |X1 (x2 |x1 ) fX1 |X2 (ξ1 |ξ2 ) · fX2 |X1 (ξ2 |x1 )
∝ exp(−λx1 x2 ) Z Z
exp(−λx1 x2 ) dx1 dx2 = +∞
joint density cannot be normalised. There is no joint density with the above full conditionals. Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
4.4 Convergence properties
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Invariant distribution Lemma 4.1 The transition kernel of the Gibbs sampler is (t)
(t−1)
K(x(t−1) , x(t) ) = fX1 |X−1 (x1 |x2
(t)
, . . . , x(t−1) ) p (t)
(t−1)
·fX2 |X−2 (x2 |x1 , x3 ·...
(t)
, . . . , x(t−1) ) p
(t)
·fXp |X−p (x(t) p |x1 , . . . , xp−1 )
Proposition 4.1 The joint distribution f (x1 , . . . , xp ) is indeed the invariant distribution of the Markov chain (X(0) , X(1) , . . .) generated by the Gibbs sampler.
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Irreducibility and recurrence
Proposition 4.2 If the joint distribution f (x1 , . . . , xp ) satisfies the positivity condition, the Gibbs sampler yields an irreducible, recurrent Markov chain. (less strict conditions exist) If the transition kernel is absolutely continuous with respect to the dominating measure, then recurrence even implies Harris recurrence.
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.3: Reducible Gibbs sampler
1
1 IC ∪C (x1 , x2 ), 2π 1 2
{(x1 , x2 ) : k(x1 , x2 ) − (1, 1)k ≤ 1}
C2
:=
{(x1 , x2 ) : k(x1 , x2 ) + (1, 1)k ≤ 1}
-1
:=
-2
C1
0
X2
(t)
f (x1 , x2 ) =
2
Consider Gibbs sampling from the uniform distribution
-2
-1
0 (t)
1
2
X1
The resulting Markov chain is not irreducible. It stays forever in either C1 or C2 .
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Ergodic theorem Theorem 4.2 If the Markov chain generated by the Gibbs sampler is irreducible and recurrent (which is e.g. the case when the positivity condition holds), then for any integrable function h : E → R n
1X lim h(X(t) ) → Ef (h(X)) n→∞ n t=1
for almost every starting value X(0) . If the chain is Harris recurrent, then the above result holds for every starting value X(0) . Thus we can approximate expectations Ef (h(X)) by their empirical counterparts using a single Markov chain.
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.4 (1) Consider
X1 X2
∼ N2
µ1 µ2
2 σ1 σ12 , σ12 σ22
Associated marginal distributions X1 ∼ N(µ1 , σ12 ) X2 ∼ N(µ2 , σ22 ) Associated full conditionals X1 |X2 = x2 ∼ N(µ1 + σ12 /σ22 (x2 − µ2 ), σ12 − (σ12 )2 σ22 )
X2 |X1 = x1 ∼ N(µ2 + σ12 /σ12 (x1 − µ1 ), σ22 − (σ12 )2 σ12 )
Gibbs sampler consists of iterating for t = 1, 2, . . . (t)
(t−1)
1. Draw X1 ∼ N(µ1 + σ12 /σ22 (X2 − µ2 ), σ12 − (σ12 )2 σ22 ) (t) (t) 2 2. Draw X2 ∼ N(µ2 + σ12 /σ1 (X1 − µ1 ), σ22 − (σ12 )2 σ12 ). Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.4 (2)
0.4 0.2 0.1
(τ )
(τ )
0.3
|{(X1 , X2 ) : τ ≤ t, X1
(τ )
≥ 0, X2
(τ )
≥ 0}|/t
0.5
Using the ergodic theorem we can estimate P(X1 ≥ 0, X2 ≥ 0) by (t) (t) (t) (t) the proportion of samples (X1 , X2 ) with X1 ≥ 0 and X2 ≥ 0:
0
2000
4000
6000
8000
10000
t
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Dependency structure of samples from the Gibbs sampler X(t−1) and X(t) are dependent and typically positively correlated (t) (t) (unless the components (X1 , . . . , Xp ) are independent for a fixed t) Amount of correlation increases with the dependency (t) (t) (correlation) of the components (X1 , . . . , Xp ). Consequence: a sample of size n from a Gibbs sampler can (and in most cases will) contain less information than an i.i.d. sample of size n, especially when the correlation between X(t−1) and X(t) is large. concept of the “effective sample size”
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
2.0
Example 4.5: Bivariate Gaussian ρ(X1 , X2 ) = 0.3
(t)
MCMC sample
(t)
MCMC sample
0.0
X1
-1.0
X2
(t)
1.0
fˆX1 (x1 )
-2.0
fˆX2 (x2 )
-2.0
-1.0
0.0
1.0
2.0
(t)
X1
X2 Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
1.5
Example 4.5: Bivariate Gaussian ρ(X1 , X2 ) = 0.99
(t)
MCMC sample
(t)
MCMC sample
0.0
X1
−1.0
−0.5
X2
(t)
0.5
1.0
fˆX1 (x1 )
−1.5
fˆX2 (x2 )
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
(t)
X1
X2 Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
4.5 Data augmentation
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Data augmentation Gibbs sampling is only feasible when we can sample easily from the full conditionals. A technique that can help achieving full conditionals that are easy to sample from is demarginalisation: Introduce a set of auxiliary random variables Z1 , . . . , Zr such that f is the marginal density of (X1 , . . . , Xp , Z1 , . . . , Zr ), i.e. Z f (x1 , . . . , xp ) = f (x1 , . . . , xn , z1 , . . . , zr ) d(z1 , . . . , zr ). In many cases there is a “natural choice” of the completion (Z1 , . . . , Zr ).
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.6: Mixture of Gaussians: Model Consider the following K population mixture model for data Y1 , . . . , Y n : K X f (yi ) = πk φ(µk ,1/τ ) (yi )
0.3 0.0
0.1
0.2
density
0.4
0.5
k=1 mixture population 1 population 2 population 3
-2
-1
0
1
t
Objective: Bayesian inference for the parameters (π1 , . . . , πK , µ1 , . . . , µK ). Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.6: Mixture of Gaussians: Priors The number of components K is assumed to be known. The variance parameter τ is assumed to be known. (π1 , . . . , πK ) ∼ Dirichlet(α1 , . . . , αK ), i.e. P K Γ( K αk ) Y αk −1 πk f(α1 ,...,αK ) (π1 , . . . , πK ) = QK k=1 k=1 Γ(αk ) k=1 (µ1 , . . . , µK ) ∼ N(µ0 , 1/τ0 ), i.e. f(µ0 ,τ0 ) (µk ) ∝ exp −τ0 (µk − µ0 )2 /2
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.6: Mixture of Gaussians: Joint distribution f (µ1 , . . . , µK , π1 , . . . , πK , y1 , . . . , yn ) ∝ ·
K Y k=1
! exp −τ0 (µk − µ0 ) /2 2
·
K Y
! πkαk −1
k=1 K X k=1
!
πk exp −τ (yi − µk ) /2 2
The full conditionals do not seem to come from “nice” distributions. Use data augmentation: include auxiliary variables Z1 , . . . Zn which indicate which population the i-th individual is from, i.e. P(Zi = k) = πk
and
Yi |Zi = k ∼ N (µk , 1/τ ).
The marginal distribution of Y is as before, so Z1 , . . . Zn are indeed a completion. Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.6: Mixture of Gaussians: Joint distribution (ctd.) The joint distribution of the augmented system is f (y1 , . . . , yn , z1 , . . . , zn , µ1 , . . . , µK , π1 , . . . , πK ) ! ! K K Y Y ∝ πkαk −1 · exp −τ0 (µk − µ0 )2 /2 k=1
·
k=1
n Y i=1
! πzi exp −τ (yi − µzi ) /2 2
The full conditionals now come from “nice” distributions.
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.6: Mixture of Gaussians: Full conditionals
P(Zi = k|Y1 , . . . , Yn , µ1 , . . . , µK , π1 , . . . , πK ) πk φ(µk ,1/τ ) (yi ) = PK ι=1 πι φ(µι ,1/τ ) (yi ) µk |Y1 , . . . , Yn , Z1 , . . . , Zn , π1 , . . . , πK ! P τ 1 i: Zi =k Yi + τo µ0 ∼N , |{: Zi = k}|τ + τ0 |{i : Zi = k}|τ + τ0 π1 , . . . , πK |Y1 , . . . , Yn , Z1 , . . . , Zn , µ1 , . . . , µK ∼ Dirichlet (α1 + |{i : Zi = 1}|, . . . , αK + |{I : Zi = K}|) .
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Example 4.6: Mixture of Gaussians: Gibbs sampler (0)
(0)
(0)
(0)
Starting with initial values µ1 , . . . , µK , π1 , . . . , πK iterate the following steps for t = 1, 2, . . . 1. For i = 1, . . . , n: (t)
Draw Zi by
from the discrete distribution on {1, . . . , K} specified (t)
p(Zi ) = PK
πk φ(µ(t−1) ,1/τ ) (yi ) k
(t−1)
ι=1 πι
φ(µ(t−1) ,1/τ ) (yi )
.
ι
2. For k = 1, . . . , K:
Draw (t)
µk
P τ Y + τo µ0 (t) 1 i: Zi =k i . ∼ N , (t) (t) |{i : Zi = k}|τ + τ0 |{i : Zi = k}|τ + τ0
3. Draw (t) (t) (t) (t) (π1 , . . . , πK ) ∼ Dirichlet α1 + |{i : Zi = 1}|, . . . , αK + |{i : Zi = K}| .
Lectures 5 & 6: The Gibbs Sampler Nick Whiteley
4.1 to 4.3 Introduction & Algorithm 4.4 Convergence properties 4.5 Data augmentation
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
What we have seen last time: Gibbs sampler Key idea: Generate a Markov chain by updating the component of (X1 , . . . , Xp ) in turn by drawing from the full conditionals: (t)
Xj
(t)
(t)
(t−1)
∼ fXj |X−j (·|X1 , . . . , Xj−1 , Xj+1 , . . . , Xp(t−1) )
Two drawbacks: Requires that it is possible / easy to sample from the full conditionals. Can yields a slowly mixing chain if (some of) the components of (X1 , . . . , Xp ) are highly correlated.
What we will see today: Metropolis-Hastings algorithm Key idea: Use rejection mechanism, with a “local proposal”: We let the newly proposed X depend on the previous state of the chain X(t−1) . Samples (X(0) , X(1) , . . .) form a Markov chain (like the Gibbs sampler). Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
5.1 Algorithm
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
The Metropolis-Hastings algorithm Algorithm 5.1: Metropolis-Hastings (0)
(0)
Starting with X(0) := (X1 , . . . , Xp ) iterate for t = 1, 2, . . . 1. Draw X ∼ q(·|X(t−1) ). 2. Compute ( α(X|X(t−1) ) = min 1,
f (X) · q(X(t−1) |X) f (X(t−1) ) · q(X|X(t−1) )
) .
3. With probability α(X|X(t−1) ) set X(t) = X, otherwise set X(t) = X(t−1) .
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Illustration of the Metropolis-Hastings method
x(3) = x(4) = x(5) = x(6) = x(7)
X2
(t)
x(15) x(11) = x(12) = x(13) = x(14)
x(8)
x(0) = x(1) = x(2)
x(10)
x(9)
(t)
X1 Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Basic properties of the Metropolis-Hastings algorithm The probability that a newly proposed value is accepted given X(t−1) = x(t−1) is Z a(x(t−1) ) = α(x|x(t−1) )q(x|x(t−1) ) dx. The probability of remaining in state X(t−1) is P(X(t) = X(t−1) |X(t−1) = x(t−1) ) = 1 − a(x(t−1) ). The probability of acceptance does not depend on the normalisation constant: If f (x) = C · π(x), then α(X|X(t−1) ) =
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
π(X) · q(X(t−1) |X) π(X(t−1) ) · q(X|X(t−1) ) 5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
The Metropolis-Hastings Transition Kernel
Lemma 5.1 The transition kernel of the Metropolis-Hastings algorithm is K(x(t−1) , x(t) ) = α(x(t) |x(t−1) )q(x(t) |x(t−1) ) +(1 − a(x(t−1) ))δx(t−1) (x(t) ), where δx(t−1) (·) denotes Dirac-mass on {x(t−1) }.
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
5.2 Convergence properties
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Theoretical properties
Proposition 5.1 The Metropolis-Hastings kernel satisfies the detailed balance condition K(x(t−1) , x(t) )f (x(t−1) ) = K(x(t) , x(t−1) )f (x(t) ). Thus f (x) is the invariant distribution of the Markov chain (X(0) , X(1) , . . .). Furthermore the Markov chain is reversible.
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Example 5.1: Reducible Metropolis-Hastings Consider the target distribution f (x) = (I[0,1] (x) + I[2,3] (x))/2. and the proposal distribution q(·|x(t−1) ): X|X (t−1) = x(t−1) ∼ U[x(t−1) − δ, x(t−1) + δ] δ 1/(2δ) 1/2
δ q(·|x(t−1) ) f (·)
x(t−1) 1 2 3 Reducible if δ ≤ 1: the chain stays either in [0, 1] or [2, 3]. Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Further theoretical properties
The Markov chain (X(0) , X(1) , . . .) is irreducible if q(x|x(t−1) ) > 0 for all x, x(t−1) ∈ supp(f ): every state can be reached in a single step. (less strict conditions can be obtained, see e.g.(see Roberts & Tweedie, 1996) The chain is aperiodic, if there is positive probability that the chain remains in the current state, i.e. P(X(t) = X(t−1) ) > 0,
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
An ergodic theorem
Theorem 5.1 If the Markov chain generated by the Metropolis-Hastings algorithm is irreducible, then for any integrable function h : E → R n
1X h(X(t) ) → Ef (h(X)) n→∞ n lim
t=1
for every starting value X(0) . Interpretation: We can approximate expectations by their empirical counterparts using a single Markov chain.
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
5.3 Random-walk Metropolis 5.4 Choosing the proposal distribution
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Random-walk Metropolis: Idea In the Metropolis-Hastings algorithm the proposal is from X ∼ q(·|X(t−1) ). A popular choice for the proposal is q(x|x(t−1) ) = g(x − x(t−1) ) with g being a symmetric distribution, thus X = X(t−1) + ,
∼ g.
Probability of acceptance becomes ( ) ( ) f (X) · g(X − X(t−1) ) f (X) min 1, = min 1, , f (X(t−1) ) · g(X(t−1) − X) f (X(t−1) ) We accept . . . every move to a more probable state with probability 1. moves to less probable states with a probability f (X)/f (x(t−1) ) < 1. Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Random-walk Metropolis: Algorithm Random-Walk Metropolis (0)
(0)
Starting with X(0) := (X1 , . . . , Xp ) and using a symmetric random walk proposal g, iterate for t = 1, 2, . . . 1. Draw ∼ g and set X = X(t−1) + . 2. Compute ( α(X|X(t−1) ) = min 1,
f (X) f (X(t−1) )
) .
3. With probability α(X|X(t−1) ) set X(t) = X, otherwise set X(t) = X(t−1) . Popular choices for g are (multivariate) Gaussians or t-distributions (the latter having heavier tails) Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Example 5.2: Bayesian probit model (1) Medical study on infections resulting from birth by Cesarean section 3 influence factors: indicator whether the Cesarian was planned or not (zi1 ), indicator of whether additional risk factors were present at the time of birth (zi2 ), and indicator of whether antibiotics were given as a prophylaxis (zi3 ).
Response variable: number of infections Yi that were observed amongst ni patients having the same covariates. # births infection total yi ni 11 98 1 18 2 0 23 26 28 58 9 0 8 40 Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
planned
risk factors
antibiotics
zi1
zi2
zi3
1 0 0 1 0 1 0
1 1 0 1 1 0 0
1 1 1 0 0 0 0 5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Example 5.2: Bayesian probit model (2) Model for Yi : Yi ∼ Bin(ni , πi ),
π = Φ(z0i β),
where zi = [1, zi1 , zi2 , zi3 ]T and Φ(·) being the CDF of a N(0, 1). Prior on the parameter of interest β: β ∼ N (0, I/λ). The posterior density of β is f (β|y1 , . . . , yn ) ∝
N Y
! Φ(z0i β)yi · (1 − Φ(z0i β))ni −yi
i=1
3 X λ · exp − βj2 2 j=0
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Example 5.2: Bayesian probit model (3) Use the following random walk Metropolis algorithm (50,000 samples). Starting with any β (0) iterate for t = 1, 2, . . .: 1. Draw ∼ N (0, Σ) and set β = β (t−1) + . 2. Compute ( α(β|β (t−1) ) = min 1,
f (β|Y1 , . . . , Yn ) f (β (t−1) |Y1 , . . . , Yn )
) .
3. With probability α(β|β (t−1) ) set β (t) = β, otherwise set β (t) = β (t−1) . (for the moment we use Σ = 0.08 · I, and λ = 10).
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Example 5.2: Bayesian probit model (4)
(t)
Convergence of the βj Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
is to a distribution, not a value! 5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Example 5.2: Bayesian probit model (5)
Convergence of cumulative averages Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
(τ ) tau=1 βj /t
Pt
is to a value.
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Example 5.2: Bayesian probit model (6)
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Example 5.2: Bayesian probit model (7)
intercept planned risk factors antibiotics
β0 β1 β2 β3
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
Posterior mean -1.0952 0.6201 1.2000 -1.8993
95% credible -1.4646 0.2029 0.7783 -2.3636
interval -0.7333 1.0413 1.6296 -1.471
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Choosing a good proposal distribution Ideally: Markov chain with small correlation ρ(X(t−1) , X(t) ) between subsequent values. fast exploration of the support of the target f . Two sources for this correlation: the correlation between the current state X(t−1) and the newly proposed value X ∼ q(·|X(t−1) ) (can be reduced using a proposal with high variance) the correlation introduced by retaining a value X(t) = X(t−1) because the newly generated value X has been rejected (can be reduced using a proposal with small variance)
Trade-off for finding the ideal compromise between: fast exploration of the space (good mixing behaviour) obtaining a large probability of acceptance
For multivariate distributions: covariance of proposal should reflect the covariance structure of the target. Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Example 5.3: Choice of proposal (1)
Target distribution, we want to sample from: N(0, 1) (i.e. f (·) = φ(0,1) (·)) We want to use a random walk Metropolis algorithm with ε ∼ N(0, σ 2 ) What is the optimal choice of σ 2 ? We consider four choices σ 2 = 0.12 , 1, 2.382 , 102 .
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
-6 -4 -2
0
σ2 = 1
2
4
6
-6 -4 -2
0
2
σ 2 = 0.12
4
6
Example 5.3: Choice of proposal (2)
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
-6 -4 -2
0
2
σ 2 = 102
4
6
-6 -4 -2
0
2
σ 2 = 2.382
4
6
Example 5.3: Choice of proposal (3)
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Example 5.3: Choice of proposal (4)
σ2 σ2 σ2 σ2
= 0.12 =1 = 2.382 = 102
Autocorrelation ρ(X (t−1) , X (t) ) Mean 95% CI 0.9901 (0.9891,0.9910) 0.7733 (0.7676,0.7791) 0.6225 (0.6162,0.6289) 0.8360 (0.8303,0.8418)
Probability of acceptance α(X, X (t−1) ) Mean 95% CI 0.9694 (0.9677,0.9710) 0.7038 (0.7014,0.7061) 0.4426 (0.4401,0.4452) 0.1255 (0.1237,0.1274)
Suggests: Optimal choice is 2.382 > 1.
Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Example 5.4: Bayesian probit model (revisited) So far we used: Var() = 0.08 · I). Better choice: Let Var() reflect the covariance structure ˆ Frequentist asymptotic theory: Var(β
m.l.e
) = (Z0 DZ)−1
D is a suitable diagonal matrix
Better choice: Var() = 2 · (Z0 DZ)−1 Increases rate of acceptance from 13.9% to 20.0% and reduces autocorrelation: Σ = 0.08 · I (t−1) (t) Autocorrelation ρ(βj , βj ) 0 Σ = 2 · (Z DZ)−1 (t−1) (t) Autocorrelation ρ(βj , βj )
β0 0.9496 β0 0.8726
β1 0.9503 β1 0.8765
β2 0.9562 β2 0.8741
β3 0.9532 β3 0.8792
(in this example det(0.08 · I) = det(2 · (Z0 DZ)−1 )) Lecture 7: The Metropolis-Hastings Algorithm Nick Whiteley
5.1 Algorithm 5.2 Convergence properties 5.3 & 5.4 RW Metropolis & Choice of proposal
Lectures 8 & 9: Combining Kernels, Convergence Diagnostics & Simulated Annealing Nick Whiteley
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Choosing a good proposal distribution Ideally: Markov chain with small correlation ρ(X(t−1) , X(t) ) between subsequent values. fast exploration of the support of the target f . Two sources for this correlation: the correlation between the current state X(t−1) and the newly proposed value X ∼ q(·|X(t−1) ) (can be reduced using a proposal with high variance) the correlation introduced by retaining a value X(t) = X(t−1) because the newly generated value X has been rejected (can be reduced using a proposal with small variance)
Trade-off for finding the ideal compromise between: fast exploration of the space (good mixing behaviour) obtaining a large probability of acceptance
For multivariate distributions: covariance of proposal should reflect the covariance structure of the target. Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Example 5.3: Choice of proposal (1)
Target distribution, we want to sample from: N(0, 1) (i.e. f (·) = φ(0,1) (·)) We want to use a random walk Metropolis algorithm with ε ∼ N(0, σ 2 ) What is the optimal choice of σ 2 ? We consider four choices σ 2 = 0.12 , 1, 2.382 , 102 .
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
-6 -4 -2
0
σ2 = 1
2
4
6
-6 -4 -2
0
2
σ 2 = 0.12
4
6
Example 5.3: Choice of proposal (2)
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
-6 -4 -2
0
2
σ 2 = 102
4
6
-6 -4 -2
0
2
σ 2 = 2.382
4
6
Example 5.3: Choice of proposal (3)
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Example 5.3: Choice of proposal (4)
σ2 σ2 σ2 σ2
= 0.12 =1 = 2.382 = 102
Autocorrelation ρ(X (t−1) , X (t) ) Mean 95% CI 0.9901 (0.9891,0.9910) 0.7733 (0.7676,0.7791) 0.6225 (0.6162,0.6289) 0.8360 (0.8303,0.8418)
Probability of acceptance α(X, X (t−1) ) Mean 95% CI 0.9694 (0.9677,0.9710) 0.7038 (0.7014,0.7061) 0.4426 (0.4401,0.4452) 0.1255 (0.1237,0.1274)
Suggests: Optimal choice is 2.382 > 1.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Example 5.4: Bayesian probit model (revisited) So far we used: Var() = 0.08 · I).
Better choice: Let Var() reflect the covariance structure ˆ Frequentist asymptotic theory: Var(β
m.l.e
) = (Z0 DZ)−1
D is a suitable diagonal matrix
Better choice: Var() = 2 · (Z0 DZ)−1
Increases rate of acceptance from 13.9% to 20.0% and reduces autocorrelation: Σ = 0.08 · I (t−1) (t) Autocorrelation ρ(βj , βj ) 0 Σ = 2 · (Z DZ)−1 (t−1) (t) Autocorrelation ρ(βj , βj )
β0 0.9496 β0 0.8726
β1 0.9503 β1 0.8765
β2 0.9562 β2 0.8741
β3 0.9532 β3 0.8792
(in this example det(0.08 · I) = det(2 · (Z0 DZ)−1 )) Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
5.5 Composing kernels: Mixtures and Cycles
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Composing kernels: Idea MCMC algorithm (Gibbs sampler, Metropolis-Hastings) can be uniquely identified by the transition kernel. So far: only one type of update in the Metropolis-Hastings algorithm. Question: Can we combine different MCMC updates? Assume: r possible MCMC updates characterised by kernels K (ρ) (·, ·). f is the invariant distribution of each kernel K (ρ) .
Two possibilities of combining the r MCMC updates: Cycle Perform the MCMC update in a deterministic order. Mixture Pick an MCMC update at random.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Cycles Cycle of MCMC updates K (1) , . . . , K (r) Starting with X(0) iterate for t = 1, 2, . . . 1. Set ξ (t,0) := X(t−1) . 2. For ρ = 1, . . . , r: Obtain ξ (t,ρ) from ξ (t,ρ−1) by performing an MCMC update corresponding to the kernel K (ρ) . 3. Set X(t) := ξ (t,ρ) . Similar to the (systematic scan) Gibbs sampler. Corresponding transition kernel is Z Z ◦ (t−1) (t) K (x ,x ) = · · · K (1) (x(t−1) , ξ (t,1) )K (2) (ξ (t,1) , ξ (t,2) )
· · · K (r) (ξ (t,r−1) , x(t) ) dξ (t,r−1) · · · dξ (t,1)
f is the invariant distribution of K ◦ if f is the invariant distribution of all K (ρ) . Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Mixtures Mixture of MCMC updates K (1) , . . . , K (r) Starting with X(0) iterate for t = 1, 2, . . . 1. Draw ρ from {1, . . . , k} with probabilities (w1 , . . . , wr ).
2. Obtain X(t) from X(t−1) by performing an MCMC update corresponding to the kernel K (ρ) . Similar to the random scan Gibbs sampler. Corresponding transition kernel is K + (x(t−1) , x(t) ) =
r X
wρ K (ρ) (x(t−1) , x(t) ).
ρ=1
f is the invariant distribution of K + if f is the invariant distribution of all K (ρ) . Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Example 5.5: One-at-a-time Metropolis-Hastings: Idea Metropolis-Hastings algorithm 5.1 updates all components of X(t) in a single step. (t)
Can we update each component Xj separately? one-at-a-time Metropolis-Hastings algorithm. Can be seen as a composition of p transition kernels K (1) , . . . , K (p) . (t)
Kernel K (j) is a Metropolis-Hastings update of Xj . Two possibilities of combining the kernels: Cycle (“systematic scan”). Mixture (“random scan”).
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Example 5.5: One-at-a-time MH (cycle, systematic scan) (0)
(0)
Starting with X(0) = (X1 , . . . , Xp ) iterate for t = 1, 2, . . . Iterate for j = 1, . . . , p (t)
(t)
(t−1)
(t−1)
i. Draw Xj ∼ qj (·|X1 , . . . , Xj−1 , Xj , . . . , Xp ). ii. Compute ( (t) (t) (t−1) (t−1) f (X1 , . . . , Xj−1 , Xj , Xj+1 , . . . , Xp ) αj = min 1, (t) (t) (t−1) (t−1) (t−1) f (X1 , . . . , Xj−1 , Xj , Xj+1 , . . . , Xp ) (t−1)
·
qj (Xj
(t)
(t)
(t)
Lectures 8 & 9 Nick Whiteley
(t)
(t−1)
qj (Xj |X1 , . . . , Xj−1 , Xj
iii. With probability αj set Xj (corresponds to setting ξ
(t)
(t,j)
=
(t−1)
(t−1)
)
(t−1)
(t−1)
)
|X1 , . . . , Xj−1 , Xj , Xj+1 , . . . , Xp
, Xj+1 , . . . , Xp (t)
= Xj , otherwise set Xj
(t−1)
= Xj
)
.
.
(t) (t) (t−1) (t−1) (X1 , . . . , Xj , Xj+1 , . . . , Xp ))
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Example 5.5: One-at-a-time MH (mixture, random scan) (0)
(0)
Starting with X(0) = (X1 , . . . , Xp ) iterate 1. Draw an index j from a distribution on {1, . . . , p} (e.g. uniform) (t−1)
2. Draw Xj ∼ qj (·|X1
(t−1)
, . . . , Xp
).
3. Compute αj
=
(
min 1,
(t−1)
f (X1
(t−1)
f (X1
(t−1)
(t−1)
(t−1)
, . . . , Xj−1 , Xj , Xj+1 , . . . , Xp (t−1)
(t−1)
, . . . , Xj−1 , Xj
(t−1)
)
(t−1)
, Xj+1 , . . . , Xp
)
(t−1) (t−1) (t−1) (t−1) (t−1) qj (Xj |X1 , . . . , Xj−1 , Xj , Xj+1 , . . . , Xp ) · (t−1) (t−1) (t−1) (t−1) (t−1) qj (Xj |X1 , . . . , Xj−1 , Xj , Xj+1 , . . . , Xp ) (t)
4. With probability αj set Xj (t)
5. Set Xι
Lectures 8 & 9 Nick Whiteley
(t−1)
:= Xι
(t)
= Xj , otherwise set Xj
(t−1)
= Xj
)
.
.
for all ι 6= j.
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
The Gibbs sampler as a Metropolis-Hastings algorithm
Remark 5.2 The Gibbs sampler for a p-dimensional distribution is a special case of a one-at-a-time Metropolis-Hasting algorithm: the (systematic scan) Gibbs sampler is a cycle of p kernels, the random scan Gibbs sampler is a mixture of these kernels. The proposal qj corresponding to the j-th kernel consists of (t) drawing Xj ∼ fXj |X−j . The corresponding probability of acceptance is uniformly equal to 1.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
7 Convergence diagnostics
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Practical considerations: Burn-in period Theory (ergodic theorems) allows for the use of the entire chain (X(0) , X(1) , . . .). However distribution of (X(t) ) for small t might still be far from the stationary distribution f . Can be beneficial to discard the first iterations X(t) , t = 1, . . . , T0 (burn-in period). Optimal T0 depends on mixing properties of the chain.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Practical considerations: Thinning (1) MCMC methods typically yield positively correlated chain: ρ(X(t) , X(t+τ ) ) large for small τ . Idea: build a subchain by only keeping every m-th value: Consider a Markov chain (Y(t) )t=1,...,bT /mc with Y(t) = X(m·t) instead of (X(t) )t=1,...,T (thinning ). (Y(t) )t exhibits less autocorrelation than (X(t) )t , i.e. ρ(Y(t) , Y(t+τ ) ) = ρ(X(t) , X(t+m·τ ) ) < ρ(X(t) , X(t+τ ) ), if the correlation ρ(X(t) , X(t+τ ) ) decreases monotonically in τ . Price we have to pay: length of (Y(t) )t=1,...,bT /mc is only (1/m)-th of the length of (X(t) )t=1,...,T .
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Practical considerations: Thinning (2) If X(t) ∼ f and corresponding variances exist, ! bT /mc T X X 1 1 h(X(t) ) ≤ Var h(Y(t) ) , Var T bT /mc t=1
t=1
i.e. thinning cannot be justified when objective is estimating Ef (h(X)). Thinning can be a useful concept if computer has insufficient memory. for convergence diagnostics: (Y(t) )t=1,...,bT /mc is closer to an i.i.d. sample than (X(t) )t=1,...,T .
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
The need for convergence diagnostics
Theory we have studied guarantees (under certain conditions) the convergence of the Markov chain X(t) to the desired distribution. This does not imply that a finite sample from such a chain yields a good approximation to the target distribution. Validity of the approximation must be confirmed in practise. Convergence diagnostics help answering this question. Convergence diagnostics are not perfect and should be treated with a good amount of scepticism.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Different diagnostic tasks Convergence to the target distribution Does X(t) yield a sample from the target distribution? Has (X(t) )t reached a stationary regime? Does (X(t) )t cover the support of the target distribution? P Convergence of the averages Does Tt=1 h(X(t) )/T provide a good approximation to the expectation Ef (h(X)) under the target distribution? Comparison to i.i.d. sampling How much information is contained in the sample from the Markov chain compared to i.i.d. sampling?
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Pathological example 1: potentially slowly mixing Gibbs sampler from a bivariate Gaussian with correlation ρ(X1 , X2 ) ρ(X1 , X2 ) = 0.99
0.5
(t)
0.0 −0.5
X2
0.0
−1.5
-2.0
−1.0
-1.0
X2
(t)
1.0
1.0
2.0
1.5
ρ(X1 , X2 ) = 0.3
-2.0
-1.0
0.0 (t)
X1
1.0
2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
(t)
X1
For correlations ρ(X1 , X2 ) close to ±1 the chain can be poorly mixing. Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Pathological example 2: no central limit theorem
The following MCMC algorithm has the Beta(α, 1) distribution as stationary distribution: Starting with any X (0) iterate for t = 1, 2, . . . 1. With probability 1 − X (t−1) , set X (t) = X (t−1) . 2. Otherwise draw X (t) ∼ Beta(α + 1, 1).
Markov chain converges very slowly (no central limit theorem applies).
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Pathological example 3: nearly reducible chain Metropolis-Hastings sample from a mixture of two well-separated Gaussians, i.e. the target is f (x) = 0.4 · φ(−1,0.22 ) (x) + 0.6 · φ(2,0.32 ) (x)
0.4 0.0
0.2
density
0.6
0.8
If the variance of the proposal is too small, the chain cannot move from one population to the other.
−2
Lectures 8 & 9 Nick Whiteley
−1
0
1
2
3
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Basic plots (t)
Plot the sample paths (Xj )t . should be oscillating very fast and show very little structure.
P (τ ) Plot the cumulative averages ( tτ =1 Xj /t)t .
should be converging to a value.
¯j − Alternatively plot CUSUM (X P ¯ j = T X (τ ) /T . X τ =1
j
Pt
(τ ) τ =1 Xj /t)t
with
should be converging to 0.
Only very obvious problems visible in these plots. Difficult to assess multivariate distributions from univariate projections.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Basic plots for pathological example 1 (ρ(X1 , X2 ) = 0.3) Cumulative averages
−3
1.0 0.5 −0.5
0.0
0 −2
−1
X1
1
Cumulative average of X1
2
3
Sample paths
0
200
400
600
sample
800
1000
0
200
400
600
800
1000
sample
Looks OK.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Basic plots for pathological example 1 (ρ(X1 , X2 ) = 0.99) Cumulative averages
0.6 0.4 0.0
0.2
X1
−1
0
1
Cumulative average of X1
2
Sample paths
0
200
400
600
800
1000
sample
0
200
400
600
800
1000
sample
Slow mixing speed can be detected.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Basic plots for pathological example 2
Cumulative average of X
0.8 0.6 X 0.4 0.2 0
200
400
600
800
1000
0.40 0.45 0.50 0.55 0.60 0.65
Cumulative averages
1.0
Sample paths
0
200
sample
400
600
800
1000
sample
Slow convergence of the mean can be detected.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Basic plots for pathological example 3 Cumulative averages
1.8 1.6 1.4 1.0
1.2
X
1.0
1.5
2.0
2.5
Cumulative average of X
3.0
2.0
Sample paths
0
200
400
600
sample
800
1000
0
200
400
600
800
1000
sample
We cannot detect that the sample only covers one part of the distribution. (“you’ve only seen where you’ve been”) Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Non-parametric tests of convergence
Partition chain in 3 blocks: burn-in (X(t) )t=1,...,bT /3c first block (X(t) )t=bT /3c+1,...,2bT /3c second block (X(t) )t=2bT /3c+1,...,T Distribution of X(t) in both blocks should be identical. Idea: Use of a non-parametric test to test whether the two distributions are identical. Problem: Tests designed for i.i.d. samples. Resort to a (less correlated) thinned chain Y(t) = X(m·t) .
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Kolmogorov-Smirnov test Two i.i.d. populations: Z1,1 , . . . , Z1,n and Z2,1 , . . . , Z2,n Estimate empirical CDF in each population: n
1X I(−∞,z] (Zk,i ) Fˆk (z) = n i=1
Test statistic is the maximum difference between the two empirical CDFs: K = sup |Fˆ1 (x) − Fˆ2 (x)| x∈R
For n → ∞ the CDF of R(k) = 1 − Lectures 8 & 9 Nick Whiteley
√
n · K converges to the CDF
+∞ X (−1)i−1 exp(−2i2 k 2 ) i=1
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Kolmogorov-Smirnov test
In our case the two populations are thinned first block (Y(t) )t=bT /(3·m)c+1,...,2bT /(3·m)c thinned second block (X(t) )t=2bT /(3·m)c+1,...,bT /mc Even the thinned chain (Y(t) )t is autocorrelated test invalid from a formal point of view. p Standardised test statistic bT /(3 · m)c · K can still be used a heuristic tool.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
KS test for pathological example 1
● ● ● ● ● ●
● ●
●
2000
4000
6000
sample
8000
10000
● ●
● ● ● ●● ● ● ●●
3
1.5 1.0 0
● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ●●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●
2
● ● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●●●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●
1
●
●
4
● ●● ● ● ●● ●● ● ● ● ●●
●● ● ●● ● ●● ● ●● ● ●
0.5
Standardised KS statistic
●
ρ(X1 , X2 ) = 0.99
Standardised KS statistic
2.0
ρ(X1 , X2 ) = 0.3
0
●
● ●● ●
2000
4000
6000
8000
10000
sample
Slow mixing speed can be detected for the highly correlated chain. Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
4.0
KS test for pathological example 2
3.0 2.5 2.0 1.5 0.5
1.0
Standardised KS statistic
3.5
● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0
2000
4000
6000
8000
10000
sample
Problems can be detected.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
● ●
● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●
●
1.5
●
● ●
● ●
●
1.0
Standardised KS statistic
2.0
KS test for pathological example 3
● ● ● ● ● ● ● ●
● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●
●
● ● ● ●
● ● ●
● ●●
●
●●
●
● ● ●● ● ● ● ● ●● ● ● ●
● ● ●
● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●
●● ●
●
● ● ● ● ● ●● ● ● ● ●● ●● ● ●
● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●
● ●● ●● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●
●
●
0
2000
4000
6000
8000
10000
sample
We cannot detect that the sample only covers one part of the distribution. (“you’ve only seen where you’ve been”)
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Comparing multiple chains Compare L > 1 chains (X(1,t) )t , . . . , (X(L,t) )t . Initialised using overdispersed starting values X(1,0) , . . . , X(L,0) . Idea: Variance and range of each chain (X(l,t) )t should equal the range and variance of all chains pooled together. Compare basic plots for the different chains. Quantitative measure: (l)
Compute distance δα between α and (1 − α) quantile of (l,t) (Xk )t . (·) Compute distance δα between α and (1 − α) quantile of the pooled data. PL (l) δα /L The ratio Sˆαinterval = l=1(·) should be around 1. δα
Alternative: compare variance within each chain with the pooled variance estimate. Choosing suitable initial values X(1,0) , . . . , X(L,0) difficult in high dimensions. Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Comparing multiple chains plots for pathological example 3 Cumulative averages
0
2000
4000
6000
sample
8000
10000
1.5 1.0 0.5 0.0 −1.0 −0.5
−1
0
X
1
2
Cumulative average of X
3
2.0
Sample paths
0
2000
4000
6000
8000
10000
sample
Sˆαinterval = 0.2703 1 We can detect that the sample only covers one part of the distribution (provided the chains are initialised appropriately). Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Riemann sums and control variates Consider order statistic X [1] ≤ . . . ≤ X [T ] . Provided (X [t] )t = 1 . . . , T covers the support of the target, the Riemann sum T X (X [t] − X [t−1] )f (X [t] ) t=2
converges to
Z
f (x)dx = 1.
P Thus if Tt=2 (X [t] − X [t−1] )f (X [t] ) 1, the Markov chain has failed to explore all the support of the target. Requires that target density f is available inclusive of normalisation constants. Only effective in 1D. Riemann sums can be seen as a special case of control variates. Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Riemann sums for pathological example 3
For the chain stuck in the population with mean 2 we obtain T X (X [t] − X [t−1] )f (X [t] ) = 0.598 1, t=2
so we can detect that we have not explored the whole distribution.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Effective sample size MCMC algorithms yield a positively correlated sample (X(t) )t=1,...,T . MCMC sample of size T thus contains less information than an i.i.d. sample of size T . Question: how much less information? Approximate (h(X(t) ))t=1,...,T by an AR(1) process, i.e. we assume that ρ(h(X(t) ), h(X(t+τ ) )) = ρ|τ | . Variance of the estimator is ! T 1X 1+ρ 1 (t) Var h(X ) ≈ · Var h(X(t) ) T 1−ρ T t=1
Same variance as an i.i.d. sample of the size T · Thus define T · Lectures 8 & 9 Nick Whiteley
1−ρ as effective sample size. 1+ρ
1−ρ . 1+ρ
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Effective sample for pathological example 1 Rapidly mixing chain (ρ(X1 , X2 ) = 0.3) 10,000 samples
fˆX1 (x1 )
Slowly mixing chain (ρ(X1 , X2 ) = 0.99) 10,000 samples
fˆX1 (x1 )
(t)
X1
MCMC sample
(t−1) (t) ρ(X1 , X1 )
= 0.078
ESS for estimating Ef (X1 ) is 8,547.
Lectures 8 & 9 Nick Whiteley
(t)
X1
MCMC sample
(t−1) (t) ρ(X1 , X1 )
= 0.979
ESS for estimating Ef (X1 ) is 105.
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
8 Simulated Annealing
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Finding the mode of a distribution Our objective so far: estimate E(h(X)). Our new objective: estimate (global) mode(s) of a distribution: {ξ : f (ξ) ≥ f (x) ∀x}
The best we can do so far: Choose the X(t) with maximal density f (X(t) ). Inefficient solution: The Markov chain explores the whole support of the target distribution (unnecessary here) Idea: Transform distribution such that it is more concentrated around the mode(s). Consider f(β) (x) ∝ (f (x))β for very large values of β. For β → +∞ the distribution f(β) (·) will be concentrated on the (global) modes. Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Example 8.1: Normal distribution (1) Consider the N(µ, σ 2 ) distribution with density (x − µ)2 (x − µ)2 1 exp − ∝ exp − . f(µ,σ2 ) (x) = √ 2σ 2 2σ 2 2πσ 2 Mode of the N(µ, σ 2 ) distribution is µ. For increasing β the distribution is more and more concentrated around its mode µ, as f(µ,σ2 ) (x)
β
β (x − µ)2 exp − 2σ 2 (x − µ)2 = exp − ∝ f(µ,σ2 /β) (x). 2σ 2 /β
∝
Increasing β corresponds to reducing the variance. Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Example 8.1: Normal distribution (2)
¡
¢10 φ(0,1) (x) ∝ φ(0,1/10) (x)
¡
¢100 φ(0,1) (x) ∝ φ(0,1/100) (x)
¡
¢1000 φ(0,1) (x) ∝ φ(0,1/1000) (x)
6 0
2
4
density
8
10
12
φ(0,1) (x)
-2
-1
0 x
Lectures 8 & 9 Nick Whiteley
1
2 -2
-1
0 x
1
2 -2
-1
0 x
1
2 -2
-1
0
1
2
x
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Another example
f (x)
-2
-1
0 x
Lectures 8 & 9 Nick Whiteley
(f (x))
1
2
-2
-1
0 x
3
(f (x))
1
2
-2
-1
0 x
9
(f (x))
1
2
-2
-1
0
27
1
2
x
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Sampling from f(β) (·) We can sample from f(β) (·) using a random walk Metropolis algorithm. Probability of acceptance becomes ( ) ( β ) f(β) (X) f (X) min 1, . = min 1, f(β) (X(t−1) ) f (X(t−1) ) For β → +∞ the probability of acceptance converges to . . . 1 if f (X) ≥ f (X(t−1) ), and 0 if f (X) < f (X(t−1) ).
For large β the chain (X(t) )t converges to a local maximum of f (·). Whether the chain can escape from local maxima of the density depends on whether it can reach the (global) mode within a single step. (unrealistic for complex models) Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Example 8.2 Assume we want to find the mode of 0.4 for x = 2 0.3 for x = 4 p(x) = 0.1 for x = 1, 3, 5.
using a random walk Metropolis algorithm that can only move one to the left or one to the right.
#1
For β → +∞ the probability for accepting a move from 4 to 3 converges to 0, as p(4) > p(3), thus the chain cannot escape from the local maximum at 4.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Sampling from f(β) (·) is difficult
For large β the distribution f(β) (·) is increasingly concentrated around the (global) modes. For large β sampling from f(β) gets increasingly difficult. (Getting stuck in a local maximum of the density increasingly likely for large β)
Remedy: Start with a small β0 and let βt increase slowly over time. Speed at which βt increases determines whether we can escape local extrema.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Simulated Annealing: Minimising an arbitrary function More general objective: find global minima of a function H(x) over x ∈ E. Idea: Consider a distribution
f (x) ∝ exp(−H(x)) for x ∈ E, yielding f(βt ) (x) = (f (x))βt ∝ exp(−βt · H(x)) for x ∈ E. back to the framework of the previous slides. In this context βt is often referred to as inverse temperature.
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Simulated Annealing: Algorithm Algorithm 8.1: Simulated Annealing (0)
(0)
Starting with X(0) := (X1 , . . . , Xp ) and β (0) > 0 iterate for t = 1, 2, . . . 1. Increase βt−1 to βt (see below for different annealing schedules) 2. Draw X ∼ q(·|X(t−1) ). 3. Compute
(
q(X(t−1) |X) α(X|X(t−1) ) = min 1, exp −βt (H(X(t−1) ) − H(X) · q(X|X(t−1) ) 4. With probability α(X|X(t−1) ) set X(t) = X, otherwise set X(t) = X(t−1) . Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Annealing schedules As before X(t) converges for βt → ∞ to a local minimum of H(·). Convergence to a global minimum depends on annealing schedule: . Logarithmic tempering βt = log(1+t) β0 Inverse temperature βt increases slowly enough that convergence to the global minimum of H(·) can be shown in special cases (e.g. finite support E, . . . ) (theoretical results however practically irrelevant)
Geometric tempering βt = αt · β0 for some α > 1 . Popular choice, no theoretical convergence results. In practise: expect simulated annealing to find a “good” local minimum, but don’t expect it to find the global minimum! Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
Example 8.3 (1) Minimise
with
|x| mod 2 2 − |x| mod 2
for 2k ≤ |x| ≤ 2k + 1, k ∈ N0 for 2k + 1 ≤ |x| ≤ 2(k + 1), k ∈ N0
0
2
4
H(x)
6
8
10
s(x) =
2 H(x) = (x − 1)2 − 1 + 3 · s(11.56 · x2 )
-1
0
1
2
3
x
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing
1.0
X (t)
0.0
0.5
H(X (t) )
1.5
2.0
0.0
0.5
1.0
1.5
2.0
2.5
Example 8.3 (2)
0
200
400
600
800
1000
t
Lectures 8 & 9 Nick Whiteley
5.5 Composing kernels 7 Convergence diagnostics 8 Simulated Annealing