Springer Series on
Signals and Communication Technology
Signals and Communication Technology Circuits and Systems Based on Delta Modulation Linear, Nonlinear and Mixed Mode Processing D.G. Zrilic ISBN 3-540-23751-8 Functional Structures in Networks AMLn – A Language for Model Driven Development of Telecom Systems T. Muth ISBN 3-540-22545-5 RadioWave Propagation for Telecommunication Applications H. Sizun ISBN 3-540-40758-8 Electronic Noise and Interfering Signals Principles and Applications G. Vasilescu ISBN 3-540-40741-3 DVB The Family of International Standards for Digital Video Broadcasting, 2nd ed. U. Reimers ISBN 3-540-43545-X Digital Interactive TV and Metadata Future Broadcast Multimedia A. Lugmayr, S. Niiranen, and S. Kalli ISBN 3-387-20843-7 Adaptive Antenna Arrays Trends and Applications S. Chandran (Ed.) ISBN 3-540-20199-8 Digital Signal Processing with Field Programmable Gate Arrays U. Meyer-Baese ISBN 3-540-21119-5 Neuro-Fuzzy and Fuzzy Neural Applications in Telecommunications P. Stavroulakis (Ed.) ISBN 3-540-40759-6 SDMA for Multipath Wireless Channels Limiting Characteristics and Stochastic Models I.P. Kovalyov ISBN 3-540-40225-X Digital Television A Practical Guide for Engineers W. Fischer ISBN 3-540-01155-2 Multimedia Communication Technology Representation, Transmission and Identification of Multimedia Signals J.R. Ohm ISBN 3-540-01249-4
Information Measures Information and its Description in Science and Engineering C. Arndt ISBN 3-540-40855-X Processing of SAR Data Fundamentals, Signal Processing, Interferometry A. Hein ISBN 3-540-05043-4 Chaos-Based Digital Communication Systems Operating Principles, Analysis Methods, and Performance Evalutation F.C.M. Lau and C.K. Tse ISBN 3-540-00602-8 Adaptive Signal Processing Application to Real-World Problems J. Benesty and Y. Huang (Eds.) ISBN 3-540-00051-8 Multimedia Information Retrieval and Management Technological Fundamentals and Applications D. Feng, W.C. Siu, and H.J. Zhang (Eds.) ISBN 3-540-00244-8 Structured Cable Systems A.B. Semenov, S.K. Strizhakov, and I.R. Suncheley ISBN 3-540-43000-8 UMTS The Physical Layer of the Universal Mobile Telecommunications System A. Springer and R. Weigel ISBN 3-540-42162-9 Advanced Theory of Signal Detection Weak Signal Detection in Generalized Obeservations I. Song, J. Bae, and S.Y. Kim ISBN 3-540-43064-4 Wireless Internet Access over GSM and UMTS M. Taferner and E. Bonek ISBN 3-540-42551-9 The Variational Bayes Method in Signal Processing ˇ ıdl and A. Quinn V. Sm´ ISBN 3-540-28819-8
ˇ ıdl V´aclav Sm´ Anthony Quinn
The Variational Bayes Method in Signal Processing With 65 Figures
123
ˇ ıdl Dr. V´aclav Sm´ Institute of Information Theory and Automation Academy of Sciences of the Czech Republic, Department of Adaptive Systems PO Box 18, 18208 Praha 8, Czech Republic E-mail:
[email protected]
Dr. Anthony Quinn Department of Electronic and Electrical Engineering University of Dublin, Trinity College Dublin 2, Ireland E-mail:
[email protected]
ISBN-10 3-540-28819-8 Springer Berlin Heidelberg New York ISBN-13 978-3-540-28819-0 Springer Berlin Heidelberg New York Library of Congress Control Number: 2005934475 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specif ically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microf ilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media. springer.com © Springer-Verlag Berlin Heidelberg 2006 Printed in Germany The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specif ic statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting and production: SPI Publisher Services Cover design: design & production GmbH, Heidelberg Printed on acid-free paper
SPIN: 11370918
62/3100/SPI - 5 4 3 2 1 0
Do mo Thuismitheoirí A.Q.
Preface
Gaussian linear modelling cannot address current signal processing demands. In modern contexts, such as Independent Component Analysis (ICA), progress has been made specifically by imposing non-Gaussian and/or non-linear assumptions. Hence, standard Wiener and Kalman theories no longer enjoy their traditional hegemony in the field, revealing the standard computational engines for these problems. In their place, diverse principles have been explored, leading to a consequent diversity in the implied computational algorithms. The traditional on-line and data-intensive preoccupations of signal processing continue to demand that these algorithms be tractable. Increasingly, full probability modelling (the so-called Bayesian approach)—or partial probability modelling using the likelihood function—is the pathway for design of these algorithms. However, the results are often intractable, and so the area of distributional approximation is of increasing relevance in signal processing. The Expectation-Maximization (EM) algorithm and Laplace approximation, for example, are standard approaches to handling difficult models, but these approximations (certainty equivalence, and Gaussian, respectively) are often too drastic to handle the high-dimensional, multi-modal and/or strongly correlated problems that are encountered. Since the 1990s, stochastic simulation methods have come to dominate Bayesian signal processing. Markov Chain Monte Carlo (MCMC) sampling, and related methods, are appreciated for their ability to simulate possibly high-dimensional distributions to arbitrary levels of accuracy. More recently, the particle filtering approach has addressed on-line stochastic simulation. Nevertheless, the wider acceptability of these methods—and, to some extent, Bayesian signal processing itself— has been undermined by the large computational demands they typically make. The Variational Bayes (VB) method of distributional approximation originates— as does the MCMC method—in statistical physics, in the area known as Mean Field Theory. Its method of approximation is easy to understand: conditional independence is enforced as a functional constraint in the approximating distribution, and the best such approximation is found by minimization of a Kullback-Leibler divergence (KLD). The exact—but intractable—multivariate distribution is therefore factorized into a product of tractable marginal distributions, the so-called VB-marginals. This straightforward proposal for approximating a distribution enjoys certain opti-
VIII
Preface
mality properties. What is of more pragmatic concern to the signal processing community, however, is that the VB-approximation conveniently addresses the following key tasks: 1. The inference is focused (or, more formally, marginalized) onto selected subsets of parameters of interest in the model: this one-shot (i.e. off-line) use of the VB method can replace numerically intensive marginalization strategies based, for example, on stochastic sampling. 2. Parameter inferences can be arranged to have an invariant functional form when updated in the light of incoming data: this leads to feasible on-line tracking algorithms involving the update of fixed- and finite-dimensional statistics. In the language of the Bayesian, conjugacy can be achieved under the VB-approximation. There is no reliance on propagating certainty equivalents, stochastically-generated particles, etc. Unusually for a modern Bayesian approach, then, no stochastic sampling is required for the VB method. In its place, the shaping parameters of the VB-marginals are found by iterating a set of implicit equations to convergence. This Iterative Variational Bayes (IVB) algorithm enjoys a decisive advantage over the EM algorithm whose computational flow is similar: by design, the VB method yields distributions in place of the point estimates emerging from the EM algorithm. Hence, in common with all Bayesian approaches, the VB method provides, for example, measures of uncertainty for any point estimates of interest, inferences of model order/rank, etc. The machine learning community has led the way in exploiting the VB method in model-based inference, notably in inference for graphical models. It is timely, however, to examine the VB method in the context of signal processing where, to date, little work has been reported. In this book, at all times, we are concerned with the way in which the VB method can lead to the design of tractable computational schemes for tasks such as (i) dimensionality reduction, (ii) factor analysis for medical imagery, (iii) on-line filtering of outliers and other non-Gaussian noise processes, (iv) tracking of non-stationary processes, etc. Our aim in presenting these VB algorithms is not just to reveal new flows-of-control for these problems, but—perhaps more significantly—to understand the strengths and weaknesses of the VB-approximation in model-based signal processing. In this way, we hope to dismantle the current psychology of dependence in the Bayesian signal processing community on stochastic sampling methods. Without doubt, the ability to model complex problems to arbitrary levels of accuracy will ensure that stochastic sampling methods—such as MCMC— will remain the golden standard for distributional approximation. Notwithstanding this, our purpose here is to show that the VB method of approximation can yield highly effective Bayesian inference algorithms at low computational cost. In showing this, we hope that Bayesian methods might become accessible to a much broader constituency than has been achieved to date. Praha, Dublin October 2005
Václav Šmídl Anthony Quinn
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 How to be a Bayesian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The Variational Bayes (VB) Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 A First Example of the VB Method: Scalar Additive Decomposition 3 1.3.1 A First Choice of Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 The Prior Choice Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 The VB Method in its Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 VB as a Distributional Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Layout of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.7 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2
Bayesian Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Bayesian Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Off-line vs. On-line Parametric Inference . . . . . . . . . . . . . . . . 2.2 Bayesian Parametric Inference: the Off-Line Case . . . . . . . . . . . . . . . 2.2.1 The Subjective Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Posterior Inferences and Decisions . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Prior Elicitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3.1 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Bayesian Parametric Inference: the On-line Case . . . . . . . . . . . . . . . . 2.3.1 Time-invariant Parameterization . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Time-variant Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 13 14 15 16 16 18 19 19 20 20 22 22
3
Off-line Distributional Approximations and the Variational Bayes Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Distributional Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 How to Choose a Distributional Approximation . . . . . . . . . . . . . . . . . 3.2.1 Distributional Approximation as an Optimization Problem . . 3.2.2 The Bayesian Approach to Distributional Approximation . . .
25 25 26 26 27
X
4
Contents
3.3 The Variational Bayes (VB) Method of Distributional Approximation 3.3.1 The VB Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 The VB Method of Approximation as an Operator . . . . . . . . 3.3.3 The VB Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 The VB Method for Scalar Additive Decomposition . . . . . . . 3.4 VB-related Distributional Approximations . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Optimization with Minimum-Risk KL Divergence . . . . . . . . 3.4.2 Fixed-form (FF) Approximation . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Restricted VB (RVB) Approximation . . . . . . . . . . . . . . . . . . . 3.4.3.1 Adaptation of the VB method for the RVB Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3.2 The Quasi-Bayes (QB) Approximation . . . . . . . . . 3.4.4 The Expectation-Maximization (EM) Algorithm . . . . . . . . . . 3.5 Other Deterministic Distributional Approximations . . . . . . . . . . . . . . 3.5.1 The Certainty Equivalence Approximation . . . . . . . . . . . . . . . 3.5.2 The Laplace Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 The Maximum Entropy (MaxEnt) Approximation . . . . . . . . . 3.6 Stochastic Distributional Approximations . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Distributional Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Example: Scalar Multiplicative Decomposition . . . . . . . . . . . . . . . . . . 3.7.1 Classical Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 The Bayesian Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 Full Bayesian Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.4 The Variational Bayes (VB) Approximation . . . . . . . . . . . . . . 3.7.5 Comparison with Other Techniques . . . . . . . . . . . . . . . . . . . . . 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28 28 32 33 37 39 39 40 40
Principal Component Analysis and Matrix Decompositions . . . . . . . . . 4.1 Probabilistic Principal Component Analysis (PPCA) . . . . . . . . . . . . . 4.1.1 Maximum Likelihood (ML) Estimation for the PPCA Model 4.1.2 Marginal Likelihood Inference of A . . . . . . . . . . . . . . . . . . . . . 4.1.3 Exact Bayesian Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 The Laplace Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The Variational Bayes (VB) Method for the PPCA Model . . . . . . . . . 4.3 Orthogonal Variational PCA (OVPCA) . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 The Orthogonal PPCA Model . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 The VB Method for the Orthogonal PPCA Model . . . . . . . . . 4.3.3 Inference of Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Moments of the Model Parameters . . . . . . . . . . . . . . . . . . . . . . 4.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Convergence to Orthogonal Solutions: VPCA vs. FVPCA . . 4.4.2 Local Minima in FVPCA and OVPCA . . . . . . . . . . . . . . . . . . 4.4.3 Comparison of Methods for Inference of Rank . . . . . . . . . . . . 4.5 Application: Inference of Rank in a Medical Image Sequence . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57 58 59 61 61 62 62 69 70 70 77 78 79 79 82 83 85 87
41 42 44 45 45 45 45 46 47 48 48 48 49 51 54 56
Contents
XI
5
Functional Analysis of Medical Image Sequences . . . . . . . . . . . . . . . . . 89 5.1 A Physical Model for Medical Image Sequences . . . . . . . . . . . . . . . . 90 5.1.1 Classical Inference of the Physiological Model . . . . . . . . . . . 92 5.2 The FAMIS Observation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2.1 Bayesian Inference of FAMIS and Related Models . . . . . . . . 94 5.3 The VB Method for the FAMIS Model . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4 The VB Method for FAMIS: Alternative Priors . . . . . . . . . . . . . . . . . . 99 5.5 Analysis of Clinical Data Using the FAMIS Model . . . . . . . . . . . . . . 102 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6
On-line Inference of Time-Invariant Parameters . . . . . . . . . . . . . . . . . . 109 6.1 Recursive Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2 Bayesian Recursive Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2.1 The Dynamic Exponential Family (DEF) . . . . . . . . . . . . . . . . 112 6.2.2 Example: The AutoRegressive (AR) Model . . . . . . . . . . . . . . 114 6.2.3 Recursive Inference of non-DEF models . . . . . . . . . . . . . . . . . 117 6.3 The VB Approximation in On-Line Scenarios . . . . . . . . . . . . . . . . . . . 118 6.3.1 Scenario I: VB-Marginalization for Conjugate Updates . . . . 118 6.3.2 Scenario II: The VB Method in One-Step Approximation . . . 121 6.3.3 Scenario III: Achieving Conjugacy in non-DEF Models via the VB Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.3.4 The VB Method in the On-Line Scenarios . . . . . . . . . . . . . . . 126 6.4 Related Distributional Approximations . . . . . . . . . . . . . . . . . . . . . . . . 127 6.4.1 The Quasi-Bayes (QB) Approximation in On-Line Scenarios 128 6.4.2 Global Approximation via the Geometric Approach . . . . . . . 128 6.4.3 One-step Fixed-Form (FF) Approximation . . . . . . . . . . . . . . . 129 6.5 On-line Inference of a Mixture of AutoRegressive (AR) Models . . . 130 6.5.1 The VB Method for AR Mixtures . . . . . . . . . . . . . . . . . . . . . . . 130 6.5.2 Related Distributional Approximations for AR Mixtures . . . 133 6.5.2.1 The Quasi-Bayes (QB) Approximation . . . . . . . . . . 133 6.5.2.2 One-step Fixed-Form (FF) Approximation . . . . . . . 135 6.5.3 Simulation Study: On-line Inference of a Static Mixture . . . . 135 6.5.3.1 Inference of a Many-Component Mixture . . . . . . . . 136 6.5.3.2 Inference of a Two-Component Mixture . . . . . . . . . 136 6.5.4 Data-Intensive Applications of Dynamic Mixtures . . . . . . . . . 139 6.5.4.1 Urban Vehicular Traffic Prediction . . . . . . . . . . . . . . 141 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7
On-line Inference of Time-Variant Parameters . . . . . . . . . . . . . . . . . . . . 145 7.1 Exact Bayesian Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.2 The VB-Approximation in Bayesian Filtering . . . . . . . . . . . . . . . . . . . 147 7.2.1 The VB method for Bayesian Filtering . . . . . . . . . . . . . . . . . . 149 7.3 Other Approximation Techniques for Bayesian Filtering . . . . . . . . . . 150 7.3.1 Restricted VB (RVB) Approximation . . . . . . . . . . . . . . . . . . . 150 7.3.2 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
XII
Contents
7.3.3
7.4
7.5
7.6
7.7 8
Stabilized Forgetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3.1 The Choice of the Forgetting Factor . . . . . . . . . . . . . The VB-Approximation in Kalman Filtering . . . . . . . . . . . . . . . . . . . . 7.4.1 The VB method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Loss of Moment Information in the VB Approximation . . . . VB-Filtering for the Hidden Markov Model (HMM) . . . . . . . . . . . . . 7.5.1 Exact Bayesian filtering for known T . . . . . . . . . . . . . . . . . . . 7.5.2 The VB Method for the HMM Model with Known T . . . . . . 7.5.3 The VB Method for the HMM Model with Unknown T . . . . 7.5.4 Other Approximate Inference Techniques . . . . . . . . . . . . . . . 7.5.4.1 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.4.2 Certainty Equivalence Approach . . . . . . . . . . . . . . . 7.5.5 Simulation Study: Inference of Soft Bits . . . . . . . . . . . . . . . . . The VB-Approximation for an Unknown Forgetting Factor . . . . . . . 7.6.1 Inference of a Univariate AR Model with Time-Variant Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Simulation Study: Non-stationary AR Model Inference via Unknown Forgetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2.1 Inference of an AR Process with Switching Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2.2 Initialization of Inference for a Stationary AR Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
153 154 155 156 158 158 159 160 162 164 164 165 166 168 169 173 173 174 176
The Mixture-based Extension of the AR Model (MEAR) . . . . . . . . . . . . 179 8.1 The Extended AR (EAR) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.1.1 Bayesian Inference of the EAR Model . . . . . . . . . . . . . . . . . . . 181 8.1.2 Computational Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 8.2 The EAR Model with Unknown Transformation: the MEAR Model 182 8.3 The VB Method for the MEAR Model . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.4 Related Distributional Approximations for MEAR . . . . . . . . . . . . . . . 186 8.4.1 The Quasi-Bayes (QB) Approximation . . . . . . . . . . . . . . . . . . 186 8.4.2 The Viterbi-Like (VL) Approximation . . . . . . . . . . . . . . . . . . . 187 8.5 Computational Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.6 The MEAR Model with Time-Variant Parameters . . . . . . . . . . . . . . . . 191 8.7 Application: Inference of an AR Model Robust to Outliers . . . . . . . . 192 8.7.1 Design of the Filter-bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 8.7.2 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 8.8 Application: Inference of an AR Model Robust to Burst Noise . . . . 196 8.8.1 Design of the Filter-Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 8.8.2 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 8.8.3 Application in Speech Reconstruction . . . . . . . . . . . . . . . . . . . 201 8.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Contents
9
XIII
Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 9.1 The VB Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 9.2 Contributions of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 9.3 Current Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 9.4 Future Prospects for the VB Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Required Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 A.1 Multivariate Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 A.2 Matrix Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 A.3 Normal-inverse-Wishart (N iWA,Ω ) Distribution . . . . . . . . . . . . . . . . 210 A.4 Truncated Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 A.5 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 A.6 Von Mises-Fisher Matrix distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 212 A.6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 A.6.2 First Moment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 A.6.3 Second Moment and Uncertainty Bounds . . . . . . . . . . . . . . . . 214 A.7 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 A.8 Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 A.9 Truncated Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Notational Conventions
R, X, Θ∗ x A ∈ Rn×m ai , ai,D ai,j , ai,j,D bi , bi,D diag (·) a diag−1 (·) A;r , AD;r A;r,r , AD;r,r a;r , aD;r A(r) ∈ Rn×m A Ir ∈ Rr×r 1p,q , 0p,q tr (A)
Linear Algebra Set of real numbers, set of elements x and set of elements θ, respectively. x ∈ R, a real scalar. Matrix of dimensions n × m, generally denoted by a capital letter. ith column of matrix A, AD , respectively. (i, j)th element of matrix A, AD , respectively, i = 1 . . . n, j = 1 . . . m. ith element of vector b, bD , respectively.
A = diag (a), a ∈ Rq , then ai,j = a0i ifif ii=j =j , i, j = 1, . . . , q. Diagonal vector of given matrix A (the context will distinguish this from a scalar, a (see 2nd entry, above)). a = diag−1 (A), A ∈ Rn×m , then a = [a1,1 , . . . , aq,q ] , q = min (n, m). Operator selecting the first r columns of matrix A, AD , respectively. Operator selecting the r × r upper-left sub-block of matrix A, AD , respectively. Operator extracting upper length-r sub-vector of vector a, aD , respectively. Subscript (r) denotes matrix A with restricted rank, rank (A) = r ≤ min (n, m). Transpose of matrix A. Square identity matrix. Matrix of size p × q with all elements equal to one, zero, respectively. Trace of matrix A.
XVI
Notational Conventions
a = vec (A) A = vect (a, p)
Operator restructuring elements of A = [a1 , . . . , an ] into a vector a = [a1 , . . . , an ] . Operator restructuring elements of vector a ∈ Rpn into matrix A ∈ Rp×n , as follows: ⎡ ⎤ a1 ap+1 · · · ap(n−1)+1 ⎢ ⎥ .. .. A = ⎣ ... ⎦. . . ap a2p · · ·
A = UA LA VA
[A ⊗ B] ∈ Rnp×mq
apn
Singular Value Decomposition (SVD) of matrix A ∈ Rn×m . In this monograph, the SVD is expressed in the ‘economic’ form, where UA ∈ Rn×q , LA ∈ Rq×q , VA ∈ Rm×q , q = min (n, m). Kronecker product of matrices A ∈ Rn×m and B ∈ Rp×q , such that ⎤ ⎡ a1,1 B · · · a1,m B ⎢ .. ⎥ . .. A ⊗ B = ⎣ ... . . ⎦ an,1 B · · · an,m B
[A ◦ B] ∈ Rn×m
Hadamard product of matrices A ∈ Rn×m and B ∈ Rn×m , such that ⎡ ⎤ a1,1 b1,1 · · · a1,m b1,m ⎢ ⎥ .. .. .. A◦B =⎣ ⎦. . . . an,1 bn,1 · · · an,m bn,m
{A}c A(i)
χX (·) erf (x) ln (A) , exp (A)
Γ (x) ψΓ (x)
Set Algebra Set of objects A with cardinality c. ith element of set {A}c , i = 1, . . . , c. Analysis Indicator (characteristic) function X.
x of set Error function: erf (x) = √2π 0 exp −t2 dt. Natural logarithm and exponential of matrix A respectively. Both operations are performed on elements of the matrix (or vector), e.g.
ln [a1 , a2 ] = [ln a1 , ln a2 ] . ∞ Gamma function, Γ (x) = 0 tx−1 exp(−t)dt, x > 0. ∂ Digamma (psi) function, ψΓ (x) = ∂x ln Γ (x).
Notational Conventions
Γr
1
2p
Multi-gamma function: Γr
0 F1 (a, AA )
δ (x)
XVII
1 p 2
1
= π 4 r(r−1)
r
j=1
Γ
1 (p − j + 1) , r ≤ p 2
Hypergeometric function, p Fq (·), with p = 0, q = 1, scalar parameter a, and symmetric matrix parameter, AA . δ-type function. The exact meaning is determined by the type of the argument, x. If x is a continuous variable, then δ (x) is the Dirac δ-function: δ (x − x0 ) g (x) dx = g (x0 ) , X
where x, x0 ∈ X. If x is an integer, then δ (x) is the Kronecker function: 1, if x = 0, . δ (x) = 0, otherwise. p (i)
ith elementary vector of Rp , i = 1, . . . , p:
p (i) = [δ (i − 1) , δ (i − 2) , . . . , δ (i − p)] . I(a,b]
Interval (a, b] in R.
Probability Calculus Probability of given argument. Distribution of (discrete or continuous) random variable x, conditioned by known θ. f˘ (x) Variable distribution to be optimized (‘wildcard’ in functional optimization). x and f (x) in the i-th iteration of an iterative algorithm. x[i] , f [i] (x) θˆ Point estimate of unknown parameter θ. Ef (x) [·] Expected value of argument with respect to distribution, f (x). g (x) Simplified notation for Ef (x) [g (x)]. x, x Upper bound, lower bound, respectively, on range of random variable x. Scalar Normal distribution of x with mean value, µ, and variNx (µ, r) ance, r. Multivariate Normal distribution of x with mean value, µ, and Nx (µ, Σ) covariance matrix, Σ. NX (M, Σp ⊗ Σn ) Matrix Normal distribution of X with mean value, M , and covariance matrices, Σp and Σn . Pr (·) f (x|θ)
XVIII
Notational Conventions
tNx (µ, r; X) MX (F ) Gx (α, β) Ux (X)
Truncated scalar Normal of x, of type N (µ, r), confined to support set X ⊂ R. Von-Mises-Fisher matrix distribution of X with matrix parameter, F . Scalar Gamma distribution of x with parameters, α and β. Scalar Uniform distribution of x on the support set X ⊂ R.
List of Acronyms
AR ARD CDEF DEF DEFS DEFH EAR FA FAMIS FVPCA HMM HPD ICA IVB KF KLD LPF FF MAP MCMC MEAR ML OVPCA PCA PE PPCA QB RLS RVB
AutoRegressive (model, process) Automatic Rank Determination (property) Conjugate (parameter) distribution to a DEF (observation) model Dynamic Exponential Family Dynamic Exponential Family with Separable parameters Dynamic Exponential Family with Hidden variables Extended AutoRegressive (model, process) Factor Analysis Functional Analysis for Medical Image Sequences (model) Fast Variational Principal Component Analysis (algorithm) Hidden Markov Model Highest Posterior Density (region) Independent Component Analysis Iterative Variational Bayes (algorithm) Kalman Filter Kullback-Leibler Divergence Low-Pass Filter Fixed Form (approximation) Maximum A Posteriori Markov Chain Monte Carlo Mixture-based Extension of the AutoRegressive model Maximum Likelihood Orthogonal Variational Principal Component Analysis Principal Component Analysis Prediction Error Probabilistic Principal Component Analysis Quasi-Bayes Recursive Least Squares Restricted Variational Bayes
XX
List of Acronyms
SNR SVD TI TV VB VL VMF VPCA
Signal-to-Noise Ratio Singular Value Decomposition Time-Invariant Time-Variant Variational Bayes Viterbi-Like (algorithm) Von-Mises-Fisher (distribution) Variational PCA (algorithm)
1 Introduction
1.1 How to be a Bayesian In signal processing, as in all quantitative sciences, we are concerned with data, D, and how we can learn about the system or source which generated D. We will often refer to learning as inference. In this book, we will model the data parametrically, so that a set, θ, of unknown parameters describes the data-generating system. In deterministic problems, knowledge of θ determines D under some notional rule, D = g(θ). This accounts for very few of the data contexts in which we must work. In particular, when D is information-bearing, then we must model the uncertainty (sometimes called the randomness) of the process. The defining characteristic of Bayesian methods is that we use probabilities to quantify our beliefs amid uncertainty, and the calculus of probability to manipulate these quantitative beliefs [1–3]. Hence, our beliefs about the data are completely expressed via the parametric probabilistic observation model, f (D|θ). In this way, knowledge of θ determines our beliefs about D, not D themselves. In practice, the result of an observational experiment is that we are given D, and our problem is to use them to learn about the system—summarized by the unknown parameters, θ—which generated them. This learning amid uncertainty is known as inductive inference [3], and it is solved by constructing the distribution f (θ|D), namely, the distribution which quantifies our a posteriori beliefs about the system, given a specific set of data, D. The simple prescription of Bayes’ rule solves the implied inverse problem [4], allowing us to reverse the order of the conditioning in the observation model, f (D|θ): f (θ|D) ∝ f (D|θ)f (θ).
(1.1)
Bayes’ rule specifies how our prior beliefs, quantified by the prior distribution, f (θ), are updated in the light of D. Hence, a Bayesian treatment requires prior quantification of our beliefs about the unknown parameters, θ, whether or not θ is by nature fixed or randomly realized. The signal processing community, in particular, has been resistant to the philosophy of strong Bayesian inference [3], which assigns
2
1 Introduction
probabilities to fixed, as well as random, unknown quantities. Hence, they relegate Bayesian methods to inference problems involving only random quantities [5, 6]. This book adheres to the strong Bayesian philosophy. Tractability is a primary concern to any signal processing expert seeking to develop a parametric inference algorithm, both in the off-line case and, particularly, on-line. The Bayesian approach provides f (θ|D) as the complete inference of θ, and this must be manipulated in order to solve problems of interest. For example, we may wish to concentrate the inference onto a subset, θ1 , by marginalizing over their complement, θ2 : f (θ1 |D) ∝ f (θ|D)dθ2 . (1.2) Θ2∗
A decision, such as a point estimate, may be required. The mean a posteriori estimate may then be justified: θ1 f (θ1 |D)dθ1 . (1.3) θ1 = Θ1∗
Finally, we might wish to select a model from a set of candidates, {M1 , . . . , Mc }, via computation of the marginal probability of D with respect to each candidate: f (Ml |D) ∝ Pr[Ml ]. f (D|θl , Ml )dθl . (1.4) Θl∗
Θl∗
Here, θl ∈ are the parameters of the competing models, and Pr[Ml ] is the necessary prior on those models.
1.2 The Variational Bayes (VB) Method The integrations required in (1.2)–(1.4) will often present computational burdens that compromise the tractability of the signal processing algorithm. In Chapter 3, we will review some of the approximations which can help to address these problems, but the aim of this book is to advocate the use of the Variational Bayes (VB) approximation as an effective pathway to the design of tractable signal processing algorithms for parametric inference. These VB solutions will be shown, in many cases, to be novel and attractive alternatives to currently available Bayesian inference algorithms. The central idea of the VB method is to approximate f (θ|D), ab initio, in terms of approximate marginals: f (θ|D) ≈ f˘(θ|D) = f˘(θ1 |D)f˘(θ2 |D).
(1.5)
In essence, the approximation forces posterior independence between subsets of parameters in a particular partition of θ chosen by the designer. The optimal such approximation is chosen by minimizing a particular measure of divergence from f˘(θ|D) to f (θ|D), namely, a particular Kullback-Leibler Divergence (KLD), which we will call KLDVB in Section 3.2.2:
1.3 A First Example of the VB Method: Scalar Additive Decomposition
f˜(θ|D) = arg
min
f˘(θ1 |·)f˘(θ2 |·)
KL f˘(θ1 |D)f˘(θ2 |D)||f (θ|D) .
3
(1.6)
In practical terms, functional optimization of (1.6) yields a known functional form for f˜(θ1 |D) and f˜(θ2 |D), which will be known as the VB-marginals. However, the shaping parameters associated with each of these VB-marginals are expressed via particular moments of the others. Therefore, the approximation is possible if all moments required in the shaping parameters can be evaluated. Mutual interaction of VB-marginals via their moments presents an obstacle to evaluation of its shaping parameters, since a closed-form solution is available only for a limited number of problems. However, a generic iterative algorithm for evaluation of VBmoments and shaping parameters is available for tractable VB-marginals (i.e. marginals whose moments can be evaluated). This algorithm—reminiscent of the classical Expectation-Maximization (EM) algorithm—will be called the Iterative Variational Bayes (IVB) algorithm in this book. Hence, the computational burden of the VB-approximation is confined to iterations of the IVB algorithm. The result is a set of moments and shaping parameters, defining the VB-approximation (1.5).
1.3 A First Example of the VB Method: Scalar Additive Decomposition Consider the following additive model: d = m + e,
f (e) = Ne 0, ω −1 .
(1.7) (1.8)
The implied observation model is f (d|m, ω) = Nd m, ω −1 . The task is to infer the two unknown parameters—i.e. the mean, m, and precision, ω—of the Normal distribution, N , given just one scalar data point, d. This constitutes a stressful regime for inference. In order to ‘be a Bayesian’, we assign a prior distribution to m and ω. Given the poverty of data, we can expect our choice to have some influence on our posterior inference. We will now consider two choices for prior elicitation. 1.3.1 A First Choice of Prior The following choice seems reasonable:
f (m) = Nm 0, φ−1 , f (ω) = Gω (α, β) .
(1.9) (1.10)
In (1.9), the zero mean expresses our lack of knowledge of the polarity of m, and the precision parameter, φ > 0, is used to penalize extremely large values. For φ → 0, (1.9) becomes flatter. The Gamma distribution, G, in (1.10) was chosen to reflect the positivity of ω. Its parameters, α > 0 and β > 0, may again be chosen to yield a
4
1 Introduction
non-informative prior. For α → 0 and β → 0, (1.10) approaches Jeffreys’ improper prior on scale parameters, 1/ω [7]. Joint inference of the normal mean and precision, m and ω respectively, is well studied in the literature [8, 9]. From Bayes’ rule, the posterior distribution is
(1.11) f (m, ω|d, α, β, φ) ∝ Nd m, ω −1 Nm 0, φ−1 Gω (α, β) . The basic properties of the Normal (N ) and Gamma (G) distributions are summarized in Appendices A.2 and A.5 respectively. Even in this simple case, evaluation of the marginal distribution of the mean, m, i.e. f (m|d, α, β, φ), is not tractable. Hence, we seek the best approximation in the class of conditionally independent posteriors on m and ω, by minimizing KLDVB (1.6), this being the VB-approximation. The solution can be found in the following form: −1 −1 f˜ (m|d, α, β, φ) = Nm ( d, ( ω + φ) ω + φ) ω , 1 1 2 f˜ (ω|d, α, β, φ) = Gω α + , + d2 + 2β . m − 2dm 2 2
(1.12) (1.13)
The shaping parameters of (1.12) and (1.13) are mutually dependent via their moments, as follows: ω = Ef˜(ω|d,·) [ω] =
1 2
α+
1 2
2 − 2dm + d2 + 2β m −1
ω + φ) ω d, m = Ef˜(m|d,·) [m] = ( 2 −1 2 = E ˜ = ( ω + φ) + m 2. m f (m|d,·) m
, (1.14)
The VB-moments (1.14) fully determine the VB-marginals, (1.12) and (1.13). It can be shown that this set of VB-equations (1.14) has three possible solutions (being roots of a 3rd-order polynomial), only one of which satisfies ω > 0. Hence, the optimized KLDVB has three ‘critical’ points for this model. The exact distribution and its VB-approximation are compared in Fig. 1.1. 1.3.2 The Prior Choice Revisited For comparison, we now consider a different choice of the priors: −1 f (m|ω) = Nm 0, (γω) , f (ω) = Gω (α, β) .
(1.15) (1.16)
Here, (1.16) is the same as (1.10), but (1.15) has been parameterized differently from (1.9). It still expresses our lack of knowledge of the polarity of m, and it still penalizes extreme values of m if γ → 0. Hence, both prior structures, (1.9) and (1.15), can
1.3 A First Example of the VB Method: Scalar Additive Decomposition
5
3 2
m
1 0 -1 -2
0
ω
0.05
0.1
0.1 0.05 0
0
1
2
3
4
Fig. 1.1. The VB-approximation, (1.12) and (1.13), for the scalar additive decomposition (dash-dotted contour). Full contour lines denote the exact posterior distribution (1.11).
express non-informative prior knowledge. However, the precision parameter, γω, of m is now chosen proportional to the precision parameter, ω, of the noise (1.8). From Bayes’ rule, the posterior distribution is now
−1 Gω (α, β) , (1.17) f (m, ω|d, α, β, γ) ∝ Nd m, ω −1 Nm 0, (γω) −1 −1 × f (m, ω|d, α, β, γ) = Nm (γ + 1) d, ((γ + 1) ω) γd2 1 ×Gω α + , β + . (1.18) 2 2 (1 + γ) Note that the posterior distribution, in this case, has the same functional form as the prior, (1.15) and (1.16), namely a product of Normal and Gamma distributions. This is known as conjugacy. The (exact) marginal distributions of (1.17) are now readily available: d 1 , , 2α , f (m|d, α, β, γ) = Stm γ + 1 2α (d2 γ + 2β (1 + γ)) γd2 1 f (ω|d, α, β, γ) = Gω α + , β + , 2 2 (1 + γ) where Stm denotes Student’s t-distribution with 2α degrees of freedom.
6
1 Introduction
In this case, the VB-marginals have the following forms: −1 −1 f˜ (m|d, α, β, γ) = Nm (1 + γ) d, ((1 + γ) ω ) , (1.19) 1 2 − 2dm (1 + γ) m + d2 . (1.20) f˜ (ω|d, α, β, γ) = Gω α + 1, β + 2 The shaping parameters of (1.19) and (1.20) are therefore mutually dependent via the following VB-moments: ω = Ef˜(ω|d,·) [ω] =
β+
1 2
α+1 , 2 − 2dm + d2 (1 + γ) m
−1
m = Ef˜(m|d,·) [m] = (1 + γ) d, 2 −1 −1 2 = E ˜ = (1 + γ) ω +m 2. m f (m|d,·) m
(1.21)
In this case, (1.21) has a simple, unique, closed-form solution, as follows: (1 + 2α) (1 + γ) , d2 γ + 2β (1 + γ) d m = , 1+γ 2 2 = d (1 + γ + 2α) + β (1 + γ) . m 2 (1 + γ) (1 + 2α) ω =
(1.22)
The exact and VB-approximated posterior distributions are compared in Fig. 1.2. Remark 1.1 (Choice of priors for the VB-approximation). Even in the stressful regime of this example (one datum, two unknowns), each set of priors had a similar influence on the posterior distribution. In more realistic contexts, the distinctions will be even less, as the influence of the data—via f (D|θ) in (1.1)—begins to dominate the prior, f (θ). However, from an analytical point-of-view, the effects of the prior choice can be very different, as we have seen in this example. Recall that the moments of the exact posterior distribution were tractable in the case of the second prior (1.17), but were not tractable in the first case (1.11). This distinction carried through to the respective VB-approximations. Once again, the second set of priors implied a far simpler solution (1.22) than the first (1.14). Therefore, in this book, we will take care to design priors which can facilitate the task of VB-approximation. We will always be in a position to ensure that our choice is non-informative.
1.4 The VB Method in its Context Statistical physics has long been concerned with high-dimensional probability functions and their simplification [10]. Typically, the physicist is considering a system of
1.4 The VB Method in its Context
7
3 2
m
1 0
-1 -2
0
ω
0.05
0.1
0.1 0.05 0
0
1
2
3
4
Fig. 1.2. The VB-approximation, (1.19) and (1.20), for the scalar additive decomposition (dash-dotted contour), using alternative priors, (1.15) and (1.16). Full contour lines denote the exact posterior distribution (1.17).
many interacting particles and wishes to infer the state, θ, of this system. Boltzmann’s law [11] relates the energy of the state to its probability, f (θ). If we wish to infer a sub-state, θi , we must evaluate the associated marginal, f (θi ). Progress can be made by replacing the exact probability model, f (θ), with an approximation, f˜ (θ). Typically, this requires us to neglect interactions in the physical system, by setting many such interactions to zero. The optimal such approximate distribution, f˜ (θ), can be chosen using the variational method [12], which seeks a free-form solution within the approximating class that minimizes some measure of disparity between f (θ) and f˜ (θ). Strong physical justification can be advanced for minimization of a KullbackLeibler divergence (1.6), which is interpretable as a relative entropy. The Variational Bayes (VB) approximation is one example of such an approximation, where independence between all θi is enforced (1.5). In this case, the approximating marginals depend on expectations of the remaining states. Mean Field Theory (MFT) [10] generalizes this approach, exploring many such choices for the approximating function, f˜ (θ), and its disparity with respect to f (θ). Once the variational approximation has been obtained, the exact system is studied by means of this approximation [13]. The machine learning community has adopted Mean Field Theory [12] as a way to cope with problems of learning and belief propagation in complex systems such as neural networks [14–16]. Ensemble learning [17] is an example of the use of the VB-approximation in this area. Communication between the machine learning
8
1 Introduction
and physics communities has been enhanced by the language of graphical models [18–20]. The Expectation-Maximization (EM) algorithm [21] is another important point of tangency, and was re-derived in [22] using KLDVB minimization. The EM algorithm has long been known in the signal processing community as a means of finding the Maximum Likelihood (ML) solution in high-dimensional problems— such as image segmentation—involving hidden variables. Replacement of the EM equations with Variational EM (i.e. IVB) [23] equations allows distributional approximations to be used in place of point estimates. In signal processing, the VB method has proved to be of importance in addressing problems of model structure inference, such as the inference of rank in Principal Component Analysis (PCA) [24] and Factor Analysis [20, 25]), and in the inference of the number of components in a mixture [26]. It has been used for identification of non-Gaussian AutoRegressive (AR) models [27, 28], for unsupervised blind source separation [29], and for pattern recognition of hand-written characters [15].
1.5 VB as a Distributional Approximation The VB method of approximation is one of many techniques for approximation of probability functions. In the VB method, the approximating family is taken as the set of all possible distributions expressed as the product of required marginals, with the optimal such choice made by minimization of a KLD. The following are among the many other approximations—deterministic and stochastic—that have been used in signal processing: Point-based approximations: examples include the Maximum a Posteriori (MAP) and ML estimates. These are typically used as certainty equivalents [30] in decision problems, leading to highly tractable procedures. Their inability to take account of uncertainty is their principal drawback. Local approximations: the Laplace approximation [31], for example, performs a Taylor expansion at a point, typically the ML estimate. This method is known to the signal processing community in the context of criteria for model order selection, such as the Schwartz criterion and Bayes’ Information Criterion (BIC), both of which were derived using the Laplace method [31]. Their principal disadvantage is their inability to cope with multimodal probability functions. Spline approximations: tractable approximations of the probability function may be proposed on a sufficiently refined partition of the support. The computational load associated with integrations typically increases exponentially with the number of dimensions. MaxEnt and moment matching: the approximating distribution may be chosen to match a selected set of the moments of the true distribution [32]. Under the MaxEnt principle [33], the optimal such moment-matching distribution is the one possessing maximum entropy subject to these moment constraints. Empirical approximations: a random sample is generated from the probability function, and the distributional approximation is simply a set of point masses placed
1.5 VB as a Distributional Approximation
9
at these independent, identically-distributed (i.i.d.) sampling points. The key technical challenge is efficient generation of i.i.d. samples from the true distribution. In recent years, stochastic sampling techniques [34]—particularly the class known as Markov Chain Monte Carlo (MCMC) methods [35]—have overtaken deterministic methods as the golden standard for distributional approximation. They can yield approximations to an arbitrary level of accuracy, but typically incur major computational overheads. It can be instructive to examine the performance of any deterministic method—such as the VB method—in terms of the accuracy-vs-complexity trade-off achieved by these stochastic sampling techniques.
quality of approximation
The VB method has the potential to offer an excellent trade-off between computational complexity and accuracy of the distributional approximation. This is suggested in Fig. 1.3. The main computational burden associated with the VB method is the need to solve iteratively—via the IVB algorithm—a set of simultaneous equations in order to reveal the required moments of the VB-marginals. If computational cost is of concern, VB-marginals may be replaced by simpler approximations, or the evaluation of moments can be approximated, without, hopefully, diminishing the overall quality of approximation significantly. This pathway of approximation is suggested by the dotted arrow in Fig. 1.3, and will be traversed in some of the signal processing applications presented in this book. Should the need exist to increase accuracy, the VB method is sited in the flexible context of Mean Field Theory, which offers more sophisticated techniques that might be explored.
deterministic methods
sampling methods mean field theory Variational Bayes (IVB)
EM algorithm certainty equivalent
computational cost Fig. 1.3. The accuracy-vs-complexity trade-off in the VB method.
10
1 Introduction
1.6 Layout of the Work We now briefly summarize the main content of the Chapters of this book. Chapter 2. This provides an introduction to Bayesian theory relevant for distributional approximation. We review the philosophical framework, and we introduce basic probability calculus which will be used in the remainder of the book. The important distinction between off-line and on-line inference is outlined. Chapter 3. Here, we are concerned with the problem of distributional approximation. The VB-approximation is defined, and from it we synthesize an ergonomic procedure for deducing these VB-approximations. This is known as the VB method. Related distributional approximations are briefly reviewed and compared to the VB method. A simple inference problem—scalar multiplicative decomposition—is considered. Chapter 4. The VB method is applied to the problem of matrix multiplicative decompositions. The VB-approximation for these models reveals interesting properties of the method, such as initialization of the Iterative VB algorithm (IVB) and the existence of local minima. These models are closely related to Principal Component Analysis (PCA), and we show that the VB inference provides solutions to problems not successfully addressed by PCA, such as the inference of rank. Chapter 5. We use our experience from Chapter 4 to derive the VB-approximation for the inference of physiological factors in medical image sequences. The physical nature of the problem imposes additional restrictions which are successfully handled by the VB method. Chapter 6. The VB method is explored in the context of recursive inference of signal processes. In this Chapter, we confine ourselves to time-invariant parameter models. We isolate three fundamental scenarios, each of which constitutes a recursive inference task where the VB-approximation is tractable and adds value. We apply the VB method to the recursive identification of mixtures of AR models. The practical application of this work in prediction of urban traffic flow is outlined. Chapter 7. The time-invariant parameter assumption from Chapter 6 is relaxed. Hence, we are concerned here with Bayesian filtering. The use of the VB method in this context reveals interesting computational properties in the resulting algorithm, while also pointing to some of the difficulties which can be encountered. Chapter 8. We address a practical signal processing task, namely, the reconstruction of AR processes corrupted by unknown transformation and noise distortions. The use of the VB method in this ambitious context requires synthesis of experience gained in Chapters 6 and 7. The resulting VB inference is shown to be successful in optimal data pre-processing tasks such as outlier removal and suppression of burst noise. An application in speech denoising is presented. Chapter 9. We summarize the main findings of the work, and point to some interesting future prospects.
1.7 Acknowledgement
11
1.7 Acknowledgement ˇ 1ET 100 750 401 and The first author acknowledges the support of Grants AV CR MŠMT 1M6798555601.
2 Bayesian Theory
In this Chapter, we review the key identities of probability calculus relevant to Bayesian inference. We then examine three fundamental contexts in parametric modelling, namely (i) off-line inference, (ii) on-line inference of time-invariant parameters, and (iii) on-line inference of time-variant parameters. In each case, we use the Bayesian framework to derive the formal solution. Each context will be examined in detail in later Chapters.
2.1 Bayesian Benefits A Bayesian is someone who uses only probabilities to quantify degrees of belief in an uncertain hypothesis, and uses only the rules of probability as the calculus for operating on these degrees of belief [7, 8, 36, 37]. At the very least, this approach to inductive inference is consistent, since the calculus of probability is consistent, i.e. any valid use of the rules of probability will lead to a unique conclusion. This is not true of classical approaches to inference, where degrees of belief are quantified using one of a vast range of criteria, such as relative frequency of occurrence, distance in a normed space, etc. If the Bayesian’s probability model is chosen to reflect such criteria, then we might expect close correspondence between Bayesian and classical methods. However, a vital distinction remains. Since probability is a measure function on the space of possibilities, the marginalization operator (i.e. integration) is a powerful inferential tool uniquely at the service of the Bayesian. Careful comparison of Bayesian and classical solutions will reveal that the real added value of Bayesian methods derives from being able to integrate, thereby concentrating the inference onto a selected subset of quantities of interest. In this way, Bayesian methods naturally embrace the following key problems, all problematical for the non-Bayesian: 1. projection into a desired subset of the hypothesis space; 2. reduction of the number of parameters appearing in the probability function (socalled ‘elimination of nuisance parameters’ [38]); 3. quantification of the risk associated with a data-informed decision;
14
2 Bayesian Theory
4. evaluation of expected values and moments; 5. comparison of competing model structures and penalization of complexity (Ockham’s Razor) [39, 40] ; 6. prediction of future data. All of these tasks require integration with respect to the probability measure on the space of possibilities. In the case of 5. above, competing model structures are measured, leading to consistent quantification of model complexity. This natural engendering of Ockham’s razor is among the most powerful features of the Bayesian framework. Why, then, are Bayesian methods still so often avoided in application contexts such as statistical signal processing? The answer is mistrust of the prior, and philosophical angst about (i) its right to exist, and (ii) its right to influence a decision or algorithm. With regard to (i), it is argued by non-Bayesians that probabilities may only be attached to objects or hypotheses that vary randomly in repeatable experiments [41]. With regard to (ii), the non-Bayesian (objectivist) perspective is that inferences should be based only on data, and never on prior knowledge. Preoccupation with these issues is to miss where the action really is: the ability to marginalize in the Bayesian framework. In our work, we will eschew detailed philosophical arguments in favour of a policy that minimizes the influence of the priors we use, and points to the practical added value over frequentist methods that arise from use of probability calculus. 2.1.1 Off-line vs. On-line Parametric Inference In an observational experiment, we may wish to infer knowledge of an unknown quantity only after all data, D, have been gathered. This batch-based inference will be called the off-line scenario, and Bayesian methods must be used to update our beliefs given no data (i.e. our prior), to beliefs given D. It is the typical situation arising in database analysis. In contrast, we may wish to interleave the process of observing data with the process of updating our beliefs. This on-line scenario is important in control and decision tasks, for example. For convenience, we refer to the independent variable indexing the occasions (temporal, spatial, etc.) when our inferences must be updated, as time, t = 0, 1, .... The incremental data observed between inference times is dt , and the aggregate of all data observed up to and including time t is denoted by Dt . Hence: Dt = Dt−1 ∪ dt , t = 1, ..., with D0 = {}, by definition. For convenience, we will usually assume that dt ∈ Rp×1 , p ∈ N + , ∀t, and so Dt can be structured into a matrix of dimension p × t, with the incremental data, dt , as its columns: Dt = [Dt−1 , dt ] .
(2.1)
In this on-line scenario, Bayesian methods are required to update our state of knowledge conditioned by Dt−1 , to our state of knowledge conditioned by Dt . Of
2.2 Bayesian Parametric Inference: the Off-Line Case
15
course, the update is achieved using exactly the same ‘inference machine’, namely Bayes’ rule (1.1). Indeed, one step of on-line inference is equivalent to an off-line step, with D = dt , and with the prior at time t being conditioned on Dt−1 . Nevertheless, it will be convenient to handle the off-line and on-line scenarios separately, and we now review the Bayesian probability calculus appropriate to each case.
2.2 Bayesian Parametric Inference: the Off-Line Case Let the measured data be denoted by D. A parametric probabilistic model of the data is given by the probability distribution, f (D|θ), conditioned by knowledge of the parameters, θ. In this book, the notation f (·) can represent either a probability density function for continuous random variables, or a probability mass function for discrete random variables. We will refer to f (·) as a probability distribution in both cases. In this way a significant harmonization of formulas and nomenclature can be achieved. We need only keep in mind that integrations should be replaced by summations whenever the argument is discrete1 . Our prior state of knowledge of θ is quantified by the prior distribution, f (θ). Our state of knowledge of θ after observing D is quantified by the posterior distribution, f (θ|D). These functions are related via Bayes’ rule, f (D|θ) f (θ) f (θ, D) = , f (D) f (D|θ) f (θ) dθ Θ∗
f (θ|D) =
(2.2)
where Θ∗ is the space of θ. We will refer to f (θ, D) as the joint distribution of parameters and data, or, more concisely, as the joint distribution. We will refer to f (D|θ) as the observation model. If this is viewed as a (non-measure) function of θ, it is known as the likelihood function [3, 43–45]: l(θ|D) ≡ f (D|θ) .
(2.3)
ζ = f (D) is the normalizing constant, sometimes known as the partition function in the physics literature [46]: f (θ, D) dθ = f (D|θ) f (θ) dθ. (2.4) ζ = f (D) = Θ∗
Θ∗
Bayes’ rule (2.2) can therefore be re-written as f (θ|D) =
1 f (D|θ) f (θ) ∝ f (D|θ) f (θ) , ζ
(2.5)
where ∝ means equal up to the normalizing constant, ζ. The posterior is fully determined by the product f (D|θ) f (θ), since the normalizing constant follows from the 1
This can also be achieved via measure theory, operating in a consistent way for both discrete and continuous distributions, with probability densities generalized in the Radon-Nikodym sense [42]. The practical effect is the same, and so we will avoid this formality.
16
2 Bayesian Theory
requirement that f (θ|D) be a probability distribution; i.e. Θ∗ f (θ|D) = 1. Evaluation of ζ (2.4) can be computationally expensive, or even intractable. If the integral in (2.4) does not converge, the distribution is called improper [47]. The posterior distribution with explicitly known normalization (2.5) will be called the normalized distribution. In Fig. 2.1, we represent Bayes’ rule (2.2) as an operator, B, transforming the prior into the posterior, via the observation model, f (D|θ) . f (D|θ) f (θ)
B
f (θ|D)
Fig. 2.1. Bayes’ rule as an operator.
2.2.1 The Subjective Philosophy All our beliefs about θ, and their associated quantifiers via f (θ), f (θ|D), etc., are conditioned on the parametric probability model, f (θ, D), chosen by us a priori (2.2). Its ingredients are (i) the deterministic structure relating D to an unknown parameter set, θ, i.e. the observation model f (D|θ), and a chosen measure on the space, Θ, of this parameter set, i.e. the prior measure f (θ). In this sense, Bayesian methods are born from a subjective philosophy, which conditions all inference on the prior knowledge of the observer [2,36]. Jeffreys’ notation [7], I, is used to condition all probability functions explicitly on this corpus of prior knowledge; e.g. f (θ) → f (θ|I). For convenience, we will not use this notation, nor will we forget the fact that this conditioning is always present. In model comparison (1.4), where we examine competing model assumptions, fl (θl , D), l = 1, . . . , c, this conditioning becomes more explicit, via the indicator variable or pointer, l ∈ {1, 2, ..., c} , but once again we will suppress the implied Jeffreys’ notation. 2.2.2 Posterior Inferences and Decisions The task of evaluating the full posterior distribution (2.5) will be called parameter inference in this book. We favour this phrase over the alternative—density estimation— used in some decision theory texts [48]. The full posterior distribution is a complete description of our uncertainty about the parameters of the observation model (2.3), given prior knowledge, f (θ), and all available data, D. For many practical tasks, we need to derive conditional and marginal distributions of model parameters, and their moments. Consider the (vector of) model parameters to be partitioned into two subsets, θ = [θ1 , θ2 ] . Then, the marginal distribution of θ1 is f (θ1 |D) = f (θ1 , θ2 |D) dθ2 . (2.6) Θ2∗
2.2 Bayesian Parametric Inference: the Off-Line Case f (θ1 , θ2 |D)
17
f (θ1 |D)
dθ2
Fig. 2.2. The marginalization operator.
In Fig. 2.2, we represent (2.6) as an operator. This graphical representation will be convenient in later Chapters. The moments of the posterior distribution—i.e. the expected or mean value of known functions, g (θ) , of the parameter—will be denoted by g (θ) f (θ|D) dθ. (2.7) Ef (θ|D) [g (θ)] = Θ∗
In general, we will use the notation g (θ) to refer to a posterior point estimate of g(θ). Hence, for the choice (2.7), we have g (θ) ≡ Ef (θ|D) [g (θ)] .
(2.8)
The posterior mean (2.7) is only one of many decisions that can be made in choosing a point estimate, g (θ), of g(θ). Bayesian decision theory [30, 48–52] allows an optimal such choice to be made. The Bayesian model, f (θ, D) (2.2), is supplemented by a loss function, L(g, g) ∈ [0, ∞), quantifying the loss associated with estimating g ≡ g(θ) by g ≡ g (θ). The minimum Bayes risk estimate is found by minimizing the posterior expected loss, (2.9) g (θ) = arg min Ef (θ|D) [L(g, g)] . g (θ) , Q posThe quadratic loss function, L(g, g) = g(θ) − g (θ) Q g(θ) − g itive definite, leads to the choice of the posterior mean (2.7). Other standard loss functions lead to other standard point estimates, such as the maximum and median a posteriori estimates [37]. The Maximum a Posteriori (MAP) estimate is defined as follows: (2.10) θMAP = arg max f (θ|D). θ
In the special case where f (θ) = const., i.e. the improper uniform prior, then, from (2.2) and (2.3), (2.11) θMAP = θML = arg max l(θ|D). θ
Here, θML denotes the Maximum Likelihood (ML) estimate. ML estimation [43] is the workhorse of classical inference, since it avoids the issue of defining a prior over the space of possibilities. In particular, it is the dominant tool for probabilistic methods in signal processing [5, 53, 54]. Consider the special case of an additive Gaussian noise model for vector data, D = d ∈ Rp , with d = s(θ) + e, e ∼ N (0, Σ) ,
18
2 Bayesian Theory
where Σ is known, and s(θ) is the (non-linearly parameterized) signal model. In this case, θML = θLS , the traditional non-linear, weighted Least-Squares (LS) estimate [55] of θ. From the Bayesian perspective, these classical estimators—θML and θLS — can be justified only to the extent that a uniform prior over Θ∗ might be justified. When Θ∗ has infinite Lebesgue measure, this prior is improper, leading to technical and philosophical difficulties [3, 8]. In this book, it is the strongly Bayesian choice, (θ) g (θ) = Ef (θ|D) [g (θ)] (2.8), which predominates. Hence, the notation g ≡ g will always denote the posterior mean of g(θ), unless explicitly stated otherwise. As an alternative to point estimation, the Bayesian may choose to describe a continuous posterior distribution, f (θ|D) (2.2), in terms of a region or interval within which θ has a high probability of occurrence. These credible regions [37] replace the confidence intervals of classical inference, and have an intuitive appeal. The following special case provides a unique specification, and will be used in this book. Definition 2.1 (Highest Posterior Density (HPD) Region). R ⊂ Θ∗ is the 100(1− α)% HPD region of (continuous) distribution, f (θ|D) , where α ∈ (0, 1), if (i) f (θ|D) = 1 − α, and if (ii) almost surely (a.s.) for any θ1 ∈ R and θ2 ∈ / R, then R f (θ1 |D) ≥ f (θ2 |D). 2.2.3 Prior Elicitation The prior distribution (2.2) required by Bayes’ rule is a function that must be elicited by the designer of the model. It is an important part of the inference problem, and can significantly influence posterior inferences and decisions (Section 2.2.2). General methods for prior elicitation have been considered extensively in the literature [7,8,37,56], as well as the problem of choosing priors for specific signal models in Bayesian signal processing [3, 35, 57]. In this book, we are concerned with the practical impact of prior choices on the inference algorithms which we develop. The prior distribution will be used in the following ways: 1. To supplement the data, D, in order to obtain a reliable posterior estimate, in cases where there are insufficient data and/or a poorly defined model. This will be called regularization (via the prior); 2. To impose various restrictions on the parameter θ, reflecting physical constraints such as positivity. Note, from (2.2), that if the prior distribution on a subset of the parameter support, Θ∗ , is zero, then the posterior distribution will also be zero on this subset; 3. To express prior ignorance about θ. If the data are assumed to be informative enough, we prefer to choose a non-informative prior (i.e. a prior with minimal impact on the posterior distribution). Philosophical and analytical challenges are encountered in the design of non-informative priors, as discussed, for example, in [7, 46]. In this book, we will typically choose our prior from a family of distributions providing analytical tractability during the Bayes update (Fig. 2.1). Notably, we will work
2.3 Bayesian Parametric Inference: the On-line Case
19
with conjugate priors, as defined in the next Section. In such cases, we will design our non-informative prior by choosing its parameters to have minimal impact on the parameters of the posterior distribution. 2.2.3.1 Conjugate priors In parametric inference, all distributions, f (·), have a known functional form, and are completely determined once the associated shaping parameters are known. Hence, the shaping parameters of the posterior distribution, f (θ|D, s0 ) (2.5), are, in general, the complete data record, D, and any shaping parameters, s0 , of the prior, f0 (θ|s0 ) . Hence, a massive increase in the degrees-of-freedom of the inference may occur during the prior-to-posterior update. It will be computationally advantageous if the form of the posterior distribution is identical to the form of prior, f0 (·|s0 ), i.e. the inference is functionally invariant with respect to Bayes’ rule, and is determined from a finite-dimensional vector shaping parameter: s = s (D, s0 ) , s ∈ Rq , q < ∞, with s({} , s0 ) ≡ s0 , a priori. Then Bayes’ rule (2.5) becomes f0 (θ|s) ∝ f (D|θ) f0 (θ|s0 ) .
(2.12)
Such a distribution, f0 , is known as self-replicating [42], or as the conjugate distribution to the observation model, f (D|θ) [37]. s are known as the sufficient statistics of the distribution, f0 . The principle of conjugacy may be used in designing the prior; i.e. if there exists a family of conjugate distributions, Fs , whose elements are indexed by s ∈ Rq , then the prior is chosen as f (θ) ≡ f0 (θ|s0 ) ∈ Fs ,
(2.13)
with s0 forming the parameters of the prior. If s0 are unknown, then they are called hyper-parameters [37], and are assigned a hyperprior, s0 ∼ f (s0 ). As we will see in Chapter 6, the choice of conjugate priors is of key importance in the design of tractable Bayesian recursive algorithms, since they confine the shaping parameters to Rq , and prevent a linear increase in the number of degrees-of-freedom with Dt (2.1). From now on, we will not use the subscript ‘0’ in f0 . The fixed functional form will be implied by the conditioning on sufficient statistics s.
2.3 Bayesian Parametric Inference: the On-line Case We now specialize Bayesian inference to the case of learning in tandem with data acquisition, i.e. we wish to update our inference in the light of incremental data, dt (Section 2.1.1). We distinguish two situations, namely time-invariant and timevariant parameterizations.
20
2 Bayesian Theory
2.3.1 Time-invariant Parameterization The observations at time t, namely dt (2.1), lead to the following update of our knowledge, according to Bayes’ rule (2.2): f (θ|dt , Dt−1 ) = f (θ|Dt ) ∝ f (dt |θ, Dt−1 ) f (θ|Dt−1 ) , t = 1, 2, ...,
(2.14)
where f (θ|D0 ) ≡ f (θ), the parameter prior (2.2). This scenario is illustrated in Fig. 2.3. The observation model, f (dt |θ, Dt−1 ), at time t is related to the observaf (dt |θ, Dt−1 ) f (θ|Dt−1 )
B
f (θ|Dt )
Fig. 2.3. The Bayes’ rule operator in the on-line scenario with time-invariant parameterization.
tion model for the accumulated data, Dt —which we can interpret as the likelihood function of θ (2.3)—via the chain rule of probability: l(θ|Dt ) ≡ f (Dt |θ) =
t
f (dτ |θ, Dτ −1 ) .
(2.15)
τ =1
Conjugate updates (Section 2.2.3) are essential in ensuring tractable on-line algorithms in this context. This will be the subject of Chapter 6. 2.3.2 Time-variant Parameterization In this case, new parameters, θt , are required to explain dt , i.e. the observation model, f (dt |θt , Dt−1 ), t = 1, 2, ..., is an explicitly time-varying function. For convenience, we assume that θt ∈ Rr , ∀t, and we aggregate the parameters into a matrix, Θt , as we did the data (2.1): (2.16) Θt = [Θt−1 , θt ] , with Θ0 = {} by definition. Once again, Bayes’ rule (2.2) is used to update our knowledge of Θt in the light of new data, dt : f (Θt |dt , Dt−1 ) = f (Θt |Dt ), t = 1, 2, ..., ∝ f (dt |Θt , Dt−1 ) f (Θt |Dt−1 )
(2.17)
= f (dt |Θt , Dt−1 ) f (θt |Dt−1 , Θt−1 ) f (Θt−1 |Dt−1 ) , where we have used the chain rule to expand the last term in (2.17), via (2.16). Typically, we want to concentrate the inference into the newly generated parameter, θt , which we do via marginalization (2.6):
2.3 Bayesian Parametric Inference: the On-line Case
f (θt |Dt ) =
···
∗ Θt−1
∝
21
Θ1∗
f (θt , Θt−1 |Dt ) dΘt−1
(2.18)
···
∗ Θt−1
f (dt |Θt , Dt−1 ) f (θt |Θt−1 , Dt−1 ) f (Θt−1 |Dt−1 ) dΘt−1 .
Θ1∗
Note that the dimension of the integration is r(t − 1) at time t. If the integrations need to be carried out numerically, this increasing dimensionality proves prohibitive in real-time applications. Therefore, the following simplifying assumptions are typically adopted [42]: Proposition 2.1 (Markov observation model and parameter evolution models). The observation model is to be simplified as follows: f (dt |Θt , Dt−1 ) = f (dt |θt , Dt−1 ) ,
(2.19)
i.e. dt is conditionally independent of Θt−1 , given θt . The parameter evolution model is to be simplified as follows: f (θt |Θt−1 , Dt−1 ) = f (θt |θt−1 ) .
(2.20)
In many applications, (2.20) may depend on exogenous (observed) data, ξt , which can be seen as shaping parameters, and need not be explicitly listed in the conditioning part of the notation. This Markov model (2.20) is the required extra ingredient for Bayesian timevariant on-line inference. Employing Proposition 2.1 in (2.18), and noting that f (Θt−1 |Dt−1 ) f (Θt−2 )dΘt−2 = f (θt−1 |Dt−1 ), then the following equat∗ Θt−2 ions emerge: The time update (prediction) of Bayesian filtering: f (θt |Dt−1 ) ≡ f (θ1 ) , t = 1, f (θt |Dt−1 ) = f (θt |θt−1 , Dt−1 ) f (θt−1 |Dt−1 ) dθt−1 , t = 2, 3, .... (2.21) ∗ Θt−1
The data update of Bayesian filtering: f (θt |Dt ) ∝ f (dt |θt , Dt−1 ) f (θt |Dt−1 ) , t = 1, 2, ....
(2.22)
Note, therefore, that the integration dimension is fixed at r, ∀t (2.21). We will refer to this two-step update for Bayesian on-line inference of θt as Bayesian filtering, in analogy to Kalman filtering which involves the same two-step procedure, and which is, in fact, a specialization to the case of Gaussian observation (2.19) and parameter evolution (2.20) models. On-line inference of time-variant parameters is illustrated in schematic form in Fig. 2.4. In Chapter 7, the problem of designing tractable Bayesian recursive filtering algorithms will be addressed for a wide class of models, (2.19) and (2.20), using Variational Bayes (VB) techniques.
22
2 Bayesian Theory Time update
Data update
f (θt |θt−1 , Dt−1 ) f (θt−1 |Dt−1 )
f (dt |θt , Dt−1 )
×
dθt−1
f (θt , θt−1 |Dt−1 )
B
f (θt |Dt )
f (θt |Dt−1 )
Fig. 2.4. The inferential scheme for Bayesian filtering. The operator ‘×’ denotes multiplication of distributions.
2.3.3 Prediction Our purpose in on-line inference of parameters will often be to predict future data. In the Bayesian paradigm, k-steps-ahead prediction is achieved by eliciting the following distribution: (2.23) dt+k ∼ f (dt+k |Dt ) . This will be known as the predictor. The one-step ahead predictor (i.e. k = 1 in (2.23)) for a model with timeinvariant parameters (2.14) is as follows: f (dt+1 |Dt ) = f (dt+1 |θ) f (θ|Dt ) dθ, (2.24) ∗ Θ ∗ f (dt+1 |θ) f (θ, Dt ) dθ = Θ , (2.25) f (θ, Dt ) dθ Θ∗ =
ζt+1 f (Dt+1 ) = , f (Dt ) ζt
(2.26)
using (2.2) and (2.4). Hence, the one-step-ahead predictor is simply a ratio of normalizing constants. Evaluation of the k-steps-ahead predictor, k > 1, involves integration over future data, dt+1 , . . ., dt+k−1 , which may require numerical methods. For models with time-variant parameters (2.17), marginalization over the parameter trajectory, θt+1 , . . . , θt+k−1 , is also required.
2.4 Summary In later Chapters, we will study the use of the Variational Bayes (VB) approximation in all three contexts of Bayesian learning reviewed in this Chapter, namely: 1. off-line parameter inference (Section 2.2), in Chapter 3; 2. on-line inference of Time-Invariant (TI) parameters (Section 2.3.1), in Chapter 6; 3. on-line inference of Time-Variant (TV) parameters (Section 2.3.2), in Chapter 7.
2.4 Summary Context off-line on-line TI on-line TV
Observation model Model of state evolution posterior f (D|θ) f (dt |θ, Dt−1 ) f (dt |θt , Dt−1 )
— — f (θt |θt−1 , Dt−1 )
23
prior
f (θ|D) f (θ) f (θ|Dt ) f (θ|D0 ) ≡ f (θ) f (θt |Dt ) f (θ1 |D0 ) ≡ f (θ1 )
Table 2.1. The distributions arising in three key contexts of Bayesian inference.
The key probability distributions arising in each context are summarized in Table 2.1. The VB approximation will be employed consistently in each context, but with different effect. Each will imply distinct criteria for the design of tractable Bayesian learning algorithms.
3 Off-line Distributional Approximations and the Variational Bayes Method
In this Chapter, we formalize the problem of approximating intractable parametric distributions via tractable alternatives. Our aim will be to generate good approximations of the posterior marginals and moments which are unavailable from the exact distribution. Our main focus will be the Variational Bayes method for distributional approximation. Among the key deliverables will be (i) an iterative algorithm, called IVB, which is guaranteed to converge to a local minimizer of the disparity function; and (ii) the VB method, which provides a set of clear and systematic steps for calculating VB-approximations for a wide range of Bayesian models. We will compare the VB-approximation to related and rival alternatives. Later in this Chapter, we will apply the VB method to an insightful toy problem, namely the multiplicative decomposition of a scalar.
3.1 Distributional Approximation Tractability of the full Bayesian analysis—i.e. application of Bayes’ rule (2.2), normalization (2.4), marginalization (2.6), and evaluation of moments of posterior distributions (2.7)—is assured only for a limited class of models. Numerical integration can be used, but it is often computationally expensive, especially in higher dimensions. The problem can be overcome by approximating the true posterior distribution by a distribution that is computationally tractable: f (θ|D) ≈ A [f (θ|D)] ≡ f˜ (θ|D) .
(3.1)
In Fig. 3.1, we interpret the task of distributional approximation as an operator, A. Once f (θ|D) is replaced by f˜ (θ|D) (3.1), then, notionally, all the inferential operations listed above may be performed tractably. Many approximation strategies have been developed in the literature. In this Chapter, we review some of those most relevant in signal processing. Note that in the off-line context of this Chapter, all distributional approximations will operate on the posterior distribution, i.e. after the
26
3 Off-line Distributional Approximations and the Variational Bayes Method f (θ|D)
A
f˜(θ|D)
Fig. 3.1. Distributional approximation as an operator.
update by Bayes’ rule (2.5). Hence, we will not need to examine the prior and observation models separately, and we will not need to invoke the principle of conjugacy (Section 2.2.3.1).
3.2 How to Choose a Distributional Approximation It will be convenient to classify all distributional approximation methods into one of two types: Deterministic distributional approximations: the approximation, f˜ (θ|D) (3.1), is obtained by application of a deterministic rule; i.e. f˜ (θ|D) is uniquely determined by f (θ|D). The following are deterministic methods of distributional approximation: (i) certainty equivalence [30], which includes maximum likelihood and Maximum a posteriori (MAP) point inference [47] as special cases; (ii) the Laplace approximation [58]; (iii) the MaxEnt approximation [59, 60]; and (iv) fixed-form minimization [32]. The latter will be reviewed in Section 3.4.2, and the others in Section 3.5. Stochastic distributional approximations: the approximation is developed via a random sample of realizations from f (θ|D) . The fundamental distributional approximation in this class is the empirical distribution from nonparametric statistics [61] (see (3.59)). The main focus of attention is on the numerically efficient generation of realizations from f (θ|D) . An immediate consequence of stochastic approximation is that f˜ (θ|D) will vary with repeated use of the method. We briefly review this class of approximations in Section 3.6. Our main focus of attention will be the Variational Bayes (VB) approximation, which—as we will see in Section 3.3—is a deterministic, free-form distributional approximation. 3.2.1 Distributional Approximation as an Optimization Problem In general, the task is to choose an optimal distribution, f˜ (θ|D) ∈ F, from the space, F, of all possible distributions. f˜ (θ|D) should be (i) tractable, and (ii) ‘close’ to the true posterior, f (θ|D) , in some sense. The task can be formalized as an optimization problem requiring the following elements: 1. A subspace of distributions, Fc ⊂ F, such that all functions, f˘ ∈ Fc , are regarded as tractable. Here, f˘ (θ|D) denotes a ‘wildcard’ or candidate tractable distribution from the space Fc .
3.2 How to Choose a Distributional Approximation
27
2. A proximity measure, ∆ f ||f˘ , between the true distribution and any tractable approximation. ∆ f ||f˘ must be defined on F × Fc , such that it accept two distributions, f ∈ F and f˘ ∈ Fc , as input arguments, yield a positive scalar as its value, and have f˘ = f (θ|D) as its (unique) minimizer. Then, the optimal choice of the approximating function must satisfy f˜ (θ|D) = arg min ∆ f (θ|D) ||f˘ (θ|D) , f˘∈Fc
(3.2)
where we denote the optimal distributional approximation by f˜ (θ|D) . 3.2.2 The Bayesian Approach to Distributional Approximation From the Bayesian point of view, choosing an approximation, f˜ (θ|D) (3.1), can be seen as a decision-making problem (Section 2.2.2). Hence, the designer chooses a loss function (2.9) (negative utility function [37]) measuring the loss associated with choosing each possible f˘ (θ|D) ∈ Fc , when the ‘true’ distribution is f (θ|D). In [62], a logarithmic loss function was shown to be optimal if we wish to extract maximum information from the data. Use of the logarithmic loss function leads to the Kullback-Leibler (KL) divergence [63] (also known as the cross-entropy) as an appropriate assignment for ∆ in (3.2): ∆ f (θ|D) ||f˘ (θ|D) = KL f (θ|D) ||f˘ (θ|D) . (3.3) The Kullback-Leibler (KL) divergence from f (θ|D) to f˘ (θ|D) is defined as: f (θ|D) f (θ|D) dθ = Ef (θ|D) ln . f (θ|D) ln KL f (θ|D) ||f˘ (θ|D) = f˘ (θ|D) f˘ (θ|D) Θ∗ (3.4) It has the following properties: 1. KL f (θ|D) ||f˘ (θ|D) ≥ 0; 2. KL f (θ|D) ||f˘ (θ|D) = 0 iff f (θ|D) = f˘ (θ|D) almost everywhere; 3. KL f (θ|D) ||f˘ (θ|D) = ∞ iff on a set of a positive measure f (θ|D) > 0 and f˘ (θ|D) = 0; 4. KL f (θ|D) ||f˘ (θ|D) = KL f˘ (θ|D) ||f (θ|D) in general, and the KL divergence does not obey the triangle inequality. Given 4., care is needed in the syntax describing KL (·). We say that (3.4) is from f (θ|D) to f˘ (θ|D). This distinction will be important in what follows. For future purposes, we therefore distinguish between the two possible orderings of the arguments in the KL divergence:
28
3 Off-line Distributional Approximations and the Variational Bayes Method
KL divergence for Minimum Risk (MR) calculations, as defined in (3.4): KLDMR ≡ KL f (θ|D) ||f˘ (θ|D) .
(3.5)
KL divergence for Variational Bayes (VB) calculations: KLDVB ≡ KL f˘ (θ|D) ||f (θ|D) .
(3.6)
The notations KLDMR and KLDVB imply the order of their arguments, which are, therefore, not stated explicitly.
3.3 The Variational Bayes (VB) Method of Distributional Approximation The Variational Bayes (VB) method of distributional approximation is an optimization technique (Section 3.2.1) with the following elements: The space of tractable distributions Fc is chosen as the space of conditionally independent distributions: Fc ≡ {f (θ1 , θ2 |D) : f (θ1 , θ2 |D) = f (θ1 |D) f (θ2 |D)} .
(3.7)
A necessary condition for applicability of the VB approximation is therefore that Θ be multivariate. The proximity measure is assigned as (3.6): (3.8) ∆ f (θ|D) ||f˘ (θ|D) = KL f˘ (θ|D) ||f (θ|D) = KLDVB . Since the divergence, KLDMR (3.4,3.5), is not used, the VB approximation, f˜ (θ|D), defined from (3.2) and (3.8), is not the minimum Bayes risk distributional approximation. A schematic illustrating the VB method of distributional approximation is given in Fig. 3.2.
3.3.1 The VB Theorem Theorem 3.1 (Variational Bayes). Let f (θ|D) be the posterior distribution of multivariate parameter, θ. The latter is partitioned into q sub-vectors of parameters: (3.9) θ = θ1 , θ2 , . . . , θq . Let f˘ (θ|D) be an approximate distribution restricted to the set of conditionally independent distributions for θ1 , θ2 , . . . , θq : q f˘ (θ|D) = f˘ (θ1 , θ2 , . . . , θq |D) = Πi=1 f˘ (θi |D) .
(3.10)
3.3 The Variational Bayes (VB) Method of Distributional Approximation
29
KLDVB
F
KLDMR true posterior distribution approximations f˜ VB approximation minimum Bayes risk approximation conditionally independent distributions Fc
Fig. 3.2. Schematic illustrating the VB method of distributional approximation. The minimum Bayes’ risk approximation is also illustrated for comparison.
Then, the minimum of KLDVB , i.e.
f˜ (θ|D) = arg min KL f˘ (θ|D) ||f (θ|D) , f˘(·)
(3.11)
is reached for (3.12) f˜ (θi |D) ∝ exp Ef˜(θ/i |D) [ln (f (θ, D))] , i = 1, . . . , q,
q where θ/i denotes the complement of θi in θ, and f˜ θ/i |D = j=1,j=i f˜ (θj |D). We will refer to f˜ (θ|D) (3.11) as the VB-approximation, and f˜ (θi |D) (3.12) as the VB-marginals. Proof: KLDVB can be rewritten, using the definition (3.4), as follows:
30
3 Off-line Distributional Approximations and the Variational Bayes Method
KL f˘ (θ|D) ||f (θ|D) =
˘ (θi |D) f˘ θ/i |D f (D)
f dθ f˘ (θi |D) f˘ θ/i |D ln = f (θ|D) f (D) Θ∗
= f˘ (θi |D) f˘ θ/i |D ln f˘ (θi |D) dθ+ ∗ Θ
− f˘ (θi |D) f˘ θ/i |D ln f (θ, D) dθ+ ∗ Θ
+ f˘ (θi |D) f˘ θ/i |D ln f˘ θ/i |D + ln f, (D) dθ Θ∗ f˘ (θi |D) ln f˘ (θi |D) dθi + ln f (D) + ηi + =
Θi∗
−
Θi∗
Here,
f˘ (θi |D)
∗ Θ/i
˘ f θ/i |D ln f (θ, D) dθ/i dθi . (3.13)
ηi = Ef˘(θ/i |D) ln f˘ θ/i |D .
For any known non-zero scalars, ζi = 0, i = 1, . . . , q, it holds that KL f˘ (θ|D) ||f (θ|D) = f˘ (θi |D) ln f˘ (θi |D) dθi + ln f (D) + ηi + = Θi∗
= Θi∗
ζi exp Ef˘(θ/i |D) [ln f (θ, D)] dθi f˘ (θi |D) ln ζi Θi∗
− f˘ (θi |D) ln
1 ζi
f˘ (θi |D) dθi +ln f (D)−ln (ζi )+ηi . exp Ef˘(θ/i |D) [ln f (θ, D)] (3.14)
(3.14) is true ∀i ∈ {1, . . . , q} . We choose each ζi , i = 1, . . . , q respectively, as the following normalizing constant for exp E· [·] in the denominator of (3.14): ζi = exp Ef˘(θ/i |D) [ln f (θ, D)] dθi , i = 1, . . . , q. Θi∗
Then, the last equality in (3.14) can be rewritten in terms of a KL divergence, ∀i ∈ {1, . . . , q}: 1 ˘ ˘ KL f (θ|D) ||f (θ|D) = KL f (θi |D) || exp Ef˘(θ/i |D) [ln f (θ, D)] + ζi + ln f (D) − ln (ζi ) + ηi . (3.15)
3.3 The Variational Bayes (VB) Method of Distributional Approximation
31
The only term on the right-hand side of (3.15) dependent on f˘ (·) is the KL divergence. Hence, of (3.15) with respect to f˘ (θi |D), ∀i ∈ {1, . . . , q} ,
minimization keeping f˘ θ/i |D fixed, is achieved by minimization of the first term. Invoking non-negativity (Property 1) of the KL divergence (Section 3.2.2), the minimum ˜ ˘ of the first term is zero. The minimizer is almost surely f (θi |D) = f (θi |D) ∝ exp Ef˜(θ/i |D) [ln (f (θ, D))] , i.e. (3.12), via the second property of the KL divergence (Section 3.2.2). We note the following: • The VB-approximation (3.11) is a deterministic, free-form distributional approximation, as asserted in Section 3.2. The term in italics refers to the fact that no functional form for f˜ is prescribed a priori. • The posterior distribution of the parameters, f (θ|D) , and the joint distribution, f (θ, D) , differ only in the normalizing constant, ζ (2.4), as seen in (2.5). Furthermore, ζ is independent of θ. Hence, (3.12)
can also be written in terms of ln f θi , θ/i |D , in place of ln f θi , θ/i , D . We prefer to use the latter, as it emphasizes the fact that the normalizing constant does not need to be known, and, in fact, the VB method can be used with improper distributions. • Theorem 3.1 can be proved in many other ways. See, for example, [26] and [29]. Our proof was designed so as to use only basic probabilistic calculus, and the stated properties of the KL divergence. • Uniqueness of the VB-approximation (3.12)—i.e. uniqueness of the minimizer of the KL divergence (3.11)—is not guaranteed in general [64]. Therefore, an extra test for uniqueness may be required. This will be seen in some of the applications addressed later in the book. • We emphasize the fact that the VB-approximation, (3.10) and (3.12), i.e. q f˜ (θi |D) , f˜ (θ|D) = f˜ (θ1 , θ2 , . . . , θq |D) = Πi=1
(3.16)
enforces posterior conditional independence between the partitioned parameters. Hence: – the VB approximation can only be used for multivariate models; – cross-correlations between the θi ’s are not present in the approximation. The degree of partitioning, q, must therefore be chosen judiciously. The larger it is, the more correlation between parameters will be lost in approximation. Hence, the achieved minimum of KLDVB will be greater (3.11), and the approximation will be poorer. The guiding principle must be to choose q sufficiently large to achieve tractable VB-marginals (3.12), but no larger. Remark 3.1 (Lower bound on the marginal of D via the VB-approximation). The VBapproximation is often interpreted in terms of a lower bound on f (D), the marginal distribution of the observed data [24,25]. For an arbitrary approximating distribution, f˘ (θ|D), it is true that
32
3 Off-line Distributional Approximations and the Variational Bayes Method
ln f (D) = ln
f (θ, D) dθ = ln
Θ∗
Θ∗
f˘ (θ|D) f (θ, D) dθ f˘ (θ|D)
f (θ|D) f (D) dθ ≡ ln f˜ (D) , f˘ (θ|D) ln f˘ (θ|D) ln f˜ (D) = ln f (D) − KL f˘ (θ|D) ||f (θ|D) , ≥
(3.17)
Θ∗
(3.18)
using Jensen’s inequality [24]. Minimizing the KL divergence on the right-hand side of (3.17)—e.g. using the result of Theorem 3.1—the error in the approximation is minimized. The main computational problem of the VB-approximation (3.12) is that it is not given in closed form. For example, with q = 2, we note from (3.12) that f˜ (θ1 |D) is needed for evaluation of f˜ (θ2 |D), and vice-versa. A solution of (3.12) is usually found iteratively. In such a case the following general result can be established. Algorithm 1 (Iterative VB (IVB) algorithm). Consider the q = 2 case for conve nience, i.e. θ = [θ1 , θ2 ] . Then cyclic iteration of the following steps, n = 2, 3, . . ., monotonically decreases KLDVB (3.6) in (3.11): 1. Compute the current update of the VB-marginal of θ2 at iteration n, via (3.12): [n] ˜ (3.19) f (θ2 |D) ∝ exp f˜[n−1] (θ1 |D) ln f (θ1 , θ2 , D) dθ1 . Θ1∗
2. Use the result of the previous step to compute the current update of the VBmarginal of θ1 at iteration n, via (3.12): [n] ˜ f (θ1 |D) ∝ exp (3.20) f˜[n] (θ2 |D) ln f (θ1 , θ2 , D) dθ2 . Θ2∗
Here, the initializer, f˜[1] (θ1 |D), may be chosen freely. Convergence of the algorithm to fixed VB-marginals, f˜[∞] (θi |D), ∀i, was proven in [26]. This Bayesian alternating algorithm is clearly reminiscent of the EM algorithm from classical inference [21], which we will review in Section 3.4.4. In the EM algorithm, maximization is used in place of one of the expectation steps in Algorithm 1. For this reason, Algorithm 1 is also known as (i) an ‘EM-like algorithm’ [19], (ii) the ‘VB algorithm with E-step and M-step’ [26], which is misleading, and (iii) the ‘Variational EM (VEM)’ algorithm [23]. We will favour the nomenclature ‘IVB algorithm’. Algorithm 1 is an example of a gradient descent algorithm, using the natural gradient technique [65]. In general, the algorithm requires q steps—one for each θi , i = 1, . . . , q—in each iteration. 3.3.2 The VB Method of Approximation as an Operator The VB method of approximation is a special case of the distributional approximation expressed in operator form in Fig. 3.1. Therefore, in Fig. 3.3, we represent the
3.3 The Variational Bayes (VB) Method of Distributional Approximation f (θ|D)
f˜(θ|D)
A
f˜(θ1 |D) f (θ|D)
V
f˜(θ2 |D)
33
×
f˜(θ|D)
Fig. 3.3. The VB method of distributional approximation, represented as an operator, V, for q = 2. The operator ‘×’ denotes multiplication of distributions.
VB method of approximation via the operator V. It is the principal purpose of this book to examine the consequences of replacing A by V. The conditional independence enforced by the VB-approximation (3.3) has the convenient property that marginalization of the VB-approximation is achieved simply via selection of the relevant VB-marginal(s). The remaining VB-marginals are then ignored, corresponding to the fact that they integrate to unity in (3.16). Graphically, we simply ‘drop off’ the VB-marginals of the marginalized parameters from the inferential schematic, as illustrated in Fig. 3.4. We call this VB-marginalization. Throughout this book, we will approximate the task of (intractable) marginalization in this way. f (θ1 , θ2 |D)
f (θ1 , θ2 |D)
A
f˜(θ1 , θ2 |D)
dθ2
f˜(θ1 |D)
f˜(θ1 |D)
V f˜(θ2 |D)
Fig. 3.4. VB-marginalization via the VB-operator, V, for q = 2.
3.3.3 The VB Method In this Section, we present a systematic procedure for applying Theorem 3.1 in problems of distributional approximation for Bayesian parametric inference. It is specifically this 8-step procedure—represented by the flowchart in Fig. 3.5—that we will be invoking from now on, when we refer to the Variational Bayes (VB) method. Our aim is to formulate the VB method with the culture and the needs of signal processing in mind. Step 1: Choose a Bayesian (probability) model: Construct the joint distribution of model parameters and observed data, f (θ, D) (2.2). This step embraces the choice
34
3 Off-line Distributional Approximations and the Variational Bayes Method
of an observation model, f (D|θ) (2.3), and a prior on its parameters, f (θ). Recall, from Section 3.3.1, that the VB method is applicable to improper joint distributions. We assume that analytical marginalization of the posterior distribution (2.6) is intractable, as is evaluation of posterior moments (2.7). This creates a ‘market’ for the VB-approximation which follows. Step 2: Partition the parameters: Partition θ into q sub-vectors (3.9). For convenience, we will assume that q = 2 in this Section. Check if
ln f (θ1 , θ2 , D) = g (θ1 , D) h (θ2 , D) ,
(3.21)
where g (θ1 , D), and h (θ2 , D) are p-dimensional vectors (p < ∞) of compatible dimension. If (3.21) holds, the joint distribution, f (θ, D), is said to be a member of the separable-in-parameters family. This step is, typically, the crucial test which must be satisfied if a Bayesian model is to be amenable to VB approximation. If the logarithm of the joint distribution cannot be written in the form (3.21)—i.e. as a scalar product of g (θ1 , D) and h (θ2 , D)—the VB method will not be tractable. This will become evident in the following steps. In such a case, the model must be reformulated or approximated by other means. Step 3: Write down the VB-marginals: Application of Theorem 3.1 to (3.21) is straightforward. The VB-marginals are: f˜ (θ1 |D) ∝ exp Ef˜(θ2 |D) [ln f (θ1 , θ2 , D)] (θ2 , D) , (3.22) ∝ exp g (θ1 , D) h f˜ (θ2 |D) ∝ exp Ef˜(θ1 |D) [ln f (θ1 , θ2 , D)] (3.23) ∝ exp g (θ1 , D)h (θ2 , D) . The induced expectations are (2.7) g (θ1 , D) ≡ Ef˜(θ1 |D) [g (θ1 , D)] ,
(3.24)
h (θ2 , D) ≡ Ef˜(θ2 |D) [h (θ2 , D)] .
(3.25)
Step 4: Identify standard distributional forms: Identify the functional forms of (3.22) and (3.23) as those of standard parametric distributions. For this purpose, we take the expectations, g (θ1 , D) and h (θ2 , D), as constants in (3.22) and (3.23) respectively. These standard distributions are denoted by
(3.26) f˜ (θ1 |D) ≡ f θ1 | {a}r1 ,
(3.27) f˜ (θ2 |D) ≡ f θ2 | {b} , r2
where {a}r1 and {b}r2 will be called the shaping parameters of the respective VBmarginals. They depend on the arguments of those VB-marginals as follows:
3.3 The Variational Bayes (VB) Method of Distributional Approximation
a(j)
= a(j) h, D ,
b(j) = b(j) ( g , D) ,
35
j = 1, . . . , r1 ,
(3.28)
j = 1, . . . , r2 ,
(3.29)
where g and h are shorthand notation for (3.24) and (3.25) respectively. This step can be difficult in some situations, since it requires familiarity with the catalogue of standard distributions. Note that the form of the VB-marginals yielded by Step 3. may be heavily disguised versions of the standard form. In the sequel, we will assume that standard parametric distributions can been identified. If this is not the case, we can still proceed using symbolic or numerical integration, or, alternatively, via further approximation of the VB-marginals generated in Step 3. Step 5: Formulate necessary VB-moments: Typically, shaping parameters (3.28) and (3.29) are functions of only a subset of the expectations (3.24), (3.25), being g (θ1 ), where g = [gi , i ∈ Ig ⊆ {1, . . . p}] , and h (θ2 ), h = [hj , j ∈ Ih ⊆ {1, . . . p}]. These necessary moments are, themselves, functions of the shaping parameters, (3.28) and (3.29), and are typically listed in tables of standard parametric distributions:
(3.30) g (θ1 ) = g {a}r1 ,
{b} . (3.31) h (θ2 ) = h r2 These necessary moments, (3.30) and (3.31), will be what we refer to as the VBmoments. The set of VB-equations (3.28)–(3.31) fully determines the VB-approximation (3.12). Hence, any solution of this set—achieved by any technique—yields the VB-approximation (3.12). Step 6: Reduce the VB-equations: Reduce the equations (3.28)–(3.31) to a set providing an implicit solution for a reduced number of unknowns. The remaining unknowns therefore have an explicit solution. Since the shaping parameters, (3.28) and (3.29), are explicit functions of the VB moments, (3.30) and (3.31), and vice versa, we can always reformulate the equations in terms of either shaping parameters alone, or VB-moments alone, simply by substitution. Usually, there will be a choice in how far we wish to go in reducing the VB-equations. The choice will be influenced by numerical considerations, namely, an assessment of the computational load and convenience associated with solving a reduced set of VB-equations. The reduction techniques will be problem-specific, and will typically require some knowledge of solution methods for sets of non-linear equations. If no reduction is possible, we can proceed to the next step. In rare cases, a full analytical solution of the VB-equations (3.28)–(3.31) can be found. This closed-form solution must be tested to see if it is, indeed, the global minimizer of (3.11). This is because KLDVB (3.6) can exhibit local minima or maxima, as well as saddle points. In all cases, (3.12) is satisfied [64].
36
3 Off-line Distributional Approximations and the Variational Bayes Method
Step 7: Run the IVB Algorithm (Algorithm 1): The reduced set VB-equations must be solved iteratively. Here, we can exploit Algorithm 1, which guides us in the order for evaluating the reduced equations. Assuming iteration on the unreduced VB-equations (3.28)–(3.31) for convenience, then, at the nth iteration, we evaluate the following: [n] Shaping parameters of f˜[n] θ2 | {b}r2 : g (θ1 )
[n−1]
b(j)
[n]
[n−1] , = g {a}r1 [n−1] (j) =b g (θ1 ) ,D ,
(3.32) j = 1, . . . , r2 .
(3.33)
[n] Shaping parameters of f˜[n] θ1 | {a}r1 : h (θ2 )
a(j)
[n]
[n]
{b}[n] , =h r2 [n] (j) =a h (θ2 ) , D ,
(3.34) j = 1, . . . , r1 .
(3.35) [1]
Special care is required when choosing initial shaping parameters {a}r1 (3.28) (the remaining unknowns do not have to be initialized). In most of the literature in this area, e.g. [20, 24], these are chosen randomly. However, we will demonstrate that a carefully designed choice of initial values may lead to significant computational savings in the associated IVB algorithm (Chapter 4). In general, the IVB algorithm is initialized by setting all of the shaping parameters of just one of the q VB-marginals. Step 8: Report the VB-marginals: Report the VB-approximation (3.16) in the form of shaping parameters, (3.28) and (3.29), and/or the VB-moments, (3.30) and (3.31). Note, therefore, an intrinsic convenience of the VB method: its output is in the form of the approximate marginal distributions and approximate posterior moments for which we have been searching. Remarks on the VB method: (i) The separable parameter requirement of (3.21) forms a family of distributions that is closely related to the exponential family with hidden variables (also known as ‘hidden data’) [26, 65], which we will encounter in Section 6.3.3. This latter family of distributions is revealed via the assignment: g (θ1 , D) = g (θ1 ) . In this case, θ2 constitutes the hidden variable in (3.21). In the VB method, no formal distinction is imposed between the parameters and the hidden data. Family (3.21) extends the exponential family with hidden variables by allowing dependence of g (θ1 ) on D.
3.3 The Variational Bayes (VB) Method of Distributional Approximation
37
(ii) Requirement (3.21) may not be the most general case for which the VB theorem can be applied. However, if a distribution cannot be expressed in this form, all subsequent operations are far less systematic: i.e. the VB method as defined above cannot be used. (iii) Step 3 of the VB-method makes clear how easy it is to write down the functional form of the VB-marginals. All that is required is to expand the joint distribution into separable form (3.21) and then to use a ‘cover-up rule’ to hide all terms associated with θ2 , immediately revealing f˜ (θ1 |D) (3.22). When q > 2, we cover up all terms associated with θ2 , . . . , θq . The same is true when writing down f˜ (θ2 |D) , etc. This ‘cover-up’ corresponds to the substitution of those terms by their VB-moments (Step 5). (iv) The IVB algorithm works by propagating VB-statistics from each currentlyupdated VB-marginal into all the others. The larger the amount of partitioning, q (3.9), the less correlation is available to ‘steer’ the update towards a local minimum of KLDVB (3.11). Added to this is the fact that the number of IVB equations—of the type (3.19) and (3.20)—increases as q. For these reasons, convergence of the IVB algorithm will be negatively affected by a large choice of q. Once again, the principle should be to choose q no larger than is required to ensure tractable VB-marginals (i.e. standard forms with available VB-moments). (v) In the remainder of this book, we will follow the steps of the VB method closely, but we will not always list all the implied mathematical objects. Notably, we will often not define the auxiliary functions g (θ1 ) and h (θ2 ) in (3.21). (vi) The outcome of the VB method is a set of evaluated shaping parameters, (3.28) and (3.29). It should be remembered, however, that these parameters fully determine the VB-marginals, and so the VB method reports distributions on all considered unknowns. If we are interested in moments of these distributions, some of them may have been provided by the set of necessary VB-moments, (3.30) and (3.31), while others can be evaluated via the shaping parameters using standard results. Note that VB-moments alone will usually not be sufficient for full description of the posteriors. (vii) If the solution was obtained using the IVB algorithm (Algorithm 1), then we do not have to be concerned with local maxima or saddle points, as is the case for an analytical solution (see remark in Step 6. above). This is because the IVB algorithm is a gradient descent method. Of course, the algorithm may still converge to a local minimum of KLDVB (3.6). (viii) The flow of control for the VB method is displayed in Fig. 3.5. 3.3.4 The VB Method for Scalar Additive Decomposition Consider the model for scalar decomposition which we introduced in Section 1.3.2. We now verify the 8 steps of the VB method in this case. Step 1: From (1.17), θ = [m, ω] , and the joint distribution is 1 1 2 2 α f (m, ω, d| ) ∝ ω exp − (d − m) ω − m γω − βω , 2 2
38
3 Off-line Distributional Approximations and the Variational Bayes Method Collect data
Choose a Bayesian model
Partition the parameters
Write down the VB-marginals
Identify standard forms
Formulate necessary VB-moments
Reduce the VB-equations
Yes
No
Analytical Solution?
Find Global Minimum
Run the IVB Algorithm
Report the VB-marginals Fig. 3.5. Flowchart of the VB method.
where = (α, β, γ) , which, for conciseness, we will not show in the following equations. Step 2: The form (3.21) is revealed for assignments θ1 = m, θ2 = ω, and
1 1 g (m, d) = α, − d2 + 2β , md, − (1 + γ) m2 , 2 2
h (ω, d) = [ln(ω), ω, ω, ω] . Step 3: (3.22)–(3.23) immediately have the form:
3.4 VB-related Distributional Approximations
39
1 , (3.36) f˜ (m|d) ∝ exp − −2md + m2 (1 + γ) ω 2 1 2 (1 + γ) +m f˜ (ω|d) ∝ exp α ln ω − ω d2 + 2β − 2dmd . (3.37) 2 Step 4: (3.36) can be easily recognized to be Normal: f˜ (m|d) = N (a1 , a2 ). Here, the shaping parameters are a1 (the mean) and a2 (the variance). Similarly, (3.37) is in the form of the Gamma distribution: f˜ (ω|D) = G (b1 , b2 ), with shaping parameters b1 and b2 . The shaping parameters are assigned from (3.36) and (3.37), as follows: a1 = (1 + γ)
−1
d, −1
) , a2 = ((1 + γ) ω b1 = α + 1, 1 2 − 2md (1 + γ) m + d2 . b2 = β + 2 Step 5: The required VB-moments are summarized in (1.21). Note that only mo1 , g3 and g4 are required; i.e. Ig = {1} and Ih = {3, 4} . ments h Step 6: Equations (1.21) can be analytically reduced to a single linear equation which yields the solution in the form of (1.22). Step 7: In the case of the first variant of the scalar decomposition model (Section 1.3), the set (1.14) can be reduced to a cubic equation. The solution is far more complicated, and, furthermore, we have to test which of the roots minimizes (3.11). If an iterative solution is required–as was the case in the first variant of the model (Section 1.3)—the VB equations are evaluated in the order given by (3.32)–(3.35); 2 are evaluated in the first step, and ω i.e. m and m in the second step. In this case, [1] [1] −1 we can initialize a1 = d, and a2 = (2φ) . Step 8: Report shaping parameters a1 , a2 , b1 , b2 and/or the VB-moments (1.21) if they are of interest.
3.4 VB-related Distributional Approximations 3.4.1 Optimization with Minimum-Risk KL Divergence In Section 3.2.2, we explained that minimization of KLDMR (3.5) provides the minimum Bayes risk distributional approximation. Optimization of KLDMR under the assumption of conditional independence (3.10) has the following solution. Remark 3.2 (Conditional independence approximation under KLDMR ). Consider minimization of the following KL divergence for any i ∈ {1, . . . , q}:
40
3 Off-line Distributional Approximations and the Variational Bayes Method
f˜ (θi |D) = arg min KL f (θ|D) ||f˘ (θi |D) f˘ θ/i |D f˘(θi |D)
= arg min
f˘(θi |D)
Θi∗
∗ Θ/i
f θi , θ/i |D ln
= arg min KL f (θi |D) ||f˘ (θi |D)
(3.38)
1 dθ/i dθi f˘ (θi |D)
f˘(θi |D)
= f (θi |D) . Hence, the best conditionally independent approximation of the posterior under KLDMR is the product of the analytical marginals. This, in general, will differ from the VB-approximation (3.12), as suggested in Fig. 3.2. While this result is intuitively appealing, it may nevertheless be computationally intractable. Recall that the need for an approximation arose when operations such as normalization and marginalization on the true posterior distribution proved intractable. This situation was already demonstrated in the first model for scalar decomposition (Section 1.7), and will be encountered again. 3.4.2 Fixed-form (FF) Approximation Another approach to KLDMR minimization is to choose Fc = Fβ in (3.7), being a family of tractable parametric distributions, with members f˘ (θ|D) ≡ f0 (θ|β). Here, the approximating family members are indexed by an unknown (shaping) parameter, β, but their distributional form, f0 (·), is set a priori. The optimal approximation f˜ (θ|D) = f0 θ|βˆ , is then determined via βˆ = arg min KL (f (θ|D) ||f0 (θ|β)) . β
(3.39)
In many applications, measures other than KLDMR (3.39) are used for specific problems. Examples include the Levy, chi-squared and L2 norms. These are reviewed in [32]. 3.4.3 Restricted VB (RVB) Approximation The iterative evaluation of the VB-approximation via the IVB algorithm (Algorithm 1) may be prohibitive, e.g. in the on-line scenarios which will follow in Chapter 6 and 7. Therefore, we now seek a modification of the original VB-approximation that yields a closed-form solution. Corollary 3.1 (of Theorem 3.1: Restricted Variational Bayes (RVB)). Let f (θ|D) be the posterior distribution of multivariate parameter θ = [θ1 , θ2 ] , i.e. we consider a binary partitioning of the global parameter set, θ. Let f (θ2 |D) be a posterior distribution of θ2 of fixed functional form. Let f˘ (θ|D) be a conditionally-independent approximation of f (θ|D) of the kind
3.4 VB-related Distributional Approximations
41
f˘ (θ|D) = f˘ (θ1 , θ2 |D) = f˘ (θ1 |D) f (θ2 |D) . (3.40) Then, the minimum of KLDVB (3.6)—i.e. KL f˘ (θ|D) ||f (θ|D) —is reached for f˜ (θ1 |D) ∝ exp Ef (θ2 |D) [ln (f (θ, D))] .
(3.41)
Proof: Follows from the proof of Theorem 3.1. Note that Corollary 3.1 is equivalent to the first step of the IVB algorithm (Algorithm 1). However, with distribution f (θ2 |D) being known, the equation (3.41) now constitutes a closed-form solution. Furthermore, since it is chosen by the designer, its moments, required for (3.41), will be available. The RVB approximation can greatly reduce the computational load needed for distributional approximation, since there is now no IVB algorithm, i.e. no iterations are required. Note however that, since f (θ2 |D) is fixed, the minimum value of KLDVB achieved by the RVB-approximation (3.41) will be greater than or equal to that achieved by the VB-approximation (3.12). In this sense, RVB can be seen as a sub-optimal approximation. The quality of the RVB approximation (3.41) strongly depends on the choice of the fixed approximating distribution f (θ2 |D) in (3.40). If f (θ2 |D) is chosen close to the VB-optimal posterior (3.12), i.e. f (θ2 |D) ≈ f˜ (θ2 |D) (3.16), then just one step of the RVB algorithm can replace many iterations of the original IVB algorithm. In Section 3.4.3.2, will outline one important strategy for choice of f (θ2 |D). First, however, we review the steps of the VB method (Section 3.3.3), specialized to the RVB approximation. 3.4.3.1 Adaptation of the VB method for the RVB Approximation
The choice of the restricted distribution f θ\1 |D can be a powerful tool, not only in reducing the number of IVB iterations, but any time the VB method yields excessively complicated results (e.g. difficult evaluation of VB-moments). Various restrictions can be imposed, such as neglecting some of the terms in (3.22) and (3.23)). One such scenario for RVB approximation is described next, using the steps designed for the VB method (Section 3.3.3). Step 1: We have assumed that the chosen Bayesian model is intractable. If the analytical form of one of the marginals is available, we can use it as the restricted marginal f θ\1 |D . Step 2: We assume the Bayesian model is separable in parameters (3.21). Step 3: Write down the VB marginals. In this step, we replace some of the VBmarginals by known standard form. For simplicity, we take q = 2, and
f˜ (θ2 |D) = f (θ2 |D) = f θ2 | {b}r2 . Since the distribution is known, we can immediately evaluate its moments,
42
3 Off-line Distributional Approximations and the Variational Bayes Method
h {b}r2 . h (θ2 , D) =
Step 4: Identify standard forms. Only the form of f˜ (θ1 |D) = f θ1 | {a}r1 needs to be identified. Its shaping parameters can be evaluated in closed form as follows: a(j) = a(j) h, D , j = 1, . . . , r1 . Note that the form of shaping parameters is identical to the VB-solution (3.28). What have changed are the values of h only. Step 5: Note that no moments of θ1 are required, i.e. Ig = {}. Steps 6–7: Do not arise. Step 8: Report shaping parameters, {a}r1 ,{b}r2 . Remark 3.3 Restriction of the VB Method). Consider the multiple partition (Partial ing, θ = θ1 , θ2 , . . . , θq , q > 2, of the global parameter set, θ (3.9). There exists a range of choices for partial free-form approximation of f (θ|D), between the extreme cases of (full) VB-approximation (3.16) and the RVB-approximation (3.41). Specifically, we can search for an optimal approximation, minimizing KLDVB (3.8), within the class (3.42) f˘ (θ|D) = Πi∈I f˘ (θi |D) f (∪i∈I / θi |D) , where I ⊆ {1, . . . q} . When I = {1, . . . q} , the full VB approximation is generated via the VB method (3.16). When I is a singleton, the RVB approximation (3.41) is produced in closed form, i.e. without IVB cycles. In intermediate cases, the optimized distributional approximation is
(3.43) f˜ (θi |D) f (∪i∈I f˜ (θ|D) = / θi |D) , i∈I
generated via a reduced set of IVB equations, where unknown moments of ∪i∈I / θi in the (full) VB method have been replaced by the known moments of the fixed distribution, f (∪i∈I / θi |D) . In this manner, the designer has great flexibility in trading an optimal approximation of parameter subsets for a reduced number of IVB equations. This flexibility can be important in on-line scenarios where the computational load of the IVB algorithm must be strictly controlled. While all these choices (3.42) can be seen as restricted VB-approximations, we will reserve the term RVB-approximation for the closed-form choice (3.41). 3.4.3.2 The Quasi-Bayes (QB) Approximation The RVB solution (3.41) holds for any tractable choice of distribution, f (θ2 |D). We seek a reasonable choice for this function, such that the minimum of KLDVB
3.4 VB-related Distributional Approximations
43
(3.6) achieved by the the RVB-approximation approaches that achieved by the VBapproximation (3.11). Hence, we rewrite the KL divergence in (3.11) as follows, using (3.10): KL f˘ (θ|D) ||f (θ|D) =
f˘ (θ1 |D) f˘ (θ2 |D) dθ f˘ (θ1 |D) f˘ (θ2 |D) ln f (θ1 |θ2 , D) f (θ2 |D) Θ∗ f˘ (θ1 |D) = dθ + f˘ (θ1 |D) f˘ (θ2 |D) ln f (θ1 |θ2 , D) Θ∗ f˘ (θ2 |D) dθ2 . + (3.44) f˘ (θ2 |D) ln f (θ2 |D) Θ2∗
We note that the second term in (3.44) is KL f˘ (θ2 |D) ||f (θ2 |D) , which is minimized for the restricted assignment (see (3.40)) f (θ2 |D) ≡ f (θ2 |D) = f (θ|D) dθ1 , (3.45) Θ1∗
i.e. the exact marginal distribution of the joint posterior f (θ|D) (2.6). The global minimum of (3.44) with respect to f˘ (θ2 |D) is not reached for this choice, since the first term in (3.44) is also dependent on f˘ (θ2 |D). Therefore we consider (3.45) to be the best analytical choice for the restricted assignment, f (θ2 |D) , that we can make. It is also consistent with the minimizer of KLDMR (Remark 3.2). From (3.41) and (3.45), the Quasi-Bayes (QB) approximation is therefore
f˜ (θ1 |D) ∝ exp Ef (θ2 |D) [ln (f (θ, D))] . (3.46) The name Quasi-Bayes (QB) was first used in the context of finite mixture models [32], to refer to this type of approximation. In [32], the marginal for θ1 was approximated by conditioning the joint posterior distribution on θ2 , which was assigned as the true posterior mean of θ2 : θ2 = Ef (θ2 |D) [θ2 ] .
(3.47)
Returning, for a moment, to the RVB approximation of Corollary 3.1, we note that iff ln f (θ1 , θ2 , D) is linear in θ2 , then, using (3.46), the RVB approximation, f˜ (θ1 |D) , is obtained by replacing all occurrences of θ2 by its expectation, Ef (θ2 |D) [θ2 ] . In this case, therefore, the RVB approximation is the following conditional distribution: f˜ (θ1 |D) ≡ f θ1 |D, θ2 . This corresponds to a certainty equivalence approximation for inference of θ1 (Section 3.5.1). The choice (3.45) yields (3.47) as the certainty equivalent for θ2 , this being the original definition of the QB approximation in [32]. In this sense, the RVB setting for QB (3.46) is the generalization of the QB idea expressed in [32].
44
3 Off-line Distributional Approximations and the Variational Bayes Method
3.4.4 The Expectation-Maximization (EM) Algorithm The Expectation-Maximization (EM) algorithm is a well known algorithm for Maximum Likelihood (ML) estimation (2.11)—and, by extension, for MAP estimation (2.10)—of subset θ2 of the model parameters θ = [θ1 , θ2 ] [21]. Here, we follow an alternative derivation of EM via distributional approximations [22]. The task is to estimate parameter θ2 by maximization of the (intractable) marginal posterior distribution: θˆ2 = arg max f (θ2 |D) . θ2
This task can be reformulated as an optimization problem (Section 3.2.1) for the constrained distributional family: ! Fc = f (θ1 , θ2 |D) : f (θ1 , θ2 |D) = f (θ1 |D, θ2 ) δ θ2 − θˆ2 . Here, δ(·) denotes the Dirac δ-function, δ (x − x0 ) g (x) dx = g(x0 ),
(3.48)
X
if x ∈ X is a continuous variable, and the Kronecker function, 1, if x = 0 , δ (x) = 0, otherwise if x is integer. We optimize over the family Fc with respect to KLDVB (3.6) [22]. Hence, we recognize that this method of distributional approximation is a special case of the VB approximation (Theorem the functional restrictions 3.1), with ˘ ˆ ˘ f (θ1 |D) = f (θ1 |D, θ2 ) and f (θ2 |D) = δ θ2 − θ2 . The functional optimization ˘ for is trivial since, in (3.12), all moments and expectations with respect to f (θ1 |D) ˆ δ θ2 − θ2 simply result in replacement of θ2 by θˆ2 in the joint distribution. The resulting distributional algorithm is then a cyclic iteration (alternating algorithm) of two steps: Algorithm 2 (The Expectation-Maximization (EM) Algorithm). E-step: Compute the approximate marginal distribution of θ1 , at iteration i: [i−1] . f˜[i] (θ1 |D) = f θ1 |D, θˆ2
(3.49)
M-step: Use the approximate marginal distribution of θ1 from the E-step to update the certainty equivalent for θ2 : [i] ˆ f˜[i] (θ1 |D) ln f (θ1 , θ2 , D) dθ1 . (3.50) θ2 = arg max θ2
Θ1∗
In the context of uniform priors (i.e. ML estimation (Section 2.11)), it was proved in [66] that this algorithm monotonically increases the marginal likelihood, f (D|θ2 ), of θ2 , and therefore converges to a local maximum [66].
3.5 Other Deterministic Distributional Approximations
45
3.5 Other Deterministic Distributional Approximations 3.5.1 The Certainty Equivalence Approximation In many engineering problems, full distributions (2.2) are avoided. Instead, a point ˆ is used to summarize the full state of knowledge expressed by the posteestimate, θ, rior distribution (Section 2.2.2). The point estimate, θˆ = θˆ (D), can be interpreted as an extreme approximation of the posterior distribution, replacing f (θ|D) by a suitably located Dirac δ-function: ˆ f (θ|D) ≈ f˜ (θ|D) = δ θ − θ(D) , (3.51) where θˆ is the chosen point estimate of parameter θ. The approximation (3.51) is known as the certainty equivalence principle [30], and we have already encountered it in the QB (Section 3.4.3.2) and EM (Section 3.4.4) approximations. It remains to determine an optimal assignment for the point estimate. The Bayesian decision-theoretic framework for design of point estimates was mentioned in Section 2.2.2, where, also, popular choices such as the MAP, ML and mean a posteriori estimates were reviewed. 3.5.2 The Laplace Approximation This method is based on local approximation of the posterior distribution, f (θ|D) , ˆ using a Gaussian distribution. Formally, the posterior around its MAP estimate, θ, distribution (2.2) is approximated as follows: ˆ H −1 . (3.52) f (θ|D) ≈ N θ, θˆ is the MAP estimate (2.10) of θ ∈ Rp , and H ∈ Rp×p is the (negative) Hessian matrix of the logarithm of the joint distribution, f (θ, D) , with respect to θ, evaluated ˆ at θ = θ: 2 ∂ ln f (θ, D) , i, j = 1, . . . , p. (3.53) H=− ∂θi ∂θj θ=θˆ The asymptotic error of approximation was studied in [31]. 3.5.3 The Maximum Entropy (MaxEnt) Approximation The Maximum Entropy Method of distributional approximation [60,67–70] is a freeform method (Section 3.3.1), in common with the VB method, since a known distributional form is not stipulated a priori. Instead, the approximation f˜ (θ|D) ∈ Fc (Section 3.2.1) is chosen which maximizes the entropy, Hf = − ln f (θ|D)dF (θ|D), (3.54) Θ∗
46
3 Off-line Distributional Approximations and the Variational Bayes Method
constrained by any known moments (2.8) of f : mi = g i (θ) = Ef (θ|D) [gi (θ)] =
Θ∗
gi (θ) f (θ|D) dθ.
(3.55)
In the context of MaxEnt, (3.55) are known as the mean constraints. The MaxEnt distributional approximation is of the form " ˜ f (θ|D) ∝ exp − αi (D) gi (θ) , (3.56) i
where the αi are chosen—using, for example, the method of Lagrange multipliers for constrained optimization—to satisfy the mean constraints (3.55) and the normalization requirement for f˜ (θ|D) . Since its entropy (3.54) has been maximized, (3.56) may be interpreted as the smoothest (minimally informative) distribution matching the known moments of f (3.55). The MaxEnt approximation has been widely used in solving inverse problems [71, 72], notably in reconstruction of non-negative data sets, such as in Burg’s method for power spectrum estimation [5, 6] and in image reconstruction [67].
3.6 Stochastic Distributional Approximations A stochastic distributional approximation maps f to a randomly-generated approximation, f˜ (Fig. 3.1), in contrast to all the methods we have reviewed so far, where f˜ is uniquely determined by f and the rules of the approximation procedure. The computational engine for stochastic methods is therefore the generation of an independent, identically-distributed (i.i.d.) sample set (i.e. a random sample), θ(i) ∼ f (θ|D), ! {θ}n = θ(1) , . . . , θ(n) .
(3.57) (3.58)
The classical stochastic distributional approximation is the empirical distribution [61], n 1" ˜ δ(θ − θ(i) ), (3.59) f (θ|D) = n i=1 where δ(·) is the Dirac δ-function (3.48) located at θ(i) . The posterior moments (2.8) of f (θ|D) under the empirical approximation (3.59) are therefore 1 " (i) gj θ . n i=1 n
Ef˜(θ|D) [gj (θ)] =
(3.60)
Note that marginal distributions and measures are also generated with ease under approximation (3.59) via appropriate summations.
3.6 Stochastic Distributional Approximations
47
For low-dimensional θ, it may be possible to generate the i.i.d. set{θ}n using one of a vast range of standard stochastic sampling methods [34, 35]. The real challenge being addressed by modern stochastic sampling techniques is to generate a representative random sample (3.58) for difficult—notably high-dimensional— distributions. Markov-Chain Monte Carlo (MCMC) methods [73,74] refer to a class of stochastic sampling algorithms that generate a correlated sequence $ of samples,
# θ(0) , θ(1) , θ(2) , . . . , θ(k) , . . . , from a first-order (Markov) kernel, f θ(k) |θ(k−1) , D . For mild regularity conditions on f (·|·), then θ(k) ∼ fs (θ|D) as k → ∞, where fs (θ|D) is the stationary distribution of the Markov process with this kernel.. This convergence in distribution is independent of the initialization, θ(0) , of the Markov chain [75]. Careful choice of the kernel can ensure that fs (θ|D) = f (θ|D). Hence, repeated simulation from the Markov process, i = 1, 2, . . . , with n sufficiently large, generates the required random sample (3.58) for construction of the empirical approximation (3.59). Typically, the associated computational burden is large, and can be prohibitive in cases of high-dimensional θ. Nevertheless, the very general and flexible way in which MCMC methods have been defined has helped to establish them as the golden standard for (Bayesian) distributional approximation. In the online scenario (i.e. Bayesian filtering, see Chapter 6), sequential Monte Carlo techniques, such as particle filtering [74], have been developed for recursive updating of the empirical distribution (3.59) via MCMC-based sampling. 3.6.1 Distributional Estimation The problem of distributional approximation is closely related to that of distributional (e.g. density) estimation. In approximation—which is our concern in this book—the parametric distribution, f (θ|D), is known a priori, as emphasized in Fig. 3.1. This means that a known (i.e. deterministic) observation model, f (D|θ) (2.2), parametrized by a finite set of parameters, θ, forms part of our prior knowledge base, I (Section 2.2.1). In contrast, nonparametric inference [61] addresses the more general problem of an unknown distribution, f (D), on the space, D, of observations, D. Bayesian nonparametrics proceeds by distributing unknown f via a nonparametric prior: f ∼ F0 [76, 77]. The distribution is learned via i.i.d. sampling (3.58) from D, which in the Bayesian context yields a nonparametric posterior distribution, f | {D}n ∼ Fn . An appropriate distributional estimate, fˆ(D), can then be generated. Once again, it is the empirical distribution, fˆ = f˜ (3.59), which is the basic nonparametric density estimator [61]. In fact, this estimate is formally justified as the posterior expected distribution under i.i.d. sampling, if F0 = D0 , where D0 is the nonparametric Dirichlet process prior [76]. Hence, the stochastic sampling techniques reviewed above may validly be interpreted as distributional estimation techniques. The MaxEnt method (Section 3.5.3) also has a rôle to play in distributional estimation. Given a set of sample moments, Ef˜(D) [gj (D)] , where f˜(D) is the empirical distribution (3.59) built from then the MaxEnt distributional %i.i.d. samples, Di ∈ D, estimate is fˆ(D) ∝ exp − j αj ({D}n ) gj (D) (3.56).
48
3 Off-line Distributional Approximations and the Variational Bayes Method
For completeness, we note that the VB method (Section 3.3) is a strictly parametric distributional approximation technique, and has no rôle to play in distributional estimation.
3.7 Example: Scalar Multiplicative Decomposition In previous Sections, we reviewed several distributional approximations and formulated the VB method. In this Section, we study these approximations for a simple model. The main emphasis is, naturally, on the VB method. The properties of the VB approximation will be compared to those of competing techniques. 3.7.1 Classical Modelling We consider the following scalar model: d = ax + e.
(3.61)
Model (3.61) is over-parameterized, with three unknown parameters (a, x, e) explaining just one measurement, d. (3.61) expresses any additive-multiplicative decomposition of a real number. Separation of the ‘signal’, ax, from the ‘noise’, e, is not possible without further information. In other words, the model must be regularized. Towards this end, let us assume that e is distributed as N (0, re ). Then, f (d|a, x, re ) = N (ax, re ) ,
(3.62)
where variance re is assumed to be known. The likelihood function for this model, for d = 1 and re = 1, is displayed in the upper row of Fig. 3.6, in both surface plot (left) and contour plot (right) forms. The ML solution (2.11) is located anywhere in the manifold defined by the signal estimate: a x = d. (3.63) This indeterminacy with respect to a and x will be known as scaling ambiguity, and will be encountered again in matrix decompositions in Chapters 4 and 5. Further regularization is clearly required. 3.7.2 The Bayesian Formulation It can also be appreciated from Fig. 3.6 (upper-left), that the volume (i.e. integral) under the likelihood function is infinite. This means that f (a, x|d, re ) ∝ f (d|a, x, re ) f (a, x) is improper (unnormalizable) when the parameter prior f (a, x) is itself improper (e.g. uniform in R2 ). Prior-based regularization is clearly required in order to achieve a proper posterior distribution via Bayes’ rule. Under the assignment,
3.7 Example: Scalar Multiplicative Decomposition
f (a|ra ) = N (0, ra ) , f (x|rx ) = N (0, rx ) ,
49
(3.64) (3.65)
the posterior distribution is
2
1 (ax − d) 1 a2 1 x2 − − f (a, x|d, re , ra , rx ) ∝ exp − 2 re 2 ra 2 rx
.
(3.66)
In what follows, we will generally suppress the notational dependence on the known prior parameters. (3.66) is displayed in the lower row of Fig. 3.6, for d = 1, re = 1, ra = 10, rx = 20. The posterior distribution (3.66) is now normalizable (proper), with point maximizers (MAP estimates) as follows: 1. For d >
√ re , ra rx
then & 1 rx re 2 x ˆ=± d − , ra ra & 1 ra re 2 a ˆ=± d − . rx rx
(3.67) (3.68)
Note that the product of the maxima is a ˆx ˆ=d− √
re . ra rx
(3.69)
Comparing (3.69) to (3.63), we see that the signal estimate has been shifted towards the coordinate origin. For the choice ra re and rx re , the prior strongly influences the posterior and is therefore said to be an informative prior (Section2.2.3). For the choice, ra re and rx re , the prior has negligible influence on the posterior and can be considered as non-informative. Scaling ambiguity (3.63) has been reduced, now, to a sign ambiguity, characteristic of multiplicative decompositions (Chapter 5). 2. For d ≤ √rraerx , then x ˆ=a ˆ = 0. Clearly, then, the quantity d˜MAP = √rraerx constitutes an important inferential breakpoint. For d > d˜MAP , a non-zero signal is inferred, while for d ≤ d˜MAP , the observation is inferred to be purely noise. 3.7.3 Full Bayesian Solution The posterior distribution (3.66) is normalizable, but the normalizing constant cannot be expressed in closed form. Integration of (3.66) over x ∈ R yields the following marginal distribution for a:
50
3 Off-line Distributional Approximations and the Variational Bayes Method Probability surface
Contour plot
a
5
unregularized
0
-5 -5
0
5
x
a
5
regularized (via prior)
0
-5 -5
0
5
x
Fig. 3.6. Illustration of scaling ambiguity in the scalar multiplicative decomposition. Upper row: the likelihood function, f (d|a, x, re ), for d = 1, re = 1 (dash-dotted line denotes manifold of maxima). Lower row: posterior distribution, f (a, x|d, re , ra , rx ), for d = 1 and prior parameters re = 1, ra = 10, rx = 20. Cross-marks denote maxima.
− 12 1 d2 ra + a4 rx + a2 2 2 1 exp − π ra a rx + 1 . 2 2 2 ra (a rx + 1) (3.70) The normalizing constant, ζa , for (3.70) is not available in closed form. Structural symmetry with respect to a and x in (3.61) implies that the marginal inference for x has the same form as (3.70). The maximum of the posterior marginal (3.70) is reached for ⎧ 1 √ + 2 2 ⎪ re ⎨± −ra rx −2re + ra rx (ra rx +4d2 ) ˜ if d > 2rx ra rx + re = dm , (3.71) a ˆ= + ⎪ ⎩ r2 0 if d ≤ ra erx + re . f (a|d, re , ra , rx ) ∝
The same symbol, a ˆ, is used to denote the (distinct) joint (3.68) and marginal (3.71) MAP estimates. No confusion will be encountered. Both cases of (3.71) are illustrated in Fig. 3.7, for d = 1 (left) and d = 2 (right). The curves were normalized by numerical integration. Once again, there remains a sign ambiguity in the estimate of a.
5
5
0
0
-5 -5
a
a
3.7 Example: Scalar Multiplicative Decomposition
0 x
5
-5 -5
0 x
51
5
Fig. 3.7. Analytical marginals (for the distribution in Fig. 3.6). re = 1, ra = 10, rx = 20, for which case the inferential breakpoint is d˜m = 1.0025 (3.71). Both modes of solution are displayed: d = 1 < d˜m (left), d = 2 > d˜m (right).
The unavailability of ζa in closed form means that maximization (3.71) is the only operation that can be carried out analytically on the posterior marginal (3.70). Most importantly, the moments of the posterior must be evaluated using numerical methods. In this sense, (3.70) is intractable. Hence, we now seek its approximation using the VB method of Section 3.3.3. Remark 3.4 (Multivariate extension). Extension of the model (3.61) to the multivariate case yields the model known as the factor analysis model (Chapter 5). The full Bayesian solution presented in this Section can, indeed, be extended to this multivariate case [78]. The multivariate posterior distributions suffer the same difficulties as those of the scalar decomposition. Specifically, normalizing constants of the marginal posteriors and, consequently, their moments must be evaluated using numerical methods, such as MCMC approximations (Section 3.6) [79]. We will study the VBapproximation of related matrix decompositions in Chapters 4 and 5. 3.7.4 The Variational Bayes (VB) Approximation It this Section, we follow the VB method of Section 3.3.3, in order to obtain approximate inferences of the parameters a and x. Step 1: The joint distribution is already available in (3.66). Step 2: Since there are only two parameters in (3.66), the only available partitioning is θ1 = a, θ2 = x. All terms in the exponent of (3.66) are a linear combination of a and x. Hence, (3.66) is in the form (3.21), and the VB-approximation will be straightforward. Step 3: Application of the VB theorem yields 1 2 2 2 ˜ xdra − d ra f (a|d) ∝ exp − ra x + re a − 2a , 2re ra 1 2 rx a + re x2 − 2 axdrx − d2 rx . f˜ (x|d) ∝ exp − 2re rx
52
3 Off-line Distributional Approximations and the Variational Bayes Method
Step 4: The distributions in step 3 are readily recognized to be Normal: f˜ (a|d) = N (µa , φa ) , f˜ (x|d) = N (µx , φx ) .
(3.72)
Here, the shaping parameters, µa , φa , µx and φx , are assigned as follows: −1 x , µa = re−1 d re−1 a2 + ra−1
(3.73)
−1 2 + r−1 µx = re−1 d re−1 x a, x
(3.74)
−1 φa = re−1 a2 + ra−1 ,
(3.75)
−1 2 + r−1 φx = re−1 x . x
(3.76)
Step 5: The necessary moments of the Normal distributions, required in step 4, are readily available: a = µa ,
a2 = φa + µ2a ,
(3.77)
x = µx ,
2 = φx + µ2 . x x
(3.78)
From steps 4 and 5, the VB-approximation is determined by a set of eight equations in eight unknowns. Step 6: In this example, the set of equations can be reduced to one cubic equation, whose three roots are expressed in closed form as functions of the shaping parameters from Step 4, as follows: 1. zero-signal inference: µa = 0,
(3.79)
µx = 0, &
4ra rx 1+ −1 , re & re 4ra rx 1+ −1 . φx = 2ra re re φa = 2rx
2. and 3. non-zero signal inference:
(3.80)
3.7 Example: Scalar Multiplicative Decomposition
µa
µx φa φx
12
√ ra rx − dre =± , drx 12
√ d2 − re ra rx − dre =± , dra &
re ra = sgn d2 − re . d rx &
re rx = sgn d2 − re . d ra d2 − re
53
(3.81)
(3.82) (3.83) (3.84)
Here, sgn (·) returns the sign of the argument. The remaining task is to determine which of these roots is the true minimizer of the KL divergence (3.11). Roots 2. and 3. will be treated as one option, since the value of the KL divergence is clearly identical for both of them. From (3.83), we note that √ solutions 2. and 3. are non-complex if d > re . However, (3.81) collapses to µx = 0 (i.e. to the zero-signal inference) for , √ 1 re + re (re + 4ra rx ) √ re ≈ re + √ > re , (3.85) d = d˜VB = √ 2 2 ra rx ra rx and has complex values for d < d˜VB . Hence, (3.85) denotes the VB-based breakpoint. For d > d˜VB , a non-zero signal is inferred (cases 2. and 3.), while for d ≤ d˜VB the observation is considered to be purely noise. For improved insight, Fig. 3.8 demonstrates graphically how the modes of the VB-approximation arise. d < d˜VB
KLDVB
KLDVB
d > d˜VB
0
f˘(a, x|d)
f˜(a, x|d)
0
f˘(a, x|d)
Fig. 3.8. The notional shape of the KL divergence (KLDVB ) as a function of observed data. The two modes are illustrated.
Step 7: As an alternative to the analytical solution above, the IVB algorithm may be adopted (Algorithm 1). In this case, the equations in Steps 4 and 5 are evaluated in the order suggested by (3.32)–(3.35). The trajectory of the iterations for the posterior VB-means, µa (3.73) and µx (3.74), are shown in Fig. 3.9, for d = 2 (left) and
54
3 Off-line Distributional Approximations and the Variational Bayes Method
d = 1 (right). For the chosen priors, the inferential breakpoint (3.85) is at d˜VB = 1.0025. Hence, these two cases demonstrate the two distinct modes of solution of the equations (3.73)–(3.76). Being a gradient descent algorithm, the IVB algorithm has no difficulty with multimodality of the analytical solution (Fig. 3.8), since the remaining modes are irrational (for d < d˜VB ) or are local maxima (for d > d˜VB ), as discussed in Step 6. above. For this reason, the IVB algorithm converges to the global minimizer of KLDVB (3.6) independently of the initial conditions. Step 8: Ultimately, the VB-approximation, f˜(a, x|d), is given by the product of the VB-marginals in (3.72).
2
2.5 2
1 a
a
1.5 1 0.5
0 -1
0 -2
-0 5 0
1
2 x
-2
-1
0 x
1
2
Fig. 3.9. VB-approximation of the posterior distribution for the scalar multiplicative decomposition, using the IVB algorithm. The dashed line denotes the initial VB-approximation; the full line denotes the converged VB-approximation; the trajectory of the posterior VB-means, µa (3.73) and µx (3.74), is also illustrated. The prior parameters are re = 1, ra = 10, rx = 10. Left: (non-zero-signal mode) d = 2. Right: (zero-signal-mode) d = 1.
3.7.5 Comparison with Other Techniques The VB-approximation can be compared to other deterministic methods considered in this Chapter: KLDMR under the assumption of conditional independence (Section 3.4.1) is minimized for the product of the true marginal posterior distributions (3.70). Since these are not tractable, their evaluation must be undertaken numerically, as was the case in Fig. 3.7. The results are illustrated in Fig. 3.10 (left). QB approximation for the model cannot be undertaken because of intractability of the true marginal distribution (3.70). Laplace approximation (Section 3.5.2) is applied at the MAP estimate, (3.67) and (3.68). The result is displayed in Fig. 3.10 (middle). Unlike the VB-approximation and the KLDMR -based method of Section 3.4.1, the Laplace approximation
3.7 Example: Scalar Multiplicative Decomposition
55
5
5
0
0
0
-5 -5
0 x
5
-5 -5
a
5
a
a
does model cross-correlation between variables. However, it is dependent on the MAP estimates, and so its inferential break-point, d˜MAP (3.7.2), is the same as that for MAP estimation. VB-approximation is illustrated in Fig. 3.10 (right). This result illustrates a key consequence of the VB approximation, namely absence of any cross-correlation between variables, owing to the conditional independence assumption which underlies the approximation (3.10).
0 x
5
-5 -5
0 x
5
Fig. 3.10. Comparison of approximation techniques for the scalar multiplicative decomposition (3.66). re = 1, ra = 10, rx = 20, d = 2. Left: KLDMR -based approximation. Centre: the Laplace approximation. Right: the VB-approximation, which infers a non-zero signal (3.81), since d > d˜VB . In the last two cases, the ellipse corresponding to the 2-standarddeviation boundary of the approximating Normal distribution is illustrated.
These results suggest the following: • The prior distribution is indispensable in regularizing the model, and in ensuring that finite VB-moments are generated. With uniform priors, i.e. ra → ∞ and rx → ∞, none of the derived solutions is valid. • From (3.81) and, (3.82), the ratio of the posterior expected values, a/ x, is fixed by the priors at ra /rx . This is a direct consequence of the scaling ambiguity of the model: the observed data do not bring any information about the ratio of the mean values of a and x. Hence, the scale of these parameters is fixed by the prior. ˜ The inferential breakpoint, d—i.e. the value of d above which a non-zero signal is inferred—is different for each approximation. For the ML and MAP approaches, the inferred signal is non-zero even for very small data: d˜ML = 0, and d˜MAP = 0.07 for the chosen priors. After exact marginalization, the signal is inferred to be non√ zero only if the observed data are greater than d˜m = re = 1 for uniform priors, or d˜m = 1.0025 for the priors used in Fig. 3.7. The VB-approximation infers a non-zero signal only above d˜VB = 1.036.
56
3 Off-line Distributional Approximations and the Variational Bayes Method
3.8 Conclusion The VB-approximation is a deterministic, free-form distributional approximation. It replaces the true distribution with one for which correlation between the partitioned parameters is suppressed. An immediate convenience of the approximation is that its output is naturally in the form of marginals and key moments, answering many of the questions which inspire the use of approximations for intractable distributions in the first place. While an analytical solution may be found in special cases, the general approach to finding the VB-approximation is to iterate the steps of the IVB algorithm. The IVB algorithm may be seen as a Bayesian counterpart of the classical EM algorithm, generating distributions rather than simply point estimates. An advantage of the IVB approach to finding the VB-approximation is that the solution is guaranteed to be a local minimizer of KLDVB . In this Chapter, we systematized the procedure for VB-approximation into the 8step VB method, which will be followed in all the later Chapters. It will prove to be a powerful generic tool for distributional approximation in many signal processing contexts.
4 Principal Component Analysis and Matrix Decompositions
Principal Component Analysis (PCA) is one of the classical data analysis tools for dimensionality reduction. It is used in many application areas, including data compression, denoising, pattern recognition, shape analysis and spectral analysis. For an overview of its use, see [80]. Typical applications in signal processing include spectral analysis [81] and image compression [82]. PCA was originally developed from a geometric perspective [83]. It can also be derived from the additive decomposition of a matrix of data, D, into a low-rank matrix, M(r) , of rank r, representing a ‘signal’ of interest, and noise, E: D = M(r) + E.
(4.1)
If E has a Gaussian distribution, and M(r) is decomposed into a product of two lower-dimensional matrices—i.e. M(r) = AX —then Maximum Likelihood (ML) estimation of M(r) gives the same results as PCA [84,85]. This is known as the Probabilistic PCA (PPCA) model. In this Chapter, we introduce an alternative model, parameterizing M(r) in terms of the Singular Value Decomposition (SVD): i.e. M(r) = ALX . We will call this the Orthogonal PPCA model (see Fig. 4.1). The ML estimation of M(r) again gives the same results as PCA. We will study the Bayesian inference of the parameters of both of these models, and find that the required integrations are intractable. Hence, we will use the VB-approximation of Section 3.3 to achieve tractable Bayesian inference. Three algorithms will emerge: (i) Variational PCA (VPCA), (ii) Fast Variational PCA (FVPCA), and (iii) Orthogonal Variational PCA (OVPCA). (i) and (ii) will be used for inference of PPCA parameters. (iii) will be used for inference of the Orthogonal PPCA model. The layout of the Chapter is summarized in Fig. 4.1. The Bayesian methodology allows us to address important tasks that are not successfully addressed by the ML solution (i.e. PCA). These are: Uncertainty bounds: PCA provides point estimates of parameters. Since the results of Bayesian inference are probability distributions, uncertainty bounds on the inferred parameters can easily be derived.
58
4 Principal Component Analysis and Matrix Decompositions Matrix decomposition D = M(r) + E
PPCA model M(r) = AX
Orthogonal PPCA model M(r) = ALX
VPCA
FVPCA
OVPCA
algorithm
algorithm
algorithm
Fig. 4.1. Models and algorithms for Principal Component Analysis.
Inference of rank: in PCA, the rank, r, of the matrix M(r) must be known a priori. Only ad hoc and asymptotic results are available for guidance in choosing this important parameter. In the Bayesian paradigm, we treat unknown r as a random variable, and we derive its marginal posterior distribution.
4.1 Probabilistic Principal Component Analysis (PPCA) The PPCA observation model [85] is D = AX + E,
(4.2)
as discussed above. Here, D ∈ Rp×n are the observed data, A ∈ Rp×r and X ∈ Rn×r are unknown parameters, and E ∈ Rp×n is independent, identicallydistributed (i.i.d.) Gaussian noise with unknown but common variance, ω −1 : f (E|ω) =
p n
Nei,j 0, ω −1 .
(4.3)
i=1 j=1
ω is known as the precision parameter, and has the meaning of inverse variance: var (ei,j ) = ω −1 . In this Chapter, we will make use of the matrix Normal distribution (Appendix A.2):
(4.4) f (E|ω) = NE 0p,n , ω −1 Ip ⊗ In . This is identical to (4.3). The model (4.2) and (4.4) can be written as follows:
f (D|A, X, ω) = N AX , ω −1 Ip ⊗ In .
(4.5)
4.1 Probabilistic Principal Component Analysis (PPCA)
59
Note that (4.5) can be seen as a Normal distribution with low-rank mean value, M(r) = AX ,
(4.6)
where r is the rank of M(r) , and we assume that r < min (p, n). Matrices A and X are assumed to have full rank; i.e. rank (X) = rank (A) = r. The original PPCA model of [85] contains an extra parameter, µ, modelling a common mean value for the columns, mi , of M(r) . This parameter can be seen as an extra column in A if we augment X by a column 1n,1 . Hence, it is a restriction of X. In this work, we do not impose this restriction of a common mean value; i.e. we assume that the common mean value is µ = 0p,1 . In the sequel, we will often invoke the theorem that any matrix can be decomposed into singular vectors and singular values [86]. We will use the following form of this theorem. Definition 4.1. The Singular Value Decomposition (SVD) of matrix D ∈ Rp×n is defined as follows: (4.7) D = UD LD VD , UD = where UD ∈ Rp×p and VD ∈ Rn×n are orthonormal matrices, such that UD p×n min(p,n) is a matrix with diagonal elements lD ∈ R Ip , VD VD = In and LD ∈ R and zeros elsewhere. The columns of UD and VD are known as the left- and rightsingular vectors, respectively. The elements of lD are known as the singular values. Unless stated otherwise, we will assume (without loss of generality) that r < p ≤ n. This assumption allows us to use the economic SVD, which uses only p right singular vectors of VD . Therefore, in the sequel, we will assume that LD ∈ Rp×p and VD ∈ Rn×p .
Since LD is a square diagonal matrix, we will often work with its diagonal elements only. In general, diagonal elements of a diagonal matrix will be denoted by the same lower-case letter as the original matrix. In particular, we use the notation, LD = diag (lD ) , lD = diag−1 (LD ) , where lD ∈ Rp . 4.1.1 Maximum Likelihood (ML) Estimation for the PPCA Model The ML estimates of the parameters of (4.5)—i.e. M(r) and ω—are defined as follows:
pn 1 ˆ 2 M(r) , ω ˆ = arg max ω exp − ωtr D − M(r) D − M(r) . (4.8) M(r) ,ω 2 Here, tr (·) denotes the trace of the matrix. Note that (4.8) is conditioned on a known rank, r.
60
4 Principal Component Analysis and Matrix Decompositions
Using the SVD of the data matrix, D (4.7), the maximum of (4.8) is reached for pn ˆ (r) = UD;r LD;r,r VD;r , ω ˆ = %p M
2 . i=r+1 li,D
(4.9)
Here, UD;r and VD;r are the first r columns of the matrices, UD and VD , respectively, and LD;r,r is the r × r upper-left sub-block of matrix LD . Remark 4.1 (Rotational ambiguity). The ML estimates of A and X in (4.5), using (4.9), are not unique because (4.5) exhibits a multiplicative degeneracy; i.e. ˜ , ˆ = A˜X ˆ (r) = AˆX ˆ = AT ˆ M T −1 X (4.10) for any invertible matrix, T ∈ Rr×r . This is known as rotational ambiguity in the factor analysis literature [84]. Remark 4.2 (Relation of the ML estimate to PCA). Principal Component Analysis (PCA) is concerned with projections of p-dimensional vectors dj , j = 1, . . . , n, into an r-dimensional subspace. Optimality of the projection was studied from both a maximum variation [87], and least squares [83], point-of-view. In both cases, the optimal projection was found via the eigendecomposition of the sample covariance matrix:
S=
1 DD = U ΛU . n−1
(4.11)
Here, Λ = diag (λ), with λ = [λ1 , . . . , λp ] , is a matrix of eigenvalues of S, and U is the matrix of associated eigenvectors. The columns uj , j = 1, . . . , r, of U , corresponding to the largest eigenvalues, λ1 > λ2 . . . > λr , form a basis for the optimal projection sub-space. These results are related to the ML solution (4.9), as follows. From (4.7), = UD LD LD UD . DD = UD LD VD VD LD UD
(4.12)
ˆ as follows: The ML estimate (4.9) can be decomposed into Aˆ and X, ˆ = LD;r,r V . Aˆ = UD;r , X D;r
(4.13)
Hence, comparing (4.11) with (4.12), and using (4.13), the following equalities emerge: 1 1 Aˆ = UD;r = U;r , LD = (n − 1) 2 Λ 2 . (4.14) Equalities (4.14) formalize the relation between PCA and the ML estimation of the PPCA model. Remark 4.3 (Ad hoc choice of rank r). The rank, r, has been assumed to be known a priori. If this is not the case, the ML solution—and, therefore, PCA—fails since the likelihood function in (4.8) increases with r in general. This is a typical problem
4.1 Probabilistic Principal Component Analysis (PPCA)
61
with classical estimation, since no Ockham-based regularization (i.e. penalization of complexity) is available in this non-measure-based approach [3]. Many heuristic methods for selection of rank do, however, exist [80]. One is based on the asymptotic properties of the noise E (4.4). Specifically, from (4.5), + nω −1 Ip . Ef (D|M(r) ,ω) [DD ] = M(r) M(r)
(4.15)
Using the SVD (Definition 4.1), M(r) M(r) = UM L2M UM . Noting the equality, UM UM = Ip , then, from (4.15) and (4.12), = UM L2M UM + nω −1 UM UM . lim UD L2D UD
n→∞
It follows that limn→∞ UD = UM , and that l2 + nω −1 2 lim li,D = i,M−1 n→∞ nω
i ≤ r, i > r.
(4.16)
(4.17)
Hence, the index, r, for which the singular values, limn→∞ li,D , i > r, are constant is considered to be an estimate of the rank r. In finite samples, however, (4.17) is only approximately true. Therefore, the estimate can be chosen by visual examination of the graphed singular values [80], looking for the characteristic ‘knee’ in the graph. 4.1.2 Marginal Likelihood Inference of A An alternative inference of the parameters of the PPCA model (4.5) complements (4.5) with a Gaussian prior on X, f (X) = NX (0n,r , In ⊗ Ir ) ,
(4.18)
and uses this to marginalize over X in the likelihood function (4.5) [84, 85]. The resulting maximum of the marginal likelihood, conditioned by r, is then reached for ω ˆ given by (4.9), and for
(4.19) Aˆ = UD;r L2D;r,r − ω ˆ −1 Ir R. Here, UD;r and LD;r,r are given by (4.7), and R ∈ Rr×r is any orthogonal (i.e. rotation) matrix. In this case, indeterminacy of the model is reduced from an arbitrary invertible matrix, T (4.10), to an orthogonal matrix, R. This reduction is a direct consequence of the restriction imposed on the model by the prior on X (4.18). 4.1.3 Exact Bayesian Analysis Bayesian inference for the PPCA model can be found in [78]. Note that the special case of the PPCA model (4.5) for p = n = r = 1 was studied in Section 3.7. We found that the inference procedure was not tractable. Specifically, the marginals of a and x (scalar versions of (4.6)) could not be normalized analytically, and so moments were unavailable. The same is true of the marginal inferences of A and X derived in [78]. Their numerical evaluation was accomplished using Gibbs sampling in [79, 88].
62
4 Principal Component Analysis and Matrix Decompositions
4.1.4 The Laplace Approximation Estimation of the rank of the PPCA model via the Laplace approximation (Section 3.5.2) was published in [89]. There, the parameter A (4.5) was restricted by the orthogonality constraint, A A = Ir . The parameter X was integrated out, as in Section 4.1.2 above.
4.2 The Variational Bayes (VB) Method for the PPCA Model The VB-approximation for the PPCA model (4.5) was introduced in [24]. Here, we use the VB method to obtain the necessary approximation (Section 3.3.3) and some interesting variants. Note that this development extends the scalar decomposition example of Section 3.7 to the matrix case. Step 1: Choose a Bayesian Model The observation model (4.5) is complemented by the following priors:
f (A|Υ ) = NA 0p,r , Ip ⊗ Υ −1 ,
(4.20)
Υ = diag (υ) , υ = [υ1 , . . . , υr ] , r
f (υ|α0 , β0 ) = Gυi (αi,0 , βi,0 ) ,
(4.21)
i=1
f (X) = NX (0n,r , In ⊗ Ir ) , f (ω|ϑ0 , ρ0 ) = Gω (ϑ0 , ρ0 ) .
(4.22) (4.23)
In (4.21), α0 = [α1,0 , . . . , αr,0 ] and β0 = [β1,0 , . . . , βr,0 ] . Here, Υ ∈ Rr×r is a diagonal matrix of hyper-parameters, υi , distributed as (4.21). The remaining prior parameters, ϑ0 and ρ0 , are known scalar parameters. Complementing (4.5) with (4.20)– (4.23), the joint distribution is f (D, A, X, ω, Υ |α0 , β0 , ϑ0 , ρ0 , r)
= ND AX, ω −1 Ip ⊗ In NA 0p,r , Ip ⊗ Υ −1 NX (0n,r , In ⊗ Ir ) × r
× Gω (ϑ0 , ρ0 ) Gυi (αi,0 , βi,0 ) . (4.24) i=1
In the sequel, the conditioning on α0 , β0 , ϑ0 , ρ0 will be dropped for convenience. Step 2: Partition the parameters The logarithm of the joint distribution is as follows:
4.2 The Variational Bayes (VB) Method for the PPCA Model
63
1 pn ln ω − ωtr (D − AX ) (D − AX ) + 2 2 r r " " p 1 1 + ln υi − tr (Υ A A) − tr (XX ) + (α0 − 1) ln υi + 2 i=1 2 2 i=1
ln f (D, A, X, ω, Υ |r) =
−
r "
β0 υi + (ϑ0 − 1) ln ω − ρ0 ω + γ. (4.25)
i=1
Here, γ gathers together all terms that do not depend on A, X, ω or Υ . We partition the model (4.24) as follows: θ1 = A, θ2 = X, θ3 = ω and θ4 = Υ . Hence, (4.25) is a member of the separable-in-parameters family (3.21). The detailed assignment of functions g (·) and h (·) is omitted for brevity. Step 3: Write down the VB-marginals Any minimizer of KLDVB must have the following form (Theorem 3.1):
1 X A − 1 tr AΥ A D − 1 tr −A ω tr −2AX , f˜ (A|D, r) ∝ exp − ω X 2 2 2
1 A X − 1 tr X X D − 1 tr −X ω A tr −2AX , f˜ (X|D, r) ∝ exp − ω 2
f˜ (υ|D, r) ∝ exp
2
r "
f˜ (ω|D, r) ∝ exp
i=1
2
p 1 + αi,0 − 1 ln υi − βi,0 + ai ai υi 2 2
,
pn + ϑ0 − 1 ln ω + 2
1 AX X X D − DX A + 1 tr A −ω ρ0 + tr DD − A 2 2
.
Step 4: Identify standard forms The VB-marginals from the previous step are recognized to have the following forms: f˜ (A|D, r) = NA (µA , Ip ⊗ ΣA ) , f˜ (X|D, r) = NX (µX , In ⊗ ΣX ) , r
f˜ (υ|D, r) = Gυi (αi , βi ) ,
(4.26) (4.27) (4.28)
i=1
f˜ (ω|D, r) = Gω (ϑ, ρ) . The associated shaping parameters are as follows:
(4.29)
64
4 Principal Component Analysis and Matrix Decompositions
−1 X + Υ ω µA = ω DX X , −1 X + Υ X , ΣA = ω −1 A + I ω A D A , µX = ω r −1 A + I A , ΣX = ω r p α = α0 + 1r,1 , 2 1 A , β = β0 + diag A 2 np , ϑ = ϑ0 + 2 1 AX X . X D − DX A + 1 tr A ρ = ρ0 + tr DD − A 2 2
(4.30) (4.31) (4.32) (4.33) (4.34) (4.35) (4.36) (4.37)
Step 5: Formulate necessary VB-moments Using standard results for the matrix Normal and Gamma distributions (Appendices A.2 and A.5, respectively), the necessary moments of distributions (4.26)–(4.29) can be expressed as functions of their shaping parameters, (4.30)–(4.37), as follows: = µA , A A = pΣ + µ µ , A A A A X = µX , X = nΣ + µ µ , X X X X αi , i = 1, . . . , r, υ i = βi ϑ ω = . ρ
(4.38)
(4.39) (4.40)
(4.39) can be written in vector form as follows: υ = α ◦ β −1 . Here, ‘◦’ denotes the Hadamard product (the ‘.∗’ operator in MATLAB: see Notational Conventions, Page XVI) and β −1 ≡ β1−1 , . . . , βr−1 . This notation will be useful in the sequel. Step 6: Reduce the VB-equations As mentioned in Section (3.3.3), the VB-equations, (4.30)–(4.40), can always be expressed in terms of moments or shaping parameters only. Then, the IVB algorithm
4.2 The Variational Bayes (VB) Method for the PPCA Model
65
can be used to find the solution. This approach was used in [24]. Next, we show that reduction of the VB-equations can be achieved using re-parameterization of the model, resulting in far fewer operations per IVB cycle. The IVB algorithm used to solve the VB-equations is a gradient search method in the multidimensional space of the shaping parameters and moments above. If we can identify a lower-dimensional subspace in which the solution exists, the iterations can be performed in this subspace. The challenge is to define such a subspace. Recall that both the ML solution (4.8) and the marginal ML solution (4.19) of the PPCA model (4.5) have the form of scaled singular vectors, UD;r , of the data matrix, D (4.7). The scaling coefficients for these singular vectors are different for each method of inference. Intuitively, therefore, it makes sense to locate our VBapproximate distributions in the space of scaled singular vectors. In other words, we should search for a solution in the space of scaling coefficients of UD . This idea is formalized in the following conjecture. Conjecture 4.1 (Soft orthogonality constraints). The prior distributions on A and X—i.e. (4.20) and (4.22)—were chosen with diagonal covariance matrices. In other words, the expected values of A A and X X are diagonal. Hence, this choice favours those matrices, A and X, which are orthogonal. We conjecture that the VB-approximation (3.12) converges to posterior distributions with orthogonal mean value, even if the IVB algorithm was initialized with non-orthogonal matrices. If this Conjecture is true, then it suffices to search for a VB-approximation only in the space of orthogonal mean values and diagonal covariance matrices. Validity of the conjecture will be tested in simulations. Proposition 4.1 (Orthogonal solution of the VB-equations). Consider a special case of distributions (4.26) and (4.27) for matrices A and X respectively, in which we restrict the first and second moments as follows: µA = UD;r KA ,
(4.41)
µX = VD;r KX ,
(4.42)
KA = diag (kA ) , KX = diag (kX ) , ΣA = diag (σA ) ,
(4.43)
ΣX = diag (σX ) .
(4.44)
The first moments, µA (4.41) and µX (4.42), are formed from scaled singular vectors of the data matrix, D (4.7), multiplied by diagonal proportionality matrices, KA ∈ Rr×r and KX ∈ Rr×r . The second moments, (4.43) and (4.44), are restricted to have a diagonal form. Then the VB-marginals, (4.26)–(4.29), are fully determined by the following set of equations:
66
4 Principal Component Analysis and Matrix Decompositions
kA = ω lD;r ◦ kX ◦ σA ,
−1 nσX + ω kX ◦ kX + α ◦ β −1 , σA = ω
(4.45)
kX = ω σX ◦ kA ◦ lD;r , −1
σX = ( ω pσA + ω kA ◦ kA + 11,r ) , p αi = α0 + , i = 1, . . . , r, 2
1 2 βi = β0 + pσi,A + ki,A , i = 1, . . . , r, 2 np , ϑ = ϑ0 + 2
1 ρ = ρ0 + (lD − kA ◦ kX ) (lD − kA ◦ kX ) + 2 1 (kX ◦ kX ) + pnσA σX + nσX (kA ◦ kA )) , + (pσA 2 ϑ ω = . ρ
(4.46)
(4.47) (4.48)
(4.49) (4.50)
Proof: From (4.30) and (4.31), ΣA . µA = ω DX
(4.51)
Substituting (4.41), (4.43) and (4.7) into (4.51) it follows that UD;r KA = ω UD LD VD VD;r KX diag (σA ) . Hence, diag (kA ) = ω diag (lD;r ) diag (KX ) diag (σA ) , using orthogonality of matrices UD and VD . Rewriting all diagonal matrices using identities of the kind KA KX = diag (kA ) diag (kX ) = diag (kA ◦ kX ) , and extracting their diagonal elements, we obtain (4.45). The identities (4.46)–(4.49) all follow in the same way. The key result of Proposition 4.1 is that the distributions of A and X are completely determined by the constants of proportionality, kA and kX , and variances, σA and σX , respectively. Therefore, the number of scalar unknowns arising in the VB-equations, (4.41)–(4.50), is now only 5r + 1, being the number of terms in the set kA , kX , σA , σX , β and ρ. In the original VB-equations, (4.30)–(4.40), the number of scalar unknowns was much higher—specifically, r (p + n + 2r) + r + 1—for the parameter set µA , µX , ΣA , ΣX , β and ρ. In fact, the simplification achieved in the VB-equations is even greater than discussed above. Specifically, the vectors σA ,σX , kA and kX now interact with each other only through ω (4.50). Therefore, if ω is fixed, the vector identities in (4.45)– (4.47) decouple element-wise, i = 1, . . . ,r, into scalar identities. The complexity of
4.2 The Variational Bayes (VB) Method for the PPCA Model
67
each such scalar identity is merely that of the scalar decomposition in Section 3.7, which had an analytical solution. Proposition 4.2 (Analytical solution). Let the posterior expected value, ω (4.50), of ω be fixed. Then the scalar identities, i = 1, . . . , r, implied by VB-equations (4.45)– (4.47) have an analytical solution with one of two modes, determined, for each i, by an associated inferential breakpoint, as follows: √ √ + n, ˜li,D = p√ 1 − βi,0 ω . (4.52) ω 1. Zero-centred solution, for each i such that li,D ≤ ˜li,D , where li,D is the ith singular value of D (4.7): ki,A = 0,
(4.53)
ki,X = 0, σi,X =
− 1 2n − (n − p) βi,0 ω
+
2
2 ω βi,0 2 (n − p) + 4βi,0 np ω
2 n (1 − βi,0 ω ) 1 − σi,X , σi,A = σi,X p ω
αi,0 + 12 p (σi,X − 1) . βi = (n (σi,X − 1) + p) σi,X ω
2. Non-zero solution, for each i such that li,D > ˜li,D : . √ −b + b2 − 4ac , ki,A = 2a 2
ω li,D − p ki,A , ki,X = 2 +1 ki,A lD ω
,
(4.54)
(4.55)
ω k2 + 1 i,A , 2 ω ω li,D −p 2
ω li,D − p , σi,X = 2 +1 ki,A ω li,D ω
2 2 ki,A ((1 − ω βi,0 )(n − p) + li,D ) + n + βi,0 ω 2 li,D (αi,0 + p2 ) ω .(4.56) βi = 2 (p − n) − n + ω 2 ω ki,A li,D σi,A =
Note that ki,A in (4.55) is the positive root of the quadratic equation, 2 aki,A + bki,A + c = 0,
whose coefficients are
68
4 Principal Component Analysis and Matrix Decompositions
a = n ω 3 li,D 2 , + 2 p ω 2 li,D 2 + n ω 2 li,D 2 + βi,0 n ω 3 li,D 2 + b = n ω p − p2 ω − ω 3 li,D 4 − βi,0 n ω 2 p − βi,0 ω 3 li,D 2 p + βi,0 ω 2 p2 , c = np − βi,0 ω 3 li,D 4 − βi,0 n ω p + βi,0 ω 2 li,D 2 p + βi,0 n ω 2 li,D 2 . Proof: The identities were established using the symbolic software package, Maple. For further details, see [90]. Note that the element-wise inferential breakpoints, ˜li,D (4.52), differ only with respect to the known prior parameters, βi,0 (4.21). It is reasonable to choose these equal—i.e. β0 = β1,0 1r,1 —in which case the r breakpoints are identical. Step 7: Run the IVB Algorithm From Step 6, there are two approaches available to us for finding the VB-marginals (4.26)–(4.29). If we search for a solution without imposing orthogonality constraints, we must run the IVB algorithm in order to find a solution to the full sets of VBequations (4.30)–(4.40). This was the approach presented in [24], and will be known as Variational PCA (VPCA). In contrast, if we search for an orthogonal solution using Conjecture 4.1, then we can exploit the analytical solution (Proposition 4.2), greatly simplifying the IVB algorithm, as follows. This will be known as Fast Variational PCA (FVPCA). Algorithm 3 (Fast VPCA (FVPCA)). 1. Perform SVD (4.7) of data matrix, D. n (as explained shortly). Set iteration 2. Choose initial value of ω as ω [1] = lp,D counter to j = 1. 3. Evaluate inferential breakpoints, √ √ p + n+ ˜li,D = √ 1 − βi,0 ω [j] . ω [j] ! ! 4. Partition lD into lz = li,D : li,D ≤ ˜li,D and lnz = li,D : li,D > ˜li,D . 5. Evaluate solutions (4.53)–(4.54) for lz , and solutions (4.55)–(4.56) for lnz . 6. Update iteration counter (j = j + 1), and estimate ω [j] = ρϑ[j] , using (4.48) and (4.49). [j−1] > ε, ε small, go to step 3, otherwise 7. Test with stopping rule; e.g. if ω [j] − ω end. Remark 4.4 (Automatic Rank Determination (ARD) Property). The shaping parameters, α and β, can be used for rank determination. It is observed that for some values of the index, i, the posterior expected values, υi = αi /βi (4.39), converge (with the number of IVB iterations) to the prior expected values, υi → αi,0 /βi,0 . This can be understood as a prior dominated inference [8]; i.e. the observations are not informative in those dimensions. Therefore, the rank can be determined as the number of υ i
4.3 Orthogonal Variational PCA (OVPCA)
69
that are significantly different from the prior value, αi,0 /βi,0 . This behaviour will be called the Automatic Rank Determination (ARD) property.1 Remark 4.5 (Ad hoc choice of initial value, ω [1] , of ω ). As n → ∞ (4.2), the singular values, li,D , of D are given by (4.17). In finite samples, (4.17) holds only approximately, as follows: p " 1 l2 ≈ nω −1 , p − r i=r+1 i,D p " i=1
2 li,D ≈
r "
(4.57)
2 li,M + pnω −1 ,
(4.58)
i=1
where r is the unknown rank. From the % ordering of the singular values, l1,D > l2,D > p 1 2 2 < p−r . . . > lp,D , it follows that lp,D i=r+1 li,D (i.e. the mean is greater than the 2 minimal value). From (4.57), it follows that lp,D < nω −1 . From (4.58), it is true that %p 2 −1 . These considerations lead to the following choice of interval i=1 li,D > pnω for ω : n pn <ω < 2 . (4.59) l lD l D p,D Recall that ω is the precision parameter in the PPCA model (4.5). Hence, we initialize ω at the upper bound in (4.59)—i.e. ω [1] = l2n —encouraging convergence p,D
to a higher-precision solution. We will examine this choice in simulation, in Section 4.4.2. Step 8: Report the VB-marginals The VB-marginals are given by (4.26)–(4.29). Their shaping parameters and moments are inferred using either the VPCA algorithm (4.30)–(4.40) or the FVPCA algorithm (Algorithm 3).
4.3 Orthogonal Variational PCA (OVPCA) In Section 4.1, we explained that Maximum Likelihood (ML) estimation of parameters A and X in the PPCA model suffers from rotational ambiguity (Remark 4.1). This is a consequence of inappropriate modelling of low-rank matrix, M(r) , which is clearly over-parameterized. It leads to complications in the Bayesian approach, where the inference of A and X must be regularized via priors, as noted in Section 3.7. From an analytical point-of-view, model (4.6) contains redundant parameters. In this Section, we re-parameterize the model in a more compact way. 1
In the machine learning community, this property is known as the Automatic Relevance Determination property [24]. In our work, the inferred number of relevant parameters is associated with the rank of M(r) .
70
4 Principal Component Analysis and Matrix Decompositions
4.3.1 The Orthogonal PPCA Model We now apply the economic SVD (Definition 4.1) to low rank matrix, M(r) . In this case, p−r singular vectors will be irrelevant, since they will be multiplied by singular values equal to zero. This allows us to write M(r) as the following product: M(r) = ALX .
(4.60)
Here, the matrices A ∈ Rp×r and X ∈ Rn×r have orthogonality restrictions, A A = Ir and X X = Ir . Also, L = diag (l) ,
is a diagonal matrix of non-zero singular values, l = [l1 , . . . , lr ] , ordered, without loss of generality, as (4.61) l1 > l2 > . . . > lr > 0. The decomposition (4.60) is unique, up to the sign of the r singular vectors (i.e. there are 2r possible decompositions (4.60) satisfying the orthogonality and ordering (4.61) constraints, all equal to within a sign ambiguity [86]). From (4.2), (4.4) and (4.60), the orthogonal PPCA model is
(4.62) f (D|A, L, X, ω, r) = N ALX , ω −1 Ip ⊗ In . This model, and its VB inference, were first reported in [91]. The ML estimates of the model parameters, conditioned by known r, are ˆ L, ˆ X, ˆ ω A, ˆ = arg max f (D|A, L, X, ω, r) , A,L,X,ω
with assignments pn ˆ = LD;r,r , X ˆ = VD;r , ω Aˆ = UD;r , L ˆ = %p
2 , i=r+1 li,D
(4.63)
using (4.7). 4.3.2 The VB Method for the Orthogonal PPCA Model Here, we follow the VB method (Section 3.3.3) to obtain VB-marginals of the parameters in the orthogonal PPCA model (4.62). Step 1: Choose a Bayesian Model The reduction of rotational ambiguity to merely a sign-based ambiguity is an advantage gained at the expense of orthogonal restrictions which are generally difficult to handle. Specifically, parameters A and X are restricted to having orthonormal columns, i.e. A A = Ir and X X = Ir , respectively.
4.3 Orthogonal Variational PCA (OVPCA)
71
Intuitively, each column ai , i = 1 . . . r, of A belongs to the unit hyperball in p dimensions, i.e. ai ∈ Hp . Hence, A ∈ Hpr , the Cartesian product of r p-dimensional unit hyperballs. However, the requirement of orthogonality—i.e. ai aj = 0, ∀i = j—confines the space further. The orthonormally constrained subset, Sp,r ⊂ Hpr , is known as the Stiefel manifold [92]. Sp,r has finite area, which will be denoted as α (p, r), as follows: 1
α (p, r) =
2r π 2 pr #1 $. j=1 Γ 2 (p − j + 1)
r 1 r(r−1)
π4
(4.64)
Here, Γ (·) denotes the Gamma function [93]. Both the prior and posterior distributions have a support confined to Sp,r . We choose the priors on A and X to be the least informative, i.e. uniform on Sp,r and Sn,r respectively: f (A) = UA (Sp,r ) = α (p, r)
−1
f (X) = UX (Sn,r ) = α (n, r)
χSp,r (A) ,
−1
χSn,r (X) .
(4.65) (4.66)
There is no upper bound on ω > 0 (4.4). Hence, an appropriate prior is (the improper) Jeffreys’ prior on scale parameters [7]: f (ω) ∝ ω −1 .
(4.67)
Remark 4.6 (Prior on l). Suppose that the sum of squares of elements of D is bounded from above; e.g. p " n "
d2i,j = tr (DD ) ≤ 1.
(4.68)
i=1 j=1
This can easily be achieved, for example, by preprocessing of the data. (4.68) can be expressed, using (4.7), as )= tr (DD ) = tr (UD LD LD UD
p "
2 li,D ≤ 1.
(4.69)
i=1
Note that tr M(r) M(r) ≤ tr (DD ) (4.2). Hence, using (4.69), r " i=1
li2 ≤
p "
2 li,D ≤ 1.
(4.70)
i=1
This, together with (4.61), confines l to the space 0 r / " / 2 Lr = l/l1 > l2 > . . . > lr > 0, li ≤ 1 , i=1
(4.71)
72
4 Principal Component Analysis and Matrix Decompositions
which is a sector of the unit hyperball. Constraint (4.70) forms a full unit hyperball, Hr ⊂ Rr , with hypervolume r r +1 . (4.72) hr = π 2 /Γ 2 Positivity constraints restrict this further to hr /2r , while hyperplanes, {li = lj , ∀i, j = 1 . . . r}, partition the positive sector of the hyperball into r! sectors, each of equal hypervolume, only one of which satisfies condition (4.61). Hence, the hypervolume of the support (4.71) is r
ξr = hr
1 π2
. = r r 2 (r!) Γ 2 + 1 2r (r!)
(4.73)
We choose the prior distribution on l to be non-committal—i.e. uniform—on support (4.71). Using (4.73), (4.74) f (l) = Ul (Lr ) = ξr−1 χLr (l) . Multiplying (4.62) by (4.65), (4.66), (4.67) and (4.74), and using the chain rule of probability, we obtain the joint distribution,
f (D, A, X, L, ω|r) = N ALX , ω −1 Ip ⊗ It × α (p, r)
−1
α (n, r)
−1
ξr−1 ω −1 χΘ∗ (θ) .
(4.75)
Here, θ = {A, X, L, ω} with support Θ∗ = Sp,r × Sn,r × Lr × R+ . Step 2: Partition the Parameters We choose to partition the parameters of model (4.62) as follows: θ1 = A, θ2 = X, θ3 = L, θ4 = ω. The logarithm of the joint distribution, restricted to zero outside the support Θ∗ , is given by ln f (D, A, X, L, ω|r) = pn 1 − 1 ln ω − ωtr (D − ALX ) (D − ALX ) + γ = 2 2 pn 1 1 = − 1 ln ω − ωtr (DD − 2ALX D ) − ωtr (LL ) + γ, (4.76) 2 2 2 using orthogonality of matrices A and X. Once again, γ denotes the accumulation of all terms independent of A, X, L and ω. Note that the chosen priors, (4.65), (4.66), (4.67) and (4.74), do not affect the functional form of (4.76) but, instead, they restrict the support of the posterior. Therefore, the function appears very simple, but evaluation of its moments will be complicated.
4.3 Orthogonal Variational PCA (OVPCA)
73
Step 3: Inspect the VB-marginals Application of the VB theorem to (4.76) yields to following VB-marginals: f˜ (A, X, L, ω|D, r) = f˜ (A|D, r) f˜ (X|D, r) f˜ (L|D, r) f˜ (ω|D, r) ,
(4.77)
X D χS (A) , (4.78) f˜ (A|D, r) ∝ exp ω tr AL p,r χS (X) , f˜ (X|D, r) ∝ exp ω tr X D AL n,r 1 l l χLr (l) , f˜ (l|D, r) ∝ exp ω l diag−1 (X D A) − ω 2 pn 1 L X D + − 1 ln ω − ωtr DD − A f˜ (ω|D, r) ∝ exp 2 2 1 − ω l l χR+ (ω) . (4.79) 2 Recall that the operator diag−1 (L) = l extracts the diagonal elements of the matrix argument into a vector (see Notational Conventions on Page XV). Step 4: Identify standard forms The VB-marginals, (4.78)–(4.79), are recognized to have the following standard forms: f˜ (A|D, r) f˜ (X|D, r) f˜ (l|D, r) f˜ (ω|D, r)
= M (FA ) , = M (FX ) , = tN (µl , φIp ; Lr ) ,
(4.80) (4.81) (4.82)
= G (ϑ, ρ) .
(4.83)
Here, M (·) denotes the von Mises-Fisher distribution (i.e. Normal distribution restricted to the Stiefel manifold (4.64) [92]). Its matrix parameter is FA ∈ Rp×r in (4.80), and FX ∈ Rn×r in (4.81). tN (·) denotes the truncated Normal distribution on the stated support. The shaping parameters of (4.80)–(4.83) are L, DX FA = ω L, FX = ω D A µl = 2diag
−1
(4.84)
, XDA
φ=ω −1 , pn , ϑ= 2 1 L A + 1 l l. ρ = tr DD − 2DX 2 2
(4.85) (4.86) (4.87) (4.88) (4.89)
74
4 Principal Component Analysis and Matrix Decompositions
Step 5: Formulate the necessary VB-moments The necessary VB-moments , where, involved in (4.84)–(4.89) are A, X, l, l l and ω by definition, L = diag l . Moments A and X are expressed via the economic SVD (Definition 4.1) of parameters FA (4.84) and FX (4.85), FA = UFA LFA VF A ,
(4.90)
UFX LFX VF X ,
(4.91)
FX =
with LFX and LFA both in Rr×r . Then, = UF G (p, LF ) VF , A A A A = UF G (n, LF ) VF , X X X X , l = µl + φ ϕ (µl , φ) , , l − φκ (µl , φ) 1r,1 , l l = rφ + µl ω =
ϑ . ρ
(4.92) (4.93) (4.94) (4.95) (4.96)
Moments of tN (·) and M (·)—from which (4.92)–(4.95) are derived—are reviewed in Appendices A.4 and A.6 respectively. Functions G (·, ·), ϕ (·, ·) and κ (·, ·) are also defined there. Note that each of G (·, ·), ϕ (·, ·) and κ (·, ·) returns a multivariate value with dimensions equal to those of the multivariate argument. Multivariate arguments of functions ϕ (·, ·) and κ (·, ·) are evaluated element-wise, using (A.26) and (A.27). Remark 4.7 (Approximate support for l). Moments of the exact VB-marginal for l (4.82) are difficult to evaluate, as Lr (4.71) forms a non-trivial subspace of Rr . Therefore, we approximate the support, Lr , by an envelope, Lr ≈ Lr . Note that (4.70) is maximized for each% i = 1, . . . , r if l1 = l2 = . . . = li , li+1 = li+2 = r . . . = lr = 0. In this case, j=1 lj2 = ili2 ≤ 1, which defines an upper bound, 1 li ≤ li , which is li = i− 2 . Hence, (4.71) has a rectangular envelope, ! 1 Lr = l : 0 < li ≤ li = i− 2 , i = 1 . . . r . (4.97) (4.82) is then approximated by f˜ (l|D, r) =
r
1 tN µi,l , φ; 0, i− 2 .
(4.98)
i=1
Moments of the truncated Normal distribution in (4.98) are available via the error function, erf (·) (Appendix A.4). The error of approximation in (4.97) is largest at the boundaries, li = lj , i = j, i, j ∈ {1 . . . r}, and is negligible when no two li ’s are equal.
4.3 Orthogonal Variational PCA (OVPCA)
75
Step 6: Reduce the VB-equations The set of VB-equations, (4.84)–(4.96), is non-linear and must be evaluated numerically. No analytical simplification is available. One possibility is to run the IVB algorithm (Algorithm 1) on the full set (4.84)–(4.96). It can be shown [91] that initialization of the full IVB algorithm via ML estimates (4.63) yields VB moments, (4.92)–(4.93), that are collinear with the ML solution (4.63). Note that this is the same space that was used for restriction of the PPCA solution (Proposition 4.1), and from which the FVPCA algorithm was derived. Therefore, in this step, we impose the restriction of collinearity with the ML solution on the moments, (4.92)–(4.93), thereby obtaining reduced VB-equations. (4.92) and X (4.93) in the space of Proposition 4.3. We search for a solution of A scaled singular vectors of matrix D (4.7): = UD;r KA , A = VD;r KX . X
(4.99) (4.100)
UD and VD are given by the economic SVD of D (4.7). KA = diag (kA ) ∈ Rr×r and KX = diag (kX ) ∈ Rr×r denote matrix constants of proportionality which must be inferred. Then, distributions (4.84)–(4.96) are fully determined by the following equations: lD;r ◦ kX ◦ l , (4.101) kA = G p, ω kX = G n, ω lD;r ◦ kA ◦ l , (4.102) µl = kX ◦ lD;r ◦ kA , φ=ω −1 , , l = µl + φϕ (µl , φ) , , l l = rφ + µl l − φκ (µl , φ) 1r,1 , ω = pn lD lD − 2 kX ◦ l ◦ kA lD;r + l l
(4.103) (4.104) (4.105) (4.106) −1
.
(4.107)
Proof: Substituting (4.100) into (4.84), and using (4.7), we obtain =ω UD;r LD;r,r KX L. FA = ω (UD LD VD ) VD;r KX L
(4.108)
This is in the form of the SVD of FA (4.90), with assignments r , VF = Ir . UFA = UD;r , LFA = ω LD;r,r KX L A Substituting (4.109) into (4.92), then r Ir . = UD;r G p, ω A LD;r,r KX L
(4.109)
(4.110)
76
4 Principal Component Analysis and Matrix Decompositions
Note that G (·, ·) is a diagonal matrix since its multivariate argument is diagonal. Hence, (4.110) has the form (4.99) under the assignment r . (4.111) LD;r,r KX L KA = G p, ω Equating the diagonals in (4.111) we obtain (4.101). (4.102) is found in exactly the same way. (4.103)–(4.107) can be easily obtained by substituting (4.99)–(4.100) into (4.86), (4.89), (4.94) and (4.95), and exploiting the orthogonality of UD and VD . Step 7: Run the IVB Algorithm Using Proposition 4.3, the reduced set of VB-equations is now (4.101)–(4.107). The IVB algorithm implied by this set will be known as Orthogonal Variational PCA (OVPCA). and X are deterUnder Proposition 4.3, we note that the optimal values of A mined up to the inferred constants of proportionality, kA and kX , by UD and VD respectively. The iterative algorithm is then greatly simplified, since we need only iterate on the 2r degrees-of-freedom constituting KA and KX together, and not on
and X with r p + n − r−1 degrees-of-freedom. The ML solution is a reasonA 2 able choice as an orthogonal initializer, and is conveniently available via the SVD of D (4.63). With this choice, the required initializers of matrices KA and KX are [1] [1] KA = KX = Ir , via (4.99) and (4.100). Remark 4.8 (Automatic Rank Determination (ARD) property of the OVPCA algorithm). Typically, ki,A and ki,X converge to zero for i > r, for some empirical upper bound, r. A similar property was used as a rank selection criterion for the FVPCA algorithm (Remark 4.4). There, the rank was chosen as rˆ = r [24]. This property will be used when comparing OVPCA and FVPCA (Algorithm 3). Nevertheless, the full posterior distribution of r—i.e. f (r|D)—will be derived for the orthogonal PPCA model shortly (see Section 4.3.3). Remark 4.9. Equations (4.101)–(4.103) are satisfied for kA = kX = µl = 0r,1 ,
(4.112)
independently of data, i.e. independently of lD . Therefore, (4.112) will appear as a critical point of KLDVB (3.6) in the VB approximation. (4.112) is appropriate when M(r) = 0p,n (4.1), in which case r = 0. Of course, such an inference is to be avoided when a signal is known to be present. Remark 4.10 (Uniqueness of solution). In Section 4.3.1, we noted 2r cases of the SVD decomposition (4.60), differing only in the signs of the singular vectors. Note, (4.99) and X however, that Proposition 4.3 separates posterior mean values, A (4.100), into orthogonal and proportional terms. Only the proportional terms (kA and kX ) are estimated using the IVB algorithm. Since all diagonal elements of the function G (·, ·) are confined to the interval I[0,1] (Appendix A.6.2), the converged
4.3 Orthogonal Variational PCA (OVPCA)
77
values of kA and kX are always positive. The VB solution is therefore unimodal, approximating only one of the possible 2r modes. This is important, as the all-mode distribution of A is symmetric around the coordinate origin, which would consign = 0p,r . the posterior mean to A Step 8: Report the VB-marginals The VB-marginals are given by (4.80)–(4.83), with shaping parameters FA , FX , µl , are required. φ, ϑ and ρ (4.84)–(4.89). In classical PCA applications, estimates A In our Bayesian context, complete distributions of parameters are available, and so other moments and uncertainty measures can be evaluated. Moreover, we can also report an estimate, rˆ = r, of the number of relevant principal components, using the ARD property (Remark 4.8). More ambitious tasks—such as inference of rank—may be addressed using the inferred VB-marginals. These tasks will be described in the following subsections. 4.3.3 Inference of Rank In the foregoing, we assumed that the rank, r, of the model (4.60) was known a priori. If this is not the case, then inference of this parameter can be made using Bayes’ rule: f (r|D) ∝ f (D|r) f (r) ,
(4.113)
where f (r) denotes the prior on r, typically uniform on 1 ≤ r ≤ p ≤ n. (4.113) is constructed by marginalizing over the parameters of the model, yielding a complexity-penalizing inference. This ‘Ockham sensitivity’ is a valuable feature of the Bayesian approach. The marginal distribution of D—i.e. f (D|r)—can be approximated by a lower bound, using Jensen’s inequality (see Remark 3.1): ln f (D|r) ≈ ln f (D|r) − KL f˜ (θ|D, r) ||f (θ|D, r) (4.114) = Θ∗
f˜ (θ|D, r) ln f (D, θ|r) − ln f˜ (θ|D, r) dθ.
The parameters are θ = {A, X, L, ω}, and f (D, θ|r) is given by (4.75). From (4.114), the optimal approximation, f˜ (θ|D, r), under a conditional independence assumption is the VB-approximation (4.77). Substituting (4.80)–(4.86) into (4.77), and the result—along with (4.75)—into (4.114), then (4.113) yields
78
4 Principal Component Analysis and Matrix Decompositions
-
r r + 1 + ln (r!) + (4.115) f˜ (r|D) ∝ exp − ln π + r ln 2 + ln Γ 2 2 1 + φ−1 µl µl − l + l l + l µl − µl 2 1 1 p, FA FA − ω + ln 0 F1 l ◦ kA lD;r + kX ◦ 2 4 1 1 n, FX FX − ω l ◦ kA lD;r + + ln 0 F1 kX ◦ 2 4 r " µj,l lj − µj,l √ + ln erf + + erf √ 2φ 2φ j=1 0 , +r ln πφ/2 − (ϑ + 1) ln ρ , l, l l and ω are the converged solutions of the OVPCA alwhere kA , kX , µl , φ, gorithm, and FA and FX are functions of these, via, for example, (4.108). lj , j = 1, . . . , r, are the the upper bounds on lj in the envelope L (4.97). One of the main algorithmic advantages of PCA is that a single evaluation of all p eigenvectors, i.e. U (4.11), provides with ease the PCA solution for any rank r < p, via the simple extraction of the first r columns, U;r (4.7), of U = UD (4.14). The OVPCA algorithm also enjoys this property, thanks to the linear dependence of (4.99) on UD . Furthermore, X observes the same property, via (4.100). Therefore, A in the OVPCA procedure, the solution for a given rank is obtained by simple extraction of UD;r and VD;r , followed by iterations involving only scaling coefficients, kA and kX . Hence, p × (p + n) values (those of UD and VD ) are determined rankl, independently via the SVD (4.7), and only 4r + 3 values (those of kA , kX , µl , φ, together) are involved in the rank-dependent iterations (4.101)–(4.107). l l and ω 4.3.4 Moments of the Model Parameters The Bayesian solution provides an approximate posterior distribution of all involved parameters, (4.80)–(4.83) and (4.115). In principle, moments and uncertainty bounds can then be inferred from these distributions, in common with any Bayesian method. The first moments of all involved parameters have already been presented, (4.92)–(4.94) and (4.96), since they are necessary VB-moments. The second noncentral moment of l—i.e. l l—was also generated. Parameter ω is Gamma-distributed (4.83), and so its confidence intervals are available. The difficult task is to determine uncertainty bounds on orthogonal parameters, A and X, which are von Mises-Fisher distributed, (4.80) and (4.81). Such confidence intervals are not available. Therefore, we develop approximate uncertainty bounds in Appendix A.6.3, using a Gaussian approximation. The distribution of X ∈ Rn×r (4.81) is fully determined by the r-dimensional vector, yX , defined as follows:
4.4 Simulation Studies
yX (X) = diag−1 UF X XVFX = diag−1 VD;r X .
79
(4.116)
This result is from (A.35). Therefore, confidence intervals on X can be mapped to confidence intervals on yX , via (4.116), as shown in Appendix A.6.2. The idea is illustrated graphically for p = 2 and r = 1 in Fig. 4.2. It follows that the HPD region (Definition 2.1) of the von Mises-Fisher distribution is bounded by X, where ! X = X : yX (X) = y X ,
(4.117)
using (4.116) and with yX given in Appendix A.6.3. Since A has the same distribution, A is defined analogously.
X
X yX yX
YX X
space of X (thickness is proportional to f (X)) direction of VMF maximum, and also axis of yX maximum of VMF distribution mean value example of projection X → YX confidence interval for f (yX ) projection of uncertainty bounds yX → X
Fig. 4.2. Illustration of the properties of the von-Mises-Fisher (VMF) distribution, X ∼ M (F ), for X, F ∈ R2×1 .
4.4 Simulation Studies We have developed three algorithms for VB inference of matrix decomposition models in this Chapter, namely VPCA, FVPCA and OVPCA (see Fig. 4.1). We will now compare their performance in simulations. Data were generated using model (4.5) with p = 10, n = 100 and r = 3. These data are displayed in Fig. 4.3. Three noise levels, E (4.4), were considered: (i) ω = 100 (SIM1), (ii) ω = 25 (SIM2) and (iii) ω = 10 (SIM3). Note, therefore, that the noise level is increasing from SIM1 to SIM3. 4.4.1 Convergence to Orthogonal Solutions: VPCA vs. FVPCA VB-based inference of the PPCA model (4.5) was presented in Section 4.2. Recall that the PPCA model does not impose any restrictions of orthogonality. However, we have formulated Conjecture 4.1 which states that a solution of the VB-equations can be found in the orthogonal space spanned by the singular vectors of the data matrix D (4.7). In this Section, we explore the validity of this conjecture. Recall, from Step 7 of the VB method in Section 4.2, that two algorithms exist for evaluation of the shaping parameters of the VB-marginals (4.26)–(4.29):
80
4 Principal Component Analysis and Matrix Decompositions Simulated values: a1
Simulated values: x1
1
1
0.5
0
0 0 1
Simulated values: a2
10
Simulated values: a3
10
Simulated values: x2
100
-1
0 5
Simulated values: x3
100
0
0.5 0 0
0 1 0
0.5 0 0 1
-1
p = 10
10
-5
0
n = 100
100
Example of data realization di,: , i = 1 . . . 10 (SIM1) 2 0
-2 -4
0
20
40
60
80
100
Fig. 4.3. Simulated data, D, used for testing the PCA-based VB-approximation algorithms. SIM1 data are illustrated, for which ω = 100.
FVPCA algorithm (Algorithm 3), which uses Conjecture 4.1, and is deterministically initialized via the ML solution (4.13). VPCA algorithm, which does not use Conjecture 4.1, and is initialized randomly. The validity of the conjecture is tested by comparing the results of both algorithms via a Monte Carlo simulation using many random initializations in the VPCA case. If Conjecture 4.1 is true, then the posterior moments, µX (4.32) and µA (4.30), inferred by the VPCA algorithm should converge to orthogonal (but not orthonormal) matrices for any initialization, including non-orthogonal ones. Therefore, a Monte Carlo study was undertaken, involving 100 runs of the VPCA algorithm (4.30)–(4.40). Dur (4.32), via the ing the iterations, we tested orthogonality of the posterior moments, X
4.4 Simulation Studies
assignment
81
Q (µX ) = ||µX µX || ,
where ||A|| ≡ [|ai,j |] , ∀i, j, denotes the matrix of absolute values of its elements. The following criterion of diagonality of Q (·)—being, therefore, a criterion of orthogonality of µX —is then used: q (µX ) =
1r,1 Q (µX ) 1r,1 ; 1r,1 diag−1 (Q (µX ))
(4.118)
i.e. the ratio of the sum of all elements of Q (µX ) over the sum of its diagonal elements. Obviously, q (µX ) = 1 for a diagonal matrix, and q (µX ) > 1 for a nondiagonal matrix, µX . We changed the stopping rule in the IVB algorithm (step 7, Algorithm 3) to (4.119) q (µX ) < 1.01.
counts
Hence, the absolute value of non-diagonal elements must be less than 1% of the diagonal elements for stopping to occur. Note that all the experiments were performed with initial value of ω [1] from Remark 4.5. In all MC runs, (4.119) was satisfied, though it typically took many iterations of the VPCA algorithm. The results are displayed in Fig. 4.4. Histograms of q (µX ) [1] (4.118) for the initializing matrices, µX , are displayed in the left panel, while q (µX ) [m] for the converged µX are displayed in the middle panel. In the right panel, the histogram of the number of iterations required to satisfy the stopping rule is displayed. We conclude that Conjecture 4.1 is verified; i.e. the solution exists in the orthogonal space (4.41)–(4.42).
25
100
20
80
15
60
10
40
5
20
0
1.5
0 1.0082
2 2.5 [1] q(m X )
30
20
10
[m]
1.01
q(m X )
0
0 10 20 30 40 number of iterations (in thousands)
Fig. 4.4. Monte Carlo study (100 trials) illustrating the convergence of the VPCA algorithm to [1]
an orthogonal solution. Left: criterion for initial values, q µX
for converged values, q
[m] µX
(4.118). Middle: criterion
. Right: number of iterations required for convergence.
82
4 Principal Component Analysis and Matrix Decompositions
From now on, we will assume validity of Conjecture 4.1, such that the VPCA and FVPCA algorithms provide the same inference of A, X and ω. In the case of the SIM1 data, we display—in Table 4.1—the median value, kA , obtained by projecting the VPCA inference, µA , into the orthogonal space (4.41)–(4.42). This compares very closely to kA (4.45) inferred directly via the FVPCA algorithm (Algorithm 3). Table 4.1. Comparison of converged values of kA obtained via the VPCA and FVPCA algorithms. VPCA, median FVPCA k1,A k2,A k3,A
9.989 9.956 8.385
9.985 9.960 8.386
In subsequent simulations, we will use only the FVPCA algorithm for inference of the PPCA model, since it is faster and its initialization is deterministic. 4.4.2 Local Minima in FVPCA and OVPCA In this simulation study, we design experiments which reveal the existence of local minima in the VB-approximation for the PPCA model (evaluated by FVPCA (Algorithm 3)), and for the orthogonal PPCA model (evaluated by the OVPCA algorithm (Section 4.3.2, Step 7)). Recall, that these algorithms have the following properties: (i) D enters each algorithm only via its singular values, lD (4.7). (ii) For each setting of ω , the remaining parameters are determined analytically. From (ii), we need to search for a minimum of KLDVB (3.6) only in the onedimensional space of ω . Using the asymptotic properties of PCA (Remark 4.5), we already have a reasonable interval estimate for ω , as given is (4.59). We will test the initialization of both algorithms using values from this interval. If KLDVB for these models is unimodal, the inferences should converge to the same value for all possible initializations. Since all other VB-moments are obtained deterministically, once ω is known, therefore we monitor convergence only via ω . In the case of datasets SIM1 and SIM3, ω converged to the same value for all tested initial values, ω [1] , in interval (4.59). This was true for both the FVPCA and OVPCA algorithms. However, for the dataset SIM2, the results of the two algorithms differ, as displayed in Fig. 4.5. The terminal values, ω [m] , were the same for all [m] initializations using FVPCA. However, the terminal ω using OVPCA exhibits two [m] was almost different modes: (i) for the two lowest tested values of ω [1] , where ω [1] [m] was very close to identical with FVPCA; and (ii) all other values of ω , where ω the simulated value. In (i), the ARD property of OVPCA (Remark 4.8) gave r = 2, while in (ii), r = 3. This result suggests that there are two local minima in KLDVB for the Orthogonal PPCA model (4.62), for the range of initializers, ω [1] , considered (4.59).
4.4 Simulation Studies
83
converged value of ω
25
24
FVPCA OVPCA simulated value
23
22 5
10
15
20 25 initial value ω [1]
30
35
40
Fig. 4.5. Converged value of ω using the dataset SIM2, for different initial values, ω [1] .
For the three datasets, SIM1 to SIM3, there was only one minimizer of KLDVB in the case the PPCA model. We now wish to examine whether, in fact, the VBapproximation of the PPCA model can exhibit multiple modes for other possible datasets. Recall that the data enter the associated FVPCA algorithm only via their singular values lD . Hence, we need only simulate multiple realizations of lD , rather than full observation matrices, D. Hence, in a MonteCarlo study involving 1000 (i) (i) (i) runs, we generated 10-dimensional vectors, lD = exp lM C , where each lM C was 10 drawn from the Uniform distribution, UlM C [0, 1] . For each of the 1000 vectors (i)
of singular values, lD , we generated the converged estimate, ω [m] , using the FVPCA algorithm. As in the previous simulation, we examined the range of initializers, ω [1] , given by (4.59). For about 2% of the datasets, non-unique VB-approximations were obtained. In the light of these two experiments, we conclude the following: • There are local minima in KLDVB for both the PPCA and Orthogonal PPCA models, leading to non-unique VB-approximations. • Non-unique behaviour occurs only rarely. • Each local minimum corresponds to a different inferred r arising from the ARD property of the associated IVB algorithm (Remark 4.4). 4.4.3 Comparison of Methods for Inference of Rank We now study the inference of rank using the FVPCA and OVPCA algorithms. The true rank of the simulated data is r = 3. Many heuristic methods for choice of the number of relevant principal components have been proposed in the literature [80]. These methods are valuable since they provide an intuitive insight into the problem. Using Remark 4.3, we will consider the criterion of cumulative variance:
84
4 Principal Component Analysis and Matrix Decompositions
%i j=1 vi,c = %p j=1
λj λj
× 100%.
(4.120)
Here, λj are the eigenvalues (4.11) of the simulated data, D. This criterion is displayed in the third and fourth columns of Fig. 4.6. As before, we test the methods using the three datasets, SIM1 to SIM3, introduced at the beginning of the Section. eigenvalues l (detail)
eigenvalues l 400 SIM1
6
2 5
10
SIM2
400
0 0
5
10
15
98
80
96
70 0
5
10
94 1
100
100
80
95
2
3
4
2
3
4
2
3
4
10 200 5 0 0
5
10
0
0
5
400 SIM3
90
4
200
0 0
cumulative variance (detail)
cumulative variance 100 100
10
60 0
5
10
90 1
100
100
95
20 80
200
90
10
85 0 0
5
10
0 0
5
60 0 10
5
10
1
Fig. 4.6. Ad hoc methods for estimation of rank for three simulated datasets, each with different variance of noise. The method of visual examination (Remark 4.3) is applied to the eigenvalue graphs, λ = eig (DD ), and to the cumulative variance graphs.
For all datasets, the first two eigenvalues are dominant (first column), while the third eigenvalue is relatively small (it contains only 1% of total variation, as seen in Fig. 4.6 (right)). In the first row, i.e. dataset SIM1, the third eigenvalue is clearly distinct from the remaining ones. In the second dataset (SIM2), the difference is not so obvious, and it is completely lost in the third row (SIM3). Ad hoc choices of rank using (i) visual inspection (Remark 4.3) and (ii) the method of cumulative variance
4.5 Application: Inference of Rank in a Medical Image Sequence
85
are summarized in Table 4.2. This result underlines the subjective nature of these ad hoc techniques. Table 4.2. Estimation of rank in simulated data using ad hoc methods. SIM1 SIM2 SIM3 visual inspection 3 cumulative variance 2-3
2-3 2
2 2
Next, we analyze the same three datasets using formal methods. The results of FVPCA (ARD property, Remark 4.4), OVPCA (ARD property, Remark 4.8, and posterior distribution, f˜(r|D) (4.115)), and the Laplace approximation (the posterior distribution, fL (r|D), as discussed in Section 4.1.4), are compared in Table 4.3. Table 4.3. Comparison of formal methods for inference of rank in simulated data. FVPCA OVPCA ARD ARD f˜ (r|D) , r = 2 3 4 5
Laplace fL (r|D),r = 2 3 4 5
SIM1 SIM2 SIM3
3 3 0 98.2 1.7 0.1 0 82 13 2 2 3 96 3.5 0.2 0 70 25 3 0.5 2 2 97 3.9 0.1 0 94 5 0.5 0.0 Values of f˜ (r|D) and fL (r|D) not shown in the table are very close to zero, i.e. < 0.001.
Note that for low noise levels (SIM1) and high noise levels (SIM3), all methods inferred the true rank correctly. In this case, data were simulated using the underlying model. We therefore regard the results of all methods to be correct. The differences between posterior probabilities caused by different approximations are, in this case, insignificant. The differences will, however, become important for real data, as we will see in next Section.
4.5 Application: Inference of Rank in a Medical Image Sequence PCA is widely used as a dimensionality reduction tool in the analysis of medical image sequences [94]. This will be the major topic of Chapter 5. Here, in this Section, we address just one task of the analysis, namely inference of the number of physiological factors (defined as the rank, r, of matrix M(r) (4.1)) using the PPCA (4.6) and Orthogonal PPCA (4.60) models. For this purpose, we consider a scintigraphic dynamic image sequence of the kidneys. It contains n = 120 images, each of size 64 × 64 pixels. These images were preprocessed as follows:
86
4 Principal Component Analysis and Matrix Decompositions
• A rectangular area of p = 525 pixels was chosen as the region of interest at the same location in each image. • Data were scaled by the correspondence analysis method [95] (see Section 5.1.1 of Chapter 5). With this scaling, the noise on the preprocessed data is approximately additive, isotropic and Gaussian [95], satisfying the model assumptions (4.4). Note that true rank of M(r) is therefore r ≤ min (p, n) = 120. We compare the following three methods for inference of rank: (i) The posterior inference of rank, f˜ (r|D) (4.115), developed using the VBapproximation, and using the OVPCA algorithm. (ii) The ARD property (Remark 4.8) of OVPCA. (iii)The ARD property (Remark 4.8) of FVPCA. The various inferences are compared in Table 4.4. For comparison, we also inferred Table 4.4. Inference of rank for a scintigraphic image sequence (p = 525 and n = 120).
Pr (r Pr (r Pr (r Pr (r
OVPCA OVPCA FVPCA f˜ (r|D) ARD Property ARD Property = 17|D) = 0.0004 = 18|D) = 0.2761 r = 45 r = 26 = 19|D) = 0.7232 = 20|D) = 0.0002
Note: where not listed, Pr (r|D) < 3 × 10−7 .
rank via the criterion of cumulative variance (4.120) (Fig. 4.7), in which case, r = 5 was inferred. It is difficult to compare performance of the methods since no ‘true’ rank is available. From a medical point-of-view, the number of physiological factors, r, should be 4 or 5. This estimate is supported by the ad hoc criterion (Fig. 4.7). From this perspective, the formal methods appear to over-estimate significantly the number of factors. The reason for this can be understood by reconstructing the data using the inferred number of factors recommended by each method (Table 4.5). Four consecutive frames of the actual scintigraphic data are displayed in the first row. Though the signal-to-noise ratio is poor, functional variation is clearly visible in the central part of the left kidney, and in the upper part of the right kidney, which cannot be accounted for by noise. The same frames of the sequence—reconstructed using r = 5 factors, as recommended by medical experts—are displayed in Table 4.5 (second row). This reconstruction fails to capture the observed functional behaviour. In contrast, the functional information is apparent in the sequence reconstructed using the f˜ (r|D) inference of OVPCA (i.e. r = 19 factors). The same is true of sequences reconstructed using r > 19 factors, such as the r = 45 choice suggested by the ARD Property of OVPCA (Table 4.5, last row).
4.6 Conclusion
87
98 cumulative variance
97 96 95 94 93 92 91 90 89 0
5 10 15 number of principal components
20
Fig. 4.7. Cumulative variance for the scintigraphic data. For clarity, only the first 20 of n = 120 points are shown. Table 4.5. Reconstruction of a scintigraphic image sequence using different numbers, r, of factors. number of factors, r
frames 48–51 of the dynamic image sequence
original images (r = 120)
Ad hoc criterion (r = 5)
r = 19) f˜ (r|D) (OVPCA) (
ARD (OVPCA) (r = 45)
4.6 Conclusion The VB method has been used to study matrix decompositions in this Chapter. Our aim was to find Bayesian extensions for Principal Component Analysis (PCA), which is the classical data analysis technique for dimensionality reduction. We explained that PCA is the ML solution for inference of a low-rank mean value, M(r) , in the isotropic Normal distribution (4.4). This mean value, M(r) , can be modelled in two ways (Fig. 4.1): (i) as a product of full rank matrices, M(r) = AX (PPCA model),
88
4 Principal Component Analysis and Matrix Decompositions
or (ii) via its SVD, M(r) = ALX (orthogonal PPCA model). The main drawback of the ML solution is that it does not provide inference of rank, r, nor uncertainty ˆ X, ˆ etc. These tasks are the natural constituency of bounds on point estimates, A, Bayesian inference, but this proves analytically intractable for both models. Therefore, we applied the VB method to the PPCA and orthogonal PPCA models. In each case, the resulting VB-marginals, f˜ (A|D) etc., addressed the weaknesses of the ML solution, providing inference of rank and parameter uncertainties. We should not miss the insights into the nature of the VB method itself, gained as a result of this work. We list some of these now: • The VB approximation is a generic tool for distributional approximation. It has been applied in this Chapter to an ‘exotic’ example, yielding the von MisesFisher distribution. The VB method always requires evaluation of necessary VBmoments, which, in this case, had to be done approximately. • Careful inspection of the implied set of VB-equations (Step 6 of the VB method) revealed that an orthogonality constraint could significantly reduce this set of equations, yielding much faster variants of the IVB algorithm, namely FVPCA and OVPCA. Further study of the VB-equations allowed partial analytical solutions to be found. The resulting inference algorithms then iterated in just one variable ( ω (4.40) in this case). • The VB-approximation is found by minimization of KLDVB (3.6) which is known to have local minima [64] (see Section 3.16). We took care to study this issue in simulation (Section 4.4.2). We discovered local minima in the VBapproximation of both the PPCA model (via FVPCA) and the orthogonal PPCA model (via OVPCA). Hence, convergence to a local (or global) minimum is sensitive to initialization of the IVB algorithm (via ω [1] in this case). Therefore, appropriate choice of initial conditions is an important step in the VB method. • We have shown that the VB-approximation is suitable for high-dimensional problems. With reasonably chosen initial conditions, the IVB algorithm converged within a moderate number of steps. This may provide significant benefits when compared to stochastic sampling techniques, since there is no requirement to draw large numbers of samples from the implied high-dimensional parameter space. In the next Chapter, we will build on this experience, using VB-approximations to infer diagnostic quantities in medical image sequences.
5 Functional Analysis of Medical Image Sequences
Functional analysis is an important area of enquiry in medical imaging. Its aim is to analyze physiological function—i.e. behaviour or activity over time which can convey diagnostic information—of biological organs in living creatures. The physiological function is typically measured by the volume of a liquid involved in the physiological process. This liquid is marked by a contrast material (e.g. a radiotracer in nuclear medicine [96]) and a sequence of images is obtained over time. A typical such sequence is illustrated in Fig. 5.1. The following assumptions are made: (i) There is no relative movement between the camera (e.g. the scintigraphic camera in nuclear medicine) and the imaged tissues. (ii) Physiological organs do not change their shape. (iii)Changes in the volume of a radiotracer within an organ cause a linear response in activity, uniformly across the whole organ. Under these assumptions, the observed images can be modelled as a linear combination of underlying time-invariant organ images [94]. The task is then to identify these organ images, and their changing intensity over time. The idea is illustrated in Fig. 5.1 for a scintigraphic study of human kidneys (a renal study). In Fig. 5.1, the complete sequence of 120 pictures is interpreted as a linear combination of these underlying organ images, namely the left and right kidneys, and the urinary bladder. The activity of the radiotracer in each organ is displayed below the corresponding organ image. In this Chapter, we proceed as follows: (i) we use physical modelling of the dynamic image data to build an appropriate mathematical model, inference of which is intractable; (ii) we replace this model by a linear Gaussian model (called the FAMIS model), exact inference of which remains intractable; and (iii) we use the VB method of Chapter 3 to obtain approximate inference of the FAMIS model parameters; (iv) we examine the performance of the VB-approximation for the FAMIS model in the context of real clinical data.
90
5 Functional Analysis of Medical Image Sequences
0
120
t Functional
Left kidney
Right kidney
Urinary bladder
1
1
0 0
Analysis
t
120
1
0 0
t
120
0 0
t
120
Fig. 5.1. Functional analysis of a medical image sequence (renal study).
5.1 A Physical Model for Medical Image Sequences The task is to analyze a sequence of n medical images taken at times t = 1, . . . , n. Here, we use t as the time-index. The relation of t to real time, τt ∈ R, is significant for the clinician, but we will assume that this mapping to real time is handled separately. Each image is composed of p pixels, stored column-wise as a p-dimensional vector of observations, dt . The entire sequence forms the matrix D ∈ Rp×n . Typically p n even for poor-quality imaging modalities such as scintigraphy [96]. It is assumed that each image in the sequence is formed from a linear combination of r < n < p underlying images, aj , j = 1, . . . , r, of the physiological organs. Formally, dt =
r " j=1
aj xt,j + et .
(5.1)
5.1 A Physical Model for Medical Image Sequences
91
Here, aj , j = 1, . . . , r, are the underlying time-invariant p-dimensional image vectors, known as the factor images, and xt,j is the weight assigned to the jth factor image at time t. The vector, xj = [x1,j , . . . , xn,j ] , of weights over time is known as the factor curve or the activity curve of the jth factor image. The product aj xj is known as the jth factor. Vector et ∈ Rp models the observation noise. The physiological model (5.1) can be written in the form of the following matrix decomposition (4.2): (5.2) D = AX + E, where A ∈ Rp×r = [a1 , . . . , ar ], X ∈ Rn×r = [x1 , . . . , xr ], E ∈ Rp×n = [e1 , . . . , en ]. The organization of the image data into matrix objects is illustrated in Fig. 5.2.
X D
=
+
A
E
Fig. 5.2. The matrix representation of a medical image sequence.
It would appear that the matrix decomposition tools of Chapter 4 could be used here. However, those results cannot be directly applied since the physical nature of the problem imposes extra restrictions on the model (5.2). In the context of nuclear medicine, each pixel of dt is acquired as a count of radioactive particles. This has the following consequences: 1. Finite counts of radioactive particles are known to be Poisson-distributed [95]: ⎞ ⎛ p r n
" (5.3) f (D|A, X) = Po ⎝ ai,j xt,j ⎠ , i=1 t=1
j=1
where Po (·) denotes the Poisson distribution [97]. From (5.2) and (5.3), we conclude that E is a signal-dependent noise, which is characteristic of imaging with finite counts. 2. All pixels, di,t , aggregated in the matrix, D, are non-negative. The factor images, aj —being columns of A—are interpreted as observations of isolated physiological organs, and so the aj s are also assumed to be non-negative. The factor curves, xj —being the columns of X—are interpreted as the variable intensity of the associated factor images, which, at each time t, acts to multiply each pixel of aj by
92
5 Functional Analysis of Medical Image Sequences
the same amount, xt,j . Therefore, the xj s are also assumed to be non-negative. In summary, ai,j ≥ 0, i = 1, . . . , p, j = 1, . . . , r,
(5.4)
xt,j ≥ 0, t = 1, . . . , n, j = 1, . . . , r. 5.1.1 Classical Inference of the Physiological Model The traditional inference procedure for A and X in (5.2) consists of the following three steps [94, 95]: Correspondence analysis: a typical first step in classical inference is scaling. This refers to the pre-processing of the data matrix, D, in order to whiten the observation noise, E. The problem of noise whitening has been studied theoretically for various noise distributions [98–101]. In the case of the Poisson distribution (5.3), the following transformation is considered to be optimal [95]: 1 1 ˜ = diag (D1n,1 )− 2 Ddiag (D 1p,1 )− 2 . (5.5) D the notation v k , v a vector, denotes the vector of powers of elements, In k(5.5), vi (see Notational Conventions, Page XV). Pre-processing the data via (5.5) is called correspondence analysis [95]. Orthogonal analysis is a step used to infer a low-rank signal, AX , from the pre˜ (5.5). PCA is used to decompose D ˜ into orthogonal matrices, processed data, D ˜ (Section 4.1). The corresponding estimates of A and X in the original A˜ and X model (4.5) are therefore 1 ˜ Aˆ = diag (D1n,1 ) 2 A, (5.6) 1 ˆ = diag (D 1p,1 ) 2 X. ˜ X (5.7) Note, however, that these solutions are orthogonal and cannot, therefore, satisfy the positivity constraints for factor images and factor curves (5.4). Instead, they are used as the basis of the space in which the positive solution is to be found. Oblique analysis is the procedure used to find the positive solution in the space ˆ from the orthogonal analysis step above. The exisdetermined by Aˆ and X tence and uniqueness of this solution are guaranteed under conditions given in [94, 102, 103].
5.2 The FAMIS Observation Model The first two steps of the classical inference procedure above are equivalent to ML estimation for the following observation model:
5.2 The FAMIS Observation Model
93
˜ , ω −1 Ip ⊗ In , ˜ A, ˜ X, ˜ ω = N A˜X f D|
(5.8)
for any scalar precision parameter, ω (Section 4.1.1 in Chapter 4). Hence, an addi˜ Using (5.5)–(5.7) in (5.8), the tive, isotropic Gaussian noise model is implied for D. observation model for D is 1 , (5.9) f (D|A, X, ω) ∝ exp − ωtr Ωp (D − AX ) Ωn (D − AX ) 2 −1
Ωp = diag (D1n,1 )
Ωn = diag (D 1p,1 )
−1
,
(5.10)
.
(5.11)
Note that D enters (5.9) through the standard matrix Normal distributional form and via the data-dependent terms, Ωp and Ωn . If we now choose Ωp and Ωn to be parameters of the model, independent of D, then we achieve two goals: (i) The tractable matrix Normal distribution is revealed. (ii) These constants can be relaxed via prior distributions, yielding a flexible Bayesian model for analysis of medical image sequences. In particular, (5.10) and (5.11) can then be understood as ad hoc estimates for which, now, Bayesian alternatives will be available. To summarize, the appropriate observation model inspired by classical inference of medical image sequences is as follows:
(5.12) f (D|A, X, Ωp , Ωn ) = N AX , Ωp−1 ⊗ Ωn−1 , Ωp = diag (ωp ) , (5.13) Ωn = diag (ωn ) . p×r
n×r
p
(5.14) n
Here, A ∈ (R+ ) , X ∈ (R+ ) , ωp ∈ (R+ ) and ωn ∈ (R+ ) are the unknown parameters of the model. The positivity of A and X reflects their positivity in the physiological model (5.4). We note the following: • Model (5.12), with its associated positivity constraints, will be known as the model for Functional Analysis of Medical Image Sequences (the FAMIS model). It is consistent with the matrix multiplicative decomposition (5.2) with white Gaussian noise, E. • The imposed structure of the covariance matrix in (5.12) has the following interpretation. Each pixel of each image in the sequence, di,t , i = 1, . . . , p, t = 1, . . . , n, is independently distributed as ⎞ ⎛ r " −1 −1 ⎠ ai,j xt,j , ωt,n ωi,p . f (di,t |A, X, ωi,p , ωt,n ) = N ⎝ j=1
However, there is dependence in the covariance structure. Specifically, the variance profile is the same across the pixels of every image, dt , being a scaled version of ωp−1 . Similarly, the variance profile is the same for the intensity over time of every pixel, di,: , being a scaled version of ωn−1 .
94
5 Functional Analysis of Medical Image Sequences
• We have confined ourselves to diagonal precision matrices and a Kronecker product structure. This involves just p + n degrees-of-freedom in the noise covariance matrix. Inference of the model with unrestricted precision matrices is not feasible because the number of parameters is then higher than the number, pn, of available data. • As mentioned at the beginning of this Chapter, the model (5.1) is tenable only if the images are captured under the same conditions. This is rarely true in practice, resulting in the presence of artifacts in the inferred factors. The relaxation of known Ωp and Ωn , (5.10) and (5.11), allows these artifacts to be captured and modelled as noise. This property will be illustrated on clinical data later in this Chapter. 5.2.1 Bayesian Inference of FAMIS and Related Models Exact Bayesian inference for the FAMIS model is not available. However, Bayesian solutions for related models can be found in the literature, and are briefly reviewed here. Factor analysis model: this model [84] is closely related to the the FAMIS model, with the following simplifications: (i) positivity of A and X is relaxed; and (ii) matrix Ωp is assumed to be fixed at Ωp = Ip . Bayesian inference of the factor analysis model was presented in [78]. The solution suffers the same difficulties as those of the scalar multiplicative decomposition (Section 3.7). Specifically, the marginal posterior distributions, and their moments, are not tractable. Independent Component Analysis (ICA) model: in its most general form, noisy ICA [104] is a generalization of the FAMIS model. In fact, the implied matrix decomposition, D = AX + E (4.2), is identical. In ICA, A is known as the mixing matrix and the rows of X are known as sources. Therefore, in a sense, any method—such as the procedures which follow in this Chapter—may be called ICA. However, the keyword of ICA is the word ‘independent’. The aim of ICA is to infer statistically independent sources, X: f (X|D) =
r
f (xj |D) .
(5.15)
j=1
Note that this assumption does not imply any particular functional form of the probability distribution. In fact, for a Gaussian distribution, the model (5.15) is identical to the one implied by PCA (4.4). A VB-approximation for (5.15), with priors, f (xj ), chosen as a mixture of Gaussian distributions, was presented in [29].
5.3 The VB Method for the FAMIS Model The FAMIS observation model (5.12) is a special case of the matrix decomposition model (4.2) with additional assumptions. We can therefore expect similarities to the VB method of Section 4.2.
5.3 The VB Method for the FAMIS Model
95
Step 1: Choose a Bayesian model We must complement the observation model (5.12) with priors on the model parameters. The prior distributions on the precision parameters are chosen as follows: f (ωp |ϑp0 , ρp0 ) = f (ωn |ϑn0 , ρn0 ) =
p
i=1 n
Gωi,p (ϑi,p0 , ρi,p0 ) ,
(5.16)
Gωt,n (ϑt,n0 , ρt,n0 ) ,
t=1
with vector shaping parameters, ϑp0 = [ϑ1,p0 , . . . , ϑp,p0 ] , ρp0 = [ρ1,p0 , . . . , ρp,p0 ] , ϑn0 = [ϑ1,n0 , . . . , ϑn,n0 ] and ρn0 = [ρ1,n0 , . . . , ρn,n0 ] . These parameters can be chosen to yield a non-informative prior. Alternatively, asymptotic properties of the noise (Section 5.1.1) can be used to elicit the prior parameters. These shaping parameters can be seen as ‘knobs’ to tune the method to suit clinical practice. The parameters, A and X, are modelled in the same way as those in PPCA, with the additional restriction of positivity. The prior distributions, (4.20) and (4.22), then become p×r , (5.17) f (A|υ) = tN 0p,r , Ip ⊗ Υ −1 , R+ Υ = diag (υ) , υ = [υ1 , . . . , υr ] , f (υj ) = G (αj,0 , βj,0 ) , j = 1, . . . , r, n×r . f (X) = tN 0n,r , In ⊗ Ir , R+
(5.18) (5.19)
The hyper-parameter, Υ , plays an important rôle in inference of rank via the ARD property (Remark 4.4). The shaping parameters, α0 = [α1,0 , . . . , αr,0 ] and β0 = [β1,0 , . . . , βr,0 ] , are chosen to yield a non-informative prior. Step 2: Partition the parameters We partition the parameters into θ1 = A, θ2 = X, θ3 = ωp , θ4 = ωn and θ5 = υ. The logarithm of the joint distribution is then ln f (D, A, X, ωp , ωn , υ) = p 1 n + ln |Ωp | + ln |Ωn | − tr Ωp (D − AX ) Ωn (D − AX ) + 2 2 2 r " p 1 1 + ln |Υ | − tr (AΥ A ) − tr (XX ) + ((α0 − 1) ln υi − β0 υi ) + 2 2 2 i=1 +
n " i=1
((ϑn0 − 1) ln ωi,n − ρn0 ωi,n ) +
p "
((ϑp0 − 1) ln ωi,p − ρp0 ωi,p ) + γ,
i=1
(5.20) where γ denotes all those terms independent of the model parameters, θ.
96
5 Functional Analysis of Medical Image Sequences
Step 3: Write down the VB-marginals The five VB-marginals—for A, X, ωp , ωp and υ—can be obtained from (5.20) by replacement of all terms independent on the inferred parameter by expectations. This step is easy, but yields repeated lengthy distributions, which will not be written down here. Step 4: Identify standard forms In the VB-marginals of Step 3, the Kronecker-product form of the prior covariance matrices for A (5.17) and X (5.19) has been lost. Therefore, we identify the VBmarginals of A and X in their vec forms. The conversion of the matrix Normal distribution into its vec form is given in Appendix A.2. We use the notation a = vec (A) and x = vec (X) (see Notational Conventions on Page XV). The VB-marginals are recognized to have the following standard forms: pr
, (5.21) f˜ (A|D) = f˜ (a|D) = tNA µA , ΣA , R+ nr
f˜ (X|D) = f˜ (x|D) = tNX µX , ΣX , R+ , (5.22) f˜ (υ|D) =
r
Gυj (αj , βj ) ,
(5.23)
Gωi,p (ϑi,p , ρi,p ) ,
(5.24)
Gωt,n (ϑt,n , ρt,n ) .
(5.25)
j=1
f˜ (ωp |D) =
p
i=1
f˜ (ωn |D) =
n
t=1
The associated shaping parameters are as follows:
5.3 The VB Method for the FAMIS Model
97
p DΩ , n X µA = ΣA vec Ω −1 p + Υ ⊗ Ip n X ⊗ Ω ΣA = Ef˜(X|D) X Ω , p DΩ Ω n , µX = ΣX vec A −1 p A ⊗ Ω n + Ir ⊗ In ΣX = Ef˜(A|D) A Ω , 1 α = α0 + p1r,1 , 2 1 β = β0 + diag−1 Ef˜(A|D) [A A] , 2 n ϑp = ϑp0 + 1p,1 , 2 1 n D − DΩ A + n D − A X Ω n X ρp = ρp0 + diag−1 DΩ 2 n X A , +Ef˜(A|D) AEf˜(X|D) X Ω p ϑn = ϑn0 + 1n,1 , 2 1 p D − D Ω p A X − X p D + A Ω ρn = ρn0 + diag−1 D Ω 2 p A X . +Ef˜(X|D) XEf˜(A|D) A Ω
(5.26) (5.27) (5.28) (5.29)
(5.30)
(5.31)
Note that ΣA (5.27) is used in (5.26), and ΣX (5.29) is used in (5.28), for conciseness. Step 5: Formulate necessary VB-moments The necessary VB-moments of the Gamma distributions, (5.23)–(5.25), are υ = α ◦ β −1 , ω p = ω n =
ϑp ◦ ρ−1 p , ϑn ◦ ρ−1 n .
(5.32) (5.33) (5.34)
Hence, the moments of the associated full matrices are Υ = diag ( υ) , Ωp = diag ( ωp ) , Ωn = diag ( ωn ) . The necessary moments of the truncated multivariate Normal distributions, (5.21) and (5.22), are
98
5 Functional Analysis of Medical Image Sequences
A,
X,
n X , p A , Ef˜(A|D) A Ω Ef˜(X|D) X Ω n X A , E ˜ XE A Ω A X . Ef˜(A|D) AEf˜(X|D) X Ω ˜ p f (X|D) f (A|D) (5.35) These are not available in closed-form. Therefore, we do not evaluate (5.35) with respect to (5.21) and (5.22), but rather with respect to the following independence approximations: pr
f˜ (a|D) ≈ tNA µA , diag (σA ) , R+ , (5.36) + nr
, (5.37) f˜ (x|D) ≈ tNX µX , diag (σX ) , R
where σA = diag−1 (ΣA ) , σX = diag−1 (ΣX ) . Recall that the operator, diag−1 (ΣA ) = σA , extracts the diagonal elements of the matrix argument into a vector (see Notational Conventions on Page XV). This choice neglects all correlation between the pixels in A, and between the weights in X. Hence, for example, (5.36) can be written as the product of scalar truncated Normal distributions: f˜ (a|D) ≈
pr
tNak µk,A , σk,A , R+ ,
k=1
with moments 1
− a = µA − σA 2 ◦ ϕ (µA , σA ) ,
a ◦ a = σA + µA ◦ a−
−1 σA 2
(5.38)
◦ κ (µA , σA ) ,
where functions ϕ (·) and κ (·) are defined in Appendix A.4, and ‘◦’ is the Hadamard product. Using (5.38), the necessary matrix moments listed in (5.35) can all be constructed from the following identities: = vect A a, p , (5.39) Z A. (5.40) ◦ a, p diag−1 (Z) + A Ef˜(A|D) [A ZA] = diag vect a Here, Z denotes any of the constant matrices arising in (5.35). Exactly the same identities hold for (5.37). Step 6: Reduce the VB-equations No simplification of the VB-equations was found in this case. Hence, the full set of VB-equations is (5.26)–(5.31), (5.32)–(5.34), and the VB-moments (5.35), where the latter are evaluated via (5.38)–(5.40).
5.4 The VB Method for FAMIS: Alternative Priors
99
Step 7: Run the IVB algorithm The IVB algorithm is run on the full set of VB-equations from the previous step. and X (5.39) can be initialized with any matrices of appropriate dimensions with A positive elements. One such choice are the results of classical inference (Section 5.1.1). Simulation studies suggest that the resulting VB-approximation is insensitive to this initialization. However, special care should be taken in choosing the initial n (5.34). Various choices of initialization were tested in p (5.33) and Ω values of Ω simulation, as we will see in Section 5.5. The most reliable results were achieved using the results of classical inference (Section 5.1.1) as the initializers. Overall, inin using the classical inference results yields significant X, Ω p and Ω tialization of A, computational savings [105]. Remark 5.1 (Automatic Rank Determination (ARD) Property). The shaping parameters, α and β, fulfil the same rôle as those in PPCA, i.e. inference of the number of relevant factors (Remark 4.4). Step 8: Report the VB-marginals The VB-marginals are given by (5.21)–(5.25). Their shaping parameters and moments are inferred using the IVB algorithm in Step 7. Typically, in medical applica and X, of the factor images and the factor curves tions, only the expected values, A respectively, are of interest. From Remark 5.1, values of r can also be reported.
5.4 The VB Method for FAMIS: Alternative Priors The choice of covariance matrix, Ip ⊗Υ −1 in (5.17) and In ⊗Ir in (5.19), is intuitively appealing. It is a simple choice which imposes the same prior independently on each pixel of the factor image, aj , and the same prior on all weights in X. However, evaluation of the posterior distributions, (5.21) and (5.22), under this prior choice is computationally expensive, as we will explain in the next paragraph. Therefore, we seek another functional form for the priors, for which the VB-approximation yields a computationally simpler posterior (Remark 1.1). This will we achieved using properties of the Kronecker product [106]. Consider a typical Kronecker product, e.g. C ⊗ F , arising in the matrix Normal distribution. Here, C ∈ Rp×p and F ∈ Rn×n are both invertible. The Kronecker product is a computationally advantageous form since the following identity holds: −1
(C ⊗ F )
= C −1 ⊗ F −1 .
(5.41)
Hence, the inverse of a (potentially large) pn × pn matrix—arising, for instance, in the FAMIS observation model (5.12)—can be replaced by inversions of two smaller matrices of dimensions p × p and n × n respectively. This computational advantage is lost in (5.27) and (5.29), since they require inversion of a sum of two Kronecker products, which cannot be reduced. Specifically, the posterior covariance structure for A (5.27) is
100
5 Functional Analysis of Medical Image Sequences
−1 p + Υ ⊗ Ip ΣA = E (·) ⊗ Ω .
(5.42)
Hence, identity (5.41) cannot be used and the full pn × pn inversion must be evaluated. Note, however, that the second term in the sum above is the precision matrix of the prior (5.17) which can be chosen by the designer. Hence, a computationally advantageous form of the posterior can be restored if we replace Ip in (5.42) by Ωp . This replacement corresponds to replacement of Ip ⊗ Υ −1 in (5.17) by Ωp−1 ⊗ Υ −1 . Under this choice, the posterior covariance (5.42) is as follows:
p + Υ ⊗ Ω p E (·) ⊗ Ω
−1
=
−1 −1 p −1 . E (·) + Υ ⊗ Ω = E (·) + Υ ⊗Ω p
This is exactly the multivariate case of the alternative prior which we adopted in the scalar decomposition (Section 1.3), in order to facilitate the VB-approximation. In both cases (Section 1.3.2 and above), computational savings are achieved if the prior precision parameter of m or µA —i.e. φ (1.9) or Ip ⊗ Υ (5.17) respectively— are replaced by precision parameters that are proportional to the precision parameter of the observation model. Hence, we replace φ by γω (1.15) in the scalar case of Section 1.3.2, and Ip by Ωp above. The same rationale can be used to adapt the prior covariance structure of X from In ⊗ Ir in (5.19) to Ωn−1 ⊗ Ir . We now re-apply the VB method using these analytically convenient priors. Step 1: Choose a Bayesian Model Given the considerations above, we replace the priors, (5.17) and (5.19), by the following: p×r , (5.43) f (A|υ, Ωp ) = tN 0p,r , Ωp−1 ⊗ Υ −1 , R+ n×r , f (X|Ωn ) = tN 0n,r , Ωn−1 ⊗ Ir , R+
(5.44)
where Ωp and Ωn are parameters now common to the observation model (5.12) and the priors. The rest of the model, i.e. (5.16) and (5.18), is unchanged. Step 2: Partition the parameters The same partitioning is chosen as in Section 5.3. The logarithm of the joint distribution is now
5.4 The VB Method for FAMIS: Alternative Priors
101
ln f (D, A, X, ωp , ωn , υ) = p+r 1 n+r ln |Ωp | + ln |Ωn | − tr Ωp (D − AX ) Ωn (D − AX ) + + 2 2 2 r " p 1 1 + ln |Υ | − tr (Ωp AΥ A ) − tr (X Ωn X) + ((αj,0 − 1) ln υj,i − βj,0 υj ) + 2 2 2 j=1 +
n "
((ϑt,n0 − 1) ln ωt,n − ρt,n0 ωt,n )+
t=1
p "
((ϑi,p0 − 1) ln ωi,p − ρi,p0 ωi,p )+γ.
i=1
(5.45) Step 3: Write down the VB-marginals Once again, this step is easy, yielding five distributions, which we will omit for brevity. Their standard forms are identified next. Step 4: Identify standard forms The posterior distributions, (5.23)–(5.25), are unchanged, while (5.21) and (5.22) are now in the form of the truncated matrix Normal distribution (Appendix A.2): + p×r −1 f˜ (A|D) = tNA MA , Φ−1 , (5.46) Ap ⊗ ΦAr , R
n×r −1 + f˜ (X|D) = tNX MX , Φ−1 . (5.47) Xn ⊗ ΦXr , R The shaping parameters of the VB-marginals are now as follows: Φ−1 , n X MA = D Ω Ar n X + diag ( υ) , ΦAr = Ef˜(X|D) X Ω
(5.48)
p , ΦAp = Ω MX = Φ−1 Xr A Ωp D, p A + Ir , A Ω ΦXr = E ˜ f (A|D)
n , ΦXn = Ω
1 p A , β = β0 + diag−1 Ef˜(A|D) A Ω 2 1 n D − A n D − DΩ A + X Ω n X ρp = ρp0 + diag−1 DΩ 2 n X + Υ A , AE ˜ X Ω +E ˜ f (A|D)
f (X|D)
1 p D − D Ω p A X − X p D + A Ω ρn = ρn0 + diag−1 D Ω 2 p A + Ir X . +Ef˜(X|D) XEf˜(A|D) A Ω
(5.49)
102
5 Functional Analysis of Medical Image Sequences
Step 5: Formulate necessary VB-moments The moments of (5.23)–(5.25)—i.e. (5.32)–(5.34)—are unchanged. The moments of the truncated Normal distributions, (5.21) and (5.22), must be evaluated via independence approximations of the kind in (5.36) and (5.37). Steps 6-8: As in Section 5.3.
5.5 Analysis of Clinical Data Using the FAMIS Model In this study, a radiotracer has been administered to the patient to highlight the kidneys and bladder in a sequence of scintigraphic images. This scintigraphic study was already considered in Section 4.5 of the previous Chapter. Our main aim here is to study the performance of the VB-approximation in the recovery of factors (Section 5.1) from this scintigraphic image sequence, via the FAMIS model (Section 4.5). We wish to explore the features of the VB-approximation that distinguish it from classical inference (Section 5.1.1). These are as follows: The precision matrices of the FAMIS observation model (5.12), Ωp and Ωn , are inferred from the data in tandem with the factors. This unifies the first two steps of classical inference (Section 5.1.1). The oblique analysis step of classical inference (Section 5.1.1) is eliminated, since expected values (5.38) of truncated Normal distributions, (5.21) and (5.22), are guaranteed to be positive. Inference of Rank is achieved via the ARD property of the implied IVB algorithm (Remark 5.1).
Fig. 5.3. Scintigraphic medical image sequence of the human renal system, showing every 4th image from a sequence of n = 120 images. These are displayed row-wise, starting in the upper-left corner.
The scintigraphic image sequence is displayed in Fig. 5.3. The sequence consists of n = 120 images, each of size 64 × 64, with a selected region-of-interest of size
5.5 Analysis of Clinical Data Using the FAMIS Model
103
p = 525 pixels highlighting the kidneys. Note that we adopt the alternative set of priors, outlined in Section 5.4, because of their analytical convenience. Recall that this choice also constitutes a non-informative prior. Our experience indicates that the effect on the posterior inference of either set of priors (Sections 5.3 and 5.4 respectively) is negligible. In the sequel, we will compare the factor inferences emerging from four cases of the VB-approximation, corresponding to four noise modelling strategies. These refer to four choices for the precision matrices of the FAMIS observation model (5.12): Case 1: precision matrices, Ωp and Ωn , were fixed a priori at Ωp = Ip and Ωn = In . This choice will be known as isotropic noise assumption. −1 Case 2: Ωp and Ωn were again fixed a priori at Ωp = diag (D1n,1 ) and Ωn = −1 diag (D 1p,1 ) . This choice is consistent with the correspondence analysis (5.5) pre-processing step of classical inference. Case 3: Ωp and Ωn were inferred using the VB method. Their initial values in the p[1] = Ip and IVB algorithm were chosen as in case 1; i.e. the isotropic choice, Ω n[1] = In . Ω Case 4: Ωp and Ωn were again inferred using the VB method. Their initial values in the IVB algorithm were chosen as in case 2; i.e. the correspondence analysis n[1] = diag (D 1p,1 )−1 . p[1] = diag (D1n,1 )−1 and Ω choice, Ω In cases 3 and 4, where the precision matrices were inferred from the data, the shaping parameters of the prior distributions (5.16) were chosen as non-committal; i.e. ϑp0 = ρp0 = 10−10 1p,1 , ϑn0 = ρn0 = 10−10 1n,1 . The results of these experiments are displayed in Fig. 5.4. The known-precision cases (Cases 1 and 2) are displayed in the first column, while the unknown-precision cases (Cases 3 and 4) are displayed in the second column. We can evaluate the factor inferences with respect to template factor curves for healthy and pathological kidneys [107], as displayed in Fig. 5.5. For this particular dataset, therefore, we can conclude from Case 4 (Fig. 5.4, bottom-right) that the inferred factors correspond to the following physiological organs and liquids, starting from the top: (i) pathological case of the right pelvis; (ii) a linear combination of the left pelvis and the right parenchyma; (iii) the left parenchyma; and (iv) arterial blood. Turning our attention to the VB-approximation in Fig. 5.4, we conclude the following: The number of relevant factors, r, estimated using the ARD property (Remark 5.1) associated with the FAMIS model (5.12), is r = 3 in Case 1, and r = 4 in all remaining cases (Fig. 5.4). This is much smaller than the rank estimated using the PPCA model (Table 4.4, Section 4.5). Furthermore, it corresponds to the value expected by the medical expert, as expressed by the templates in Fig. 5.5. Note that this reduction from the PPCA result has been achieved for all considered noise modelling strategies. Hence, this result is a consequence of the regularization of the inference achieved by the positivity constraints (5.4). This will be discussed later (Remark 5.2).
104
5 Functional Analysis of Medical Image Sequences a priori known Ωp and Ωn factor curve 1
factor image 1
factor curve 2
factor image 2
factor curve 3
factor image 3
factor curve 3
factor image 4
intensity
factor curve 2 intensity
intensity
factor curve 3
intensity
intensity
factor image 3
factor curve 3 intensity
factor image 4 intensity
isotropic noise assumption
factor image 2
Noise modelling strategy
factor curve 1 intensity
factor image 1
inferred Ωp and Ωn (IVB algorithm is initialized via the stated noise modelling strategy)
time [images]
time [images]
(case 1)
factor image 1
factor curve 2
factor image 2
factor curve 3
factor image 3
factor curve 3
factor image 4
intensity
factor curve 3
factor curve 3 intensity
factor image 4
factor curve 2
intensity
intensity intensity
factor image 3
intensity
correspondence analysis
factor image 2
factor curve 1 intensity
factor curve 1 intensity
factor image 1
(case 3)
time [images]
(case 2)
time [images]
(case 4)
Fig. 5.4. Posterior expected factor images and factor curves for four noise models.
5.5 Analysis of Clinical Data Using the FAMIS Model healthy kidney
105
pathological kidney 12
9 8
10
7
5
counts
counts
8
parenchyma pelvis arterial blood
6
4
6
parenchyma pelvis arterial blood
4
3 2
2
1 0 0
2
4
6
8 10 12 time [min]
14
16 18
20
0 0
2
4
6
8 10 12 time [min]
14
16 18
20
Fig. 5.5. Typical activity curves in a healthy kidney (left) and a pathological kidney (right). In this Figure, factor curves are plotted against the actual time of acquisition of the images and cannot be directly compared to the discrete-time indices in Fig. 5.4. It is the overall shape of the curve that is important in clinical diagnosis.
The precision, Ωp and Ωn , is an important parameter in the analysis. If we compare the results achieved using known precision matrices (left column of Fig. 5.4) with the cases where we infer the precision matrices via (5.33) and (5.34) (right column of Fig. 5.4), then we can conclude the following: n , of the posterior distributions of the prep and Ω • The expected values, Ω cision matrices, (5.24) and (5.25) respectively, are similar for both cases of initialization (Cases 3 and 4), and are, in fact, close to the estimates obtained using correspondence analysis (5.5) from classical inference. This is displayed in Fig. 5.6. This finding supports the assumption underlying the classical inference, namely that (5.5) is the optimal pre-processing step for scintigraphic data [95]. • In Cases 1 and 2—i.e. those with known precision matrices—the inferred factor curves have sharp peaks at times t = 25 and t = 37, respectively. This behaviour is not physiologically possible. These peaks are significantly reduced in those cases where precision is inferred; i.e. in Cases 3 and 4. Note that the precision estimates, ω t,n , at times t = 25 and t = 37, are significantly lower than those at other times (Fig. 5.6). Thus, in effect, images observed at times t = 25 and t = 37 will be automatically suppressed in the inference procedure, since they will receive lower weight in (5.48). and X, are sensitive to the IVB initializa• The posterior expected values, A [1] [1] n (Cases 3 and 4). Note that the inferred p and Ω tion of the precision, Ω factor images in Case 3 (Fig. 5.4, top-right) are similar to those in Case 1 (Fig. 5.4, top-left). Specifically, the image of arterial blood (4th image in Cases 2 and 4) is not present in either Case 1 or 3. The similarity between the factor images in Cases 2 and 4 (Fig. 5.4, bottom row) is also obvious. This suggests that there are many local minima of KLDVB , and that the initial val[1] [1] n , determine which one will be reached by ues of the precision, ω p and ω
106
5 Functional Analysis of Medical Image Sequences
ωp
ωn
correspondence analysis
ω p
ω n
100 200 300 400 500 20 40 60 80 100 120 converged expected posterior values (isotropic noise initialization)
ω p
ω n
100 200 300 400 500 20 40 60 80 100 120 converged expected posterior values (initialization using correspondence analysis)
100
200 300 pixels
400
500
20
40
60 80 time
100 120
Fig. 5.6. Comparison of the posterior expected value of the precision matrices, ω p (left) and ω n (right). In the first row, the result of correspondence analysis (classical inference) is shown. In the second and third rows, the VB-approximations are shown for two different choices of IVB initialization.
the IVB algorithm. This feature of the VB-approximation has already been studied in simulation for the PPCA model, in Section 4.4.2. It is difficult to compare these results to classical inference, since the latter does not have an Automatic Rank Determination (ARD) property (Remark 5.1), nor does it infer precision matrices. It also requires many tuning knobs. Hence, for an experienced expert, it would be possible to produce results similar to those presented in Fig. 5.4. Remark 5.2 (Reduction of r via positivity constraints). In Section 4.5, we used the same dataset as that in Fig. 5.3 to infer the rank of the mean matrix, M (rank unknown), in the matrix decomposition (4.1). All of the estimates of rank displayed in Table 4.4 are significantly higher than the number of underlying physiological objects expected by the medical expert. In contrast, the estimates of rank provided by the FAMIS model (Fig. 5.4) were, in all four cases, in agreement with medical opinion. This result can be informally explained by considering how each model is matched to the actual medical data. The real scintigraphic data are composed of three
5.6 Conclusion
107
elements: D = M + N + E. M and E are the elements modelled by PPCA or FAMIS. In the case of PPCA, M is unconstrained except for an unknown rank restriction, while M observes positivity constraints in the FAMIS model (5.12). E is the Normally-distributed white noise (4.4). Informally, N is an unmodelled matrix of non-Gaussian noise and physiological residuals. Inevitably, then, the PPCA and FAMIS models yield estimates of the modelled parameters, M and E, that are corrupted by the residuals, N , as follows: ˆ = M + NM , M ˆ = E + NE . E NM and NE are method-dependent parts of the residuals, such that N = NM + NE . If the criteria of separation are (i) rank-restriction of M with unknown rank, r, and (ii) Gaussianity of the noise, E, then only a small part of N fulfills (ii), but a large part of N fulfills (i) as M has unrestricted rank. Consequently, this rank, r, is significantly overestimated. However, if we now impose a third constraint, namely (iii) positivity of the signal M (5.4), we can expect that only a small part of N fulfills ˆ (iii), ‘pushing’ the larger part, NE , of N into the noise estimate, E.
5.6 Conclusion In this Chapter, we have applied the VB method to the problem of functional analysis of medical image sequences. Mathematical modelling for these medical image sequences was reviewed. Bayesian inference of the resulting model was found to be intractable. Therefore, we introduced the FAMIS model as its suitable approximation. Exact Bayesian inference remained intractable, but the VB method yielded tractable approximate posterior inference of the model parameters. Our experience with the VB method in this context can be summarized as follows: Choice of priors: we considered two different choices of non-informative priors. These were (i) isotropic i.i.d. distributions (Section 5.3), and (ii) priors whose parameters were shared with the observation model (Section 5.4). We designed the latter (alternative) prior specifically to facilitate a major simplification in the resulting VB inference. This yielded significant computational savings without any significant influence on the resulting inferences. IVB initialization: the implied VB-equations do not have a closed-form solution. Hence, the IVB algorithm must be used. Initialization of the IVB algorithm is an important issue since it determines (i) the speed of convergence of the algorithm, and (ii) which of the (non-unique) local minima of KLDVB are found. Initialization of the IVB algorithm via correspondence analysis from classical inference provides a reliable choice, conferring satisfactory performance in terms of speed and accuracy.
108
5 Functional Analysis of Medical Image Sequences
The performance of the VB-approximation for the FAMIS model was tested in the context of a real medical image sequence (a renal scintigraphic study). The approximation provides satisfactory inference of the underlying biological processes. Contrary to classical inference—which is based on the certainty equivalence approach— the method was able to infer (i) the number of underlying physiological factors, and (ii) uncertainty bounds for these factors.
6 On-line Inference of Time-Invariant Parameters
On-line inference is exactly that: the observer of the data is ‘on’ the ‘line’ that is generating the data, and has access to the data items (observations), dt , as they are generated. In this sense, the data-generating system is still in operation while learning is taking place [108]. This contrasts with the batch mode of data acquisition, D, examined in Chapters 4 and 5. In cases where a large archive of data already exists, then on-line learning refers to the update of our knowledge in light of sequential retrieval of data items from the database. Any useful learning or decision algorithm must exploit the opportunities presented by this setting. In other words, inferences should be updated in response to each newly observed data item. In the Bayesian setting, we have already noted the essential technology for achieving this, namely repetitive Bayesian updates (2.14), generating f (θ|Dt ) from f (θ|Dt−1 ) as each new observation, dt , arrives. The ambition to implement such updates at all t will probably be frustrated, for two reasons: (i) we have a strictly limited time resource in which to complete the update, namely ∆t, the sampling period between data items (we have uniformly sampled discrete-time data in mind); and (ii) the number of degrees-of-freedom in the inference, f (θ|Dt ), is generally of the order of the number of degrees-of-freedom in the aggregated data record, Dt (2.1), and is growing without bound as t → ∞. The requirement to update f (θ|Dt ), and to extract moments, marginals and/or other decisions from it, will eventually ‘break the bank’ of ∆t ! Only a strictly limited class of observation models is amenable to tractable Bayesian on-line inference, and we will review this class (the Dynamic Exponential Family) in Section 6.2.1. Our main aim in this Chapter is to examine how the Variational Bayes (VB) approximation can greatly increase the possibilities for successful implementation of Bayesian on-line learning. We will take care to define these contexts as generally as possible (via the three so-called ‘scenarios’ of Section 6.3), but also to give worked examples of each context. In this Chapter, we assume stationary parametric modelling via a time-invariant vector of parameters, θ. The time-variant case leads to distinct application scenarios for VB, and will be handled in the next Chapter.
110
6 On-line Inference of Time-Invariant Parameters
6.1 Recursive Inference On-line inference and learning algorithms are essential whenever the observer is required to react or interact in light of the sequentially generated observations, dt . Examples can be found whenever time-critical decisions must be made, such as in the following areas: Prediction: knowledge about future outputs should be improved in light of each new observation [53, 109]; a classical application context is econometrics [56]. Adaptive control: the observer wishes to ‘close the loop’, and design the best control input in the next sampling time [52, 110]; in predictive control [111], this task is accomplished via effective on-line prediction. Fault detection: possible faults in the observed system must be recognized as soon as possible [112]. Signal processing: adaptive filters must be designed to suppress noise and reconstruct signals in real time [113–115]. All the above tasks must satisfy the following constraints: (i) the inference algorithm must use only the data record, Dt , available at the current time; i.e. the algorithm must be causal. (ii) the computational complexity of the inference algorithm should not grow with time. These two requirements imply that the data, Dt , must be represented within a finitedimensional ‘knowledge base’, as t → ∞. This knowledge base may also summarize side information, prior and expert knowledge, etc. In the Bayesian context, this extra conditioning knowledge, beyond Dt , was represented by Jeffreys’ notation, I (Section 2.2.1), and the knowledge base was the posterior distribution, f (θ|Dt , I). In the general setting, the mapping from Dt to the knowledge base should be independent of t, and can be interpreted as a data compression process. This compression of all available data has been formalized by two concepts in the literature: (a) the state-space approach [108, 116], and (b) sufficient statistics [42, 117]. In both cases, the inference task (parameter inference in our case), must address the following two sub-tasks (in no particular order): (a) infer the parameters from the compressed data history and the current data record (observation), dt ; (b) assimilate the current observation into the compressed data history. An algorithm that performs these two tasks is known as a recursive algorithm [108].
6.2 Bayesian Recursive Inference In this Section, we review models for which exact, computationally tractable, Bayesian recursive inference algorithms exist. The defining equation of Bayesian on-line
6.2 Bayesian Recursive Inference
111
inference is (2.14): the posterior distribution at time t − 1, i.e. f (θ|Dt−1 ), is updated to the posterior distribution at time t, i.e. f (θ|Dt ), via the observation model, f (dt |θ, Dt−1 ), at time t. Invoking the principles for recursive inference above, the following constraints must apply: 1. The observation model must be conditioned on only a finite number, ∂, of past observations. This follows from (2.14). These past observations enter the observation model via an m-dimensional time-invariant mapping, ψ, of the following kind: (6.1) ψt = ψ (dt−1 , . . . , dt−∂ ) , ψ ∈ Rm , with m < ∞ and ∂ < ∞. Under this condition, f (dt |θ, Dt−1 ) = f (dt |θ, ψt ) ,
(6.2)
and ψt is known as the regressor at time t. An observation model of this type, with regression onto the data (observation) history, is called an autoregressive model. In general, ψt (6.1) can only be evaluated for t > ∂. Therefore, the recursive inference algorithm typically starts at t = ∂ +1, allowing the regressor, ψ∂+1 = ψ (D∂ ) , to be assembled in the first step. The recursive update (2.14) now has the following form: f (θ|Dt ) ∝ f (dt |θ, ψt ) f (θ|Dt−1 ) , t > ∂.
(6.3)
2. The knowledge base, f (·), must be finite-dimensional, ∀t. In the Bayesian paradigm, the knowledge base is represented by f (θ|Dt ). The requirement for a time-invariant, finite-dimensional mapping from Dt to the knowledge base is achieved by the principle of conjugacy (Section 2.2.3.1). Therefore, we restate the conjugacy relation (2.12) in the on-line context, and we impose (6.2), as follows: (6.4) f (θ|st ) ∝ f (dt |θ, ψt ) f (θ|st−1 ) , t > ∂. The distribution, f (θ|st ) , is said to be conjugate to the observation model, f (dt |θ, ψt ) .The notation, f (·|st ) , denotes a fixed functional form, updated only via the finite-dimensional sufficient statistics, st . The posterior distribution is therefore uniquely determined by st , and the functional recursion (6.3) can be replaced by the following algebraic recursion in st : st = s (st−1 , ψt , dt ) , t > ∂.
(6.5)
A recursive algorithm has been achieved via compression of the complete current data record, Dt , into a finite-dimensional representation, st . The algebraic recursion (6.5) achieves Bayesian inference of θ, ∀t, as illustrated in Fig. 6.1. Of course, the mappings ψ (6.1) and s (6.5) must be implementable within the sampling period, ∆t, if the recursive algorithm is to be operated in real time. If the observation model does not have a conjugate distribution on parameters, the computational complexity of full Bayesian inference is condemned to grow with time t. Hence, our main aim is to design Bayesian recursive algorithms by verifying—or restoring—conjugacy.
112
6 On-line Inference of Time-Invariant Parameters
Conjugacy must hold even in the first step of the recursion, i.e. at t = ∂ + 1 (6.3). Therefore, the prior, f (θ|s∂ ) ≡ f (θ|s0 ) , must also be conjugate to the observation model (6.2). f (dt |θ, Dt−1 ) f (θ|Dt−1 )
B
f (θ|Dt )
dt , ψt B
st−1
st
Fig. 6.1. Illustration of Bayes’ rule in the recursive on-line scenario with time-invariant parameterization. Upper diagram: update of posterior distribution via (6.3). Lower diagram: update of sufficient statistics (6.5).
6.2.1 The Dynamic Exponential Family (DEF) A conjugate distribution exists for every observation model belonging to the exponential family [117]:
f (dt |θ) = exp q (θ) u (dt ) − ζd (θ) . (6.6) This result is not directly useful in recursive on-line inference, since autoregressive models (i.e. models with dependence on a data regressor, ψt (6.2)) are not included. Therefore, in the following proposition, we extend conjugacy to autoregressive models [52]. Proposition 6.1 (Conjugacy for the Dynamic Exponential Family (DEF)). Let the autoregressive observation model (6.2) be of the following form:
(6.7) f (dt |θ, ψt ) = exp q (θ) u (dt , ψt ) − ζdt (θ) . Here, q (θ) and u (dt , ψt ) are η-dimensional vector functions, and
exp q (θ) u (dt , ψt ) d dt , exp ζdt (θ) =
(6.8)
D
where ζdt (θ) is a normalizing constant, which must be independent of ψt . Then the distribution
(6.9) f (θ|st ) = exp q (θ) vt − νt ζdt (θ) − ζθ (st ) ,
with vt ∈ Rη , νt ∈ R+ , and sufficient statistics st = [vt , νt ] ∈ Rη+1 , is conjugate to the observation model (6.7). Here,
exp ζθ (st ) = exp q (θ) vt − νt ζdt (θ) d θ, Θ∗
6.2 Bayesian Recursive Inference
113
where ζθ (st ) is the normalizing constant for f (θ|st ). Inserting (6.9) at time t − 1 and (6.7) into Bayes’ rule (6.4), the following update equations emerge for the sufficient statistics of the posterior distribution: νt = νt−1 + 1, vt = vt−1 + u (dt , ψt ) .
(6.10) (6.11)
Remark 6.1 (Normalizing constants, ζ). In this book, the symbol ζ will be reserved for normalizing constants. Its subscript will declare the random variable whose full conditional distribution is generated when the joint distribution is divided by this quantity. Thus, for example, f (θ|D) = ζθ−1 f (θ, D), and so ζθ = f (D). Occasionally, we will denote the functional dependence of ζ on the conditioning quantity explicitly. Hence, in the case considered, we have ζθ ≡ ζθ (D). Notes: • The proof of Proposition 6.1 is available in [117]. • In the sequel, family (6.7) will be known as the Dynamic Exponential Family (DEF). Their respective conjugate distributions (6.9) will be known as CDEF parameter distributions. Comparing with (6.6), the DEF family can be understood as an extension of the exponential family to cases with autoregression. (6.6) is revealed for ψt = {}. • If the observation model satisfies two properties, namely (i) smoothness, and (ii) a support that does not depend on θ, then the (dynamic) exponential family is the only family for which a conjugate distribution exists [117]. If these properties do not hold, then sufficient statistics may also exist. For example, sufficient statistics exist for the uniform distribution parameterized by its boundaries, and for the Pareto distribution [51]. However, there are no generalizable classes of such observation models. Therefore, we will restrict our attention to the DEF family in this book. • In general, the integral on the right-hand side of (6.8) may depend on the regressor, yielding ζdt (θ, ψt ). The DEF family, however, requires that ζdt (θ, ψt ) = ζdt (θ), constituting a major restriction. As a result, the following models are (almost) the only members of the family [52]: 1. Normal (Gaussian) linear-in-parameters models in the case of continuous parameters; 2. Markov chain models in the case of discrete variables. These are also the models of greatest relevance to signal processing. • The regressor ψt (6.1) may also contain known (observed) external variables, ξt ; i.e. (6.12) ψt = ψ (dt−1 , . . . , dt−∂ , ξt ) . ξt are known as exogenous variables in the control literature [118], and (6.2), with (6.12), defines an AutoRegressive model with eXogenous variables (ARX). In this case, the DEF model can admit ζdt (θ, ψt ) = ζ (θ) ν (ξt ), where ν (ξt ) is a scalar function of ξt . Then, update (6.10) becomes
114
6 On-line Inference of Time-Invariant Parameters
νt = νt−1 + ν (ξt ) . • (6.9) was defined as the conjugate distribution for (6.7) with sufficient statistics of minimum length. In fact, the following distribution is also conjugate to (6.7):
f (θ|st ) = exp q (θ) v t − νt ζdt (θ) − ζθ (st ) . (6.13) Here, q (θ) = q (θ) , q0 (θ) , and v t = [vt , v0 ] is an extended vector of sufficient statistics. Once again, vt (which is now only a subset of the extended sufficient statistics, v t ) is updated by (6.11). v0 is not, however, updated by data and is therefore inherited unchanged from the prior. Therefore, v0 is a function of the prior parameters. This design of the conjugate distribution may be useful if the prior is to fulfil a subsidiary rôle, such as regularization of the model (Section 3.7.2). However, if we wish to design conjugate inference with non-informative priors, we invariably work with the unextended conjugate distribution (6.9). 6.2.2 Example: The AutoRegressive (AR) Model
et
dt
σ dt−1 a1 dt−2 a2 dt−m am
z −1 z −2
z −m
Fig. 6.2. The signal flowgraph of the AutoRegressive (AR) model.
The univariate (scalar), time-invariant AutoRegressive (AR) model is defined as follows: m " ak dt−k + σet , (6.14) dt = k=1
where m ≥ 1, et denotes the input (innovations process), and dt the output (i.e. observation) of the system. The standard signal flowgraph is illustrated in Fig. 6.2 [119]. The problem is to infer recursively the fixed, unknown, real parameters, r = σ 2 (the variance of the innovations) and a = [a1 , . . . , am ] . The Bayesian approach to this problem is based on the assumption that the innovations sequence, et , is i.i.d. (and
6.2 Bayesian Recursive Inference
115
therefore white) with a Normal distribution: f (et ) = N (0, 1). The fully probabilistic univariate AR observation model is therefore f (dt |a, r, ψt ) = Nd (a ψt , r) 1 1 2 √ = exp − (dt − a ψt ) , 2r 2πr
(6.15) (6.16)
where t > m = ∂ , and ψt = [dt−1 , . . . , dt−m ] is the regressor (6.1). The observa tion model (6.15) belongs to the DEF family (6.7) under the assignments, θ = [a , r] , and 1 (6.17) q (θ) = − r−1 vec [1, −a ] [1, −a ] , 2 (6.18) u (dt , ψt ) = vec [dt , ψt ] [dt , ψt ] = vec (ϕt ϕt ) , √ ζdt (θ) = ln 2πr . (6.19) In (6.18),
ϕt = [dt , ψt ]
(6.20)
is the extended regressor. We will refer to the outer product of ϕt in (6.18) as a dyad. Using (6.9), the conjugate distribution has the following form: √ 1 f (a, r|vt , νt ) ∝ exp − r−1 vec [1, −a ] [1, −a ] vt − νt ln 2πr , 2 (6.21) 2 where vt ∈ Rη , η = (m + 1) , and st = [vt , νt ] ∈ Rη+1 (Section 6.2.1) is the vector of sufficient statistics. Distribution (6.21) can be recognized as Normalinverse-Gamma (N iG) [37]. We prefer to write (6.21) in its usual matrix form, using the extended information matrix, Vt ∈ R(m+1)×(m+1) : Vt = vect (vt , m + 1) , f (a, r|st ) = N iGa,r (Vt , νt ) , where vect (·) denotes the vec-transpose operator (see Notational Conventions on Page XV) [106], st = {Vt , νt } , and 1 r−0.5νt exp − r−1 [−1, a ] Vt [−1, a ] , N iGa,r (Vt , νt ) ≡ (6.22) ζa,r (Vt , νt ) 2 −0.5
t |Vaa,t | (2π)0.5 , ζa,r (Vt , νt ) = Γ (0.5νt ) λ−0.5ν t V11,t Va1,t −1 Vt = Vaa,t Va1,t . , λt = V11,t − Va1,t Va1,t Vaa,t
(6.23) (6.24)
(6.24) denotes the partitioning of Vt ∈ R(m+1)×(m+1) into blocks where V11,t is the (1, 1) element. From (6.10) and (6.11), the updates for the statistics are
116
6 On-line Inference of Time-Invariant Parameters
Vt = Vt−1 + ϕt ϕt = V0 +
t "
ϕi ϕi ,
(6.25)
i=m+1
νt = νt−1 + 1 = ν0 + t − m,
t > m.
Here, the prior has been chosen to be the conjugate distribution, N iGa,r (a, r|V0 , ν0 ). Note, from Appendix A.3, that V0 must be positive definite, and ν0 > m + p + 1. The vector form of sufficient statistics (6.11) is revealed if we apply the vec (·) operator (see Page XV) on both sides of (6.25). The following posterior inferences will be useful in later work: Posterior means: −1 a ˆt ≡ EN iGa,r [a] = Vaa,t Va1,t , 1 λt . rˆt ≡ EN iGa,r [r] = νt − m − 4
(6.26)
Predictive distribution: the one-step-ahead predictor (2.23) is given by the ratio of normalizing constants (6.23), a result established in general in Section 2.3.3. In the case of the AR model, then, using (6.23), dt+1 , ψt+1 , νt + 1 ζa,r Vt + dt+1 , ψt+1 √ . (6.27) f (dt+1 |Dt ) = 2πζa,r (Vt , νt ) This is Student’s t-distribution [97] with νt − m − 2 degrees of freedom. The mean value of this distribution is readily found to be ˆt ψt+1 , d t+1 = a
(6.28)
using (6.26). Remark 6.2 (Classical AR estimation). The classical approach to estimation of a and r in (6.15) is based on the Minimum Mean Squared-Error (MMSE) criterion, also called the Wiener criterion, which is closely related to Least Squares (LS) estimation (Section 2.2.2) [54]. Parameter estimates are obtained by solution of the normal equations. Two principal approaches to evaluation of these equations are known as the covariance and correlation methods respectively [53]. In fact, the posterior mean (6.26), for prior parameter V0 (6.25) set to zero, is equivalent to the result of the covariance method of parameter estimation [120]. There are many techniques for the numerical solution of these equations, including recursive ones, such as the Recursive Least Squares (RLS) algorithm [108]. Remark 6.3 (Multivariate AutoRegressive (AR) model). Bayesian inference of the AR model can be easily adapted for multivariate (p > 1) observations. Then, the observation model (6.14) can be written in matrix form: 1
dt = Aψt + R 2 et .
(6.29)
6.2 Bayesian Recursive Inference
117
This can be formalized probabilistically as the multivariate Normal distribution (Appendix A.1): (6.30) f (dt |A, R, ψt ) = Ndt (Aψt , R) . Here, dt ∈ Rp are the vectors of data, A ∈ Rp×m is a matrix of regression coefficients, and ψt ∈ Rm is the regressor at time t, which may be composed of previ1 ous observations, exogenous variables (6.12) and constants. R 2 denotes the matrix 1 1 square root of R; i.e. R = R 2 R 2 ∈ Rp×p is the covariance matrix of the vector innovations sequence, et (6.29). Occasionally, it will prove analytically more convenient to work with the inverse covariance matrix, Ω = R−1 , which is known as the precision matrix. The posterior distribution on parameters A and R is of the Normal-inverseWishart (N iW) type [37] (Appendix A.3). The sufficient statistics of the posterior distribution are again of the form Vt , νt , updated as given by (6.25), where, now, ϕt = [dt , ψt ] ∈ Rp+m (6.20). One immediate consequence is that Vt ∈ R(p+m)×(p+m) , and is partitioned as follows: Vdd,t Vad,t −1 Vaa,t Vad,t , (6.31) , Λt = Vdd,t − Vad,t Vt = Vad,t Vaa,t where Vdd,t is the (p × p) upper-left sub-block of Vt . Objects Λt and Vdd,t , Vad,t are used in the posterior distribution in place of λt and V11,t , V1d,t respectively (Appendix A.3). 6.2.3 Recursive Inference of non-DEF models If the observation model is not from the DEF family (6.7), the exact posterior parameter inference does not have finite-dimensional statistics, and thus recursive inference is not feasible (Section 6.2). In such cases, we seek an approximate recursive inference procedure. Two principal approaches to approximation are as follows: Global approximation: this seeks an optimal approximation for the composition of an indeterminate number of Bayesian updates (6.3). In general, this approach leads to analytically complicated solutions. However, in special scenarios, computationally tractable solutions can be found [121, 122]. One-step approximation: this seeks an optimal approximation for just one Bayesian update (6.3), as illustrated in Fig 6.3. In general, this approach is less demanding from an analytical point of view, but it suffers from the following limitations: • The error of approximation may grow with time. Typically, the quality of the approximation can only be studied asymptotically, i.e. for t → ∞. • In the specific case of on-line inference given a set of i.i.d. observations, this approach yields different results depending on the order in which the data are processed [32]. One-step approximation is a special case of local approximation, with the latter embracing two- and multi-step approximations. In all cases, the problems mentioned above still pertain. In this book, we will be concerned with one-step approximations, and, in particular, one-step approximation via the VB method (Section 3.5).
118
6 On-line Inference of Time-Invariant Parameters f (dt |θ, Dt−1 ) f˜(θ|Dt−1 )
B
f˜(θ|Dt )
A
˜ f˜(θ|Dt )
Fig. 6.3. Illustration of the one-step approximation of Bayes’ rule in the on-line scenario with time-invariant parameterization. B denotes the (on-line) Bayes’ rule operator (Fig. 6.1) and A denotes a distributional approximation operator (Fig. 3.1). The double tilde notation is used to emphasize the fact that the approximations are accumulated in each step. From now on, only the single tilde notation will be used, for convenience.
6.3 The VB Approximation in On-Line Scenarios The VB method of approximation can only be applied to distributions which are separable in parameters (3.21). We have seen how to use the VB operator both for approximation of the full distribution (Fig. 3.3) and for generation of VB-marginals (Fig. 3.4). In the on-line scenario, these can be composed with the Bayes’ rule update of Fig. 6.1 (i.e. equation (2.14) or (6.3)) in many ways, each leading to a distinct algorithmic design. We now isolate and study three key scenarios in on-line signal processing where the VB-approximation can add value, and where it can lead to the design of useful, tractable recursive algorithms. The three scenarios are summarized in Fig. 6.4 via the composition of the Bayes’ rule (B, see Fig. 6.1) and VB-approximation (V, see Fig. 3.3) operators. We now examine each of these scenarios carefully, emphasizing the following issues: 1. 2. 3. 4.
The exact Bayesian inference task which is being addressed; The nature of the resulting approximation algorithm; The family of observation models for which each scenario is feasible; The possible signal processing applications.
6.3.1 Scenario I: VB-Marginalization for Conjugate Updates Consider the Bayesian recursive inference context (Section 6.2), where (i) sufficient statistics are available, and (ii) we are interested in inferring marginal distributions (2.6) and posterior moments (2.7) at each step t. The formal scheme is given in Fig. 6.5 (top). We assume that q = 2 here (3.9) for convenience. In the lower schematic, the VB operator has replaced the marginalization operators (see Fig. 3.4). Since sufficient statistics are available, the VB method is used in the same way as in the off-line case (Section 3.3.3). In particular, iterations of the IVB algorithm (Algorithm 1) are typically required in each time step, in order to generate the VBmarginals. This approach can be used if the following conditions hold: (i) The observation model is from the Dynamic Exponential Family (DEF) (6.7):
f (dt |θ1 , θ2 , ψt ) ∝ exp q (θ1 , θ2 ) u (dt , ψt ) .
6.3 The VB Approximation in On-Line Scenarios
119
Inference of marginals
B
B V
V
Propagation of VB-approximation
B
V
×
B
V
×
Inference of marginals for observation models with hidden variables
B
V
B
V
Fig. 6.4. Three possible scenarios for application of the VB operator in on-line inference. f (dt |θ, Dt−1 ) f (θ|Dt−1 )
B
f (θ|Dt )
f (θ1 |Dt )
f (θ2 |Dt )
f (dt |θ, Dt−1 ) f (θ|Dt−1 )
f (θ|Dt )
B V f˜(θ1 |Dt )
f˜(θ2 |Dt )
Fig. 6.5. Generation of marginals (q = 2) in conjugate Bayesian recursive inference for timeinvariant parameters, using the VB-approximation.
120
6 On-line Inference of Time-Invariant Parameters
(ii) The posterior distribution is separable in parameters (3.21):
f (θ1 , θ2 |Dt ) ∝ exp g (θ1 , Dt ) h (θ2 , Dt ) . These two conditions are satisfied iff
f (dt |θ1 , θ2 , ψt ) ∝ exp (q1 (θ1 ) ◦ q2 (θ2 )) u (dt , ψt ) ,
(6.32)
where ◦ denotes the Hadamard product (i.e. an element-wise product of vectors, ‘.∗’ in MATLAB notation; see Page XV). The set of observation models (6.32) will be important in future applications of the VB method in this book, and so we will refer to it as the Dynamic Exponential Family with Separable parameters (DEFS). This scenario can be used in applications where sufficient statistics are available, but where the posterior distribution is not tractable (e.g. marginals and moments of interest are unavailable, perhaps due to the unavailability of the normalizing constant ζθ (2.4)). Example 6.1 (Recursive inference for the scalar multiplicative decomposition (Section 3.7)). We consider the problem of generating—on-line—the posterior marginal distributions and moments of a and x in the model dt = ax + et (3.61). We assume that the residuals et are i.i.d. N (0, re ). Hence, the observation model is f (dt |a, x, re ) = f (dt |θ, re ) = N (θ, re ) ,
(6.33)
i.e. the scalar Normal distribution with unknown mean θ = ax. Note that (6.33) is a member of DEFS family (6.32) since, trivially, θ = a ◦ x, and it is therefore a candidate for inference under scenario I. Indeed, recursive inference of θ in this case is a rudimentary task in statistics [8]. The posterior distribution, t ≥ 1, has the following form: t 1 1" dτ , re f (a, x|ra , rx ) . (6.34) f (a, x|Dt , re , ra , rx ) ∝ Na,x t τ =1 t Under the conjugate prior assignment, (3.64) and (3.65), this posterior distribution is also normal, and is given by (3.66) under the following substitution: 1" dt , t τ =1 t
d→ re →
1 re . t
In this case, the extended sufficient statistics vector, v t = [vt , v0 ] (6.13), is given by t 1" 1 dτ , re , (6.35) vt = t τ =1 t v0 = [ra , rx ] .
6.3 The VB Approximation in On-Line Scenarios
121
This is an example of a non-minimal conjugate distribution, as discussed in Section 6.2.1 (see Eq. 6.13). vt is updated recursively via (6.35), and the VB-marginals and moments, (3.72)–(3.76), are elicited at each time, t, via the VB method (Section 3.3.3), using either the analytical solution of Step 6, or the IVB algorithm of Step 7. This example is representative of the signal processing contexts where scenario I may add value: sufficient statistics for recursive inference are available, but marginalization of the posterior distribution is not tractable. 6.3.2 Scenario II: The VB Method in One-Step Approximation We replace the one-step approximation operator, A, in Fig. 6.3 with the VB-approximation operator, V (Fig. 3.3). The resulting composition of operators is illustrated in Fig. 6.6. f (dt |θ, Dt−1 ) f˜(θ|Dt−1 )
B f (dt |θ, Dt−1 )
f˜(θ|Dt−1 )
B
f˜(θ|Dt )
f˜(θ|Dt )
A
f˜(θ|Dt )
f˜(θ1 |Dt ) V
f˜(θ2 |Dt )
×
f˜(θ|Dt )
Fig. 6.6. One-step approximation of Bayes’ rule in on-line, time-invariant inference, using the VB-approximation (q = 2).
In this scenario, we are propagating the VB-marginals, and so the distribution at time t − 1 has already been factorized by the VB-approximation (3.16). Hence, using (3.12) in (2.14), the joint distribution at time t is f (θ, dt |Dt−1 ) ∝ f (dt |θ, Dt−1 ) f˜ (θ1 |Dt−1 ) f˜ (θ2 |Dt−1 ) .
(6.36)
Using Theorem 3.1, the minimum of KLDVB is reached for f˜ (θi |Dt ) ∝ exp Ef˜(θ\i |Dt ) [ln f (dt |θ, Dt−1 )] + ln f˜ (θi |Dt−1 ) ∝ exp Ef˜(θ\i |Dt ) [ln f (dt |θ, Dt−1 )] f˜ (θi |Dt−1 ) = f˜ (dt |θi , Dt−1 ) f˜ (θi |Dt−1 ) , i = 1, 2, where
f˜ (dt |θi , Dt−1 ) ∝ exp Ef˜(θ\i |Dt ) [ln f (dt |θ, Dt−1 )] .
(6.37) (6.38)
Remark 6.4 ( VB-observation model). The VB-observation model for θi is given by (6.38), i = 1, . . . , q. It is parameterized by the VB-moments of f˜ θ\i |Dt , but does not have to be constructed explicitly as part of the VB-approximation. Instead, it is an object that we inspect in order to design appropriate conjugate priors for recursive inference, as explained in Section 6.2; i.e.:
122
6 On-line Inference of Time-Invariant Parameters
1. We can test if the VB-observation models (6.38) are members of the DEF family (6.7). If they are, then a Bayesian recursive algorithm is available for the proposed one-step VB-approximation (Fig. 6.6). 2. If, for each i = 1, . . . , q, f˜ (θi |Dt−1 ) in (6.37) is conjugate to the respective VBobservation model, f˜ (dt |θi , Dt−1 ) in (6.38), then these VB-marginal functional forms are replicated at all subsequent times, t, t + 1, . . .. In particular, if we choose the prior for each θi to be conjugate to the respective f˜ (dt |θi , Dt−1 ) , then conjugate Bayesian recursive inference of θ = θ1 , θ2 , . . . , θq (3.9) is achieved ∀t. Hence, each VB-marginal can be seen as being generated by an exact Bayes’ rule update, B, using the respective VB-observation model in a bank of such models. The concept is illustrated in Fig. 6.7. f˜(dt |θ1 , Dt−1 ) f˜(θ|Dt−1 )
B <
B
×
f˜(θ|Dt )
f˜(dt |θ2 , Dt−1 ) Fig. 6.7. The one-step VB-approximation for on-line time-invariant inference, shown as parallel Bayes’ rule updates of the VB-marginals (q = 2). Operator ‘<’ denotes factorization (‘fanning out’) into the VB-marginals available from the previous time-step.
Recall that each VB-observation model (6.38) depends on the VB-moments of all the other VB-marginals. If there exists an analytical solution for the VB-equations, then the computational flow of the recursive algorithm is exactly as implied in Fig. 6.7; i.e. the sufficient statistics of each VB-marginal at time t are expressed recursively via parallel (decoupled) update equations, si,t−1 → si,t , i = 1, . . . , q, using (6.10) and (6.11). Solution of the VB-equations at time t yields the shaping parameters, (3.28) and (3.29), of each VB-marginal in terms of these updated statistics (specifically, the VB-moments). More commonly, however, the VB-moments are evaluated iteratively via cycles of the IVB algorithm (Algorithm 1). In each such cycle, these updated VB-moments must be propagated back through the recursive sufficient statistics updates for each VB-observation model. Essentially, these recursive statistics updates form part of the set of VB-equations. This flow of VB-moments during iterations of the IVB algorithm is illustrated by the dotted lines in Fig. 6.8, and occurs many times in each time-step, once for each cycle of the IVB algorithm. Next, we seek the most general form of observation model, f (Dt |θ), for which the one-step VB-approximation (scenario II) of Fig. 6.6 can be realized tractably. From the foregoing, we must satisfy two conditions (assuming q = 2 for convenience):
6.3 The VB Approximation in On-Line Scenarios f˜(θ1 |Dt−1 )
B
123
f˜(θ1 |Dt )
f˜(dt |θ1 , Dt−1 ) f˜(dt |θ2 , Dt−1 ) f˜(θ2 |Dt−1 )
B
f˜(θ2 |Dt )
Fig. 6.8. The one-step VB-approximation for on-line time-invariant inference. The transmission of VB-moments via IVB cycles (q = 2) is indicated by dotted arrows.
(i) Each VB-observation model must be from the DEF family (6.7):
f˜ (dt |θi , Dt ) = f˜ (dt |θi , ψi,t ) ∝ exp qi (θi ) ui (dt , ψi,t ) , i = 1, 2. (ii) The joint distribution at time t (6.36) must be separable in parameters (3.21). Hence, the (exact) observation model must fulfil the same condition:
f (dt |θ1 , θ2 , Dt−1 ) ∝ exp g (θ1 , Dt ) h (θ2 , Dt ) . These two conditions are satisfied iff g (θ1 , Dt ) = q1 (θ1 ) ◦ u1 (dt , ψ1,t ), and h (θ2 , Dt ) = q2 (θ2 ) ◦ u2 (dt , ψ2,t ) , which is consistent with the definition of the DEFS family (6.32), under the assignment u (dt , ψt ) = u1 (dt , ψ1,t ) ◦ u2 (dt , ψ2,t ). Therefore, this scenario can be used for the same class of models as scenario I (Section 6.3.1). The key distinction between this scenario and scenario I (Section 6.3.1) is that the exact sufficient statistics, (6.10) and (6.11), are not collected. Instead, the approximate statistics, si,t , i = 1, . . . , q, of the parallel VB-observation model updates (Remark 6.4) are collected as part of the VB method at each step. Ultimately, the VB-marginals and their shaping parameters represent all the information that we carry forward about the unknown parameters. This may be useful in situations where the exact sufficient statistics are too large to be collected within the sampling period, ∆t, of the on-line process. The imposition of conditional independence in the VB-approximation (3.16) will, in effect, remove all terms modelling cross-correlations from the statistics. The free-form optimization achieved by the VB method—minimizing KLDVB (3.6)—will adjust the values of the remaining statistics at each time t, in order to emulate the original statistics as closely as possible. 6.3.3 Scenario III: Achieving Conjugacy in non-DEF Models via the VB Approximation In this Section, we study the use of VB-approximation for non-DEF observation models. Specifically, we will study on-line Bayesian inference of θ in an observation model expressed in marginalized form, as follows:
124
6 On-line Inference of Time-Invariant Parameters
f (dt |θ, Dt−1 ) =
∗ Θ2,t
f (dt , θ2,t |θ, Dt−1 ) dθ2,t .
(6.39)
The auxiliary parameter, θ2,t , has been used to augment the model. We will refer to θ2,t as the hidden variables in the model, and the integrand, f (dt , θ2,t |θ, Dt−1 ) , as the augmented model. θ2,t correspond to the missing data terms in the EM algorithm [21] (Section 3.4.4). Note that θ2,t is generated locally at each time t, and so correlation between these variables at different times is not modelled; i.e. they are (conditionally) independent. f (dt , θ2,t |θ, Dt−1 )
f (dt |θ, Dt−1 ) f (θ|Dt−1 )
f (θ|Dt )
B
f (dt , θ2,t |θ, Dt−1 ) f (θ|Dt−1 )
B
f (θ, θ2,t |Dt )
f (θ|Dt )
Fig. 6.9. On-line inference of θ when the observation model has hidden variables, θ2,t . Two equivalent operator compositions are given.
On-line inference of θ (6.39) is illustrated in Fig. 6.9.We note that the hidden variables are ‘injected’ into the observation model at each time t, but marginalization ensures that they do not appear in the posterior distribution. As discussed in Section 6.7, exact Bayesian recursive inference is available iff (6.39) is a member of the DEF family (6.7). In cases where (6.39) is not a member of the DEF family, it may be possible to achieve a recursive algorithm using an approximation. To this end, we now use the VB-approximation to replace the exact marginalization with VB-marginalization (Fig. 3.4). We concentrate on the operator composition in the lower schematic of Fig. 6.9, which yields the approximate update in Fig. 6.10. In this case, the VBmarginal of θ is propagated to the next step, while that of the hidden variables, θ2,t , is ‘dropped off’. From Fig. 6.10, we apply the VB-approximation to joint distribution, f (θ, θ2,t | Dt ); i.e. we seek its approximation in the family (3.7) Fc = {f (θ, θ2,t |Dt ) : f (θ, θ2,t |Dt ) = f (θ|Dt ) f (θ2,t |Dt )} .
6.3 The VB Approximation in On-Line Scenarios
125
f (dt , θ2,t |θ, Dt−1 ) f˜(θ|Dt−1 )
B
f˜(θ, θ2,t |Dt )
V
f˜(θ|Dt )
f˜(θ2,t |Dt ) Fig. 6.10. The one-step VB-approximation for on-line inference with hidden variables, showing the propagation of the VB-marginal of θ.
Using Theorem 3.1, the minimum of KLDVB (3.6) is reached for f˜ (θ|Dt ) ∝ exp Ef˜(θ2,t |Dt ) [ln f (dt , θ2,t |θ, Dt−1 )] + ln f˜ (θ|Dt−1 ) ∝ exp Ef˜(θ2,t |Dt ) [ln f (dt , θ2,t |θ, Dt−1 )] f˜ (θ|Dt−1 ) , ∝ f˜ (dt |θ, Dt−1 ) f (θ|Dt−1 ) , where
f˜ (dt |θ, Dt−1 ) ∝ exp Ef˜(θ2,t |Dt ) [ln f (dt , θ2,t |θ, Dt−1 )] ,
(6.40)
(6.41)
is the VB-observation model, defined in Remark 6.4. Hence, a recursive inference algorithm for θ emerges if the prior, f˜(θ), is chosen conjugate to (6.41), in which case the VB-posterior, f˜ (θ|Dt ) , is functionally invariant ∀t. The VB-marginal for the hidden variables, θ2,t , i.e. f˜ (θ2,t |Dt ) ∝ exp Ef˜(θ|Dt ) [ln f (dt , θ2,t |θ, Dt−1 )] , (6.42) is not propagated, and no VB-observation model need be formalized for it. Its purpose is purely to provide the VB-moments which are required in formulating the VB-observation model for θ (6.41). The latter is used in exactly the same way as in scenario II: its recursive update equations form part of the set iterated by the IVB algorithm. The implied computational flow is illustrated in Fig. 6.11. Comparing with Fig. 6.9, we see how the VB-approximation has replaced the Bayes’ update and integration with just a Bayes’ update, with the statistics computed via IVB cycles at each time step. If the scheme above is to be tractable, then the observation model, f (dt |θ, Dt−1 ) (6.39), must satisfy the following two conditions: (i) The VB-observation model (6.41) is from the Dynamic Exponential Family (6.7):
f˜ (dt |θ, Dt−1 ) = f˜ (dt |θ, ψt ) ∝ exp q (θ) u (dt , ψt ) . (ii) The true augmented observation model in (6.39) is separable in parameters (3.21):
f (dt , θ2,t |θ, Dt−1 ) ∝ exp g (θ, Dt ) h (θ2,t , Dt ) .
126
6 On-line Inference of Time-Invariant Parameters f˜(θ|Dt−1 )
B
f˜(θ|Dt )
f˜(dt |θ, Dt−1 ) f˜(θ2,t |Dt ) Fig. 6.11. The VB-approximation for on-line inference with hidden variables, illustrating the flow of VB-moments via IVB cycles.
These two conditions are satisfied iff g (θ, Dt ) = q (θ) ◦ u (dt , ψt ); i.e.
f (dt , θ2,t |θ, Dt−1 ) ∝ exp q (θ) u (θ2,t , Dt ) ,
(6.43)
where u (θ2,t , Dt ) = h (θ2,t , Dt ) ◦ u (dt , ψt ) . Family (6.43) will be called the Dynamic Exponential Family with Hidden variables (DEFH), since it extends the Exponential Family with Hidden variables (EFH) [26, 65] to autoregressive cases (6.2). In contrast to the DEFS family (6.32) of the previous two scenarios (6.32), DEFH observation models do not require separability of θ2,t from data Dt , as can be seen in the second argument on the right-hand-side of (6.43). This scenario is useful in situations where the observation model does not admit a conjugate distribution and, therefore, sufficient statistics are not available. Then, the VB method of approximation outlined above yields a recursive algorithm if the observation model is amenable to augmentation in such a way that (6.43) is satisfied; i.e. if the observation model (6.39) can be diagnosed as DEFH. This step may be difficult: the required augmentation must be handled on a case-by-case basis. 6.3.4 The VB Method in the On-Line Scenarios We follow the 8 steps of the VB method, as developed in Section 3.3.3, but we adapt them slightly, as follows, for use in the on-line scenario: Step 1: Choose a Bayesian model: In the off-line scenario, the full Bayesian model (observation model and prior) was chosen (6.3). In the on-line case, the prior distribution is typically chosen conjugate to the observation model, and the latter may be adapted by later steps in the VB-method. Hence, in this step, we choose only the observation model (6.2). Using this, we decide which scenario (Sections 6.3.1–6.3.3) we will follow. In scenario III, we assume that we have access to the augmented observation model (6.39). Step 2: Partition the parameters: Since the VB-observation model must be a member of the DEF family (6.1), the requirement of parameter separability (3.21) in the joint distribution is replaced by the following stronger conditions: • The observation model is in the DEF family (6.32) for scenarios I and II;
6.4 Related Distributional Approximations
127
• The observation model is in the DEFH family (6.43) for scenario III. Step 3: Write down the VB-marginals: This step is scenario-dependent: Scenario I: Write down the VB-marginals, f˜ (θi |Dt ). Scenario II: Write down the VB-observation model, f˜ (dt |θi , Dt−1 ) (6.4), for each θi . Scenario III: Write down the VB-observation model, f˜ (dt |θi , Dt−1 ) (6.41), for each θ = θ1 , θ2 , . . . , θq , and write down the VB-marginal, f˜ (θ2,t |Dt ) (6.42), of the hidden variables, θ2,t . Step 4: Identify standard forms: We identify standard forms for all VB-distributions listed in the previous step. The posterior distribution for each θi is chosen conjugate to the standard form of the respective VB-observation model (Remark 6.4). Step 5: Unchanged. Step 6: Unchanged. Step 7: Run the IVB algorithm: the steps of the IVB algorithm are iterated until a convergence criterion is satisfied. This may be a computationally prohibitive requirement in the on-line case. Therefore, an upper bound on the number of iterations in each time step is typically set. For the DEFH family (6.43), consistency in identifying θ via the IVB algorithm, using just one IVB iteration per time step, was proved in [26]. For time-invariant parameters, θ, it can be expected that—for t large—each new observation, dt , will perturb the statistics only slightly. Hence, as t increases, we can expect the IVB algorithm to converge faster, eventually requiring as few as two cycles. Step 8: Unchanged. Note that steps 1.–6. can be completed off-line. The only steps performed on-line are 7 and 8. In signal processing applications, we expect that the on-line VB method presented above can be useful for various extensions of (i) AR processes for continuous observations, and (ii) Markov chain processes for discrete observations. It is significant, however, that the factorized nature of the VB approximation (3.16) allows joint inference of both continuous and discrete variables, a task which is not computationally tractable in exact Bayesian inference [52].
6.4 Related Distributional Approximations In this Section, we review other techniques for distributional approximation which rely on KLD minimization. First, we outline the Quasi-Bayes (QB) approximation (Section 3.4.3), specialized to the on-line context. Since this is a Restricted VB (RVB) technique, DEF-type observation models are still required. The non-VBbased approximation techniques in Sections 6.4.2 and 6.4.3 can potentially be used for inference with non-DEF observation models.
128
6 On-line Inference of Time-Invariant Parameters
6.4.1 The Quasi-Bayes (QB) Approximation in On-Line Scenarios The Quasi-Bayes (QB) approximation (Section 3.4.3) was characterized as a special case of the Restricted VB (RVB) approximation (Section 3.4.3). As such, it has a closed-form solution without the need for IVB iterations (Section 3.4.3.1), which may recommend it for use in on-line scenarios where the (unrestricted) VB method is unsuitable. Recall, from (3.45), that the QB-approximation requires that one (or q − 1 in the general case: see Remark 3.3) of the VB-marginals be replaced by the exact marginal. Naturally, this can be done only if (i) such a closed-form marginal is available, and (ii) this marginal is tractable, in the sense that the necessary moments can be evaluated analytically. The QB-approximation can be applied in the three on-line scenarios outlined in Section 6.3, as follows: Scenario I: If one of the marginals in Fig 6.5 (upper schematic) is tractable, the QBapproximation can be used for inference of the other. Scenario II: In this case, both marginals are propagated to the next step. Therefore, the QB-approximation is feasible in this scenario only if the analytical marginal has a conjugate update, i.e. only if it belongs to the CDEF family defined in (6.9). This imposes an additional constraint beyond the usual requirement for a DEF-type observation model (Section 6.3.2). Scenario III: Since f˜ (θ2,t |Dt ) is not propagated, it is the natural choice for replacement by the exact marginal: ˜ f (θ2,t |Dt ) ≡ f (θ2,t |Dt ) = f (θ, θ2,t |Dt ) dθ. (6.44) Θ∗
In this case, the only requirement on (6.44) is tractability, as in scenario I above. Conjugate updates for f˜ (θ|Dt ) are guaranteed, as before, for DEFH observation models (Section 6.43). 6.4.2 Global Approximation via the Geometric Approach The problem of recursive inference with non-DEF observation models, under a limited memory constraint, was addressed in general in [123, 124]. This geometric approach to approximation is an example of global approximation (Section 6.2.3). The approximate inference procedure is found by projecting the space of posterior distributions at any t into the space of distributions with finite-dimensional statistics, wt ∈ Rη×1 : (6.45) f˜ (θ|Dt ) ≈ f (θ|wt ) , where η ≥ 1 is assigned a priori. The risk-based Kullback-Leibler divergence, KLDMR (Section 3.2.2), is used as the proximity measure [121]. It was shown in [121,125] that the only family which can be globally optimized— i.e. the optimum is with respect to the parameter inference at any time t—is a probabilistic mixture of η fixed (known) distributions, f i (θ), i = 1, . . . , η, weighted by the
6.4 Related Distributional Approximations
129
elements of wt (6.45). These statistics, wt , are updated by an appropriately chosen functional, l (·):
(6.46) wi,t = wi,t−1 + l f i (θ) , f (dt , θ|Dt−1 ) , i = 1, . . . , η. Alternatively, the choice of η fixed distributions, f i (θ), can be replaced by the choice of η functions, li (·) , of the data, such that wi,t = wi,t−1 + li (Dt ) . Practical use of the approximation is, however, rather limited. The method requires time- and data-invariant linear operators, li (·) (or f i (θ) and l (·)), to be chosen a priori. Design criteria for these operators are available only in special cases, and the method is feasible only for low-dimensional problems. 6.4.3 One-step Fixed-Form (FF) Approximation The fixed-form approximation of Section 3.4.2 can be applied after each Bayes’ rule update in on-line inference (Section 6.3). As such, it is an example of one-step approximation (Fig. 6.3). The form of the approximate parametric posterior distribution, f˜ (θ|Dt ) ∈ Fβ = {f0 (θ|β) , ∀β}, is set a priori, and fixed for all t. It is assumed that the prior is also assigned from this parametric class: f (θ) = f0 (θ|β0 ) . After a (non-conjugate) Bayes’ rule update (6.3), then the posterior distribution is f (θ|βt−1 , Dt ) ∝ f (dt |θ, Dt−1 ) f0 (θ|βt−1 ) .
(6.47)
Hence, f (θ|βt−1 , Dt ) ∈ / Fβ , and so an approximation, A[·] (3.1), is found by projecting this posterior distribution into Fβ : (6.48) f˜(θ|Dt ) ≡ A f˜ (θ|βt−1 , Dt ) = f0 (θ|βt ) . The approximation (6.48) is used as the prior in the next step. In this on-line scenario, A[·] (and, hence, the update rule for βt ) is usually defined in one of two ways [32]: 1. Probability fitting: the approximation (6.48) is optimized with respect to a chosen divergence (Section 3.4.2), such as KLDMR (3.5). 2. Moment fitting (also known as the probabilistic editor [32]): parameters βt are chosen so that selected moments of the approximating distribution, f0 (θ|βt ) (6.48), match those of the exactly updated posterior, f (θ|βt−1 , Dt ) (6.47). This specialization of the one-step approximation (Fig. 6.3) is illustrated in Fig. 6.12. Note, from Figs. 6.6 and 6.10, that scenarios II and III for on-line application of the VB method are also instances of one-step approximation. The key distinction between the VB-approximation and the one above is the free-form nature of the VB approach: i.e. the VB approximation yields not only parameters but also the form of the approximating distribution.
130
6 On-line Inference of Time-Invariant Parameters f (dt |θ, Dt−1 ) f0 (θ|βt−1 )
B
f (θ|βt−1 , Dt )
A
f0 (θ|βt )
Fig. 6.12. One-step fixed-form approximation in the on-line scenario with time-invariant parameterization.
6.5 On-line Inference of a Mixture of AutoRegressive (AR) Models The AR model was introduced in Section 6.2.2. Mixtures of these models are used in non-linear time-series modelling [126], control design [127], etc. AR mixtures find wide application in speech recognition [128], classification [129], spectrum modelling [130], etc. Mixtures of Normal distributions are used as a universal approximation for a wide class of distributions in neural computing [131]. These Normal mixtures are a special case of AR mixtures, where the regressors (6.12) are independent of the history of dt . A mixture of c multivariate AR components (i.e. models) is defined as follows (Remark 6.3):
f (dt | {A}c , {R}c , α, {ψt }c ) =
c "
(i) αi Ndt A(i) ψt , R(i) .
(6.49)
i=1
The notation # {·}c is used$ to represent a set of c elements of the same kind; e.g. {A}c = A(1) , . . . , A(c) . In (6.49), dt is a p-dimensional vector of observations (i) and ψt is an mi -dimensional regressor. A(i) ∈ Rp×mi , i = 1, . . . , c, is the regression coefficient matrix of the ith AR component, and R(i) ∈ Rp×p is the covariance matrix of the innovations process associated with each component. αi ∈ I[0,1] denotes the time-invariant weight of the ith AR component. Mixture models—such as the AR mixture in (6.49)—do not belong to the DEF family (6.7), since the logarithm of the sum cannot be separated into the required scalar product. As a consequence, the number of terms in the posterior at time t is ct . This exponential increase is an example of combinatoric explosion, typical of probabilistic inference involving mixtures [132]. 6.5.1 The VB Method for AR Mixtures We now derive the VB-approximation for posterior inference of the parameters of the mixture model (6.49). Since the observation model (6.49) is not a member of the DEF family, our only choice is to seek an augmented observation model for which (6.49) is its marginal (6.39). This would allow us to proceed with scenario III (Section 6.3.3). Step 1: (6.49) can be diagnosed as a member of the DEFH family (6.43), as follows:
6.5 On-line Inference of a Mixture of AutoRegressive (AR) Models
(i) f dt |A(i) , R(i) , lt , ψt f (lt |α) dlt ,
f (dt | {A}c , {R}c , α, {ψt }c ) = lt
f (dt | {A}c , {R}c , lt , {ψt }c ) =
131
c
li,t (i) Ndt A(i) ψt , R(i) ,
(6.50)
i=1
f (lt |α) = M ult (α) . The time-invariant parameters are θ = {{A}c , {R}c , α}, and the hidden variable is θ2,t = lt = [l1,t , . . . , lc,t ] ∈ { c (1) , . . . , c (c)} , where c (i) is the ith elementary c vector in R (see Notational Conventions on page XV). Hence, lt is the vector pointer into the component which is active at time t (6.50). Step 2: The logarithm of the augmented observation model is / / 1 / / − li,t ln /R(i) / + 2 i=1 −1 1 (i) (i) − li,t dt − A(i) ψt R(i) dt − A(i) ψt + ln Γ (li,t ) + li,t ln (αi ) + γ 2 c " 1 (i) (i) (i) (i) −1 (i) Ip , −A Ip , −A li,t ϕt ϕt = R − tr + 2 i=1 / / 1 / (i) / − li,t ln /R / + ln Γ (li,t ) + li,t ln (αi ) + γ. (6.51) 2 ln f (dt , lt | {A}c , {R}c , α, {ψt }c ) =
c "
(i)
Here, γ aggregates all terms which do not depend on parameters, and ϕt = (i) dt , ϕt ∈ Rp+mi is the ith extended regressor (see (6.20) and Remark 6.3). From (6.51), we see that the augmented observation model is from DEFH family (6.43). Step 3: The VB-observation model for θ is as follows: / / 1 / (i) / − l i,t ln /R / + 2 i=1 −1 1 (i) (i) Ip , −A(i) R(i) − tr Ip , −A(i) ϕ ϕ l + l i,t t i,t ln (αi ) . t 2 (6.52)
f˜ (dt | {A}c , {R}c , α, {ψt }c ) ∝ exp
c "
The VB-marginal on the hidden variable, θ2,t = lt , is 1 / / /R(i) / + ln Γ (li,t ) + li,t − ln t 2 i=1 −1 1 (i) (i) − tr Ef˜(θ|Dt ) Ip , −A(i) R(i) + ln (αi )t , Ip , −A(i) ϕt ϕt 2 (6.53)
f˜ (lt |Dt ) = exp
c "
132
6 On-line Inference of Time-Invariant Parameters
where Ef˜(θ|Dt )
−1 Ip , −A(i) R(i) Ip , −A(i) = ⎡ −1 R(i) t , =⎣ −1 A(i) , E R(i) t
⎤ (i) R (i) −1 A t t ⎦ . (6.54) (i) (i) −1 (i) A R A ˜ f (θ|Dt )
t
Here we are using the notation, θ t = Ef˜(θ|Dt ) [θ], to denote the (time-variant) posterior expectation (i.e. the VB-moment (3.30)) of the time-invariant parameter. When the parameter is time-variant—such as lt —its VB-moment at time t will be denoted by lt ; i.e. the second t-subscript will be omitted. Step 4: Since we are working in scenario III (Section 6.3.3), we must identify standard forms for (i) the VB-observation model (6.52), and (ii) the VB-marginal (6.53). With respect to the VB-observation model, we note that it can be factorized into c + 1 conditionally independent distributions, and therefore written as the following product of standard forms: c
Ndt A(i) ψ (i) t , R(i) , (6.55) f˜ dt , lt | {A}c , {R}c , α, {ψt }c ∝ Mul (α) t
i=1
where is the Multinomial distribution (Appendix A.7). We use the notation Mu (·) f˜ dt , lt | . . . to emphasize the fact that the moments, lt , of the hidden variables, lt , are entering (6.55) in the same way as the actual data, dt . This property is not true in general settings for the VB-observation model, since many VB-moments of the hidden variables, θ2,t , may need to be substituted into the exact observation model (6.41). In this particular context of mixture-modelling, only the first moment, lt , is required. An equivalent behaviour in the classical EM algorithm (Section 3.4.4) has encouraged the use of the phrase ‘missing data’ to describe the hidden variables, θ2,t , in that context [18]. In order to establish a conjugate recursive update of θ = {{A}c , {R}c , α} (Remark 6.4), the VB-marginal is chosen as conjugate to (6.55): f˜ ({A}c , {R}c , α|Dt ) = Diα (κt )
c
(i) (i) N iWA(i) ,R(i) Vt , νt .
(6.56)
i=1
Here, N iW (·) denotes the Normal-inverse-Wishart distribution (see Remark 6.3 and Appendix A.3), which is conjugate to Ndt (·) in (6.55), and Di (·) denotes the Dirichlet distribution (Appendix A.8), which is conjugate to Mul (·) in (6.55). t The VB-marginal of lt (6.53) is recognized to have the following standard form: f˜ (lt |Dt ) = Mu (wt ) . The shaping parameters of (6.56) and (6.57) are as follows:
(6.57)
6.5 On-line Inference of a Mixture of AutoRegressive (AR) Models (i)
Vt
(i)
= Vt−1 + l i,t ϕt ϕt , (i)
(i)
133
(6.58)
+ l i,t , (6.59) κt = κt−1 + lt , 1 // (i) // R t + ln wi,t ∝ exp − ln (αi )t + (6.60) 2 0 1 (i) (i) (i) (i) −1 (i) Ip , −A − tr Ef˜(θ|Dt ) Ip , −A R , ϕt ϕ t 2 (i) νt
=
(i) νt−1
where E· [·] in (6.60) is given by (6.54). Note that νt and κt experience the same (i) update, via l i,t . This is because νt and κt act as update counters for their respective distributions, and all these distributions are being updated together in (6.56), via (6.55). Step 6: The required VB-moments of the VB-marginals, (6.56) and (6.57), are as follows: −1 (i) = V (i) V (i) A , (6.61) t ad,t aa,t −1 (i) (i) −1 , R(i) t = νt Λt −1 (i) R (i) −1 A (i) , Ef˜(θ|Dt ) A(i) R(i) A(i) = pImi + A t t t p / / " / / / (i) / (i) (i) / / ln R t = −p ln 2 − ψΓ νt − m − p − j + ln /Λt / , j=1
⎛ ⎞ c " ln (αi )t = ψΓ (κi,t ) − ψΓ ⎝ κj,t ⎠ ,
(6.62)
j=1
κi,t l , i,t = %c j=1 κj,t (i)
(i)
(i)
(i)
(6.63) (i)
where Vdd,t , Vad,t , Vaa,t and Λt are submatrices of Vt (see 6.24 and Remark 6.3). ψΓ (·) is the digamma (psi) function [93] (Appendix A.8). Step 6: No reduction of the VB-equations (6.58)–(6.63) was found. Step 7: The IVB algorithm (Algorithm 1) is therefore used to find an iterative solution of the VB-equations (6.58)–(6.63). Step 8: We report the VB-marginal (6.56) of θ = {{A}c , {R}c , α} , via its shaping parameters (6.58)–(6.59). 6.5.2 Related Distributional Approximations for AR Mixtures 6.5.2.1 The Quasi-Bayes (QB) Approximation Recall, from Sections 3.4.3.2 and 6.4.1, that the QB method of approximation is a special case of the Restricted VB (RVB) approximation where one of the
134
6 On-line Inference of Time-Invariant Parameters
VB-marginals is replaced by the true marginal. In Section 6.4.1, we noted that this method may be particularly suitable in on-line inference, since it provides a closedform distributional approximation for the remaining random variables, obviating the need for iterations of the IVB algorithm at each time step, t. We specialized the VB method to RVB approximation in Section 3.4.3.1, and we follow these steps now. Step 1: We are deriving the QB-approximation for scenario III of on-line inference. Following the recommendation in Section 6.4.1 (see (6.44)), the VB-marginal on hidden variables, lt , is fixed as the exact marginal: f (θ, lt |Dt ) dθ f˜ (lt |Dt ) ≡ f (lt |Dt ) = Θ∗
∝
Θ∗
f˜ (θ|Dt−1 ) f (dt , θ, lt |Dt−1 ) dθ,
(6.64)
where θ = {{A}c , {R}c , α} are the AR mixture parameters. The integrand, f (θ, lt |Dt ) , in (6.64) is therefore the one-step updating of f˜ (θ|Dt−1 ) (i.e. the VB-marginal from the previous step (6.56)) via the exact observation model (6.50). Hence, (6.65) f (θ, lt |Dt ) ∝ Muα (κt−1 + lt ) × c li,t
(i) (i) (i) (i) N iWA(i) ,R(i) Vt−1 + ϕt ϕt , νt−1 + 1 × . i=1
Marginalization of (6.65) over the AR parameters, θ, yields f (lt |Dt ) = Mult (wt ) ∝
c
l
i,t wi,t ,
(6.66)
i=1
(i) (i) (i) (i) wi,t ∝ ζα (κt−1 + c (i)) ζA(i) ,R(i) Vt−1 + ϕt ϕt , νt−1 + 1 , where ζα (·) denotes the normalizing constant (Remark 6.1) of the Dirichlet distribution (A.49), and ζA(i) ,R(i) (·) denotes the normalizing constant of the Normalinverse-Wishart distribution (A.11). Note, finally, that the wi,t are probabilities, and %c so i=1 wi,t = 1, providing the constant of proportionality required in the last expression. Step 2: Identical to Step 2. of the VB method (Section 6.5.1). Step 3: f˜ (θ|Dt ) has the functional form given in (6.56), and f˜ (lt |Dt ) ≡ f (lt |Dt ) (6.64) is given by (6.66). Step 4: The standard form of f˜ (θ|Dt ) is given by (6.56), with shaping parameters (6.58)–(6.59). The standard form of f (lt |Dt ) is Multinomial (6.66), with shaping parameters wt . Step 5: f (lt |Dt ) (6.66) is tractable, and so its necessary (first-order) moments, lt , are available in closed form (from (6.66)):
6.5 On-line Inference of a Mixture of AutoRegressive (AR) Models
lt = wt .
135
(6.67)
Hence, the shaping parameters, (6.58)–(6.59), of f˜ (θ|Dt ) are updated in closed form with respect to (6.67). Step 6–7: Do not arise. Step 8: Identical to step 8 of the VB method (Section 6.5.1). 6.5.2.2 One-step Fixed-Form (FF) Approximation The one-step FF approximation (Section 6.4.3) for a mixture of AR models (6.49) was derived in [133] via minimization of KLDMR (3.5). The approximating posterior distribution was chosen to be of the same form as the VB-marginal (6.56): f0 ({A}c , {R}c , α|βt ) = Diα (κt )
c
(i) (i) N iWA(i) ,R(i) Vt , νt .
(6.68)
i=1
Hence, βt = {{Vt }c , {νt }c , κt } are the parameters to be optimized. The Bayes’ rule update of (6.68) via (6.49) yields a mixture of c components, which is not in the form of (6.68). Hence, we approximate this exact update by the distribution of the kind (6.68) that is closest to it in the minimum-KLDMR sense. In this approach, the statistics Vt and νt are not additively updated, as they were in the VB-approximation (6.58)–(6.59). Instead, they are found as solutions of implicit equations [133]. This implies a considerably increased computational load per time step. Both the VB-approximation (6.56) and FF-approximation (6.68) are one-step approximations, each projecting the exactly updated posterior distribution into the same family. Hence, the main distinction between the two methods is the criterion of optimality used. The VB-approximation minimizes KLDVB , and is susceptible to local minima (see the remark in Step 6 on Page 35), while the FF-approximation minimizes KLDMR , with a guaranteed global minimum [64]. For these reasons, the two methods of approximation produce different results, as illustrated in Fig. 3.2, and as we will see in the simulations that follow. 6.5.3 Simulation Study: On-line Inference of a Static Mixture In this Section, we illustrate the properties of the VB-approximation for inference of static mixture models. A time-invariant, static mixture (also known as a mixture of Gaussians [134]) is a special case of the AR mixture model (6.49) under the assign(i) ment mi = 1, ψt = 1, ∀i = 1, . . . , c. Then, from (6.49), f (dt | {A}c , {R}c , α, {ψt }c ) =
c "
αi Ndt A(i) , R(i) ,
i=1
where A(i) ∈ Rp×1 , and other symbols have their usual meaning.
(6.69)
136
6 On-line Inference of Time-Invariant Parameters
One of the main challenges in on-line inference arises at the beginning of the procedure, i.e. when t is small. This constitutes, intrinsically, a stressful regime [135] for identification, since the number of available data is much smaller than the number of inferred parameters. This problem is known as initialization of the recursive algorithm [52]. Its Bayesian solution involves careful design of the prior distribution and updates which are robust to early mismodelling. Bayesian recursive inference for the DEF family involves additive accumulation of the observations into the sufficient statistics (6.11). The same is true of the updates involved in the VB-approximation and QB-approximation (6.58) for AR mixtures. Therefore, any data record that has already been added to the sufficient statistics at an earlier t cannot be removed subsequently. This property is harmful in recursive identification of mixture models if the prior distribution is far from the posterior, since early misclassifications (into the wrong component (6.69)) influence all later parameter inferences. In the following simulations, we examine the influence of initialization on the finite-t performance of the various approximations. 6.5.3.1 Inference of a Many-Component Mixture A mixture of 30 Gaussian distributions was used to construct the 2-dimensional (i.e. p = 2) ‘true’ observation model displayed in Fig. 6.13 (top-left), from which the data were generated. A modelling mismatch was simulated by choosing the number of components in the observation model to be c = 10. The prior distribution on the mixture parameters, θ = {{A}c , {R}c , α} , was chosen as the conjugate distribution (i) (6.56) for the VB-observation model (6.55), with shaping parameters ν0 = 10, ∀i, (i) 1 α = c 1c,1 , and V0 as randomly generated positive definite matrices for each component, i = 1, . . . , c. This ensures that the 10 components are not coincident a priori. If they were, they could not be separated during subsequent learning. The task is to infer the mixture parameters, on-line, i.e. at each time The task is to infer the mixture parameters, θ = {{A}c , {R}c , α}, on-line, i.e. at each time t = 1, 2, 3, .... In Fig. 6.13, we compare the HPD regions (Definition 2.1) of the approximate posterior distributions (6.56), updated via the VB, QB, and FF methods respectively. The inferences at times t = 5, 20 and 200 are displayed. As expected, the distinctions between the methods of approximation are most obvious for small t. With increasing numbers of observations, all methods approximate the ‘true’ observation model well. Note that all methods were initialized with the same prior, and so manifestly have different prior sensitivity properties. Furthermore, changes in the choice of prior can be expected to lead to different behaviours in all three methods during the small-t phase of on-line learning. We will not pursue this issue further, but the interested reader is referred to [32, 52]. 6.5.3.2 Inference of a Two-Component Mixture This simulation was designed to examine the sensitivity of the approximation methods to unavoidable misclassification in the early stages of on-line inference. We simulated a mixture of c = 2 static Gaussian components (6.69), with equal covariances,
6.5 On-line Inference of a Mixture of AutoRegressive (AR) Models
data
Simulated mixture
Prior
VB
QB
137
FF
Fig. 6.13. Comparison of performance of the VB, QB and FF approximations for recursive inference of the parameters of a 2D static mixture. The approximate inferences after t = 5, 20 and 200 observations are shown in rows 2–4 respectively.
R(i) = R(2) = I2 , and with mean values A1 = [0, 1] , A2 = [0, a2 ] , where we tested various settings of a2 ∈ R[−1,1] . Hence, the distance between the component means is δA = ||A1 − A2 || = 1 − a2 ∈ R[0,2] . Our aim is to test the sensitivity of the VB, QB, and FF approximations to δA . These approximations, f˜(θ|Dt , Ii ), for i = 1 (VB), i = 2 (QB) and i = 3 (FF), can be interpreted as competing models for the mixture parameters, θ = {{A}c , {R}c , α}. Hence, we can compare them using standard model comparison techniques [70]. The competing models are summarized in Table 6.1.
138
6 On-line Inference of Time-Invariant Parameters Table 6.1. Approximate inferences of (static) AR mixture parameters. Method VB QB FF
Model (6.56)
f˜ θ| {Vt }c , {νt }c , κt , I1
f˜ θ| {Vt }c , {νt }c , κt , I2
f˜ θ| {Vt }c , {νt }c , κt , I3
update of statistics (6.58)–(6.63), (6.58)–(6.62),(6.67) see [133]
The posterior probability of each model is proportional to the predictor of Dt , assuming a uniform model prior [42, 70, 136]: f (Dt |θ) f˜ (θ| {Vt }c , {νt }c , κt , Ii ) dθ, i = 1, . . . , 3. f (Ii |Dt ) ∝ f (Dt |Ii ) = Θ∗
(6.70) Exact evaluation of (6.70) is not tractable since f (Dt |θ) is a mixture of 2t components. Therefore, we approximate (6.70) by f˜ (Dt |Ii ) ≡
t
τ =1
Θ∗
f (dτ |θ) f˜ (θ| {Vt }c , {νt }c , κt , Ii ) dθ, i = 1, . . . , 3, (6.71)
i.e. using a step-wise marginalization procedure. In evaluating (6.71), we first elicit the approximate distributions, f˜ (θ| {Vt }c , {νt }c , κt , Ii ) , using all t observations, and then use these terminal approximations in the step-wise marginalizations. This allows the terminal approximations to be compared. Since we are using an approximate marginalization in (6.71), the evaluated probabilities may not be accurate. However, the form of the approximate posterior distribution (6.56) is the same for all methods (Table 6.1), and so it is reasonable to assume that the approximation affects all models in the same way. Therefore, we can at least expect that the ranking of the models, using (6.71), will be reliable. The terminal predictions (6.71) after t = 50 observations, using each of the three approximations, were examined in a Monte Carlo study. For each setting of the inter-cluster distance, δA ∈ R[0,2] , 40 randomly initialized inference cycles were undertaken. The sample mean of (6.71), i = 1, 2, 3, was plotted in Fig. 6.14 as a function of δA . Misclassifications of observations for small t affect the terminal (t = 50) approximate distributions. The amount of misclassification depends on the inter-component distance, δA . For δA < 0.3, the data are generated from a quasi-one-component observation model, misclassifications are rare, and all methods perform well. With increasing δA , low-t misclassifications proliferate, incorporating greater errors in the accumulating statistics. The main finding from Fig. 6.14 is that the three approximations have different sensitivities to these early misclassifications. The QBapproximation is the most sensitive to initialization, while the VB-approximation is more robust. Recall that both of these approximations update their statistics additively (6.58). The FF method of approximation is far more robust to initialization. The main reason for this is non-additive updating of statistics (Section 6.5.2.2).
Approximate log-predictor, f˜(D50 |Ii )
6.5 On-line Inference of a Mixture of AutoRegressive (AR) Models
100
139
VB (i = 1) QB (i = 2) FF (i = 3)
80 60 40 20 0
-20 0
0.2
0.4
0.6 1 1.4 0.8 1.2 Inter-component distance, dA
1.6
1.8
2
Fig. 6.14. Comparison of the prediction performance of the VB, QB, and FF methods of approximation, as a function of increasing inter-component distance for a c = 2-component static mixture.
Remark 6.5 (Time-variant parameterization). The problem of initialization can be addressed by treating the parameters θ as time-variant. Under this assumption, various techniques are available for eliminating statistics introduced by historic (low-t) data: Discounting the contribution made by historic data [26]; this is a variant of forgetting [52], which will be discussed in the Chapter 7. Repetitive runs of the inference procedure, with the prior in the next run modified by the posterior from the previous run [52]. This technique is, naturally, appropriate only in off-line mode. 6.5.4 Data-Intensive Applications of Dynamic Mixtures The mixture model in (6.49) is highly flexible, and can be used to capture correlation between data channels, di,t , i = 1, . . . , p, via the component covariance matrices, R(i) , as well as temporal correlations, via regression onto the data history (6.12): (i)
(i)
ψt = ψi (dt−1 , . . . , dt−∂i , ξt ), i = 1, . . . , c.
(6.72) (i)
Here, as usual, ∂i denotes the maximum data memory in the ith component, and ξt are the exogenous variables. We refer to a component as dynamic if ∂i ≥ 1, and (i) (i) static if ψt is a function only of exogenous variables, ξt , and not of past data, an example of which was studied in Section 6.5.3. The use of multiple components is appropriate when capturing distinct modes in the data. Consider the case of a Process-Operator Loop (POL) where not only the process under operator control is subject to intrinsic changes in behaviour, but the
140
6 On-line Inference of Time-Invariant Parameters
human operator, too, may typically interact with the system in a finite number of different ways. The resulting switches in the dynamics of the POL can be captured by the mixture model (6.49) with appropriate channel and temporal correlations. Several data-intensive applications of dynamic mixture modelling were examined by the EU IST project, ProDaCTool [137]. In each application, the number of channels, p, was large (typically several dozen). The task was to identify an appropriate structure and parameterization for the dynamic mixture model, sufficient to capture the various inter-channel and inter-sample dependences. To this end, a large off-line database of observations, Dt = {d1,..., dt } , was gathered during the off-line phase. Typically, the number of data, t, was of the order of thousands. The challenging task of model identification in these highdimensional, large database problems was successfully addressed using recursive variational inference methods [127]. In order to avoid IVB iterations (Algorithm 1), the Quasi-Bayes (QB) method of approximation (Section 6.4.1) was chosen. A MATLAB toolbox, called MixTools [138], was developed as part of the ProDaCTool project, for identification in these multicomponent, high-dimensional, data-intensive dynamic mixture modelling problems. It is in data-intensive applications like these that stochastic sampling-based methods are prohibitive. The estimated mixtures were studied for their ability to predict future data emerging from the POL. If the database was sufficiently extensive to capture all key intermode transitions, as well as the significant correlations within each mode, then the inferred mixture model would be capable of accurate prediction of future behaviour. This predictive capability of the mixture was to be used as a resource in designing an advisory system [127]. The advisory system could be used on-line to generate optimized recommendations appropriate for the current mode of operation of the POL. These recommendations were made available to the operator via an appropriately designed Graphical User Interface (GUI) [139]. Experience in the following three ProDaCTool application domains were reported in [139]: 1. Urban vehicular traffic prediction; 2. An advisory system for radiotherapy of thyroid carcinoma; 3. An advisory system for a metal rolling mill. In all cases, the MixTools suite was used in the off-line phase to identify a dynamic or static mixture model as appropriate, using the QB-approximation. In applications 2. and 3., the inferred model was used in the off-line design of an appropriate advisory system, which then generated recommendations during the on-line phase. In application 1., no advisory design was undertaken, but the ability of the inferred mixture models to predict future traffic states was examined. This application is now briefly reviewed.
6.5 On-line Inference of a Mixture of AutoRegressive (AR) Models
141
6.5.4.1 Urban Vehicular Traffic Prediction From (6.27) and (6.49), the one-step-ahead predictor for the dynamic mixture is a linear combination of Student’s t-distributions: (i) (i) (i) (i) ζ V + d , ψ d , ψ , ν + 1 (i) (i) c t+1 t+1 t t t+1 t+1 " κi,t A ,R , f˜ (dt+1 |Dt ) = √ i=1
κt 1c,1
(i)
(i)
2πζA(i) ,R(i) Vt , νt
(6.73) where the statistics, {{Vt }c , {νt }c , κt } , are updated via the QB-approximation, (6.58)–(6.59) and (6.67). On the other hand, static mixtures (Section 6.5.3) cannot, formally, provide temporal predictions. Nevertheless, if the parameters are assumed to be one-step invariant, then the data at time t + 1 can be predicted informally via an estimated static observation model (6.5.3) at time t :
f˜ (dt+1 |Dt ) ≈
c "
ˆ(i) ˆ(i) l i,t Ndt A t , R t .
(6.74)
i=1
Here, Aˆ(i) t and Rˆ(i) t are the terminal posterior means (6.26) of the normal component parameters, and lt are the estimated component weights (6.67) at time t. density in time
250
intensity in time
70 60
200
50 150
40 30
100
20 50
10 0
0
100
200
300
400
500
Time course of intensity
600
700
0
0
100
200
300
400
500
600
700
Time course of density
Fig. 6.15. Traffic intensity and density records for the Strahov tunnel, Prague.
In this study, a snapshot, dt , of the traffic state in the busy Strahov traffic tunnel is provided via an array of synchronized sensors placed at 5 regularly-spaced points along each of two northbound lanes of the tunnel. Each sensor detects a traffic intensity measure, q, and density measure, ρ (Fig. 6.15). Hence, the number of channels is p = 2 × 2 × 5 = 20. Measurements are recorded every 5 minutes for 4 weeks, generating an archive of t = 8064 observations. The anticipated daily and weekly periodicities are evident in Fig. 6.15. An extra one day’s data (288 items)
142
6 On-line Inference of Time-Invariant Parameters
are reserved for validation of predictions. The task examined here is to predict future traffic state (system output), yt = [d1,t , d2,t ] , at the northbound exit of the Strahov Tunnel, using this off-line archive. The system ‘input’ is then considered to be the traffic states recorded by the intra-tunnel sensors: ut = [d3,t , d20,t ] . Clearly, there will be strong correlation (i) between sensors, encouraging a regression model between the channels, di,t , and (ii) from snapshot to snapshot, encouraging a dynamic model. The prospect of capturing multimodal behaviour (Section 6.5.4) via mixtures of these models is examined. The estimation, cˆ, of the number of components within the MixTools suite will not be described here. Marginalization over ut+1 in (6.73) or (6.74) generates the required tunnel output predictor, f˜(yt+1 |Dt ). These predictors were evaluated in the context of the 288 on-line data items mentioned already.
2
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-1.5
0
5
10
15
20
25
-1.5
0
5
10
15
20
25
Fig. 6.16. Static predictions of traffic intensity at the tunnel output. Left: one-component regression; right: mixture of cˆ components.
The performance of the predictors under static modelling (6.74) is displayed in Fig. 6.16. The one-component regression cannot react to changing modes in the data, and simply predicts the conditional mean of the traffic intensity. The mixture model predictor incorporates the posterior mean estimate, lt , of the active component label (6.67), permitting far superior prediction of the multimodal data. For static oneand multicomponent models, Table 6.2 shows the Prediction Error (PE) coefficient, expressed as the ratio of the standard deviation of the prediction error to that of the respective output channel. It confirms the enhanced predictive capability of the mixture. For completeness, the τ > 1−step-ahead predictor was also examined. The PE increases, as expected, with τ . The predictive capabilities of the dynamic mixture model (6.73) were then investigated. In this case, a first-order temporal regression is assumed for all components; i.e. ∂i = 1, i = 1, . . . , c (6.72). A single dynamic component is again compared to a mixture of dynamic components. In Table 6.3, output channel PEs are recorded for both the 2-D predictive marginal and for the full 20-D prediction. Multi-step-ahead predictors are, again, also considered.
6.6 Conclusion
143
Table 6.2. Prediction Error (PE) for τ -step-ahead static predictions of traffic tunnel output. c=1 c>1 τ Intensity Density Intensity Density 1 1.021 1.020 0.418 0.422 6 1.025 1.024 0.512 0.516 12 1.031 1.030 0.634 0.638 18 1.028 1.037 0.756 0.760
Table 6.3. Prediction Error (PE) for τ -step-ahead dynamic predictions of traffic tunnel output (based on marginal prediction, yt+τ , and full state prediction, dt+τ , respectively).
τ 1 6 12 18
marginal prediction of output joint prediction of full Tunnel state c=1 c>1 c=1 c>1 Intensity Density Intensity Density Intensity Density Intensity Density 0.376 0.378 0.391 0.392 0.330 0.331 0.328 0.329 0.526 0.521 0.487 0.483 0.444 0.436 0.421 0.416 0.711 0.706 0.614 0.615 0.576 0.566 0.524 0.526 0.840 0.836 0.758 0.766 0.697 0.688 0.654 0.644
Once again, the improvement offered by the mixture model over the singlecomponent regression is evident. In general, PE increases with τ , as before. Comparing with Table 6.2, the following conclusions are reached: • PEs for the dynamic models are significantly lower than for the static ones. The improvement is more marked for the single-component regressions than for the mixtures, and tends to diminish with increasing τ ; • Mixtures lead to better prediction than single regressions under both the static and dynamic modelling assumptions, though the improvement is less marked in the dynamic case (being between about 1 to 5% in terms of PE). In off-line identification of the model, some components had small weights, but their omission caused a clearly observable reduction in prediction quality. The current study clearly indicates the benefits of dynamic mixture modelling for high-dimensional traffic data analysis. Dynamic mixtures are capable of accurate prediction many steps (e.g. 90 minutes) ahead. Hence, it is reasonable to expect that they can serve for effective advisory design, since components arise, not only due to intrinsic changes in traffic state, but also as a result of traffic sequence choices made by the operator [137].
6.6 Conclusion The rôle of the VB-approximation in recursive inference of time-invariant parametric models was explored in this Chapter. Three fundamental scenarios were examined
144
6 On-line Inference of Time-Invariant Parameters
carefully, leading, respectively, to (i) approximations of intractable marginals at each time-step, (ii) propagation of reduced sufficient statistics for dynamic exponential family (DEF) models, and (iii) achievement of conjugate updates in DEF models with hidden variables (i.e. DEFH models) where exact on-line inference cannot be achieved recursively. The VB-observation model proved to be a key mathematical object arising from use of the VB-approximation in on-line inference. The Bayes’ rule updates in scenarios II and III are with respect to this object, not the original observation model. Conjugate distributions (and, therefore, Bayesian recursive inference) may, therefore, be available in cases (like scenario III) where none exists under an exact analysis. We applied the VB-approximation to on-line inference of mixtures of autoregressive models, this being an important instance of scenario III. In fact, the VB-approximation is a flexible tool, potentially allowing other scenarios to be examined. In Chapter 7, the important extension to time-variant parameters will be studied.
7 On-line Inference of Time-Variant Parameters
The concept of Bayesian filtering was introduced in Section 2.3.2. The Kalman filter is the most famous example of a Bayesian filtering technique [113]. It is widely used in various engineering and scientific areas, including communications, control [118], machine learning [28], economics, finance, and many others. The assumptions of the Kalman filter—namely linear relations between states and a Gaussian distribution for disturbances—are, however, too restrictive for many practical problems. Many extensions of Kalman filtering theory have been proposed. See, for example, [140] for an overview. These extensions are not exhaustive, and new approaches tend to emphasize two priorities: (i) higher accuracy, and (ii) computational simplicity. Recently, the area of non-linear filtering has received a lot of attention and various approaches have been proposed, most of them based on Monte Carlo sampling techniques, such as particle filtering [141]. The main advantage of these sampling techniques is the arbitrarily high accuracy they can achieve given a sufficient amount of sampling. However, the associated computational cost can be excessive. In this Chapter, we study properties of the Variational Bayes approximation in the context of Bayesian filtering. The class of problems to which the VB method will be applied is smaller than that of particle filtering. The accuracy of the method can be undermined if the approximated distribution is highly correlated. Then, the assumption of conditional independence—which is intrinsic to the VB method—is too severe a restriction. In other cases, the method can provide inferential schemes with interesting computational properties. Various adaptations of the VB method—such as the Restricted VB method (Section 3.4.3)—can be used to improve computational efficiency at the expense of minor loss of accuracy.
7.1 Exact Bayesian Filtering All aspects of on-line inference of parameters discussed in Chapter 6—i.e. restrictions on the ‘knowledge base’ of data and computational complexity of each step— are also relevant here. Note that the techniques of Chapter 6 can be seen as a special case of Bayesian filtering, with the trivial parameter-evolution model (2.20), i.e.
146
7 On-line Inference of Time-Variant Parameters
θt = θt−1 . In the case of time-variant parameters, the problem of parameter inference becomes more demanding, since it involves one extra operation, i.e. marginalization of previous parameters (2.21), which we called the time-update or prediction. The computational flow of Bayesian filtering is displayed in Fig. 2.4. Once again, a recursive algorithm for exact Bayesian filtering is possible if the inference is functionally invariant at each time t. This can be achieved if (i) The observation model (2.19) is a member of the DEF family. Then, there exists a CDEF parameter predictor (Section 6.2.1), f (θt |Dt−1 ), which enjoys a conjugate data update. (ii) The parameter evolution model is such that the filtering output from the previous step, f (θt−1 |Dt−1 ), is updated to a parameter predictor, f (θt |Dt−1 ), of the same form. Then, the same functional form is achieved in both updates of Bayesian filtering (Fig. 2.4). Analysis of models (2.19), (2.20)—which exhibit these properties—was undertaken in [142, 143]. The class of such models is relatively small—the best known example is the Kalman filter (below)—and, therefore, approximate methods of Bayesian filtering must be used for many practical scenarios. Remark 7.1 (Kalman filter). The famous Kalman filter [113] arises when (2.19) and (2.20) are linear in parameters with Normally distributed noise, as follows: f (θt |θt−1 ) = Nθt (Aθt−1 , Rθ ) , f (dt |θt , Dt−1 ) = Ndt (Cθt , Rd ) ,
(7.1) (7.2)
where matrices A, Rθ , C, and Rd are shaping parameters, which must be known a priori. Both conditions of tractability above are satisfied, as follows: (i) The observation model (7.2) is from the DEF family, for which the conjugate distribution is Normal, i.e. the posterior is also Normal, f (θt |Dt ) = Nθt (µt , Σt ) ,
(7.3)
with shaping parameters µt and Σt . This implies that the prior f (θ1 ) is typically chosen in the form of (7.3), with prior parameters µ1 and Σ1 . (ii) The distribution f (θt , θt−1 |Dt−1 ), is a Normal distribution, whose marginal, f (θt |Dt−1 ), is also Normal, and, therefore, in the form of (7.3). The task of Bayesian filtering is then fully determined by evolution of the shaping parameters, µt and Σt , which form the sufficient statistics of f (θt |Dt ) (7.3). Evaluation of the distributions can, therefore, be transformed into recursions on µt , Σt [42, 118]. Remark 7.2 (Bayesian filtering and smoothing). As we saw in Section 6.3.3, the integration operator can be used after the Bayes’ rule operator, marginalizing the posterior distribution, f (θt , θt−1 |Dt ). The two cases are (i) integration over θt−1 , yielding f (θt |Dt ) (this is known as Bayesian filtering); and (ii) integration over θt , yielding f (θt−1 |Dt ) (this is known as Bayesian smoothing). Both options are displayed in Fig. 7.1. Note that f (θt−1 |Dt ) is not propagated to the next step.
7.2 The VB-Approximation in Bayesian Filtering f (θt |θt−1 , Dt−1 ) f (θt−1 |Dt−1 )
147
f (dt |θt , Dt−1 )
×
B
f (θt , θt−1 |D t) dθt−1
f (θt , θt−1 |Dt−1 )
f (θt |Dt )
dθt f (θt−1 |Dt )
Fig. 7.1. Bayesian filtering and smoothing.
7.2 The VB-Approximation in Bayesian Filtering We have seen how to use the VB operator both for approximation of the full distribution (Fig. 3.3) and for generation of VB-marginals (Fig. 3.4). In the case of Bayesian filtering, we replace the marginalization operators in Fig. 7.1 with the VBapproximation operator, V (Fig. 3.4). The resulting composition of operators is illustrated in Fig. 7.2. f (θt |θt−1 , Dt−1 ) f˜(θt−1 |Dt−1 )
f (dt |θt , Dt−1 ) f˜(θt |Dt )
V × B ˜ ˜ f (θt , θt−1 |Dt−1 ) f (θt , θt−1 |Dt )
f˜(θt−1 |Dt ) Fig. 7.2. The VB approximation in Bayesian filtering.
In this scenario, the joint distribution is given by Proposition 2.1: f (dt , θt , θt−1 |Dt−1 ) = f (dt |θt , Dt−1 ) f (θt |θt−1 ) f˜ (θt−1 |Dt−1 ) ,
(7.4)
where we have used the VB-marginal from the previous step, f˜ (θt−1 |Dt−1 ). We seek VB-marginals of θt and θt−1 (Fig. 7.2). Using Theorem 3.1, the minimum of KLDVB is reached for ! f˜ (θt |Dt ) ∝ exp Ef˜(θt−1 |Dt ) [f (dt , θt , θt−1 |Dt−1 )] ! ∝ exp Ef˜(θt−1 |Dt ) ln f (dt |θt , Dt−1 ) + ln f (θt |θt−1 ) + ln f˜ (θt−1 |Dt−1 ) ! ∝ f (dt |θt , Dt−1 ) exp Ef˜(θt−1 |Dt ) [ln f (θt |θt−1 )] , (7.5) ! f˜ (θt−1 |Dt ) ∝ exp Ef˜(θt |Dt ) [f (dt , θt , θt−1 |Dt−1 )] ! ∝ exp Ef˜(θt |Dt ) ln f (dt |θt , Dt−1 ) + ln f (θt |θt−1 ) + ln f˜ (θt−1 |Dt−1 ) ! ∝ exp Ef˜(θt |Dt ) [ln f (θt |θt−1 )] f˜ (θt−1 |Dt−1 ) . (7.6)
148
7 On-line Inference of Time-Variant Parameters
Hence, the VB-approximation for Bayesian filtering—which we will call VB-filtering —can be seen as two parallel Bayes’ rule updates: f˜ (θt |Dt ) ∝ f (dt |θt , Dt−1 ) f˜ (θt |Dt−1 ) , ˜ f (θt−1 |Dt ) ∝ f˜ (dt |θt−1 , Dt−1 ) f˜ (θt−1 |Dt−1 ) . The following approximate distributions are involved: ! f˜ (θt |Dt−1 ) ∝ exp Ef˜(θt−1 |Dt ) [ln f (θt |θt−1 )] , ! f˜ (dt |θt−1 , Dt−1 ) ∝ exp Ef˜(θt |Dt ) [ln f (θt |θt−1 )] .
(7.7) (7.8)
(7.9) (7.10)
f (dt |θt , Dt−1 ) f˜(θt |Dt−1 )
B
f˜(θt |Dt )
f˜(dt |θt−1 , Dt−1 ) f˜(θt−1 |Dt−1 )
B
f˜(θt−1 |Dt )
Fig. 7.3. VB-filtering, indicating the flow of VB-moments via IVB cycles.
Remark 7.3 (VB-filtering and VB-smoothing). The objects generated by the VB approximation are as follows: VB-parameter predictor, f˜ (θt |Dt−1 ) (7.9). This is generated from the parameter evolution model (2.20) by substitution of VB-moments from (7.8). It is updated by the (exact) observation model to obtain the VB-filtering distribution, f˜ (θt |Dt ) (see the upper part of Fig. 7.3). VB-observation model, f˜ (dt |θt−1 , Dt−1 ) (7.10). Once again, this is generated from the parameter evolution model (2.20), this time by substitution of VB-moments from (7.7). It has the rôle of the observation model in the lower Bayes’ rule update in Fig. 7.3, updating f˜ (θt−1 |Dt−1 ) from the previous time step to obtain the VB-smoothing distribution, f˜ (θt−1 |Dt ). Notes on VB-filtering and VB-smoothing • The time update and data (Bayes’ rule) update of (exact) Bayesian filtering (Fig. 2.4) are replaced by two Bayes’ rule updates (Fig. 7.3). • The VB-smoothing distribution is not propagated, but its moments are used to generate shaping parameters of the VB-parameter predictor, f˜ (θt |Dt−1 ) (7.9).
7.2 The VB-Approximation in Bayesian Filtering
149
• The functional form of both inputs to the Bayes’ rule operators, B, in Fig. 7.3 are fixed. Namely, (i) the VB-parameter predictor (7.9) for the VB-filtering update is determined by the parameter evolution model (2.20), and (ii) the posterior distribution, f˜ (θt−1 |Dt−1 ), is propagated from the previous step. Therefore, there is no distribution in the VB-filtering scheme (Fig. 7.3) which can be assigned using the conjugacy principle (Section 6.2). • The same functional form is preserved in the VB-filtering distribution—i.e. (7.7) via (7.9)—at each time step, t; i.e. f˜ (θt−1 |Dt−1 ) is mapped to the same functional form, f˜ (θt |Dt ). This will be known as VB-conjugacy. • Only the VB-marginals, (7.7) and (7.8), are needed to formulate the VB-equations, which are solved via the IVB algorithm (Algorithm 1). The VB-approximation yields a tractable recursive inference algorithm if the joint distribution, ln f (dt , θt , θt−1 |Dt−1 ) (7.4), is separable in parameters (3.21). However, as a consequence of Proposition 2.1, the only object affected by the VBapproximation is the parameter evolution model, f (θt |θt−1 ) (see (7.5)–(7.6)). Hence we require separability only for this distribution:
f (θt |θt−1 ) = exp g (θt ) h (θt−1 ) . (7.11) The only additional requirement is that all necessary moments of the VB-smoothing and VB-filtering distributions, (7.7) and (7.8), be tractable. Remark 7.4 (VB-approximation after time update). In this Section, we have studied the replacement of the marginalization operator in Fig. 7.1 by the VB-approximation (i.e. it follows the Bayes’ rule update). Approximation of the marginalization operator before the Bayes’ rule update (Fig. 2.4) yields results in the same form as (7.5)– (7.6). The only difference is that the VB-moments are with respect to f˜ (θt |Dt−1 ) and f˜ (θt−1 |Dt−1 ), instead of f˜ (θt |Dt ) and f˜ (θt−1 |Dt ), respectively. In this case, the current observation, dt , has no influence on the VB-marginals, its influence being deferred to the subsequent (exact) data update step. Remark 7.5 (Possible scenarios). We could formulate many other scenarios for VBfiltering using the principles developed in Chapter 6. Specifically (i) the models could be augmented using auxiliary variables; and (ii) one-step approximations could be introduced at any point in the inference procedure. The number of possible scenarios grows with the number of variables involved in the computations. In applications, the VB-objects—namely the VB-marginals, VB-observation models and VB-parameter predictors—may be used freely, and combined with other approximations. 7.2.1 The VB method for Bayesian Filtering The VB method was derived in detail for off-line inference in Section 3.3.3. It was adapted for on-line inference of time-invariant parameters in Section 6.3.4. In this Section, we summarize the VB method in Bayesian filtering.
150
7 On-line Inference of Time-Variant Parameters
Step 1: Choose a Bayesian Model. We assume that we have (i) the parameter evolution model (2.20), and (ii) the observation model (2.19). The prior on θt , t = 1— which will be in the same form as the VB-filtering distribution f˜ (θt |Dt ) (7.7)—will be determined by the method in Step 4. Step 2: Partition the parameters, choosing θ1 = θt and θ2 = θt−1 (3.9). However, in what follows, we will use the standard notation, θt and θt−1 . The parameter evolution model, f (θt |θt−1 ) (2.20), must have separable parameters (7.11). Step 3: Write down the VB-marginals. We must inspect the VB-filtering (7.7) and VB-smoothing (7.8) distributions. Step 4: Identify standard forms. Using Fig. 7.3, the standard forms are identified in the following order: a) The VB-filtering distribution (7.7) is obtained via the VB-parameter predictor (7.9) and the (exact) observation model (2.19). The prior, f (θ1 ), is chosen with the same functional form as this VB-filtering distribution. b) The output of a) from the previous time step, t − 1, is multiplied by the VBobservation model (7.10), yielding the VB-smoothing distribution (7.8). Step 5: Unchanged. Step 6: Unchanged. Step 7: Run the IVB algorithm. An upper bound on the number of iterations of the IVB algorithm may be enforced, as was the case in Section 6.3.4. However, no asymptotic convergence proof is known for the time-variant parameter context of VB-filtering. Step 8: Unchanged.
7.3 Other Approximation Techniques for Bayesian Filtering Many methods of approximation exist for Bayesian filtering [140], and extensive research is currently being undertaken in this area [74, 141]. We now review three techniques which we will use for comparison with the VB approximation later in this Chapter.
7.3.1 Restricted VB (RVB) Approximation The RVB approximation was introduced in Section 3.4.3. Recall that the QuasiBayes (QB) approximation is a special case, and this was specialized to the on-line time-invariant case in Section 6.4.1. The key benefit of the RVB approach is that the solution to the VB-equations is found in closed form, obviating the need for the IVB algorithm (Algorithm 1). The challenge is to propose a suitable restriction.
7.3 Other Approximation Techniques for Bayesian Filtering
151
As suggested in Section 6.4.1, we prefer to impose the restriction on the VBmarginal which is not propagated to the next step. In this case, it is the VB-smoothing distribution, f˜ (θt−1 |Dt ) (7.8), which is not propagated (Fig. 7.3). We propose two possible restrictions: Non-smoothing restriction: the current observation, dt , is not used to update the distribution of θt−1 ; i.e. f˜ (θt−1 |Dt ) = f (θt−1 |Dt ) ≡ f˜ (θt−1 |Dt−1 ) . This choice avoids the VB-smoothing update (i.e. the lower part of Fig. 7.3), implying the inference scheme in Fig. 7.4. f (dt |θt , Dt−1 ) f˜(θt |Dt−1 )
f˜(θt |Dt )
B
f˜(θt−1 |Dt−1 ) Fig. 7.4. The RVB approximation for Bayesian filtering, using the non-smoothing restriction.
QB restriction: the true marginal of (7.4) is used in place of the VB-smoothing distribution: f (dt , θt , θt−1 |Dt−1 ) dθt . (7.12) f˜ (θt−1 |Dt ) ≡ f (θt−1 |Dt ) ∝ Θt∗
Note that (7.12) is now the exact smoothing distribution arising from Bayesian filtering (Fig. 7.1). The VB-filtering scheme in Fig. 7.3 is adapted to the form illustrated in Fig. 7.5. f (dt |θt , Dt−1 ) f˜(θt |Dt−1 )
f˜(θt |Dt )
B
f (dt , θt |θt−1 , Dt−1 ) f˜(θt−1 |Dt−1 )
B
dθt
f (θt−1 |Dt )
Fig. 7.5. The QB-approximation for Bayesian filtering.
These restrictions modify the VB method as described in Section 3.4.3.1.
152
7 On-line Inference of Time-Variant Parameters
7.3.2 Particle Filtering The particle filtering technique [141] is also known as the sequential Monte Carlo method [74]. It provides the necessary extension of stochastic approximations (Section 3.6) to the on-line case. Recall that the intractable posterior distribution (2.22) is approximated by the empirical distribution (3.59), 1" (i) f˜ (θt |Dt ) = δ(θt − θt ), n i=1 n
(7.13)
(i)
θt ∼ f (θt |Dt ) .
(7.14)
Here, the i.i.d. samples, {θt }n , are known as the particles. The posterior moments of (7.13), its marginals, etc., are evaluated with ease via summations (3.60): Ef˜(θt |Dt ) [g (θt )] =
1 " (i) g (θt ) f˜ (θt |Dt ) dθt = g θt . n i=1 Θt∗ n
(7.15)
Typically, the true filtering distribution, f (θt |Dt ), is intractable, and so the required particles (7.14) cannot be generated from it. Instead, a distribution fa (θt |Dt ) is introduced in order to evaluate the required posterior moments: f (θt |Dt ) fa (θt |Dt ) dθt . g (θt ) (7.16) Ef (θt |Dt ) [g (θt )] = ∗ f a (θt |Dt ) Θt The distribution fa (θt |Dt ) is known as the importance function, and is chosen— among other considerations—to be tractable, in the sense that i.i.d. sampling from it is possible. By drawing the random sample, {θt }n , from fa (θt |Dt ), (7.16) can be approximated by n " (i) Ef (θt |Dt ) [g (θt )] ≈ wi,t g θt , i=1
where wi,t
(i) f Dt , θt . ∝ (i) fa θt |Dt
(7.17)
%n The constant of proportionality is provided by the constraint i=1 wi,t = 1. A recursive algorithm for inference of wi,t can be achieved if the importance function satisfies the following property with respect to the parameter trajectory, Θt = [Θt−1 , θt ] (2.16): fa (Θt |Dt ) = fa (θt |Dt , Θt−1 ) fa (Θt−1 |Dt−1 ) .
(7.18)
This is a weaker restriction than the Markov restriction imposed on f (Θt |Dt ) by Proposition 2.1. Using (7.18), the weights (7.17) can be updated recursively, as follows:
7.3 Other Approximation Techniques for Bayesian Filtering
153
wi,t
(i) (i) (i) f dt |θt f θt |θt−1 . ∝ wi,t−1 (i) (i) fa θt |Θt−1 , Dt
This step finalizes the particle filtering algorithm. The quality of approximation is strongly dependent on the choice of the importance function, fa (θt |Dt ). There is a rich literature examining various choices of importance function for different applications [74, 140, 141]. 7.3.3 Stabilized Forgetting In Section 7.1, the two requirements for computational tractability in Bayesian filtering were given as (i) choice of a DEF observation model, and (ii) invariance of the associated CDEF distribution under the time update. This second requirement may be hard to satisfy. In the forgetting technique, the exact time update operator (Fig. 2.4) is replaced by an approximate operator without the need for an explicit parameter-evolution model (2.20). This technique was developed using heuristic arguments [42, 144]. Its Bayesian interpretation, which involves optimization of KLDVB , was presented in [145]. As a result, the time update can be replaced by the following probabilistic operator: φt 1−φt f (θt |Dt−1 , φt ) ∝ f (θt−1 |Dt−1 )θt × f (θt |Dt−1 ) . (7.19) The notation f (·)θt indicates the replacement of the argument of f (·) by θt . f (·) is a chosen alternative distribution, expressing auxiliary knowledge about θt at time t. The coefficient φt , 0 ≤ φt ≤ 1 is known as the forgetting factor. The implied approximate Bayesian filtering scheme is given in Fig. 7.6.
f (θt−1 |Dt−1 )
f (θt |Dt−1 )
f (dt |θt , Dt−1 )
φ
B
f (θt |Dt−1 )
f (θt |Dt )
Fig. 7.6. Bayesian filtering, with the time-update step approximated by forgetting operator, ‘φ’.
If both f (θt |Dt−1 ) and f (θt−1 |Dt−1 ) are conjugate to the observation model (2.19) then their geometric mean under (7.19) is also conjugate to (2.19). Then, by definition, both f (θt−1 |Dt−1 ) and f (θt |Dt ) have the same functional form. Remark 7.6 (Forgetting for the DEF family). Consider an observation model from the DEF family (6.7), and associated CDEF distributions, as follows (the normalizing constants, ζ, are defined in the usual way (Proposition 6.1)):
154
7 On-line Inference of Time-Variant Parameters
f (dt |θt , ψt ) = exp q (θt ) u (dt , ψt ) − ζdt (θt ) , (7.20)
f (θt−1 |st−1 ) = exp q (θt−1 ) vt−1 − νt−1 ζdt (θt−1 ) − ζθt−1 (st−1 ) , (7.21)
f (θt |st−1 ) = exp q (θt ) v t−1 − ν t−1 ζdt (θt ) − ζθt (st−1 ) . (7.22) Approximation of the time update (2.21) via the forgetting operator (7.19) yields f (θt |st−1 , st−1 ) = exp q (θt ) (φt vt−1 + (1 − φt ) v t−1 ) + − (φt νt−1 + (1 − φt ) ν t−1 ) ζdt (θt ) − ζθt (st−1 , st−1 ) .(7.23) The data update step of Bayesian filtering, i.e. using (7.20) and (7.23) in (2.22), yields
(7.24) f (θt |st ) = exp q (θt ) vt − νt ζdt (θt ) − ζθt (st ) , with sufficient statistics, st = [vt , νt ] (Proposition 6.1), updated via vt = φt vt−1 + u (dt , ψt ) + (1 − φt ) v t−1 , νt = φt νt−1 + 1 + (1 − φt ) ν t−1 .
(7.25) (7.26)
The prior distribution, f (θ1 ) (2.21), is chosen in the form of (7.24) with a suitable choice of s1 = {v1 , ν1 }. For the case v1 = 0η,1 , ν1 = 0, and φt = φ, v t = v, ν t = ν, the method is known as exponential forgetting, since (7.25) and (7.26) imply a sum of data vectors weighted by a discrete exponential sequence: vt =
t "
φt−τ u (dτ , ψτ ) + v,
(7.27)
φt−τ + ν,
(7.28)
τ =∂+1
νt =
t " τ =∂+1
for t > ∂. The alternative distribution contributes to the statistics via v, ν at all times t. This regularizes the inference algorithm in the case of poor data. 7.3.3.1 The Choice of the Forgetting Factor The forgetting factor φt is assumed to be known a priori. It can be interpreted as a tuning knob of the forgetting technique. Its approximate inference is described in [146, 147]. Its limits, 0 ≤ φt ≤ 1, are interpreted as follows: For φt = 1, then
f (θt |Dt−1 , φt = 1) = f (θt−1 |Dt−1 )θt .
This is consistent with the assumption that θt = θt−1 , being the time-invariant parameter assumption.
7.4 The VB-Approximation in Kalman Filtering
For φt = 0, then
155
f (θt |Dt−1 , φt = 0) = f (θt |Dt−1 ) .
This is consistent with the choice of independence between θt and θt−1 ; i.e. f (θt , θt−1 |Dt−1 ) = f (θt |Dt−1 ) f (θt−1 |Dt−1 ) . Typically, the forgetting factor is chosen by the designer of the model a priori. A choice of φt close to 1 models slowly varying parameters. A choice of φt close to 0 models rapidly varying parameters. We now review a heuristic technique for setting φ. Remark 7.7 (Heuristic choice of φ). We compare the exponential forgetting technique for time-variant parameters with inference based on a pseudo-stationary assumption on a sliding observation window of length h [148]. Specifically, we consider the following two scenarios: 1. time-invariant parameter inference on a sliding pseudo-stationary window of length h at any time t; 2. inference of time-variant parameters as t → ∞, using forgetting with ν = ν1 . Under 1., the degrees-of-freedom parameter of the posterior distribution is, from (6.10), (7.29) νt = h − ∂ + ν 1 , where ν1 is the prior degrees-of-freedom. Here, we are using the fact that the recursion starts after ∂ observations (Section 6.2). Under 2., the degrees-of-freedom parameter of the posterior distribution is, from (7.26), 1 − φt−∂ 1 t→∞ + ν1 −→ + ν1 . (7.30) νt = 1−φ 1−φ Equating (7.29) and (7.30), then φ=1−
1 . h−∂
(7.31)
This choice of φ yields Bayesian posterior estimates at large t which—under both scenarios—have an equal number of degrees of freedom in their uncertainty. Hence, the choice of φ reflects our prior assumptions about the number of samples, h, for which dt is assumed to be pseudo-stationary.
7.4 The VB-Approximation in Kalman Filtering Recall, from Remark 7.1, that Bayesian filtering is tractable under the modelling assumptions for the Kalman filter (Remark 7.1). Here, we examine the nature of the VB-approximation for this classical model.
156
7 On-line Inference of Time-Variant Parameters
7.4.1 The VB method Step 1: the Bayesian model is given by (7.1)–(7.2). The prior distribution will be determined in Step 4. Step 2: the logarithm of the parameter evolution model (7.1) is 1 ln f (θt |θt−1 ) = − (θt − Aθt−1 ) Rθ−1 (θt − Aθt−1 ) + γ, 2
(7.32)
where γ denotes terms independent of θt and θt−1 . The condition of separability (7.11) is fulfilled in this case. Step 3: the VB-parameter predictor (7.9) and the VB-observation model (7.10) are, respectively, −1 1 −1 −1 θt A Rθ Aθt − θ , (7.33) f˜ (θt |Dt−1 ) ∝ exp − t−1 A Rθ θt − θt Rθ Aθ t−1 2 1 −θt−1 A Rθ−1 θt − θt Rθ−1 Aθt−1 + θt−1 A Rθ−1 Aθt−1 . f˜ (dt |θt−1 ) ∝ exp − 2
Recall that · denotes expectation of the argument with respect to either f˜ (θt |Dt ) or f˜ (θt−1 |Dt ), as appropriate. Note that the expectation of the quadratic term A Rθ−1 Aθt−1 in (7.32) does not appear in (7.33). This term does not depend θt−1 on θt and thus becomes part of the normalizing constant. Using the above formulae, and (7.2), the VB-marginals, (7.7) and (7.8), are, respectively, −1 1 −1 −1 θt A Rθ Aθt − θ f˜ (θt |Dt ) ∝ exp − t−1 A Rθ θt − θt Rθ Aθ t−1 + 2
1 −1 (7.34) − −θt C Rd dt − dt Rd−1 Cθt + θt C Rd−1 Cθt , 2 1 −θt−1 A Rθ−1 θt − θt Rθ−1 Aθt−1 + θt−1 A Rθ−1 Aθt−1 × f˜ (θt−1 |Dt ) ∝ exp − 2 ×f˜ (θt−1 |Dt−1 ) .
(7.35)
Step 4: we seek standard forms for a) the VB-filtering distribution (7.34), and b) the VB-smoothing distribution (7.35), using a). For a), (7.34) can be recognized as having the following standard form: f˜ (θt |Dt ) = Nθt (µt , Σt ) ,
(7.36)
−1 −1 µt = C Rd−1 C + A Rθ−1 A C Rd dt + Rθ−1 Aθ t−1 ,
−1 . Σt = C Rd−1 C + A Rθ−1 A
(7.37)
with shaping parameters
(7.38)
7.4 The VB-Approximation in Kalman Filtering
157
Hence, the prior is chosen in the form of (7.36), with initial values µ1 , Σ1 . For b), we substitute (7.36) at time t − 1 into (7.35), which is then recognized to have the following form:
f˜ (θt−1 |Dt ) = Nθt−1 µt−1|t , Σt−1|t , with shaping parameters
−1 −1 −1 −1 µt−1|t = A Rθ−1 A + Σt−1 A Rθ θt + Σt−1 µt−1 ,
−1 −1 Σt−1|t = A Rθ−1 A + Σt−1 .
(7.39) (7.40)
Step 5: the only VB-moments required are the first moments of the Normal distributions: θ t−1 = µt−1|t , θt = µt . Step 6: using the VB-moments from Step 5, and substituting (7.37) into (7.39), it follows that µt−1|t = Z C Rd−1 dt + Rθ−1 Aθ t−1 + wt ,
−1 −1 −1 −1 −1 A Rθ C Rd C + A Rθ−1 A , Z = A Rθ A
−1 −1 −1 −1 wt = A Rθ A + Σt−1 Σt−1 µt−1 . These can be reduced to the following set of linear equations:
−1 I − ZRθ−1 A θ t−1 = ZC Rd dt + wt , with explicit solution
−1 −1 ZC Rd−1 dt + wt . θ t−1 = I − ZRθ A
(7.41)
Step 7: the IVB algorithm does not arise, since the solution has been found analytically (7.41). Step 8: report the VB-filtering distribution, f˜ (θt |Dt ) (7.36), via its shaping parameters, µt , Σt . Remark 7.8 (Inconsistency of the VB-approximation for Kalman filtering). The covariance matrices, Σt and Σt−1|t , of the VB-marginals, (7.34) and (7.35) respectively, are data- and time-independent. Therefore, the VB-approximation yields inconsistent inferences for the Kalman filter. This is caused by the enforced conditional independence between θt and θt−1 . These parameters are, in fact, strongly correlated via the (exact) parameter evolution model (7.1). In general, this correlation is incorporated in the VB-approximation via the substitution of VB-moments (Fig. 7.3). However, in the case of the Kalman filter, (7.1) and (7.2), only the first moments are substituted (Step 5. above), and therefore correlations are not propagated.
158
7 On-line Inference of Time-Variant Parameters
7.4.2 Loss of Moment Information in the VB Approximation Note that the VB-approximation is a sequence of two operations (Theorem 3.1): (i) the expectation operator, and (ii) normalization. Examining the q = 2 case, for example, then these operations are applied to the logarithm of the joint distribution, ln f (D, θ1 , θ2 ). In effect, the normalization operation removes all terms independent of the random variable from the VB-marginal. This can be illustrated with a simple example. Consider the following parameter evolution model: ln f (θ1 |θ2 ) = aθ12 + bθ1 θ2 + aθ22 .
(7.42)
This has the quadratic form characteristic of Gaussian modelling (see, for example, (7.32) for the Kalman filter). Taking expectations, then f˜ (θ1 ) ∝ exp aθ12 + bθ1 θ2 + aθ22 ∝ exp aθ12 + bθ1 θ2 + γ , where constant term, γ = aθ22 , will be consigned to the normalizing constant. In this way, second-moment information about θ2 —which could have been carried via θ22 —is lost. Writing (7.42) in general form, (3.24) and (3.25), then g (θ1 ) = θ12 , bθ1 , a , h (θ2 ) = a, θ2 , θ22 . Note that the moment information is lost in respect of terms in g (·) or h (·) whose corresponding entry in the other vector is a constant. We conclude that the VBapproximation will perform best with models exhibiting a minimum number of constants in g (·) and h (·). In fact, the considerations above can provide us with guidelines towards a reformulation of the model which might lead to more successful subsequent VBapproximation. In (7.42), we might relax the constants,a, via f (a). Then, the lost 2 2 ˜ a= a θ2 . moment, θ2 , will be carried through to f (θ1 ) via In the case of the Kalman filter, the quadratic terms in θt and θt−1 enter the distribution via the known constant, Rθ (7.32). Hence, this parameter is a natural candidate for relaxation and probabilistic modelling.
7.5 VB-Filtering for the Hidden Markov Model (HMM) Consider a Hidden Markov Model (HMM) with the following two constituents: (i) a first-order Markov chain on the unobserved discrete (label) variable lt , with c possible states; and (ii) an observation process in which the labels are observed via a
7.5 VB-Filtering for the Hidden Markov Model (HMM)
159
continuous c-dimensional variable dt ∈ Ic(0,1) , which denotes the probability of each state at time t. The task is to infer lt from observations dt . For analytical convenience, we denote each state of lt by a c-dimensional elementary basis vector c (i) (see Notational Conventions on page XV); i.e. lt ∈ { c (1) , . . . , c (c)}. The probability of transition from the jth to the ith state, 1 ≤ i, j ≤ c, is Pr (lt = c (i) |lt−1 = c (j)) = ti,j , where 0 < ti,j < 1, i, j ∈ [1, . . . , c]. These transition probabilities are aggregated into the stochastic transition matrix, T , such that the column sums are unity [149], i.e. ti 1c,1 = 1. The following Bayesian filtering models are implied (known parameters are suppressed from the conditioning): f (lt |lt−1 ) = Mult (T lt−1 ) ∝ lt T lt−1 = exp (lt ln T lt−1 ) ,
f (dt |lt ) = Didt (ρlt + 1c,1 ) ∝ exp ρ ln (dt ) lt .
(7.43) (7.44)
Here Mu (·) denotes the Multinomial distribution, and Di (·) denotes the Dirichlet distribution (see Appendices A.7 and A.8 respectively). In (7.43), ln T denotes the matrix of log-elements, i.e. ln T = [ln ti,j ] ∈ Rc×c . Note that (7.43) and (7.44) satisfy Proposition 2.1. In (7.44), the parameter ρ controls the uncertainty with which lt may be inferred via dt . For large values of ρ, the observed data, dt , have higher probability of being close to the actual labels, lt (see Fig. 7.7).
r=1
r = 10
r=2
1
1
1
0 0
1 d1,t
0 0
1 d1,t
0
0
1 d1,t
Fig. 7.7. The Dirichlet distribution, f (dt |lt ), for c = 2 and ρ = {1, 2, 10}, illustrated via its scalar conditional distribution, f (d1,t |lt = 2 (1)) (full line) and f (d1,t |lt = 2 (2)) (dashed line).
7.5.1 Exact Bayesian filtering for known T Since the observation model (7.44) is from the DEF family (6.7), its conjugate distribution exists, as follows:
160
7 On-line Inference of Time-Variant Parameters
f (lt |Dt ) = Mult (βt ) .
(7.45)
The time update of Bayesian filtering (Fig. 2.4)—i.e. multiplying (7.43) by (7.45) for lt−1 , followed by integration over lt−1 —yields " f (lt |Dt−1 ) ∝ Mult (T lt−1 ) Mult−1 (βt−1 ) lt−1
∝ Mult (T βt−1 ) = Mult βt|t−1 ,
(7.46)
βt|t−1 = T βt−1 . The data update step—i.e. multiplying (7.46) by (7.44), and applying Bayes’ rule— yields
f (lt |Dt ) ∝ Didt (ρlt + 1c,1 ) Mult βt|t−1 = Mult (βt ) , βt = dρt ◦ βt|t−1 = dρt ◦ T βt−1 , (7.47) where ‘◦’ denotes the Hadamard product (the ‘.*’ operation in Matlab). Hence, exact recursive inference has been achieved, but this depends on an assumption of known T. Remark 7.9 (Inference of T when lt are observed). In the case of directly observed labels, i.e. dt = lt , distribution (7.43) can be seen as an observation model: f (dt |T, dt−1 ) = f (lt |T, lt−1 ) = Mult (T lt−1 ) . Since Mu (·) is a member of the DEF family, it has the following conjugate posterior distribution for T : (7.48) f (T |Dt ) = DiT (Qt + 1c,c ) . Here, DiT (Qt ) denotes the matrix Dirichlet distribution (Appendix A.8). Its sufficient statistics, Qt ∈ Rc×c , are updated as follows: Qt = Qt−1 + dt dt−1 = Qt−1 + lt lt−1 .
(7.49)
This result will be useful later, when inferring T . 7.5.2 The VB Method for the HMM Model with Known T In the previous Section, we derived an exact Bayesian filtering inference in the case of the HMM model with known T . Once again, it will be interesting to examine the nature of the VB-approximation in this case. Step 1: the parametric model is given by (7.43) and (7.44). Step 2: the logarithm of the parameter evolution model is ln f (lt |lt−1 , T ) = lt ln T lt−1 .
(7.50)
7.5 VB-Filtering for the Hidden Markov Model (HMM)
161
(7.50) is linear in both lt and lt−1 . Hence, it is separable. However, the explicit forms 2 2 of g (lt ) ∈ Rc ×1 and h (lt−1 ) ∈ Rc ×1 from (3.21) are omitted for brevity. Note that these do not include any constants, and so we can expect a more successful VBapproximation in this case than was possible for the Kalman filter (see Section 7.4.2). Step 3: the implied VB-parameter predictor (7.9) and the VB-observation model (7.10) are, respectively, f˜ (lt |Dt−1 ) ∝ exp lt ln T l t−1 , f˜ (dt |lt−1 ) ∝ exp lt−1 ln T lt . From (7.7) (using (7.44)) and (7.8) respectively, the implied VB-marginals are (7.51) f˜ (lt |Dt ) ∝ exp (lt ρ ln (dt )) exp lt ln T l t−1 , f˜ (lt−1 |Dt ) ∝ exp lt ln T lt−1 f˜ (lt−1 |Dt−1 ) . (7.52) Step 4: the VB-filtering distribution (7.51) is recognized to have the form f˜ (lt |Dt ) = Mult (αt ) ,
(7.53)
αt ∝ dρt ◦ exp ln T l t−1 ,
(7.54)
with shaping parameter
addressing %c Step 4 a) (see Section 7.2.1). Here, the normalizing constant is found from i=1 αi,t = 1, since αt is a vector of probabilities under the definition of Mu (·). Substituting (7.53) at time t − 1 into (7.52), the VB-smoothing distribution is recognized to have the form f˜ (lt−1 |Dt ) = Mult (βt ) ,
(7.55)
βt ∝ αt−1 ◦ exp ln T lt ,
(7.56)
with shaping parameter
addressing Step 4 b). In this case, note that the posterior distributions, (7.53) and (7.55), have the same form. This is the consequence of symmetry in the observation model (7.50). This property is not true in general. Step 5: From (7.53) and (7.55), the necessary VB-moments are lt = αt , l t−1 = βt ,
(7.57) (7.58)
162
7 On-line Inference of Time-Variant Parameters
using the first moment of the Multinomial distribution (Appendix A.7). Step 6: Substitution of VB-moments, (7.57) and (7.58), into (7.56) and (7.54) respectively yields the following set of VB-equations: αt ∝ dρt ◦ exp (ln T βt ) , βt ∝ αt−1 ◦ exp (ln T αt ) . Step 7: run the IVB algorithm on the two VB equations in Step 6. Step 8: report the VB-filtering distribution, f˜ (lt |Dt ) (7.53), and the VB-smoothing distribution, f˜ (lt−1 |Dt ) (7.55), via their shaping parameters, αt and βt respectively. Their first moments are equal to these shaping parameters. Remark 7.10 (Restricted VB (RVB) for the HMM model with known T ). In Section 7.3.1, we outlined the non-smoothing restriction which, in this case, requires f˜ (lt−1 |Dt ) = f˜ (lt−1 |Dt−1 ) . In this case, βt−1 = αt−1 . Hence, the entire recursive algorithm is replaced by the following equation: (7.59) αt ∝ dρt ◦ exp (ln T αt−1 ) . In (7.47), we note that the weights, βi,t , of the exact solution are updated by the algebraic mean of ti,j , weighted by βt−1 : βi,t = dρi,t
c "
βj,t−1 ti,j , i = 1, . . . , c.
j=1
In contrast, we note that the weights of the RVB-marginal (7.59) are updated by a weighted geometric mean: αi,t = dρi,t
c
α
ti,jj,t−1 , i = 1, . . . , c.
j=1
The distinction between these cases will be studied in simulation, in Section 7.5.5. 7.5.3 The VB Method for the HMM Model with Unknown T Exact Bayesian filtering for the HMM model with unknown T is intractable, and so, on this occasion, there is a ‘market’ for use of the VB approximation. The exact inference scheme is displayed in Fig. 7.8. Note that the model now contains both time-variant and time-invariant parameters. Hence, it does not fit exactly into any of the scenarios of Chapters 6 or 7. We will therefore freely mix VB-objects arising from the various scenarios.
7.5 VB-Filtering for the Hidden Markov Model (HMM)
163
f (lt |lt−1 , T, Dt−1 ) f (lt−1 |Dt−1 )
×
f (lt , lt−1 |T, Dt−1 ) f (T |Dt−1 )
f (dt |lt , Dt−1 )
dlt−1 T
f (lt |Dt )
f (lt , lt−1 , T |Dt ) ×
B f (lt , lt−1 , T |Dt−1 )
dlt lt−1
f (T |Dt )
Fig. 7.8. Exact Bayesian filtering for the HMM model with unknown T .
Step 1: the parametric model is given by (7.43) and (7.44). However, in this case, we show the conditioning of the observation model (7.44) on unknown T : f (lt |lt−1 , T ) = Mult (T lt−1 ) .
(7.60)
The prior distribution on T will be assigned in Step 4. Step 2: the logarithm of the parameter evolution model (7.50) is unchanged, and it is separable in the parameters, lt , lt−1 and T . Step 3: the VB-parameter predictor (7.9) and the VB-observation model (7.10) are, respectively, f˜ (lt |Dt−1 ) ∝ exp lt ln T l t−1 , f˜ (dt |lt−1 ) ∝ exp lt ln T lt−1 . The resulting VB-marginals, f˜ (lt |Dt ) and f˜ (lt−1 |Dt ), have the same form as (7.51) and (7.52) respectively, under the substitution ln T → ln T. The VB-observation model (6.38) for time-invariant parameter T is . (7.61) f˜ (dt |T ) ∝ exp tr lt l t−1 ln T Step 4: using the results from Step 4 of the previous Section, the VB marginals, (7.51) and (7.52), have the form (7.53) and (7.55) respectively, with the following shaping parameters: (7.62) αt ∝ dρt ◦ exp ln T l t−1 , βt ∝ αt−1 ◦ exp ln T lt . (7.63) The VB-observation model (7.61) is recognized to be in the form of the Multinomial distribution of continuous argument (Appendix A.7): f˜ (dt |T ) = Mudt (vec (T )) .
164
7 On-line Inference of Time-Variant Parameters
Hence, we can use the results in Remark 7.9; i.e. the conjugate distribution is matrix Dirichlet (Appendix A.8), f˜ (T |Dt ) = DiT (Qt ) , with shaping parameter
(7.64)
Qt = Qt−1 + lt l t−1 . From (7.64), we set the prior for T to be f (T ) = f˜ (T ) = DiT (Q0 ) (see Step 1 above). Step 5: the necessary VB-moments of f˜ (lt |Dt ) and f˜ (lt−1 |Dt ) are given by (7.57) and (7.58), using shaping parameters (7.62) and (7.63). The only necessary VB moment of the Dirichlet distribution (7.64) is ln T , which has elements (A.51)
(7.65) ln (ti,j ) = ψΓ (qi,j,t ) − ψΓ 1c,1 Qt 1c,1 , where ψΓ (·) is the digamma (psi) function [93] (Appendix A.8). Step 6: the reduced set of VB-equations is as follows: αt ∝ dρt ◦ exp ln T βt , βt ∝ αt−1 ◦ exp ln T αt , Qt = Qt−1 + αt βt , along with (7.65). Step 7: run the IVB algorithm on the four VB equations in Step 6. Step 8: report the VB-filtering distribution, f˜ (lt |Dt ) (7.53), the VB-smoothing distribution, f˜ (lt−1 |Dt ) (7.55), and the VB-marginal of T , i.e. f˜ (T |Dt ) (7.64), via their shaping parameters, αt , βt and Qt , respectively. 7.5.4 Other Approximate Inference Techniques We now review two other approaches to approximate inference of the HMM model with unknown T . 7.5.4.1 Particle Filtering An outline of the particle filtering approach to Bayesian filtering was presented in Section 7.3.2. HMM model inference with unknown T requires combined inference of time-variant and time-invariant parameters. This context was addressed in [74, Chapter 10]. In the basic version presented there, the time-invariant parameter—T in our case—was handled via inference of a time-variant quantity, Tt , with the trivial parameter evolution model, Tt = Tt−1 . The following properties of the HMM model, (7.43) and (7.44) (with T unknown), are relevant:
7.5 VB-Filtering for the Hidden Markov Model (HMM)
165
• The hidden parameter, lt , has only c possible states. Hence, it is sufficient to generate just nl = c particles, {lt }nl . • The space of T is continuous. Hence, we generate nT particles, {T }nT , satisfying the restrictions on T , namely ti,j ∈ I(0,1) , ti 1c,1 = 1, i = 1, . . . , c. • When T is known, Bayesian filtering has an analytical solution (Section 7.5.1) with sufficient statistics βt = βt (T ) (7.47). Hence, the Rao-Blackwellization technique [74] can be used to establish the following recursive updates for the particle weights: wj,t ∝ wj,t−1 f dt |βt−1 , Tˆt−1 , Tˆt =
nT "
wj,t T (j) ,
j=1
βt = dρt ◦ Tˆt βt−1 . Here, wj,t are the particle weights (7.17), and c " f (dt |lt = c (i)) f lt = c (i)|βt−1 , Tˆt−1 , (7.66) f dt |βt−1 , Tˆt−1 = i=1
using (7.44) and (7.60). (7.66) is known as the optimal importance function [74]. Note that the particles, {T }nT , themselves are fixed in this simple implementation of the particle filter, ∀t. 7.5.4.2 Certainty Equivalence Approach In Section 7.5.1, we obtained exact inference of the HMM model in two different cases: (i) when T was known, the sufficient statistics for unknown lt were βt (7.45); (ii) when lt was known, the sufficient statistics for unknown T were Qt (7.48). Hence, a naîve approach to parameter inference of lt and T together is to use the certainty equivalence approach (Section 3.5.1). From Appendices A.8 and A.7 respectively, we use the following first moments as the certainty equivalents:
−1 Tˆt = 1c,1 1c,1 Qt Qt , lt = βt , where the notation v −1 , v a vector, denotes the vector of reciprocals, vi−1 (see Notational Conventions, Page XV). Using these certainty equivalents in cases (i) (i.e. (7.47)) and (ii) (i.e. (7.49)) above, we obtain the following recursive updates: βt = dρt ◦ Tˆt−1 βt−1 . Qt = Qt−1 + βt βt−1 .
166
7 On-line Inference of Time-Variant Parameters
7.5.5 Simulation Study: Inference of Soft Bits In this simulation, we apply all of the inference techniques above to the problem of reconstructing a binary sequence, xt ∈ {0, 1}, from a soft-bit sequence yt ∈ I(0,1) , where yt is defined as the Dirichlet observation process (7.44). The problem is a special case of the HMM, (7.43) and (7.44), where the hidden field has c = 2 states, and where we use the following assignments: lt = [xt , 1 − xt ] ,
(7.67)
dt = [yt , 1 − yt ] . Four settings of the HMM parameters were considered. Each involved one of two settings of T , i.e. T ∈ {T1 , T2 } (7.43), with 0.6 0.2 0.1 0.2 , T2 = , T1 = 0.4 0.8 0.9 0.8 and one of two settings of ρ (7.44), i.e. ρ = 1 and ρ = 2. For each of these settings, we undertook a Monte Carlo study, generating 200 soft-bit sequences, yt , each of length t = 100. For each of the four HMM settings, we examined the following inference methods: Threshold we infer xt by testing if yt is closer to 0 or 1. Hence, the bit-stream estimate is 1 if yt > 0.5, (7.68) x ˆt = round (yt ) = 0 if yt ≤ 0.5. This constitutes ML estimation of xt (2.11), ignoring the Markov chain model for xt (see Fig. 7.7). Unknown T : T is inferred from the observed data, via one of the following techniques: Naîve the certainty equivalence approach of Section 7.5.4.2. VBT the VB-approximation for unknown T , as derived in Section 7.5.3; PF100 particle filtering (Section 7.5.4), with nT = 100 particles; PF500 particle filtering (Section 7.5.4), with nT = 500 particles; Known T : we use the true value, T1 or T2 , with each of the following techniques: Exact exact Bayesian filtering, as derived in Section 7.5.1; VB the VB-approximation with known T , as derived in Section 7.5.2. The performance of each of these methods was quantified via the following criteria, where—in all but the threshold method (7.68)—xt = l 1,t (7.67) denotes the posterior mean of xt : 1. Total Squared Error (TSE): TSE =
t " t=1
2
(xt − xt ) .
7.5 VB-Filtering for the Hidden Markov Model (HMM)
167
In the case of the threshold inference (7.68), for example, this criterion is equal to the Hamming distance between the true and inferred bit-streams. 2. Misclassification counts (M): M=
t "
|round (xt ) − xt | ,
t=1
where round(·) is defined in (7.68).
T T1 T1 T2 T2
T T1 T1 T2 T2
ρ 1 2 1 2
Total Squared Error (TSE) Unknown T Threshold Naîve VBT PF100 PF500 25.37 16.20 15.81 15.68 15.66 12.67 8.25 8.20 8.15 8.11 24.88 13.65 12.29 12.65 12.54 12.34 6.87 6.65 6.86 6.83
Known T Exact VB 14.36 14.40 7.45 7.46 10.69 10.69 6.09 6.09
ρ 1 2 1 2
Misclassification counts (M) Unknown T Threshold Naîve VBT PF100 PF500 25.37 24.60 23.59 23.26 23.12 12.67 11.49 11.37 11.35 11.22 24.88 19.48 17.34 18.10 17.97 12.34 9.27 9.03 9.28 9.23
Known T Exact VB 21.08 21.11 10.33 10.38 14.73 14.75 8.35 8.35
Table 7.1. Performance of bit-stream (HMM) inference methods in a Monte Carlo study.
The results of the Monte Carlo study are displayed in Table 7.1, from which we draw the following conclusions: • The Threshold method—which ignores the Markov chain dynamics in the underlying bit-stream—yields the worst performance. • The performance of the VB-filtering method with known transition matrix, T , is comparable to the exact solution. Therefore, in this case, the VB-moments successfully capture necessary information from previous time steps, in contrast to VB-filtering in the Kalman context (Remark 7.8). • All inference methods with unknown T perform better than the Threshold method. None, however, reaches the performance of the exact solution with known T . Specifically: – For all settings of parameters, the Naîve method exhibits the worst performance. – In the case of T2 (the ‘narrowband’ case, where there are few transitions in xt ), the VB-approximation outperforms the particle filtering approach for both 100 and 500 particles. However, in the case of T1 (the ‘broadband’ case,
168
7 On-line Inference of Time-Variant Parameters
–
where there are many transitions in xt ), the particle filter outperforms the VBapproximation even for 100 particles. In all cases, however, the differences in VB and particle filtering performances are modest. Increasing the number of particles in our particle filtering approach has only a minor influence on performance. This is a consequence of the choice of fixed particles, {T }nT (Section 7.5.4.1), since the weights of most of the particles are inferred close to 0. Clearly, more sophisticated approaches to the generation of particles—such as resampling [74], kernel smoothing [74, Chapter 10], etc.—would improve performance.
7.6 The VB-Approximation for an Unknown Forgetting Factor The technique of stabilized forgetting was introduced in Section 7.3.3 as an approximation of Bayesian filtering for DEF models with slowly-varying parameters. In this technique, the time update of Bayesian filtering (2.21) is replaced by a probabilistic operator (7.19). Parameter evolution is no longer modelled explicitly via (2.20), but by an alternative distribution, f (θt |Dt−1 ), and a forgetting factor, φt . The forgetting factor, φt , is an important tuning knob of the technique, as discussed in Section 7.3.3.1. Typically, it is chosen as a constant, φt = φ, using heuristics of the kind in Remark 7.7. Its choice expresses our belief in the degree of variability of the model parameters, and, consequently, in the stationarity properties of the observation process, dt . In many situations, however, the process will migrate between epochs of slowly non-stationary (or fully stationary) behaviour and rapidly non-stationary behaviour. Hence, we would like to infer the forgetting factor on-line using the observed data. In the context of the Recursive Least Squares (RLS) algorithm, a rich literature on the inference of φt exists [147]. In this Section, we will consider a VB approach to Bayesian recursive inference of φt [150]. In common with all Bayesian techniques, we treat the unknown quantity, φt , as a random variable, and so we supplement model (7.19) with a prior distribution on φt at each time t: (7.69) φt ∼ f (φt |Dt−1 ) ≡ f (φt ) . Here, we assume that (i) the chosen prior is not data-informed [8], and (ii) that φt is an independent, identically-distributed (i.i.d.) process. This yields an augmented parameter predictor, f (θt , φt |Dt−1 ). The posterior distribution on θt is then obtained via a Bayes’ rule update and marginalization over φt , as outlined in Fig. 7.9 (upper schematic). However, the required marginalization is intractable. We overcome this problem by replacing the marginalization with VB-marginalization (Fig. 3.4). The resulting VB inference scheme is given in Fig. 7.9 (lower schematic), where the rôle of φt will be explained shortly. From (7.19) and (7.69), the joint distribution at time t is f (dt , θt , φt |Dt−1 ) ∝ f (dt |θt , Dt−1 )
φt 1 f (θt−1 |Dt−1 )θt × (7.70) ζθt (φt )
1−φt
×f (θt |Dt−1 )
f (φt ) .
(7.71)
7.6 The VB-Approximation for an Unknown Forgetting Factor f (θt |Dt−1 ) f (θt−1 |Dt−1 )
φ
169
f (φt ) f (dt |θt , Dt−1 ) ×
B
dφt
f (θt |Dt )
f (θt , φt |Dt−1 ) f (θt |Dt−1 ) f˜(θt−1 |Dt−1 )
φt
f (φt ) f (dt |θt , Dt−1 ) × B ˜ f (θt , φt |Dt−1 )
V
f˜(θt |Dt ) f˜(φt |Dt )
Fig. 7.9. Bayesian filtering with an unknown forgetting factor, φt , and its VB-approximation.
Note that the normalizing constant, ζθt (φt ), must now be explicitly stated, since it is a function of unknown φt . We require the VB-marginals for θt and φt . Using Theorem 3.1, the minimum of KLDVB (3.6) is reached for f˜ (θt |Dt ) ∝ f (dt |θt , Dt−1 ) × × exp Ef˜(φt |Dt ) φt ln f (θt−1 |Dt−1 )θt − (1 − φt ) ln f (θt |Dt−1 ) φt t 1−φ f (θt |Dt−1 ) , (7.72) ∝ f (dt |θt , Dt−1 ) f (θt−1 |Dt−1 )θt 1 f˜ (φt |Dt ) ∝ f (φt ) × (7.73) ζθt (φt ) × exp Ef˜(θt |Dt ) φt ln f (θt−1 |Dt−1 )θt − (1 − φt ) ln f (θt−1 |Dt−1 ) . Comparing (7.72) with (7.19), we note that the VB-filtering distribution now has the form of the forgetting operator, with the unknown forgetting factor replaced by its VB-moment, φt , calculated from (7.73). No simplification can be found for this VBmarginal on φt . The implied VB inference scheme for Bayesian filtering is given in Fig. 7.10. Note that it has exactly the same form as the stabilized forgetting scheme in Fig. 7.6. Now, however, IVB iterations (Algorithm 1) are required at each time t in order to elicit φt . This inference scheme is tractable for Bayesian filtering with a DEF observation model and CDEF parameter distributions (Remark 7.6). The only additional requirement is that the first moment of the VB-marginal, f˜ (φt |Dt ), be available. 7.6.1 Inference of a Univariate AR Model with Time-Variant Parameters In this Section, we apply the Bayesian filtering scheme with unknown forgetting to an AR model with time-variant parameters. We will exploit the VB inference scheme derived above (Fig. 7.10) in order to achieve recursive identification.
170
7 On-line Inference of Time-Variant Parameters f (dt |θt , Dt−1 ) f (θt |Dt−1 ) ˜ f (θt |Dt−1 ) φt
f˜(θt−1 |Dt−1 )
B
f˜(θt |Dt )
f˜(φt |Dt ) Fig. 7.10. The VB inference scheme for Bayesian filtering with an unknown forgetting factor, t generated via IVB cycles. φt , expressed in the form of stabilized forgetting, with φ
Step 1: Choose a Bayesian model The AR observation model (6.15) is from the DEF family, and so inference with a known forgetting factor is available via Remark 7.6. Furthermore, the conjugate distribution to the AR model is Normal-inverse-Gamma (N iG) (6.21). Using Remark 7.6, the update of statistics is as follows: f (at , rt |Vt (φt ) , νt (φt )) = N iG (Vt (φt ) , νt (φt )) , Vt (φt ) = φt Vt−1 + ϕt ϕt + (1 − φt ) V ,
(7.74)
νt (φt ) = φt νt−1 + 1 + (1 − φt ) ν. The distribution of φt will be revealed in Step 4. Step 2: Partition the parameters We choose θ1 = {at , rt } and θ2 = φt . From (7.74), and using (6.22), the logarithm of the joint posterior distribution is ln f (at , rt , φt |Dt ) = − ln ζat ,rt (Vt (φt ) , νt (φt )) + ln f (φt ) + −0.5 (φt νt−1 + 1 + (1 − φt ) ν) ln rt +
1 − rt−1 [−1, at ] φt Vt−1 + ϕt ϕt + (1 − φt ) V [−1, at ] . 2 This is separable in θ1 and θ2 . Step 3: Write down the VB-marginals Recall, from (7.72), that the VB-marginal of the AR parameters is in the form of the forgetting operator (7.19). Hence, we need only replace φt with φt in (7.74): f˜ (at , rt |Dt ) = N iW Vt φt , νt φt . (7.75) The VB-marginal of φt is
7.6 The VB-Approximation for an Unknown Forgetting Factor
171
f˜ (φt |Dt ) = exp − ln ζat ,rt (Vt (φt ) , νt (φt )) − 0.5φt (νt−1 + ν) ln rt +
1 − tr φt Vt−1 + V Ef˜(at ,rt |Dt ) [−1, at ] rt−1 [−1, at ] 2 + ln f (φt ) ,
(7.76)
where
Ef˜(at ,rt |Dt ) [−1, at ] rt−1
[−1, at ]
=
−1 −1 −r t r t t a . −1 −1 −r t a t rt at t a
(7.77)
Step 4: Identify standard forms From (7.74), the shaping parameters of (7.75) are Vt φt = φt Vt−1 + ϕt ϕt + 1 − φt V , νt φt = φt νt−1 + 1 + 1 − φt ν.
(7.78) (7.79)
However, (7.76) does not have a standard form. This is because of the normalizing constant, ζat ,rt (Vt (φt ) , νt (φt )) ≡ ζat ,rt (φt ). Specifically, we note from (6.23) that ζat ,rt (·) is a function of |Vt (φt )|, which is a polynomial in φt of order m + 1. In order to proceed, we will have to find a suitable approximation, ζ˜at ,rt (φt ). Proposition 7.1 (Approximation of Normalizing Constant, ζat ,rt (·)). We will match the extreme points of ζat ,rt (Vt (φt ) , νt (φt )) at φt = 0 and φt = 1. From (7.78),
ζat ,rt (Vt (0) , νt (0)) = ζat ,rt V , ν , (7.80) ζat ,rt (Vt (1) , νt (1)) = ζat ,rt (Vt−1 + ϕt ϕt , νt−1 + 1) . (7.81) Next, we interpolate ζat ,rt with the following function: ζat ,rt (φt ) ≈ ζ˜at ,rt (φt ) = exp (h1 + h2 φt ) ,
(7.82)
where h1 and h2 are unknown constants. Matching (7.82) at the extrema, (7.80) and (7.81), we obtain
h1 = ln ζat ,rt V , ν , (7.83)
(7.84) h2 = ln ζat ,rt (Vt−1 + ϕt ϕt , νt−1 + 1) − ln ζat ,rt V , ν . Using (7.82) in (7.76), the approximate VB-marginal of φt is f˜ (φt |Dt ) ≈ tExpφt (bt , [0, 1]) f (φt ) ,
(7.85)
172
7 On-line Inference of Time-Variant Parameters
where tExp (bt , [0, 1]) denotes the truncated Exponential distribution with shaping parameter bt , restricted to support I[0,1] (Appendix A.9). From (7.76) and (7.82)– (7.84), bt = ln ζat ,rt (Vt−1 + ϕt ϕt , νt−1 + 1) +
− ln ζat ,rt V , ν − 0.5 (νt−1 + ν) ln rt +
1 − tr Vt−1 + V Ef˜(at ,rt |Dt ) [−1, at ] rt−1 [−1, at ] , (7.86) 2 where E· [·] is given by (7.77). It is at this stage that we can choose the prior, f (φt ) (see Step 1 above). We recognize tExpφt (bt , [0, 1]) in (7.85) as the VB-observation model for φt (Remark 6.4). Its conjugate distribution is f (φt ) = tExpφt (b0 , [0, 1]) ,
(7.87)
with chosen shaping parameter, b0 . Substituting (7.87) into (7.85), then f˜ (φt |Dt ) ≈ tExpφt (bt + b0 , [0, 1]) . Note that, for b0 = 0, the prior (7.87) is a uniform distribution, f (φt ) = Uφt ([0, 1]). This is the non-informative conjugate prior choice (Section 2.2.3). Step 5: Formulate necessary VB-moments From Appendices A.3 and A.9, we can assemble all necessary VB-moments as follows: −1 Va1,t , at = Vaa,t
(7.88)
−1 r = νt Λ−1 t t , −1 −1 t r t , a t rt at = 1 + a t a ln rt = − ln 2 − ψΓ (νt − 1) + ln λt ,
exp (bt + b0 ) (1 − bt − b0 ) − 1 , φt = (bt + b0 ) (1 − exp (bt + b0 ))
(7.89)
where matrix Vt is partitioned as in (6.24). Step 6: Reduce the VB-equations No simplification of the VB equations (7.78), (7.79), (7.86) and (7.88)–(7.89) was found. Step 7: Run the IVB algorithm The IVB algorithm iterates on the equations listed in Step 6.
7.6 The VB-Approximation for an Unknown Forgetting Factor
173
Step 8: Report the VB-marginals The VB-marginal of the AR parameters, f˜ (θt |Dt ) (7.75), is reported via its shaping parameters, Vt and νt . There may also be interest in the VB inference of φt , in which case its VB-marginal, f˜ (φt |Dt ) (7.85), is reported via bt . 7.6.2 Simulation Study: Non-stationary AR Model Inference via Unknown Forgetting At the beginning of Section 7.6, we pointed to the need for a time-variant forgetting factor, φt , in cases where the process, dt , exhibits both pseudo-stationary and non-stationary epochs. There is then no opportunity for prior tuning of the forgetting factor to a constant value, φt = φ, reflecting a notional pseudo-stationary window length, h (Remark 7.7). This is the difficulty we encounter in the important context of changepoints [151], such as arise in speech. Here, the parameters switch rapidly between pseudo-constant values. A related issue is that of initialization in recursive inference of stationary processes (Section 6.5.3). In this problem, the on-line data may initially be incompatible with the chosen prior distribution, and so the distribution must adapt rapidly, via a low forgetting factor, φt . The mismatch diminishes with increasing data, and so the parameter evolution should gradually be switched off, by setting φt → 1. We now design experiments to study these two contexts for unknown timevariant forgetting: 1. Inference of an AR process with switching parameters (changepoints), where we study the parameter tracking abilities of the VB inference scheme in Fig. 7.9. 2. Inference of an AR process with time-invariant parameters (stationary process), where we study the behaviour of the VB inference scheme in Fig. 7.9, in the context of initialization. 7.6.2.1 Inference of an AR Process with Switching Parameters
A univariate, second-order (m = 2, such that ψt = [dt−1 , dt−2 ] ) stable AR model was simulated with parameters rt = r = 1, and 1.8 −0.29 at ∈ , . −0.98 −0.98 Abrupt switching between these two cases occurs every 30 samples. The model was identified via VB-marginals, (7.75) and (7.85), as derived above. The prior distribution is chosen equal to the alternative distribution. Their statistics were chosen as
V1 = V = diag [1, 0.001, 0.001] , ν1 = ν = 10, (7.90) corresponding to the prior estimates, a1 = [0, 0], var (a1,1 ) = var (a2,1 ) = 1000 and r1 = 10. We have tested two variants of the IVB algorithm:
174
7 On-line Inference of Time-Variant Parameters
/ / / [i] [i−1] / (i) the IVB(full) algorithm, which was stopped when /φt − φt / < 0.001. (ii) the IVB(2) algorithm, with the number of iterations set at 2. [1] The forgetting factor was initialized with φt = 0.7 in both cases, ∀t. The identification results are displayed in Fig. 7.11. Note that the VB inference—to within one time-step—correctly detects a change of parameters. At these changepoints, the forgetting factor is estimated as low as φt = 0.05 (which occurred at t = 33). This achieves a ‘memory-dump’, virtually replacing Vt and νt by the alternative (prior) values, V and ν. Thus, identification is restarted. Note that the number of iterations of the IVB algorithm (Algorithm 1) is significantly higher at each changepoint. Therefore, at these points, the expected value of the forgetting factor, φt , obtained using the truncated IVB(2) algorithm, remains too high compared to the converged value obtained by the IVB(full) algorithm (Fig. 7.11). As a result, tracking is sluggish at the changepoints when using the IVB(2) algorithm. The results of identification using a time-invariant forgetting factor, φt = φ = 0.9, are also displayed in Fig. 7.11. Overall, the best parameter tracking is achieved using the VB-marginal, f˜ (at , r|Dt ), evaluated using the IVB(full) algorithm. Identification of the process using the IVB(2) algorithm is acceptable only if the parameter variations are not too rapid.
7.6.2.2 Initialization of Inference for a Stationary AR Process We have already seen in Section 6.5.3 how the prior, f (θ), can damage the sufficient statistics used for identification of time-invariant parameters, and this effect can be felt even for large t. The forgetting technique can help here. If the forgetting factor is low when t is small, the inference of the prior can be removed rapidly. For large t, there is no need for this. Hence, there is a rôle for a time-variant forgetting factor even for time-invariant parameters. We will use the VB-approximation (Fig. 7.9) to infer the time-variant forgetting factor in this context. In other work, discount schedules have been proposed to overcome this problem of initialization. The one proposed in [26] was 1 , (7.91) η1 (t − 2) + η2 where η1 > 0 and η2 > 0 are a priori chosen constants which control the rate → 1 as t → ∞. We will next examine of forgetting for t > 2. Note that φDS t the performance of this pre-set discount schedule in overcoming the initialization problem, and compare it to the performance of the on-line VB inference scheme (Fig. 7.9), where φt is inferred from the data. A stationary, univariate, second-order (m = 2), stable AR model was simulated with parameters a = [1.8, −0.98] and r = 1. The results of parameter identification are displayed in Fig. 7.12. The prior and alternative distributions were once again set [1] via (7.90), and the forgetting factor was initialized with φt = 0.7, ∀t. Identification using the discount factor (7.91) is also illustrated in Fig. 7.12, with η1 = η2 = 1. =1− φDS t
- 50
3 2 1 0 -1
forgetting factor a1,t and a 1,t , IVB iterations ft ((f = 0.9))
(IVB(2))
a1,t and a 1,t , (IVB(full))
process dt
0
3 2 1 0 -1
number of
175
50
a1,t and a 1,t ,
Non-stationary
7.6 The VB-Approximation for an Unknown Forgetting Factor
3 2 1 0 -1 1 IVB(full) IVB(2) fix ed f
0.5 0 20 10 0 0
10
20
30
40
50
60
70
80
90
100
Fig. 7.11. Identification of an AR process exhibiting changepoints, using the VB inference scheme with unknown, time-variant forgetting. In sub-figures 2–4, full lines denote simulated values of parameters, dashed lines denote posterior expected values, and dotted lines denote uncertainty bounds. In sub-figure 6, the number of IVB cycles required for convergence at each time-step is plotted.
7 On-line Inference of Time-Variant Parameters
0
(IVB(full))
1 0.8 0.6 5
a1,t and a 1,t ,
3 2 1 0 -1
IVB iterations
-20
and ft number of
20
f DS t
Stationary process dt
176
VB-inferred forgetting factor Ad-hoc discount factor
4 3 0
10
20
30
40
50
60
70
80
90
100
Fig. 7.12. Identification of a stationary AR process using the VB inference scheme with unknown, time-variant forgetting. In the second sub-figure, the full line denotes simulated values of parameters, the dashed line denotes posterior expected values, and dotted lines denote uncertainty bounds.
Note that, for t < 15, the expected value of the unknown forgetting factor, φt , is very close to the discount factor. However, as t → ∞, φt does not converge to unity but to a smaller, invariant value (0.92 in this simulation, for t > 20). This is a consequence of the stationary alternative distribution, f (·), since V and ν are always present in Vt−1 and νt−1 (see (7.27) and (7.28) for the φt = φ case). The priors which are used in the VB inference scheme can be elicited using expert knowledge, side information, etc. In practical applications, it may well be easier to choose these priors (via V and ν) than to tune a discount schedule via its parameters, η1 and η2 . Note that number of iterations of the IVB(full) algorithm is low (typically four, as seen in Fig. 7.12). Hence, the truncation of the number of iterations to two in the IVB(2) algorithm of the previous Section yields almost identical results to the IVB(full) algorithm.
7.7 Conclusion In this Chapter, we applied the VB-approximation to the important problem of Bayesian filtering. The interesting outcome was that the time update and data (Bayes’
7.7 Conclusion
177
rule) update were replaced by two Bayes’ rule updates. These generated a VBsmoothing and a VB-filtering inference. The VB-approximation imposes a conditional independence assumption. Specifically, in the case of on-line inference of time-variant parameters, the conditional independence was imposed between the parameters at times t and t − 1. The possible correlation between the partitioned parameters is incorporated in the VBapproximation via the interacting VB-moments. We have studied the VB-approximation for two models: (i) the Kalman filter model, and (ii) the HMM model. We experienced variable levels of success. The Kalman filter uses Gaussian distributions with known covariance, and is not approximated well using the VB approach, for reasons explained in Section 7.4.2. Relaxing the covariance matrices via appropriate distributions might well remedy this problem. In the HMM model, no such problems were encountered, and we were able to report a successful VB-approximation. We reviewed the case where the time update of Bayesian filtering is replaced by a forgetting operator. Later in the Chapter, we studied the VB-approximation in the case of an unknown, time-variant forgetting factor, φt . This parameter behaves like a hidden variable, augmenting the observation model. In adopting an independent distribution for φt at each time t, the required VB-approximation is exactly the one used in scenario III of Chapter 6. A potential advantage of this approach is that the independence assumption underlying the VB-approximation is compatible with this independent modelling of φt . The VB inference algorithm with unknown forgetting was successfully applied to inference of an AR model with switching parameters. It also performed well in controlling the influence of the prior in the early stages of inference of a stationary AR process.
8 The Mixture-based Extension of the AR Model (MEAR)
In previous Chapters, we introduced a number of fundamental scenarios for use of the VB-approximation. These enabled tractable (i.e. recursive) on-line inference in a number of important time-invariant and time-variant parameter models. These scenarios were designed to point out the key inferential objects implied by the VBapproximation, and to understand their rôle in the resulting inference schemes. In this Chapter, we apply the VB-approximation to a more substantial signal processing problem, namely, inference of the parameters of an AR model under an unknown transformation/distortion. Our two principal aims will be (i) to show how the VB-approximation leads to an elegant recursive identification algorithm where none exists in an exact analysis; and (ii) to explain how such an algorithm can address a number of significant signal processing tasks, notably • inference of unknown transformations of the data; • data pre-processing, i.e. removal of unwanted artifacts such as outliers. This extended example will reinforce our experience with the VB method in signal processing.
8.1 The Extended AR (EAR) Model On-line inference of the parameters of the univariate (scalar) AutoRegressive (AR) model (6.14) was reviewed in Section 6.2.2. Its signal flowgraph is repeated in Fig. 8.1 for completeness. We concluded that the AR observation model is a member of the Dynamic Exponential Family (DEF), for which a conjugate distribution exists in the form of the Normal-inverse-Gamma (N iG) distribution (6.22). In other words, N iG is a CDEF distribution (Proposition 6.1). However, the N iG distribution is conjugate to a much richer class of observation models than merely the AR model, and so all such models are recursively identifiable. We now introduce this more general class, known as the Extended AR (EAR) model. Let us inspect the N iG distribution, which has the following CDEF form (6.9):
180
8 The Mixture-based Extension of the AR Model (MEAR)
f (a, r|Dt ) ∝ exp q (a, r) vt − νt ζdt (a, r) .
(8.1)
Following Proposition 6.1, the most general associated DEF observation model (6.7) for scalar observation, dt , is built from q (a, r) above:
f (dt |a, r, ψt ) = exp q (a, r) u (dt , ψt ) + ζdt (a, r) . (8.2) Here, the data function, u (dt , ψt ), may be constructed more generally than for the AR model. However, for algorithmic convenience, we will preserve its dyadic form (6.18): u (dt , ψt ) = vec (ϕt ϕt ) .
(8.3)
In the AR model, the extended regressor was ϕt = [dt , ψt ] . We are free, however, to choose a known transformation of the data in defining ϕt : ψt = ψ (dt−1 , . . . , dt−∂ , ξt ) ∈ Rm , ϕ1,t = ϕ1 (dt , . . . , dt−∂ , ξt ) ∈ R,
ϕt = [ϕ1,t , ψt ] ∈ Rm+1 ,
(8.4) (8.5) (8.6)
where ψt is an m-dimensional regressor, ϕt is an (m + 1)-dimensional extended regressor, and ξt is a known exogenous variable. The mapping,
ϕ = [ϕ1 , ψ ] ,
(8.7)
must be known. Using (8.4) in (8.2), the Extended AR (EAR) observation model is therefore defined as f (dt |a, r, ψt , ϕ) = |Jt (dt )| Nϕ1 (dt ) (a ψt , r) ,
(8.8)
where Jt (·) is the Jacobian of the transformation ϕ1 (8.4): Jt (dt ) =
dϕ1 (dt , . . . , dt−∂ , ξt ) . ddt
(8.9)
This creates an additional restriction that ϕ1 (8.5) be a differentiable, one-to-one mapping for each setting of ξt . Moreover, ϕ1 must explicitly be a function of dt in order that Jt = 0. This ensures uncertainty propagation from et to dt (Fig. 8.1): dt = ϕ−1 1 (ϕ1,t , dt−1 , . . . , dt−∂ , ξt ) . The signal flowgraph of the scalar EAR model is displayed in Fig. 8.1 (right). Recall, from Section 6.2.2, that et is the innovations process, and r = σ 2 . In the case of multivariate observations, dt ∈ Rp , with parameters A and R, all identities are adapted as in Remark 6.3. Furthermore, (8.9) is then replaced by a Jacobian matrix of partial derivatives with respect to di,t .
8.1 The Extended AR (EAR) Model ϕ−1 1
et et
dt
σ dt−1 a1
dt
ψ1,t ϕ2
z −1
a1 ψ2,t
dt−2 z
a2
σ
181
ϕ3
−2
a2 ψm,t
dt−m am
ϕm+1
z −m
am
Fig. 8.1. The signal flowgraph of the univariate AR model (left) and the univariate Extended AR (EAR) model (right).
8.1.1 Bayesian Inference of the EAR Model Bayesian inference of the EAR model is, by design, exactly the same as for the AR model (6.15)–(6.24). The only change is that ϕt —used in the dyadic update of the extended information matrix, Vt (6.25)—is now defined more generally via (8.6): Vt = Vt−1 + ϕt ϕt , t > ∂.
(8.10)
The update for νt (6.25) is unchanged: νt = νt−1 + 1, t > ∂. The EAR model class embraces, by design, the AR model (6.14), and the standard AutoRegressive model with eXogenous variables (ARX), where ψt = [dt−1 , . . . , dt−m , ξt ] in (6.15). The following important cases are also included [42]: (i) the ARMA model with a known MA part; (ii) an AR process, ϕ1,t , observed via a known bijective non-linear transformation, dt = ϕ−1 1 (ϕ1,t ); (iii) the incremental AR process with the regression defined on increments of the measurement process. Remark 8.1 (Prediction). Jacobians are not required in EAR identification, but they become important in prediction. The one-step-ahead predictor is given by the ratio of normalizing coefficients (6.23), a result established in general in Section 2.3.3. Hence, we can adapt the predictor for the AR model (6.27) to the EAR case using (8.8) and (8.4):
f (dt+1 |Dt ) = |Jt (dt+1 )| . where ϕt+1 = ϕ1 (dt+1 ) , ψt+1
ζa,r Vt + ϕt+1 ϕt+1 , νt + 1 √ , 2πζa,r (Vt , νt )
(8.11)
182
8 The Mixture-based Extension of the AR Model (MEAR)
8.1.2 Computational Issues A numerically efficient evaluation of (8.10), and subsequent evaluation of moments (6.26), is based on the LD decomposition [86]; i.e. Vt = Lt Tt Lt , where Lt is lowertriangular and Tt is diagonal. The update of the extended information matrix (8.10) is replaced by recursions on Lt and Tt [152]. This approach is superior to accumulation of the full matrix Vt , for the following reasons: 1. Compactness: all operations are performed on triangular matrices involving just 2 (p + m) (p + m) /2 elements, compared to (p + m) for full V t. 2 2. Computational Efficiency: the dyadic update (8.10) requires O (p + m) operations in each step to re-evaluate Lt , Tt , followed by evaluation of the normalizing constant (6.23), with complexity O(p + m). The evaluation of the first 2
moment, at (6.26), involves complexity O (p + m) . In contrast, these op 3 erations are O (p + m) , using the standard inversion of the full matrix Vt . Nevertheless, if the matrix inversion lemma [42, 108] is used, the operations have the same complexity as for the LD decomposition. 3. Regularity: elements of Tt are certain to be positive, which guarantees positivedefiniteness of Vt . This property is unique to the LD decomposition.
8.2 The EAR Model with Unknown Transformation: the MEAR Model We now relax the EAR assumption which requires ϕ (8.7) to be known. Instead, we consider a finite set, {ϕ}c , of possible transformations, called the filter-bank: ! (8.12) {ϕ}c = ϕ(1) , ϕ(2) , . . . , ϕ(c) . We assume that the observation, dt , at each time t, was generated by one element of {ϕ}c . (8.8) can be rewritten as c li,t
(i) f (dt |a, r, {ψt }c , {ϕ}c , lt ) = f dt |a, r, ψt , ϕ(i) .
(8.13)
i=1
Here, the active transformation is labelled by a discrete variable, lt = [l1,t , . . . , lc,t ] ∈ { c (1) , . . . , c (c)}, where c (i) is the i-th elementary vector in Rc (see Notational Conventions on page XVII). lt constitutes a field of hidden variables which we model via a first-order homogeneous Markov chain (7.43), with transition matrix T ∈ Ic×c (0,1) : f (lt |T, lt−1 ) = M ult (T lt−1 ) ∝ exp (lt ln T lt−1 ) .
(8.14)
Here, ln T is a matrix of log-elements of T ; i.e. ln T = [ln ti,j ]. From (8.13) and (8.14), the observation model is augmented by these hidden variables, lt , as follows:
8.3 The VB Method for the MEAR Model
183
f (dt , lt |a, r, T, lt−1 , {ψt }c , {ϕ}c ) = f (dt |a, r, {ψt }c , {ϕ}c , lt ) f (lt |T, lt−1 ) . (8.15) Marginalization over lt yields an observation model in the form of a probabilistic Mixture of EAR components with common AR parameterization, a, r: f (dt |a, r, T, lt−1 , {ψt }c , {ϕ}c ) =
c "
f (dt , lt = ec (i) |a, r, T, lt−1 , {ψt }c , {ϕ}c ) .
i=1
(8.16) This defines the Mixture-based Extended AutoRegressive (MEAR) model. We note the following: • The MEAR model involves both time-variant parameters, lt , and time-invariant parameters, a, r and T . • Model (8.16) is reminiscent of the DEFH model (6.39) of scenario III (Section 6.3.3). However, the MEAR model in not, in fact, a DEFH model (6.43) since the labels, lt , are not independently distributed at each time, t, but evolve as a Markov chain (8.14). • This Markov chain parameter evolution model (8.14) is the same as the one (7.43) studied in Section 7.5.3. However, we have now replaced the Dirichlet observation model (7.44) with the EAR observation model (8.13). • In common with all mixture-based observation models, exact Bayesian recursive identification of the MEAR model is impossible because it is not a member of the DEF family. We have already seen an example of this in Section 6.5. We have lost conjugacy (and, therefore, recursion) in our ambition to extend the AR model. We will now regain conjugacy for this extended model via the VB approximation.
8.3 The VB Method for the MEAR Model The MEAR parameter evolution model (8.14) for lt is the same as the one studied in Section 7.5.3. Therefore, some of the results derived in Section 7.5.3 can be used here. Step 1: Choose a Bayesian model The MEAR model is defined by (8.13) and (8.14). As before, the prior distributions, f (a, r), f (T ) and f (lt ) will be derived in Step 4, using conjugacy. Step 2: Partition the parameters The logarithm of the augmented observation model (8.15), using (8.13) and (8.14), is
184
8 The Mixture-based Extension of the AR Model (MEAR)
ln f (dt , lt |a, r, T, lt−1 , {ψt }c , {ϕ}c ) ∝ / c / " 1 / (i) / 1 (i) (i) ∝ li,t ln /Jt / − ln r − r−1 [−1, a ] ϕt ϕt [−1, a ] +lt ln T lt−1 , 2 2 i=1 (8.17) (i)
where the Jt are given by (8.9). We choose to partition the parameters of (8.17) as θ1 = {a, r, T }, θ2 = lt and θ3 = lt−1 . Step 3: Write down the VB-marginals Using (7.10), the VB-observation model parameterized by the time-variant parameter, lt−1 , is T t lt−1 . f˜ (dt |lt−1 ) ∝ exp lt ln (8.18) Using (6.38), the VB-observation models parameterized by the time-invariant parameters, {a, r} and T , are 1 ˜ (8.19) f (dt |a, r) ∝ exp − ln r × 2
1 × exp − r−1 2
[−1, a ]
c "
(i) (i) l i,t ϕt ϕt
[−1, a ]
,
i=1
. f˜ (dt |T ) ∝ exp tr lt l t−1 ln T
(8.20)
The VB-parameter predictor (7.9) for lt is f˜ (lt |Dt−1 ) ∝ exp lt υt + ln , T t l t−1
(8.21)
where υt ∈ Rc with elements / / 1 (i) (i) / (i) / 1 Ef (a,r|Dt ) [−1, a ] r−1 [−1, a ] , υi,t = ln /Jt / − ln rt − tr ϕt ϕt 2 2 (8.22) and Ef (a,r|Dt ) [−1, a ] r−1 [−1, a ] =
−1 r t
−1 −r t at
−1 ar −1 a − at r t t
.
8.3 The VB Method for the MEAR Model
185
Step 4: Identify standard forms The VB-parameter predictor (8.21) can be recognized as having the following form: f˜ (lt |Dt−1 ) = Mult (αt ) ,
(8.23)
i.e. the Multinomial distribution (Appendix A.7). Using (8.22), its shaping parameter is T t l (8.24) αt ∝ exp υt + ln t−1 . The VB-observation models, (8.18)–(8.20), can be recognized as Multinomial, Normal and Multinomial of continuous argument, respectively. Their conjugate distributions are as follows: f˜ (lt−1 |Dt ) = Dilt−1 (βt ) , f˜ (a, r|Dt ) = N iGa,r (Vt , νt ) , f˜ (T |Dt ) = DiT (Qt ) .
(8.25) (8.26) (8.27)
Therefore, the posteriors at time t − 1 are chosen to have the same forms, (8.25)– (8.27), with shaping parameters Vt−1 and νt−1 for (8.26), Qt−1 for (8.27), and αt−1 for (8.25) (see Section 7.5.2 for reasoning). From (8.19)–(8.22), the posterior shaping parameters are therefore Vt = Vt−1 +
c "
(i) (i) l i,t ϕt ϕt ,
(8.28)
i=1
νt = νt−1 + 1,
Qt = Qt−1 + lt l t−1 , βt ∝ αt−1 ◦ exp ln T t lt .
(8.29) (8.30)
Note, also, that the parameter priors, f (lt ), f (a, r) and f (T ), are chosen to have the same form as (8.25)–(8.27) (see Step 1). Step 5: Formulate necessary VB-moments −1 , ln rt , t , r The necessary VB-moments in (8.24) and (8.28)–(8.30) are at , aa t ln T t , lt and lt−1 . From Appendices A.3, A.7 and A.8, these are as follows:
−1 at = Va1,t Vaa,t , (8.31)
−1 −1 −1 t = Vaa,t + Va1,t Vaa,t Va1,t Vaa,t , aa −1 = ν λ−1 , r t t t rt = − ln 2 − ψΓ (νt − 1) + ln |λt | , ln
ln ti,j = ψΓ (qi,j,t ) − ψΓ 1c,1 Qt 1c,1 , t
lt = αt , l t−1 = βt ,
(8.32)
186
8 The Mixture-based Extension of the AR Model (MEAR)
where λt , Va1,t and Vaa,t are functions of matrix Vt , as defined by (6.24). qi,j,t denotes the i,j-th element of matrix Qt . Step 6: Reduce the VB-equations As in Sections 7.5.2 and 7.5.3, trivial substitutions for expectations of the labels can be made. However, no significant reduction of the VB-equations was found. The VB-equations are therefore (8.24) and (8.28)–(8.32). Step 7: Run the IVB algorithm The IVB algorithm is iterated on the full set of VB-equations from Step 6. We choose to initialize the IVB algorithm by setting the shaping parameters, αt , of f˜ (lt |Dt ). A convenient choice is the vector of terminal values from the previous step. Step 8: Report the VB-marginals The VB-marginals for the MEAR model parameters, i.e. f˜ (a, r|Dt ) (8.26) and f˜ (T |Dt ) (8.27), are reported via their shaping parameters, Vt , νt and Qt respectively. There may also be interest in inferring the hidden field, lt , via f˜ (lt |Dt ) (8.23), with shaping parameter αt .
8.4 Related Distributional Approximations for MEAR 8.4.1 The Quasi-Bayes (QB) Approximation The Quasi-Bayes (QB) approximation was defined in Section 3.4.3 as a special case of the Restricted VB (RVB) approximation. Recall, from Remark 3.3, that we are required to fix all but one of the VB-marginals in order to avoid IVB iterations at each time step. The true marginal, f (lt |Dt ), is available, and so this will be used as one restriction. Furthermore, as discussed in Section 6.4.1, we also wish to fix distributions of quantities that are not propagated to the next time step. For this reason, we will also fix f˜ (lt−1 |Dt ) in the MEAR model. Using these considerations, we now follow the steps of the VB method, adapting them as in Section 3.4.3.1 to this restricted (closed-form) case. Step 1: The MEAR model is defined by (8.13) and (8.14). Step 2: The parameters are partitioned as before. Step 3: We restrict the VB-marginals on the label field as follows: f˜ (lt |Dt ) ≡ f (lt |Dt ) " c ∝ f (dt, , lt |θ, lt−1 = c (i) , Dt−1 ) ×
(8.33)
Θ ∗ i=1
×f˜ (θ|Dt−1 ) f˜ (lt−1 = c (i) |Dt−1 ) dθ, f˜ (lt−1 |Dt ) = f˜ (lt−1 |Dt−1 ) ,
(8.34)
8.4 Related Distributional Approximations for MEAR
187
where θ = {a, r, T }. (8.33) is the exact marginal of the one-step update, being, therefore, a QB restriction (Section 7.3.1). (8.34) is the non-smoothing restriction which we introduced in Section 7.3.1. Using (8.13), (8.14) and (8.23), and summing over lt−1 , we obtain f (dt , θ, lt |Dt−1 ) =
c "
αi,t−1 Diti (qi,t−1 + lt ) ×
i=1
×
c
(8.35)
li,t (i) (i) N iGa,r Vt−1 + ϕt ϕt , νt + 1 .
i=1
Here, ti is the ith column of matrix T , and qi,t−1 is the ith column of matrix Qt−1 . Finally, we marginalize (8.35) over θ, to yield f (lt |Dt ) = Mult (αt ) ∝
c
l
i,t αi,t ,
(8.36)
i=1
αi,t ∝
c "
(i) (i) αj,t−1 ζtj (qj,t−1 + c (i)) ζa,r Vt−1 + ϕt ϕt , νt−1 + 1 .
j=1
ζtj (·) denotes the normalizing constant of the Dirichlet distribution (A.49), and ζa,r (·) denotes normalizing constant of the Normal-inverse-Gamma (N iG) distribution (6.23). Using (8.36), we can immediately write down the required first moments of (8.33) and (8.34), via (A.47): l t−1 = αt−1 . lt = αt .
(8.37) (8.38)
The remaining VB-marginal, i.e. f˜ (θ|Dt ), is the same as the one derived via the VB method in Section 8.3, i.e. (8.26) and (8.27). Step 4: The standard forms of the VB-marginals, (8.26) and (8.27), are as before. However, their shaping parameters, (8.28)–(8.30), are now available in closed form, using (8.37) and (8.38), in place of (8.32). Steps 5–7: Do not arise. Step 8: Identical to Section 8.3. 8.4.2 The Viterbi-Like (VL) Approximation From (8.28), note that Vt is updated via c dyads, by the elements, li,t , weighted 2 of lt . Dyadic updates (8.28) are expensive—O (1 + m) as discussed in Section 8.1.2—especially when the extended regression vectors, ϕt , are long. In situations where one weight, l i,t , is dominant, it may be unnecessary to perform dyadic updates using the remaining c − 1 dyads. This motivates the following ad hoc proposition.
188
8 The Mixture-based Extension of the AR Model (MEAR)
Proposition 8.1 (Viterbi-Like (VL) Approximation). Further simplification of the QB-approximation (Section 8.4.1) may be achieved using an even coarser approximation of the label-field distribution, namely, certainty equivalence (Section 3.5.1): f˜ (lt |Dt ) = δ lt − lt . (8.39) This replaces (8.33). Here, lt is the MAP estimate from (8.36); i.e. lt = arg max f (lt |Dt ) . lt
(8.40)
This corresponds to the choice of one ‘active’ component with index ıt ∈ {1, . . . , c}, ıt ) . The idea is related to the Viterbi algorithm [153]. Substituting such that lt = c ( (8.40) into (8.28)–(8.30), we obtain the shaping parameters of (8.26) and (8.27), as follows: (ı ) (ı ) Vt = Vt−1 + ϕt t ϕt t , νt = νt−1 + 1,
(8.41) (8.42)
Qt = Qt−1 + αt αt−1 .
(8.43)
Note that the update of Vt (8.41) now involves only one outer product. Weights αt (8.38) have already been generated for evaluation of (8.39). Hence, the coarser VL approximation (8.40) is not used in the Qt update (8.29). Instead, Qt is updated using (8.38).
8.5 Computational Issues We have now introduced three approximation methods for Bayesian recursive inference of the MEAR model: 1. the Variational Bayes (VB) inference (Section 8.3); 2. the Quasi-Bayes (QB) inference (Section 8.4.1); 3. the Viterbi-Like (VL) inference (Proposition 8.1). The computational flow is the same for all algorithms, involving updates of statistics Vt , νt and Qt . The recursive scheme for computation of (8.28)–(8.30) via the VBapproximation is displayed in Fig. 8.2. The computational scheme for the QB algorithm is, in principle, the same, but the weights lt are evaluated using (8.36) and (8.37), in place of (8.24) and (8.30), respectively. In the VL approximation, only one of the parallel updates is active in each time. Hence, the main points of interest in comparing the computational loads of the various schemes are (i) the weight evaluation, and (ii) the dyadic updates. Weights are computed via (8.24) for the VB scheme, and by (8.36) for the QB and VL schemes. The operations required for this step are as follows:
(1)
ϕ
dt
(1)
ϕt
(1)
O.P.
(2)
ϕ(2)
ϕt
.. .
ϕt ϕt
(2) (2) ϕt ϕt
βt ×
(8.24)
α2,t ×
.. .
m u x
αt
O.P.
1 1−z −1
1 1−z −1
Qt
Vt
.. .
.. .
O.P.
(8.30)
α1,t
(c) (c) ϕt ϕt
(8.24)
αc,t ×
Fig. 8.2. The recursive summed-dyad signal flowgraph for VB inference of the MEAR model. ‘O.P.’ denotes outer product (dyad). ‘mux’ denotes the assembly of the vector, αt , from its elements αi,t . Accumulators are represented by 1−z1−1 . Equation references are given as appropriate. The transmission of the VB-moments is not shown, for clarity.
8.5 Computational Issues
ϕ
(c) ϕt (c)
O.P.
(1)
(8.24)
189
190
8 The Mixture-based Extension of the AR Model (MEAR)
VB: (i) evaluation of at , rt , Vaa,t (8.31)–(8.32); (ii) evaluation of the c terms in (8.24). All operations in (i), and each element-wise operation in (ii), are 2 of complexity O (1 + m) . Moreover, these evaluations must be repeated for each step of the IVB algorithm. QB: (i) c-fold update of Vt in (8.36), and (ii) c-fold evaluation of the corresponding normalizing constant, ζa,r , in (8.36). The computational complexity of each normalization is O (1 + m). VL: the same computational load as QB, plus one maximization (8.40). The weight updates can be done in parallel for each candidate transformation, in all three cases. Update of Vt is done via dyadic updates (8.28) for the VB and QB schemes, and (8.41) for the VL scheme. We assume that each dyadic update is undertaken using the LD decomposition (Section 8.1.2), where the required update algorithm [42] performs one dyadic update: VB: c-fold dyadic update of Vt , QB: c-fold dyadic update of Vt , VL: one dyadic update of Vt . The dyadic updates must be done sequentially for each component. The overall computational complexity of the three schemes is summarized in Table 8.1. Table 8.1. Computational complexity of Bayesian recursive inference schemes for the MEAR model. Scheme Computational complexity for one time step VB QB VL
n (2c + 1) × O (1 + m)2
2c × O (1 + m)2 + c × O (1 + m)
(c + 1) × O (1 + m)2 + c × O (1 + m) n is the number of iterations of the IVB algorithm (for VB only) c is the number of components in the MEAR model m is the dimension of the regressor
The main drawback of the VB algorithm is that the number of iterations, n, of the IVB algorithm at each step, t, is unknown a priori. Remark 8.2 (Spanning property). Fig. 8.2 assumes linear independence between the candidate filter-bank transformations (8.12), i.e. between the extended regressors, (i) ϕt (8.4), generated by each candidate filter. In this general case, (8.28) is a rank-c update of the extended information matrix. Special cases may arise, depending on (i) the filter-bank candidates that are chosen a priori. For example, if the ϕt are all linearly-dependent, then the update in (8.28) is rank-1. In this case, too, only one outer product operation is required, allowing the c parallel paths in Fig. 8.2 to be
8.6 The MEAR Model with Time-Variant Parameters
191
collapsed, and reducing the number of dyadic updates for update of Vt to one. In all cases though, the convex combination of filter-dependent dyads (8.28) resides in the simplex whose c vertices are these filter-dependent dyads. Thus, the algorithm allows exploration of a space, ϕ, of EAR transformations (8.7) whose implied dyadic updates (8.10) are elements of this simplex. This will be known as the spanning property of MEAR identification.
8.6 The MEAR Model with Time-Variant Parameters Inference of time-variant parameters was the subject of Chapter 7. We now apply these ideas to the MEAR model with time-variant parameters. The MEAR model, (8.13) and (8.14), is adapted as follows: / / / (i) / (i) (i) f dt |at , rt , ψt , ϕ(i) = /Jt (dt )/ Nϕ(i) (dt ) −at ψt , rt , 1
f (lt |Tt , lt−1 ) = M ult (Tt lt−1 ) . The VB-approximation of Section 8.3 achieved conjugate recursive updating of timeinvariant MEAR parameters. Recall, from Section 7.3.3, that the forgetting operator preserves these conjugate updates in the case of time-variant parameters, if the alternative parameter distribution is chosen from the CDEF family (Proposition 6.1). From (8.26) and (8.27), we note that a and r are conditionally independent of T . Hence, we choose distinct, known, time-invariant forgetting factors, φa,r and φT , respectively. Then the parameter predictor of at , rt and Tt at time t − 1 is chosen as
f (at , rt , Tt |Dt−1 ) = f (at−1 , rt−1 |Vt−1 , νt−1 )at ,rt
× f (Tt−1 |Qt−1 )Tt
φT
φa,r
f Tt |Q
f at , rt |V , ν
1−φT
,
1−φa,r
×
(8.44)
where statistics V , ν and Q of the alternative distributions, f at , rt |V , ν and
f Tt |Q , are chosen by the designer. Under this choice, the VB method for inference of time-variant MEAR parameters is the same as the VB method for time-invariant parameters (Section 8.3), under the substitutions a → at , r → rt , T → Tt . Step 4 is modified as follows. 4. The distribution f (at , rt , Tt |Dt−1 ) is chosen in the form (8.44), with known time-invariant forgetting factors, φa,r and φT . This is conjugate to the VBobservation models, (8.19) and (8.20). Therefore, the posterior distributions have
192
8 The Mixture-based Extension of the AR Model (MEAR)
the same form as before, i.e. (8.26) and (8.27), with modified shaping parameters, as follows:
Vt = φa,r Vt−1 +
c "
(i) (i) + (1 − φa,r ) V , l i,t ϕt ϕt
(8.45)
i=1
νt = φa,r νt−1 + 1 + (1 − φa,r ) ν,
(8.46)
Qt = φT Qt−1 + lt l t−1 + (1 − φT ) Q.
(8.47)
These are the required updates for the VB and QB inference schemes with forgetting. For the VL scheme (Proposition 8.1), the sum in (8.45) is reduced to one dyad, as in (8.41).
8.7 Application: Inference of an AR Model Robust to Outliers One of the main limitations of the AR model is the sensitivity of parameter estimates to outliers in the measurements. In this Section, we analyze the problem of inference of a time-invariant, univariate (scalar) AR process (6.14) corrupted by isolated outliers. An isolated outlier is not modelled by the AR observation model because the outlier-affected observed value does not take part in the future regression. Instead, the process is autoregressive in internal (i.e. not directly measured) variable zt , where (8.48) zt = a ψt + σet , ψt = [zt−1 , . . . , zt−m ] . Here, r = σ 2 . The internal variable, zt , is observed via d t = zt + ω t ,
(8.49)
where ωt denotes a possible outlier at time t. For an isolated outlier, it holds that Pr [ωt±i = 0|ωt = 0] = 1, i = 1, . . . , m.
(8.50)
The AR model is identified via f (a, r|Dt ) (i.e. not via f (a, r|Zt )) and so the outlier has influence if and only if it enters the extended regressor, ϕt (6.25). 8.7.1 Design of the Filter-bank Since ϕt is of finite length, m + 1, and since the outliers are isolated, a finite number of mutually exclusive cases can be defined. Each of these cases can be expressed via an EAR model (8.8) and combined together using the MEAR approach, as follows. 1. None of the values in ϕt is affected by an outlier; i.e. dt−i = zt−i , i = 0, . . . m. ϕ(1) is then the identity transformation: (1)
ϕ(1) : ϕt
= [dt , . . . , dt−m ] .
(8.51)
8.7 Application: Inference of an AR Model Robust to Outliers
193
2. The observed value, dt , is affected by an outlier; from (8.50), all delayed values are unaffected; i.e. dt−i = zt−i , i = 1, . . . m. For convenience, ωt can be expressed as ωt = ht σet , where ht is an unknown multiplier of the realized AR residual (6.14). From (6.14), (8.48) and (8.49), dt = a ψt + (1 + ht ) σet . Dividing across by (1 + ht ) reveals the appropriate EAR transformation (8.4): (2)
ϕ(2) : ϕt
=
1 [dt , . . . , dt−m ] . 1 + ht
(8.52)
1 ϕ(2) is parameterized by ht , with constant Jacobian, J2 = 1+h (8.8). t 3. The k-steps-delayed observation, dt−k , is affected by an outlier, k ∈ {1, . . . , m}; in this case, the known transformation should replace this value by an interpolant, zˆt−k , which is known at time t. The set of transformations for each k is then (2+k) = [dt , . . . , zˆt−k , . . . dt−m ] . (8.53) ϕ(2+k) : ϕt
ϕ(2+k) is parameterized by zˆt−k , with Jacobian J2+k = 1. We have described an exhaustive set of c = m + 2 filters, ϕ(i) , transforming the (i) observed data, dt , . . . , dt−m , into EAR regressors, ϕt , for each EAR model in the filter-bank (8.12). Parameters ht and zˆt−k must be chosen. We choose ht to be a known fixed ht = h. Alternatively, if the variance of outliers is known to vary significantly, we can split ϕ(2) into u > 1 candidates with respective fixed parameters h(1) < h(2) < . . . < h(u) . Next, zˆt−k is chosen as the k-steps-delayed value of the following causal reconstruction: zˆt =
c "
Ef (zt |lt =c (j)) [zt ] f˜ (lt = c (j) |Dt , {ϕt }c )
j=1
⎛
= dt ⎝
c "
(8.54)
⎞ (2)
αj,t ⎠ + α2,t a ˆt−1 ψt J2−1 .
(8.55)
j=1,j=2
Here, we are using (6.28), (8.52) and (8.24), and the fact that zt = dt for all transformations except ϕ(2) . 8.7.2 Simulation Study A second-order (i.e. m = 2) stable univariate AR process was simulated, with parameters a = [1.85, −0.95] and r = σ 2 = 0.01. A random outlier was generated at every 30th sample. The total number of samples was t = 100. A segment of the simulated data (t = 55, . . . , 100) is displayed in Fig. 8.3 (dotted line), along with the corrupted data (dots) and the reconstruction (solid line) (8.54). Two outliers occurred during the displayed period: a ‘small’ outlier at t = 60 and a ‘big’ outlier
8 The Mixture-based Extension of the AR Model (MEAR) Signals and
Component weights
reconstructions
(first outlier)
0
- 0.5 -1.5
a 4,t
-1 60
80
70
90
a 1,t
0 1
a 2,t
a 2,t
0.5
a 3,t
VB
1
1
0 1
a 3,t
a 1,t
2 1.5
0 1 0 55
0
- 0.5
a 4,t
-1 80
90
a 1,t
1
a 2,t
0 1 0 1
a 3,t
a 2,t
0.5
a 3,t
QB
1
0 1 0 55
60
t
0
- 0.5
a 4,t
-1 80 t
70
0 1 0 85
90
90
a 1,t
1
a 2,t
0 1 0 1 0 1 0 55
60
65 t
95
100
95
100
95
100
1 0 1 0 1 0 1 0 85
90 t
a 3,t
a 2,t
0.5
a 3,t
VL
1
70
65
a 4,t
a 1,t
2
60
0 1
t
1.5
- 1.5
0 1
t
a 4,t
a 1,t
1.5
70
70
1
t
2
60
65
60
t
-1.5
Component weights (second outlier)
a 4,t
194
70
1 0 1 0 1 0 1 0 85
90 t
Fig. 8.3. Reconstruction of an AR(2) process corrupted by isolated outliers. Results for VB, QB, and VL inference schemes, respectively, are shown. There are outliers at t = 60 and t = 90. In the left column, the uncorrupted AR signal is displayed via the dotted line, the corrupted data, dt , are displayed as dots, and the reconstructed signals are displayed by the full line. Note that the reconstructed signals differ from the uncorrupted AR signal only at the second outlier, t = 90.
8.7 Application: Inference of an AR Model Robust to Outliers
195
at t = 90. The filter-bank of m + 2 = 4 transformations—ϕ(1) (8.51), ϕ(2) (8.52) with ht = h = 10, ϕ(3) and ϕ(4) (8.53)—was used for identification of the AR parameters, a and r. The prior distribution was chosen as N iG (V0 , ν0 ), with ⎡ ⎤ 0.1 0 0 V0 = ⎣ 0 0.001 0 ⎦ , ν0 = 1. 0 0 0.001 This choice of prior implies point estimates with a ˆ0 = [0, 0], and rˆ0 = 0.1. When an outlier occurs, all candidate filters are sequentially used, as seen in Fig. 8.3 (middle and right columns). Thus, the outlier is removed from the shaping parameters (8.28)–(8.30) very effectively. We note that all considered inference schemes—i.e. VB, QB, and VL—performed well when the first outlier occurred. The estimated weights and reconstructed values are almost identical across all three schemes. However, when the second outlier occurred, the VB scheme identified the weights more accurately than the QB and VL schemes. The terminal—i.e. t = t—Highest Posterior Density (HPD) region (Definition 2.1) of a (A.15) is illustrated (via the mean value and 2 standard deviations ellipse) for the various identification methods in the left (overall performance) and right (detail) of Fig 8.4. In the left diagram, the scenarios are (i) AR identification of the - 0.3
simulated value AR, corrupted
- 0.4 - 0.5
- 0.85 a2
- 0.6 a2
simulated value AR, uncorrupted MEAR, VB MEAR, QB MEAR, VL
-0.8
- 0.7 - 0.8
-0.9
-0.95
- 0.9 -1 1.2
-1 1.3
1.4
1.5
1.6 a1
1.7
1.8
1.9
1.7
1.75
1.8 a1
1.85
1.9
Fig. 8.4. Identification of an AR(2) process corrupted by isolated outliers. Left: comparison of the terminal moments, t = 100, of the posterior distribution of a. Right: detail of left, boxed region. HPD regions for QB and VL are very close together.
AR process corrupted by outliers; (ii) AR identification of the AR process uncorrupted by outliers (boxed). In the right diagram, we zoom in on the boxed area surrounding (ii) above, revealing the three MEAR-based identification scenarios: (iii) MEAR identification using the VB approximation; (iv) MEAR identification using the QB approximation; (v) MEAR estimation using the Viterbi-like (VL) approximation. Impressively, the MEAR-based strategies perform almost as well as the AR strategy with uncorrupted data, which is displayed via the full line in the right diagram. The posterior uncertainty in the estimate of a appears, therefore, to be due to the AR process itself, with all deleterious effects of the outlier process removed.
196
8 The Mixture-based Extension of the AR Model (MEAR)
8.8 Application: Inference of an AR Model Robust to Burst Noise The previous example relied on outliers being isolated (8.50), justifying the assump(i) tion that there is at most one outlier in each AR extended regressor, ϕt . In such a case, the additive decomposition (8.49) allowed successful MEAR modelling of dt via a finite number (m + 2) of candidates. A burst noise scenario requires more than one outlier to be considered in the regressor. We transform the underlying AR model (6.14) into state-space form [108], as follows: zt+1 = Azt + bσet , ⎡ ⎡ ⎤ ⎤ −a1 −a2 · · · −am 1 ⎢ 1 ⎢0⎥ ⎥ 0 · · · 0 ⎢ ⎢ ⎥ ⎥ A=⎢ . . , b = ⎢.⎥ , . . . . . ... ⎥ ⎣ .. ⎣ .. ⎦ ⎦ ··· 1
0 such that A ∈ R modelled as
m×m
and b ∈ R
m×1
0
(8.56)
(8.57)
0
. The observation process with burst noise is
dt = c zt + ht σξt ,
(8.58)
where c = [1, 0, . . . , 0] ∈ Rm×1 , and ξt is distributed as N (0, 1), independent of et . The term ht σ denotes the time-dependent standard deviation of the noise, which is assumed strictly positive during any burst, and zero otherwise. Note that (8.56) and (8.57) imply identically the same AR model as in the previous example (8.48). The only modelling difference is in the observation process (8.58), compared to (8.49) and (8.50). 8.8.1 Design of the Filter-Bank We identify a finite number of mutually exclusive scenarios, each of which can be expressed using an EAR model: 1. The AR process is observed without distortion; i.e. ht = ht−1 = . . . = ht−m = (1) 0. Formally, ϕ(1) : ϕt = [dt , . . . , dt−m ] . 2. The measurements are all affected by constant-deviation burst noise; i.e. ht = ht−1 = . . . = ht−m = h. The state-space model, (8.56) and (8.58), is now defined by the joint distribution Azt−1 bb 0 , r . (8.59) f (zt , dt |a, r, zt−1 , h) = Nzt ,dt c z t 0 h2 (8.59) cannot be directly modelled as an EAR observation process because it contains unobserved state vector zt . Using standard Kalman filter (KF) theory [42, 108], as reviewed in Section 7.4, we can multiply terms together of the kind in (8.59), and then integrate over the unobserved trajectory—i.e. over {z1 , . . . , zt }—to obtain the direct observation model:
8.8 Application: Inference of an AR Model Robust to Burst Noise
f (dt |a, r, Dt−1 , h) = Ndt (a zt , rqt ) .
197
(8.60)
The moments in (8.60) are defined recursively as follows: qt = h2 + c St−1 c, Wt = St−1 −
qt−1
(8.61)
(St−1 c) (St−1 c) ,
−2 zt = Az Wt c (dt − c Az t−1 + h t−1 ) ,
St = bb + AWt A .
(8.62) (8.63) (8.64)
(8.60) constitutes a valid EAR model (8.8) if zt and qt are independent of the unknown AR parameters, a and r. From (8.61) and (8.63), however, both qt and zt are functions of A (a) (8.57). In order to obtain a valid EAR model, we replace A (a) in (8.63) and (8.64) by its expected value, A t−1 = A (a t−1 ), using (8.31). Then, (8.60) is a valid EAR model defined by the set of transformations (2)
ϕ(2) : ϕt
1 dt , zt , =√ qt
(8.65)
−1
with time-variant Jacobian, Jt = qt 2 (Dt−1 ), evaluated recursively using (8.61). ϕ(2) is parameterized by unknown h, each setting of which defines a dis(2) tinct candidate transformation. Note that ϕt in (8.65) depends on a t−1 (8.31). Parameter updates are therefore correlated with previous estimates, a t−1 . 3. Remaining cases; cases 1. and 2. above do not address the situation where hk is not constant on a regression interval k ∈ {t − m, . . . , t}. Complete mod elling for such cases is prohibitive, since [ht−m , . . . , ht ] exists in a continuous space. Nevertheless, it is anticipated that such cases might be accommodated via a weighted combination of the two cases above. The final step is to define candidates to represent ϕ(2) (ϕ(1) is trivial). One candidate may be chosen for ϕ(2) if the variance of burst noise is reasonably well known a priori. In other cases, we can partition ϕ(2) with respect to intervals of h. Candidates are chosen as one element from each interval. 8.8.2 Simulation Study A non-stationary AR(2) process was simulated, with a1,t in the interval from −0.98 to −1.8 (as displayed in Fig. 8.5 (top-right)), a2,t = a2 = 0.98, rt = r = 0.01, and t = 200. Realizations are displayed in Fig. 8.5 (top-left, solid line). For t < 95, a1,t is increasing, corresponding to faster signal variations (i.e. increasing bandwidth). Thereafter, a1,t decreases, yielding slower variations. These variations of a1,t do not influence the absolute value of the complex poles of the system, but only their polar angle. The process was corrupted by two noise bursts (samples 50–80 and 130–180), with parameters h = 8 and h = 6 respectively. Realizations of the burst noise process imposed on the simulated signal are displayed in the second row of Fig. 8.5.
8 The Mixture-based Extension of the AR Model (MEAR)
2
a 1,t
1
1 0
0
100
200
0 -1
- 0.5 -1 -1.5 -2 1.5 1 0.5 0
a2,t
Uncorrupted AR
Simulated parameters and their inferences
Component weights
Signals and reconstructions
a1,t
198
-2 0
50
100
150
200
0
50
100
150
200
0
100
200
0
100
200
1 0 -1 -2
0
50
100
150
200 a 1,t
1
a 2,t
0 -1
a 3,t
QB inference
2
-2 0
50
100
150
200 a 1,t
1
a 2,t
0 -1
a 3,t
VL inference
2
-2 0
50
100 t
150
200
0
a1,t
- 0.5 -1 -1.5 -2
a2,t
0 1 0
100
200
1 a1,t
-2
0 1
0 1 0 1 0
0
100
0 1 0 1 0
100 t
0
100
200
0
100
200
0
100
200
0
100
200
0
100
200
0
100 t
200
1.5 1 0.5 0
200
1
0
1.5 1 0.5 0
- 0.5 -1 - 1.5 -2
a2,t
a 2,t
0 -1
1
a1,t
a 1,t
1
a 3,t
VB inference
2
- 0.5 -1 - 1.5 -2
a2,t
Corrupted data
2
200
1.5 1 0.5 0
Fig. 8.5. Reconstruction and identification of a non-stationary AR(2) process corrupted by burst noise, using the KF variant of the MEAR model. In the final column, full lines denote simulated values of parameters, dashed lines denote posterior expected values, and dotted lines denote uncertainty bounds.
8.8 Application: Inference of an AR Model Robust to Burst Noise
1
1 0
0
100
200
a1,t
2
a 1,t
0 -1
a2,t
Uncorrupted AR
Simulated parameters and their inferences -0.5 -1 -1.5 -2 0 100 200 1.5 1 0.5 0 0 100 200
Component weights
Signals and reconstructions
199
-2 0
50
100
150
200
1 0 -1 -2 200
-1 -2 0
50
100
150
200
a 1,t a 2,t
1
a 3,t
0 -1
a 4,t
QB inference
2
-2 0
50
100
150
200
a 1,t a 2,t
1
a 3,t
0 -1
a 4,t
VL inference
2
-2 0
50
100 t
150
200
0 1
a1,t
a 1,t a 2,t a 3,t
0
a 4,t
VB inference
1
1
0 1 0 1 0
0
100
200
1 0 1 0 1 0 1 0
0
100
0 1 0 1 0 1 0
100 t
200
0
100
200
0
100
200
0
100
200
0
100
200
0
100
200
0
100 t
200
1.5 1 0.5 0
- 0.5 -1 - 1.5 -2
200
1
0
- 0.5 -1 - 1.5 -2
a2,t
150
a1,t
100
a2,t
50
a1,t
0 2
1.5 1 0.5 0
- 0.5 -1 - 1.5 -2
a2,t
Corrupted data
2
1.5 1 0.5 0
Fig. 8.6. Reconstruction and identification of a non-stationary AR(2) process corrupted by burst noise, using the KF+LPF variant of the MEAR model. In the final column, full lines denote simulated values of parameters, dashed lines denote posterior expected values, and dotted lines denote uncertainty bounds.
200
8 The Mixture-based Extension of the AR Model (MEAR)
The process was inferred using r = 3 filter candidates; i.e. the unity transformation, ϕ(1) , along with ϕ(2) with h = 5, and ϕ(2) with h = 10. Parameter inferences are displayed in the right column of Fig. 8.5, as follows: in the first row, inference using the AR model with uncorrupted data; in the third, fourth and fifth rows, the VB, QB and VL parameter inferences with corrupted data. Specifically, the 95% HPD region, via (A.18) and (A.21), of the marginal Student’s t-distribution of a1,t and a2,t respectively, is displayed in each case. The process was identified using forgetting factors (8.44) φa,r = 0.92 and φT = 0.9. The non-committal, stationary, alternative N iG distribution, f (a, r) = N iGa,r V , ν was chosen. Furthermore, the matrix parameter, Q, of the stationary, alternative Di distribution, f (T ) (8.44), was chosen to be diagonally dominant with ones on the diagonal. This discourages frequent transitions between filters. Note that all methods (VB, QB and VL) achieved robust identification of the (i) process parameters during the first burst. As already noted, zˆt , i = 2, 3 (which denotes the reconstructed state vector (8.63) with respect to the ith filter), is correlated with a t−1 , which may undermine the tracking of time-variant AR parameters, at . In this case, each Kalman component predicts observations poorly, and receives low weights, α2,t and α3,t (8.24), in (8.45). This means that the first component— which does not pre-process the data—has a significant weight, α1,t . Clearly then, the Kalman components have not spanned the space of necessary preprocessing transformations well (Remark 8.2), and need to be supplemented. Extra filters can be ‘plugged in’ in a naïve manner (in the sense that they may improve the spanning of the pre-processing space, but should simply be rejected, via (8.24), if poorly designed). During the second burst (Fig. 8.5), the process is slowing down. Therefore, we extend the bank of KF filters with a simple arithmetic mean Low-Pass Filter (LPF) on the observed regressors: (3)
ϕ(3) : ϕt
=
1 (1) (1) (1) ϕt + ϕt−1 + ϕt−2 . 3
(8.66)
(8.65) and (8.66) yield EAR models with the same AR parameterization, and so they can be used together in the MEAR filter-bank. Reconstructed values for the KF variant above are derived from (8.54), as follows: " zt = α1,t dt + αi,t at z (8.67) i,t , i=2,3 α
using (8.31). For the KF+LPF variant, the term 34,t (dt + dt−1 + dt−2 ) is added to (8.67), where α4,t is the estimated weight of the LPF component (8.24). Identification and reconstruction of the process using the KF+LPF filter-bank is displayed in Fig. 8.6, in the same layout as in Fig. 8.5. The distinction is most clearly seen in the final column of each. During the second burst, the added LPF filter received high weights, α4,t (see Fig. 8.6, middle column). Hence, identification of the parameter at is improved during the second burst.
8.9 Conclusion
201
8.8.3 Application in Speech Reconstruction The MEAR filter-bank for the burst noise case (KF variant) was applied in the reconstruction of speech [154]. A MEAR model with 4 components was used, involving ϕ(1) , and ϕ(2) with three different choices of h, specifically h ∈ {3, 6, 10}. The speech was modelled as AR with order m = 8 (6.14) [155]. The known forgetting factors (8.44) were chosen as φa,r = φT = 0.95. Once again, a diagonally-dominant Q was chosen for f (T ). During periods of silence in speech, statistics (8.45) are effectively not updated, creating difficulties for adaptive identification. Therefore, we use an informative stationary alternative distribution, f (a, r), of the N iG type (8.26) for the AR parameters in (8.44). To elicit an appropriate distribution, we identify the time-invariant alternative statistics, V and ν, using 1800 samples of unvoiced speech. f (a, r) was then flattened to reduce ν from 1800 to 2. This choice moderately influences the accumulating statistics at each step, via (8.45). Specifically, after a long period of silence, the influence of data in (8.45) becomes negligible, and Vt is reduced to V . Three sections of the bbcnews.wav speech file, sampled at 11 kHz, were corrupted by additive noise. Since we are particularly interested in performance in non-stationary epochs, we have considered three transitional cases: (i) a voiced-tounvoiced transition corrupted by zero-mean, white, Gaussian noise, with a realized Signal-to-Noise Ratio (SNR) of −1 dB during the burst; (ii) an unvoiced-to-voiced transition corrupted by zero-mean, white, uniform noise at −2 dB; and (iii) a silenceto-unvoiced transition corrupted by a click of type 0.25 cos (3t) exp (−0.3t), superimposed on the silence period. Reconstructed values using the VB, QB and VL inference methods respectively are displayed in Fig. 8.7. All three methods successfully suppressed the burst in the first two cases, and the click in the third case. However, the QB and VL methods also had the deteriorious effect of suppressing the unvoiced speech, a problem which was not exhibited by the VB inference.
8.9 Conclusion This is not the first time that we have studied mixtures of AR models and cracked the problem of loss of conjugacy using the VB-approximation. In Chapter 6, a finite mixture of AR components was recognized as belonging to the DEFH family. Therefore, a VB-observation model belonging to the DEF family could be found and effective Bayesian recursive identification of the AR parameters of every component was achieved. In this Chapter, we have proposed a different mixture-based extension of the basic AR model. The same AR parameters appear in each component, and so each component models a different possible non-linear degradation of that AR process. Once again, the VB-approximation provided a recursive identification scheme, which we summarized with the signal flowgraph in Fig. 8.2. A consequence of the shared AR parameterization of each component was that the statistics were updated by c
0.4 0.2 0 -0.2 -0.4
8 The Mixture-based Extension of the AR Model (MEAR) Speech segment 1
VB reconstruction
0.4 0.2 0 -0.2 -0.4
QB reconstruction
0.4 0.2 0 -0.2 -0.4 0.4 0.2 0 -0.2 -0.4
0.5
Speech segment 3
Speech segment 2 0.1
0
0 -0.1
-0.5
-0.2
0.4 0.2 0 -0.2 -0.4
VL reconstruction
Corrupted
Uncorrupted
202
0.5 0.1
0
0 -0.1
-0.5
-0.2 0.5 0.1
0
0 -0.1
-0.5
-0.2 0.5 0.1
0
0 -0.1
-0.5
-0.2 0.5 0.1
0
0 -0.1
-0.5 4200
t
4400
6050
6150 t
6250
-0.2
5550
t
5600
5650
Fig. 8.7. Reconstruction of three sections of the bbcnews.wav speech file. In the second row, dash-dotted vertical lines delimit the beginning and end of each burst.
dyads, instead of the usual one-dyad updates characteristic of AR mixtures (Section 6.5). This is most readily appreciated by comparing (8.28) with (6.58). Each component of the MEAR model can propose a different possible preprocessing of the data in order to recover the underlying AR process. In the applications we considered in this Chapter, careful modelling of the additive noise corrupting our AR process allowed the pre-processing filter-bank to be designed. In the burst noise example, our filter-bank design was not exhaustive, and so we ‘plugged and played’ an additional pre-processing filter—a low-pass filter in this case—in order to improve the reconstruction. This ad hoc design of filters is an attractive feature
8.9 Conclusion
203
of the MEAR model. The VB inference scheme simply rejects unsuccessful proposals by assigning them low inferred component weights. Of course, we want to do all this on-line. A potentially worrying overhead of the VB-approximation is the need to iterate the IVB algorithm to convergence at each time step (Section 6.3.4). Therefore, we derived Restricted VB (RVB) approximate inference schemes (QB and VL) which yielded closed-form recursions. These achieved significant speed-ups (Table 8.1) without any serious reduction in the quality of identification of the underlying AR parameters.
9 Concluding Remarks
The Variational Bayes (VB) theorem seems far removed from the concerns of the signal processing expert. It proposes non-unique optimal approximations for parametric distributions. Our principal purpose in this book has been to build a bridge between this theory and the practical concerns of designing signal processing algorithms. That bridge is the VB method. It comprises eight well-defined and feasible steps for generating a distributional approximation for a designer’s model. Recall that this VB method achieves something quite ambitious in an intriguingly simple way: it generates a parametric, free-form, optimized distributional approximation in a deterministic way. In general, free-form optimization should be a difficult task. However, it becomes remarkably simple under the fortuitous combination of assumptions demanded by the VB theorem: (i) conditional independence between partitioned parameters, and (ii) minimization of a particular choice of Kullback-Leibler divergence (KLDVB ).
9.1 The VB Method The VB theorem yields a set of VB-marginals expressed in implicit functional form. In general, there is still some way to go in writing down a set of explicit tractable marginals. This has been our concern in designing the steps of the VB method. If the clear requirements of each step are satisfied in turn, then we are guaranteed a tractable VB-approximation. If the requirements cannot be satisfied, then we provide guidelines for how to adapt the underlying model in order to achieve a tractable VBapproximation. The requirements of the VB method mean that it can only be applied successfully in carefully defined signal processing contexts. We isolated three key scenarios where VB-approximations can be generated with ease. We studied these scenarios carefully, and showed how the VB-approximation can address important signal processing concerns such as the design of recursive inference procedures, tractable point estimation, model selection, etc.
206
9 Concluding Remarks
9.2 Contributions of the Work Among the key outputs of this work, we might list the following: 1. Practical signal processing algorithms for matrix decompositions and for recursive identification of stationary and non-stationary processes. 2. Definition of the key VB inference objects needed in order to design recursive inference schemes. In time-invariant parameter inference, this was the VBobservation model. In Bayesian filtering for time-variant parameters, this was supplemented by the VB-parameter predictor. 3. These VB inference objects pointed to the appropriate design of priors for tractable and numerically efficient recursive algorithms. 4. We showed that related distributional approximations—such as Quasi-Bayes (QB)—can be handled as restrictions of the VB-approximation, and are therefore amenable to the VB method. The choice of these restrictions sets the trade-off between optimality and computational efficiency. Of course, this gain in computational efficiency using the VB-approximation comes at a cost, paid for in accuracy. Correlation between the partitioned parameters is only approximated by the shaping parameters of the independent VB-marginals. This approximation may not be good enough when correlation is a key inferential quantity. We examined model types where the VB-approximation is less successful (e.g. the Kalman filter), and this pointed the way to possible model adaptations which could improve the performance of the approximation.
9.3 Current Issues The computational engine at the heart of the VB method is the Iterative VB (IVB) algorithm. It requires an unspecified number of iterations to yield converged VBmoments and shaping parameters. This is a potential concern in time-critical on-line signal processing applications. We examined up to three methods for controlling the number of IVB cycles: (i) Step 6 of the VB method seeks an analytical reduction in the number of VBequations and associated unknowns. On occasion, a full analytical solution has been possible. (ii) Careful choice of initial values. (iii)The Restricted VB (RVB) approximation, where a closed-form approximation is guaranteed. Another concern which we addressed is the non-uniqueness of the KLDVB minimizer, i.e. the non-uniqueness of the VB-approximation in many cases. As a consequence, we must be very careful in our choice of initialization for the IVB algorithm. Considerations based on asymptotics, classical estimation results, etc., can be helpful in choosing reasonable prior intervals for the initializers.
9.4 Future Prospects for the VB Method
207
In the case of the Kalman filter, a tractable VB-approximation was possible but it was inconsistent. The problem was that the necessary VB-moments from Step 5 were, in fact, all first-order moments, and therefore could not capture higher order dependence on data. As already mentioned, this insight pointed the way to possible adaptations of the original model which could circumvent the problem.
9.4 Future Prospects for the VB Method The VB method of approximation does not exist in isolation. In the Restricted VB approximation, we are required to fix all but one of the VB-marginals. How we do this is our choice, but it is, once again, a task of distributional approximation. Hence, subsidiary techniques—such as the Laplace approximation, stochastic sampling, MaxEnt, etc.—can be plugged in at this stage. In turn, the VB method itself can be used to address part of a larger distributional approximation problem. For example, we saw how the VB-marginals could be generated at each time step of a recursive scheme to replace an intractable exact marginal (time update), but the exact Bayesian data updates were not affected. Clearly there are many more opportunities for symbiosis between these distributional approximation methods. The conditional independence assumption is characteristic of the VB-approximation and has probably not been exploited fully in this work. The ability to reduce a joint distribution in many variables to a set of optimized independent distributions— involving only a few parameters each—is a powerful facility in the design of tractable inference algorithms. Possible application areas might include the analysis of large distributed communication systems, biological systems, etc.
It is tempting to interpret the IVB algorithm—which lies at the heart of the VBapproximation—as a ‘Bayesian EM algorithm’. Where the EM algorithm converges to the ML solution, and yields point estimates, the IVB algorithm converges to a set of distributions, yielding not only point estimates but their uncertainties. One of the most ergonomic aspects of the VB-approximation is that its natural outputs are parameter marginals and moments—i.e. the very objects whose unavailability forces the use of approximation in the first place. We hope that the convenient pathway to VB-approximation revealed by the VB method will encourage the Bayesian signal processing community to develop practical variational inference algorithms, both in off-line and on-line contexts. Even better, we hope that the VB-approximation might be kept in mind as a convenient tool in developing and exploring Bayesian models.
A Required Probability Distributions
A.1 Multivariate Normal distribution The multivariate Normal distribution of x ∈ Rp×1 is 1 −p −1 Nx (µ, R) = (2π) 2 |R| 2 exp − [x − µ] R−1 [x − µ] . 2
(A.1)
The non-zero moments of (A.1) are x = µ, = R + µµ . xx
(A.2) (A.3)
The scalar Normal distribution is an important special case of (A.1): 1 −1 2 Nx (µ, r) = (2πr) 2 exp − (x − µ) . 2r
(A.4)
A.2 Matrix Normal distribution The matrix Normal distribution of the matrix X ∈ Rp×n is NX (µX , Σp ⊗ Σn ) = (2π)
−pn/2
−n/2
|Σp |
−p/2
|Σn | × (A.5) ! −1 −1 , (X − µX ) × exp −0.5tr Σp (X − µX ) Σn
where Σp ∈ Rp×p and Σn ∈ Rn×n are symmetric, positive-definite matrices, and where ⊗ denotes the Kronecker product (see Notational Conventions, Page XVI)). The distribution has the following properties: • The first moment is EX [X] = µX .
210
A Required Probability Distributions
• The second non-central moments are EX [XZX ] = tr (ZΣn ) Σp + µX ZµX , EX [X ZX] = tr (ZΣp ) Σn + µX ZµX ,
(A.6)
where Z is an arbitrary matrix, appropriately resized in each case. • For any real matrices, C ∈ Rc×p and D ∈ Rn×d , it holds that f (CXD) = NCXD (CµX D, CΣp C ⊗ D Σn D) .
(A.7)
• The distribution of x = vec (X) (see Notational Conventions on Page XV) is multivariate Normal: f (x) = Nx ( µX , Σn ⊗ Σp ) .
(A.8)
Note that the covariance matrix has changed its form compared to the matrix case (A.5). This notation is helpful as it allows us to store the pn × pn covariance matrix in p × p and n × n structures. This matrix Normal convention greatly simplifies notation. For example, if the columns, xi , of matrix X are independently Normally distributed with common covariance matrix, Σ, then f (x1 , x2 , . . . , xn |µX , Σ) =
n
Nxi (µi,X , Σ) ≡ N (µX , Σ ⊗ In ) , (A.9)
i=1
i.e. the matrix Normal distribution (A.5) with µX = [µ1,X , . . . , µn,X ]. Moreover, linear transformations of the matrix argument, X (A.7), preserves the Kroneckerproduct form of the covariance structure.
A.3 Normal-inverse-Wishart (N iWA,Ω ) Distribution The Normal-inverse-Wishart distribution of θ = {A, R}, A ∈ Rp×m , R ∈ Rp×p is 1 −1 exp − R [−Ip , A] V [−Ip , A] , N iWA,Ω (V, ν) ≡ ζA,R (V, ν) 2 − 12 ν
|R|
(A.10)
with normalizing constant, mp 1 − 1 (ν−m−p−1) −1p 1 (ν − m − p − 1) |Λ| 2 ζA,R (V, ν) = Γp |Vaa | 2 2 2 p(ν−p−1) π 2 , 2 (A.11) and parameters, Vdd Vad −1 Vaa Vad . (A.12) V = , Λ = Vdd − Vad Vad Vaa
A.4 Truncated Normal Distribution
211
(A.12) denotes the partitioning of V ∈ R(p+m)×(p+m) into blocks, where Vdd is the upper-left sub-block of size p×p. In (A.11), Γp (·) denotes the multi-gamma function (see Notational Conventions on Page XV). The conditional and marginal distributions of A and R are [42] ˆ R ⊗ V −1 , (A.13) f (A|R, V, ν) = NA A, aa f (R|V, ν) = iWR (η, Λ) , 1 −1 −1 ˆ f (A|V, ν) = StA A, Λ ⊗ Vaa , ν − m + 2 , ν−m+2
(A.14) (A.15)
with auxiliary constants −1 Aˆ = Vaa Vad ,
η = ν − m − p − 1.
(A.16) (A.17)
St (·) denotes the matrix Student’s t-distribution with ν − m + 2 degrees of freedom, and iW (·) denotes the inverse-Wishart distribution [156,157]. The moments of these distributions are Ef (A|R,V,ν) [A] = Ef (A|V,ν) [A] = A, 1 = Λ, Ef (R|V,ν) [R] ≡ R η−p−1 −1 = ηΛ−1 , Ef (R|V,ν) R−1 ≡ R Ef (A|V,ν)
(A.18) (A.19) (A.20)
= A−A A−A
1 −1 , ΛV −1 = RV (A.21) aa η − p − 1 aa p
1 (η − j + 1) + Ef (R|V,ν) [ln |R|] = − ψΓ 2 j=1 + ln |Λ| − p ln 2,
(A.22)
∂ where ψΓ (·) = ∂β ln Γ (·) is the digamma (psi) function. In the special case where p = 1, then (A.10) has the form (6.21), i.e. the Normal-inverse-Gamma (N iG) distribution.
A.4 Truncated Normal Distribution The truncated Normal distribution for scalar random variable, x, is defined as Normal—with functional form Nx (µ, r)—on a restricted support a < x ≤ b. Its distribution is √ 2 1 2 exp − 2r (x − µ) f (x|µ, s, a, b) = √ χ(a,b] (x) , (A.23) πr (erf (β) − erf (α))
212
A Required Probability Distributions
where α =
a−µ √ , 2r
β=
b−µ √ . 2r
Moments of (A.23) are x = µ−
√ r ϕ (µ, r) ,
√ 2 = r + µ x − rκ (µ, r) , x with auxiliary functions, as follows: √
2 exp −β 2 − exp −α2 √ , ϕ (µ, r) = π (erf (β) − erf (α)) √
2 b exp −β 2 − a exp −α2 √ κ (µ, r) = . π (erf (β) − erf (α))
(A.24) (A.25)
(A.26)
(A.27)
In case of vector arguments µ and s, (A.26) and (A.27) are evaluated element-wise. Confidence intervals for this distribution can also be obtained. However, for simplicity, we use the first two moments, (A.24) and (A.25), and we approximate (A.23) by a Gaussian. The Maximum Entropy (MaxEnt) principle [158] ensures that uncertainty bounds on the MaxEnt Gaussian approximation of (A.23) encloses the uncertainty bounds of all distributions with the same first two moments. Hence, + + 2 2 2 2 <x−x . < min b, 2 x − x max a, −2 x − x
(A.28)
A.5 Gamma Distribution The Gamma distribution is as follows: f (x|a, b) = Gx (a, b) =
ba a−1 x exp (−bx) χ[0,∞) (x) , Γ (a)
(A.29)
where a > 0 and b > 0, and Γ (a) is the Gamma function [93] evaluated at a. The first moment is a x = , b and the second central moment is a 2 E (x − x ) = 2 . b
A.6 Von Mises-Fisher Matrix distribution Moments of the von Mises-Fisher distribution are now considered. Proofs of all unproved results are available in [92].
A.6 Von Mises-Fisher Matrix distribution
213
A.6.1 Definition The von Mises-Fisher distribution of matrix random variable, X ∈ Rp×n , restricted to X X = In , is given by 1 exp (tr (F X )) , ζX (p, F F ) 1 1 p, F F C (p, n) , ζX (p, F F ) = 0 F1 2 4
f (X|F ) = M (F ) =
(A.30) (A.31)
where F ∈ Rp×n is a matrix parameter of the same dimensions as X, and p ≥ n. ζX (p, F F ) is the normalizing constant. 0 F1 (·) denotes a Hypergeometric function of matrix argument F F [159]. C (p, r) denotes the area of the relevant Stiefel manifold, Sp,n (4.64). (A.30) is a Gaussian distribution with restriction X X = In , renormalized on Sp,n . It is governed by a single matrix parameter F . Consider the (economic) SVD (Definition 4.1), F = UF LF VF , of the parameter F , where UF ∈ Rp×n , LF ∈ Rn×n , VF ∈ Rn×n . Then the maximum of (A.30) is reached at ˆ = UF V . X (A.32) F
The flatness of the distribution is controlled by LF . When lF = diag−1 (LF ) = 0n,1 , the distribution is uniform on Sp,n [160]. For li,F → ∞, ∀i = 1 . . . n, the ˆ (A.32). distribution is a Dirac δ-function at X A.6.2 First Moment Let Y be the transformed variable, YX = UF XVF . (A.33)
It can be shown that ζX (p, F F ) = ζX p, L2F . The distribution of YX is then f (YX |F ) =
1 1 exp (tr (LF YX )) = exp (lF yX ) , (A.34) ζX (p, L2F ) ζX (p, L2F )
where yX = diag−1 (YX ). Hence, f (YX |F ) ∝ f (yX |lF ) .
(A.35)
The first moment of (A.34) is given by [92] Ef (YX |F ) [YX ] = Ψ, where Ψ = diag (ψ) is a diagonal matrix with diagonal elements
(A.36)
214
A Required Probability Distributions
1 1 2 ∂ p, L ψi = ln 0 F1 . ∂lF,i 2 4 F
(A.37)
We will denote vector function (A.37) by ψ = G (p, lF ) .
(A.38)
The mean value of the original random variable X is then [161] Ef (X|F ) [X] = UF Ψ VF = UF G (p, LF ) VF ,
(A.39)
where G (p, LF ) = diag (G (p, lF )). A.6.3 Second Moment and Uncertainty Bounds The second central moment of the transformed variable, yX = diag−1 (YX ) (A.34), is given by − Ef (YX |F ) [yX ] Ef (YX |F ) [yX ] = Φ, Ef (YX |F ) yX yX with elements φi,j =
1 1 2 ∂ p, LF , i, j = 1, . . . , r. ln 0 F1 ∂li,F ∂lj,F 2 4
(A.40)
(A.41)
Transformation (A.33) is one-to-one, with unit Jacobian. Hence, boundaries of confidence intervals on variables Y and Z can be mutually mapped using (A.33). However, mapping yX = diag−1 (YX ) is many-to-one, and so X → yX is surjective (but not injective). Conversion of second moments (and uncertainty bounds) of yX to X (via (A.33) and (A.34)) is therefore available in implicit form only. For example, the lower bound subspace of X is expressible as follows: $ # X = X| diag−1 (UF XVF ) = yX , where yX is an appropriately chosen lower bound on yX . The upper bound, X, can be constructed similarly via a bound yX . However, due to the topology of the support of X, i.e. the Stiefel manifold (Fig. 4.2), yX projects into the region with highest density of X. Therefore, we consider the HPD region (Definition 2.1) to be bounded by X only. It remains, then, to choose appropriate bound, yX , from (A.34). Exact confidence intervals for this multivariate distribution are not known. Therefore, we use the first two moments, (A.36) and (A.40), to approximate (A.34) by a Gaussian. The Maximum Entropy (MaxEnt) principle [158] ensures that uncertainty bounds on the MaxEnt Gaussian approximation of (A.34) enclose the uncertainty bounds of all distributions with the same first two moments. Confidence intervals for the Gaussian distribution, with moments (A.37) and (A.41), are well known. For example, , , (A.42) Pr −2 φi < (yi,X − ψi ) < 2 φi ≈ 0.95,
A.8 Dirichlet Distribution
where ψi is given by (A.37), and φi by (A.41). Therefore, we choose , yi,X = ψi − 2 φi .
215
(A.43)
The required vector bounds are then constructed as yX = y1,X , . . . , yr,X . The geometric relationship between variables X and yX is illustrated graphically for p = 2 and n = 1 in Fig. 4.2.
A.7 Multinomial Distribution The % Multinomial distribution of the c-dimensional vector variable l, where li ∈ N c and i=1 li = γ, is as follows: 1 li α χNc (l). ζl (α) i=1 i c
f (l|α) = Mul (α) =
Its vector parameter is α = [α1 , α2 , . . . , αc ] , αi > 0, malizing constant is c li ! ζl (α) = i=1 , γ!
%c i=1
(A.44)
αi = 1, and the nor-
(A.45)
where ‘!’ denotes factorial. If the argument l contains positive real numbers, i.e. li ∈ (0, ∞), then we refer to (A.44) as the Multinomial distribution of continuous argument. The only change in (A.44) is that the support is now (0, ∞)c , and the normalizing constant is c Γ (li ) , (A.46) ζl (α) = i=1 Γ (γ) where Γ (·) is the Gamma function [93]. For both variants, the first moment is given by l = α. (A.47)
A.8 Dirichlet Distribution The Dirichlet distribution of the c-dimensional vector variable, α ∈ ∆c , is as follows: 1 βi −1 α χ∆c (α), ζα (β) i=1 i c
f (α|β) = Diα (β) = where
(A.48)
216
A Required Probability Distributions
∆c =
α|αi ≥ 0,
c "
0 αi = 1
i=1
is the probability simplex in Rc . The vector parameter in (A.48) is β = [β1 , β2 , . . . , βc ] , %c βi > 0, i=1 βi = γ. The normalizing constant is c Γ (βi ) , (A.49) ζα (β) = i=1 Γ (γ) where Γ (·) is the Gamma function [93]. The first moment is given by αˆi = Ef (α|β) [αi ] =
βi , i = 1, . . . , c. γ
(A.50)
The expected value of the logarithm is ln αi = Ef (α|β) [ln αi ] = ψΓ (βi ) − ψΓ (γ) ,
(A.51)
∂ ln Γ (·) is the digamma (psi) function. where ψΓ (·) = ∂β For notational simplicity, we define the matrix Dirichlet distribution of matrix variable T ∈ Rp×p as follows:
DiT (Φ) ≡
p
Diti (φi ) ,
i=1
with matrix parameter Φ ∈ Rp×p = [φ1 , . . . , φp ]. Here, ti and φi are the ith columns of T and Φ respectively.
A.9 Truncated Exponential Distribution The truncated Exponential distribution is as follows: k exp (xk) χ(a,b] (x), exp (kb) − exp (ka) (A.52) where a < b are the boundaries of the support. Its first moment is f (x|k, (a, b]) = tExpx (k, (a, b]) =
x =
exp (bk) (1 − bk) − exp (ak) (1 − ak) , k (exp (ak) − exp (bk))
(A.53)
which is not defined for k = 0. The limit at this point is lim x =
k→0
a+b , 2
which is consistent with the fact that the distribution is then uniform on the interval (a, b].
References
1. R. T. Cox, “Probability, frequency and reasonable expectation,” Am. J. Phys., vol. 14, no. 1, 1946. 2. B. de Finneti, Theory of Probability: A Critical Introductory Treatment. New York: J. Wiley, 1970. 3. A. P. Quinn, Bayesian Point Inference in Signal Processing. PhD thesis, Cambridge University Engineering Dept., 1992. 4. E. T. Jaynes, “Bayesian methods: General background,” in The Fourth Annual Workshop on Bayesian/Maximum Entropy Methods in Geophysical Inverse Problems, (Calgary), 1984. 5. S. M. Kay, Modern Spectral Estimation: Theory and Application. Prentice-Hall, 1988. 6. S. L. Marple Jr., Digital Spectral Analysis with Applications. Prentice-Hall, 1987. 7. H. Jeffreys, Theory of Probability. Oxford University Press, 3 ed., 1961. 8. G. E. P. Box and G. C. Tiao, Bayesian Inference in Statistical Analysis. Addison-Wesley, 1973. 9. P. M. Lee, Bayesian Statistics, an Introduction. Chichester, New York, Brisbane, Toronto, Singapore: John Wiley & Sons, 2 ed., 1997. 10. G. Parisi, Statistical Field Theory. Reading Massachusetts: Addison Wesley, 1988. 11. M. Opper and O. Winther, “From naive mean field theory to the TAP equations,” in Advanced Mean Field Methods (M. Opper and D. Saad, eds.), The MIT Press, 2001. 12. M. Opper and D. Saad, Advanced Mean Field Methods: Theory and Practice. Cambridge, Massachusetts: The MIT Press, 2001. 13. R. P. Feynman, Statistical Mechanics. New York: Addison–Wesley, 1972. 14. G. E. Hinton and D. van Camp, “Keeping neural networks simple by minimizing the description length of the weights,” in Proceedings of 6th Annual Workshop on Computer Learning Theory, pp. 5–13, ACM Press, New York, NY, 1993. 15. L. K. Saul, T. S. Jaakkola, and M. I. Jordan, “Mean field theory for sigmoid belief networks.,” Journal of Artificial Inteligence Research, vol. 4, pp. 61–76, 1996. 16. D. J. C. MacKay, “Free energy minimization algorithm for decoding and cryptanalysis,” Electronics Letters, vol. 31, no. 6, pp. 446–447, 1995. 17. D. J. C. MacKay, “Developments in probabilistic modelling with neural networks – ensemble learning,” in Neural Networks: Artificial Intelligence and Industrial Applications. Proceedings of the 3rd Annual Symposium on Neural Networks, Nijmegen, Netherlands, 14-15 September 1995, (Berlin), pp. 191–198, Springer, 1995. 18. M. I. Jordan, Learning in graphical models. MIT Press, 1999.
218
References
19. H. Attias, “A Variational Bayesian framework for graphical models.,” in Advances in Neural Information Processing Systems (T. Leen, ed.), vol. 12, MIT Press, 2000. 20. Z. Ghahramani and M. Beal, “Graphical models and variational methods,” in Advanced Mean Field Methods (M. Opper and D. Saad, eds.), The MIT Press, 2001. 21. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of Royal Statistical Society, Series B, vol. 39, pp. 1– 38, 1977. 22. R. M. Neal and G. E. Hinton, A New View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants. NATO Science Series, Dordrecht: Kluwer Academic Publishers, 1998. 23. M. J. Beal and Z. Ghahramani, “The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures,” in Bayesian Statistics 7 (J. M. et. al. Bernardo, ed.), Oxford University Press, 2003. 24. C. M. Bishop, “Variational principal components,” in Proceedings of the Ninth International Conference on Artificial Neural Networks, (ICANN), 1999. 25. Z. Ghahramani and M. J. Beal, “Variational inference for Bayesian mixtures of factor analyzers,” Neural Information Processing Systems, vol. 12, pp. 449–455, 2000. 26. M. Sato, “Online model selection based on the variational Bayes,” Neural Computation, vol. 13, pp. 1649–1681, 2001. 27. S. J. Roberts and W. D. Penny, “Variational Bayes for generalized autoregressive models,” IEEE Transactions on Signal Processing, vol. 50, no. 9, pp. 2245–2257, 2002. 28. P. Sykacek and S. J. Roberts, “Adaptive classification by variational Kalman filtering,” in Advances in Neural Information Processing Systems 15 (S. Thrun, S. Becker, and K. Obermayer, eds.), MIT press, 2003. 29. J. W. Miskin, Ensemble Learning for Independent Component Analysis. PhD thesis, University of Cambridge, 2000. 30. J. Pratt, H. Raiffa, and R. Schlaifer, Introduction to Statistical Decision Theory. MIT Press, 1995. 31. R. E. Kass and A. E. Raftery, “Bayes factors,” Journal of American Statistical Association, vol. 90, pp. 773–795, 1995. 32. D. Titterington, A. Smith, and U. Makov, Statistical Analysis of Finite Mixtures. New York: John Wiley, 1985. 33. E. T. Jaynes, “Clearing up mysteries—the original goal,” in Maximum Entropy and Bayesian Methods (J. Skilling, ed.), pp. 1–27, Kluwer, 1989. 34. M. Tanner, Tools for statistical inference. New York: Springer-Verlag, 1993. 35. J. J. K. O’Ruanaidh and W. J. Fitzgerald, Numerical Bayesian Methods applied to Signal Processing. Springer, 1996. 36. B. de Finetti, Theory of Probability, vol. 2. Wiley, 1975. 37. J. Bernardo and A. Smith, Bayesian Theory. Chichester, New York, Brisbane, Toronto, Singapore: John Wiley & Sons, 2 ed., 1997. 38. G. L. Bretthorst, Bayesian Spectrum Analysis and Parameter Estimation. SpringerVerlag, 1989. 39. M. Kárný and R. Kulhavý, “Structure determination of regression-type models for adaptive prediction and control,” in Bayesian Analysis of Time Series and Dynamic Models (J. Spall, ed.), New York: Marcel Dekker, 1988. Chapter 12. 40. A. Quinn, “Regularized signal identification using Bayesian techniques,” in Signal Analysis and Prediction, Birkhäuser Boston Inc., 1998. 41. R. A. Fisher, “Theory of statistical estimation,” Proc. Camb. Phil. Soc., vol. 22(V), pp. 700–725, 1925. Reproduced in [162].
References
219
42. V. Peterka, “Bayesian approach to system identification,” in Trends and Progress in System identification (P. Eykhoff, ed.), pp. 239–304, Oxford: Pergamon Press, 1981. 43. A. W. F. Edwards, Likelihood. Cambridge Univ. Press, 1972. 44. J. D. Kalbfleisch and D. A. Sprott, “Application of likelihood methods to models involving large numbers of parameters,” J. Royal Statist. Soc., vol. B-32, no. 2, 1970. 45. R. L. Smith and J. C. Naylor, “A comparison of maximum likelihood and Bayesian estimators for the three-parameter Weibull distribution,” Appl. Statist., vol. 36, pp. 358– 369, 1987. 46. R. D. Rosenkrantz, ed., E. T. Jaynes: Papers on Probability, Statistics and Statistical Physics. D. Reidel, Dordrecht-Holland, 1983. 47. G. E. P. Box and G. C. Tiao, Bayesian Statistics. Oxford: Oxford, 1961. 48. J. Berger, Statistical Decision Theory and Bayesian Analysis. New York: SpringerVerlag, 1985. 49. A. Wald, Statistical Decision Functions. New York, London: John Wiley, 1950. 50. M. DeGroot, Optimal Statistical Decisions. New York: McGraw-Hill, 1970. 51. C. P. Robert, The Bayesian Choice: A Decision Theoretic Motivation. Springer texts in Statistics, Springer-Verlag, 1994. 52. M. Kárný, J. Böhm, T. Guy, L. Jirsa, I. Nagy, P. Nedoma, and L. Tesaˇr, Optimized Bayesian Dynamic Advising: Theory and Algorithms. London: Springer, 2005. 53. J. Makhoul, “Linear prediction: A tutorial review,” Proceedings of the IEEE, vol. 63, no. 4, pp. 561–580, 1975. 54. S. M. Kay, Fundamentals of Statistical Signal Processing. Prentice-Hall, 1993. 55. A. P. Quinn, “The performance of Bayesian estimators in the superresolution of signal parameters,” in Proc. IEEE Int. Conf. on Acoust., Sp. and Sig. Proc. (ICASSP), (San Francisco), 1992. 56. A. Zellner, An Introduction to Bayesian Inference in Econometrics. New York: J. Wiley, 1976. 57. A. Quinn, “Novel parameter priors for Bayesian signal identification,” in Proc. IEEE Int. Conf. on Acoust., Sp. and Sig. Proc. (ICASSP), (Munich), 1997. 58. R. E. Kass and A. E. Raftery, “Bayes factors and model uncertainty,” tech. rep., University of Washington, 1994. 59. S. F. Gull, “Bayesian inductive inference and maximum entropy,” in Maximum Entropy and Bayesian Methods in Science and Engineering (G. J. Erickson and C. R. Smith, eds.), Kluwer, 1988. 60. J. Skilling, “The axioms of maximum entropy,” in Maximum Entropy and Bayesian Methods in Science and Engineering. Vol. 1 (G. J. Erickson and C. R. Smith, eds.), Kluwer, 1988. 61. D. Bosq, Nonparametric Statistics for Stochastic Processes: estimation and prediction. Springer, 1998. 62. J. M. Bernardo, “Expected infromation as expected utility,” The Annals of Statistics, vol. 7, no. 3, pp. 686–690, 1979. 63. S. Kullback and R. Leibler, “On information and sufficiency,” Annals of Mathematical Statistics, vol. 22, pp. 79–87, 1951. 64. S. Amari, S. Ikeda, and H. Shimokawa, “Information geometry of α-projection in mean field approximation,” in Advanced Mean Field Methods (M. Opper and D. Saad, eds.), (Cambridge, Massachusetts), The MIT Press, 2001. 65. S. Amari, Differential-Geometrical Methods in Statistics. Sringer, 1985. 66. C.F.J.Wu, “On the convergence properties of the EM algorithm,” The Annals of Statistics, vol. 11, pp. 95–103, 1983.
220
References
67. S. F. Gull and J. Skilling, “Maximum entropy method in image processing,” Proc. IEE, vol. F-131, October 1984. 68. S. F. Gull and J. Skilling, Quantified Maimum Entropy. MemSys5 Users’ Manual. Maximum Entropy Data Consultants Ltd., 1991. 69. A. Papoulis, “Maximum entropy and spectral estimation: a review,” IEEE Trans. on Acoust., Sp., and Sig. Proc., vol. ASSP-29, December 1981. 70. D. J. C. MacKay, Information Theory, Inference & Learning Algorithms. Cambridge Univerzity Press, 2004. 71. G. Demoment and J. Idier, “Problèmes inverses et déconvolution,” Journal de Physique IV, pp. 929–936, 1992. 72. M. Nikolova and A. Mohammad-Djafari, “Maximum entropy image reconstruction in eddy current tomography,” in Maximum Entropy and Bayesian Methods (A. Mohammad-Djafari and G. Demoment, eds.), Kluwer, 1993. 73. W. Gilks, S. Richardson, and D. Spiegelhalter, Markov Chain Monte Carlo in Practice. London: Chapman & Hall, 1997. 74. A. Doucet, N. de Freitas, and N. Gordon, eds., Sequential Monte Carlo Methods in Practice. Springer, 2001. 75. A. F. M. Smith and A. E. Gelfand, “Bayesian statistics without tears: a samplingresampling perspective,” The American Statistician, vol. 46, pp. 84–88, 1992. 76. T. Ferguson, “A Bayesian analysis of some nonparametric problems,” The Annals of Statistics, vol. 1, pp. 209–230, 1973. 77. S. Walker, P. Damien, P. Laud, and A. Smith, “Bayesian nonparametric inference for random distributions and related functions,” J. R. Statist. Soc., vol. 61, pp. 485–527, 2004. with discussion. 78. S. J. Press and K. Shigemasu, “Bayesian inference in factor analysis,” in Contributions to Probability and Statistics (L. J. Glesser, ed.), ch. 15, Springer Verlag, New York, 1989. 79. D. B. Rowe and S. J. Press, “Gibbs sampling and hill climbing in Bayesian factor analysis,” tech. rep., University of California, Riverside, 1998. 80. I. Jolliffe, Principal Component Analysis. Springer-Verlag, 2nd ed., 2002. 81. S. M. Kay, Fundamentals Of Statistical Signal Processing: Estimation Theory. Prentice Hall, 1993. 82. M. E. Tipping and C. M. Bishop, “Mixtures of probabilistic principal component analyzers,” tech. rep., Aston University, 1998. 83. K. Pearson, “On lines and planes of closest fit to systems of points in space,” The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, vol. 2, pp. 559– 572, 1901. 84. T. W. Anderson, An Introduction to Multivariate Statistical Analysis. John Wiley and Sons, 1971. 85. M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,” Journal of the Royal Statistical Society, Series B, vol. 61, pp. 611–622, 1998. 86. G. H. Golub and C. F. Van Loan, Matrix Computations. Baltimore – London: The John Hopkins University Press, 1989. 87. H. Hotelling, “Analysis of a complex of statistical variables into principal components,” Journal of Educational Psychology, vol. 24, pp. 417–441, 1933. 88. D. B. Rowe, Multivariate Bayesian Statistics: Models for Source Separation and Signal Unmixing. Boca Raton, FL, USA: CRC Press, 2002. 89. T. P. Minka, “Automatic choice of dimensionality for PCA,” tech. rep., MIT, 2000. 90. V. Šmídl, The Variational Bayes Approach in Signal Processing. PhD thesis, Trinity College Dublin, 2004.
References
221
91. V. Šmídl and A. Quinn, “Fast variational PCA for functional analysis of dynamic image sequences,” in Proceedings of the 3rd International Conference on Image and Signal Processing, ISPA 03, (Rome, Italy), September 2003. 92. C. G. Khatri and K. V. Mardia, “The von Mises-Fisher distribution in orientation statistics,” Journal of Royal Statistical Society B, vol. 39, pp. 95–106, 1977. 93. M. Abramowitz and I. Stegun, Handbook of Mathematical Functions. New York: Dover Publications, 1972. 94. I. Buvat, H. Benali, and R. Di Paola, “Statistical distribution of factors and factor images in factor analysis of medical image sequences,” Physics in Medicine and Biology, vol. 43, no. 6, pp. 1695–1711, 1998. 95. H. Benali, I. Buvat, F. Frouin, J. P. Bazin, and R. Di Paola, “A statistical model for the determination of the optimal metric in factor analysis of medical image sequences (FAMIS),” Physics in Medicine and Biology, vol. 38, no. 8, pp. 1065–1080, 1993. 96. J. Harbert, W. Eckelman, and R. Neumann, Nuclear Medicine. Diagnosis and Therapy. New York: Thieme, 1996. 97. S. Kotz and N. Johnson, Encyclopedia of statistical sciences. New York: John Wiley, 1985. 98. T. W. Anderson, “Estimating linear statistical relationships,” Annals of Statististics, vol. 12, pp. 1–45, 1984. 99. J. Fine and A. Pouse, “Asymptotic study of the multivariate functional model. Application to the metric of choice in Principal Component Analysis,” Statistics, vol. 23, pp. 63–83, 1992. 100. F. Pedersen, M. Bergstroem, E. Bengtsson, and B. Langstroem, “Principal component analysis of dynamic positron emission tomography studies,” Europian Journal of Nuclear Medicine, vol. 21, pp. 1285–1292, 1994. 101. F. Hermansen and A. A. Lammertsma, “Linear dimension reduction of sequences of medical images: I. optimal inner products,” Physics in Medicine and Biology, vol. 40, pp. 1909–1920, 1995. 102. M. Šámal, M. Kárný, H. Surová, E. Maˇríková, and Z. Dienstbier, “Rotation to simple structure in factor analysis of dynamic radionuclide studies,” Physics in Medicine and Biology, vol. 32, pp. 371–382, 1987. 103. M. Kárný, M. Šámal, and J. Böhm, “Rotation to physiological factors revised,” Kybernetika, vol. 34, no. 2, pp. 171–179, 1998. 104. A. Hyvärinen, “Survey on independent component analysis,” Neural Computing Surveys, vol. 2, pp. 94–128, 1999. 105. V. Šmídl, A. Quinn, and Y. Maniouloux, “Fully probabilistic model for functional analysis of medical image data,” in Proceedings of the Irish Signals and Systems Conference, (Belfast), pp. 201–206, University of Belfast, June 2004. 106. J. R. Magnus and H. Neudecker, Matrix Differential Calculus. Wiley, 2001. 107. M. Šámal and H. Bergmann, “Hybrid phantoms for testing the measurement of regional dynamics in dynamic renal scintigraphy.,” Nuclear Medicine Communications, vol. 19, pp. 161–171, 1998. 108. L. Ljung and T. Söderström, Theory and practice of recursive identification. Cambridge; London: MIT Press, 1983. 109. E. Mosca, Optimal, Predictive, and Adaptive Control. Prentice Hall, 1994. 110. T. Söderström and R. Stoica, “Instrumental variable methods for system identification,” Lecture Notes in Control and Information Sciences, vol. 57, 1983. 111. D. Clarke, Advances in Model-Based Predictive Control. Oxford: Oxford University Press, 1994.
222
References
112. R. Patton, P. Frank, and R. Clark, Fault Diagnosis in Dynamic Systems: Theory & Applications. Prentice Hall, 1989. 113. R. Kalman, “A new approach to linear filtering and prediction problem,” Trans. ASME, Ser. D, J. Basic Eng., vol. 82, pp. 34–45, 1960. 114. F. Gustafsson, Adaptive Filtering and Change Detection. Chichester: Wiley, 2000. 115. V. Šmídl, A. Quinn, M. Kárný, and T. V. Guy, “Robust estimation of autoregressive processes using a mixture-based filter bank,” System & Control Letters, vol. 54, pp. 315– 323, 2005. 116. K. Astrom and B. Wittenmark, Adaptive Control. Reading, Massachusetts: AddisonWesley, 1989. 117. R. Koopman, “On distributions admitting a sufficient statistic,” Transactions of American Mathematical Society, vol. 39, p. 399, 1936. 118. L.Ljung, System Identification-Theory for the User. Prentice-hall. Englewood Cliffs, N.J: D. van Nostrand Company Inc., 1987. 119. A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing. Prentice-Hall, 1989. 120. V. Šmídl and A. Quinn, “Mixture-based extension of the AR model and its recursive Bayesian identification,” IEEE Transactions on Signal Processing, vol. 53, no. 9, pp. 3530–3542, 2005. 121. R. Kulhavý, “Recursive nonlinear estimation: A geometric approach,” Automatica, vol. 26, no. 3, pp. 545–555, 1990. 122. R. Kulhavý, “Implementation of Bayesian parameter estimation in adaptive control and signal processing,” The Statistician, vol. 42, pp. 471–482, 1993. 123. R. Kulhavý, “Recursive Bayesian estimation under memory limitations,” Kybernetika, vol. 26, pp. 1–20, 1990. 124. R. Kulhavý, Recursive Nonlinear Estimation: A Geometric Approach, vol. 216 of Lecture Notes in Control and Information Sciences. London: Springer-Verlag, 1996. 125. R. Kulhavý, “A Bayes-closed approximation of recursive non-linear estimation,” International Journal Adaptive Control and Signal Processing, vol. 4, pp. 271–285, 1990. 126. C. S. Wong and W. K. Li, “On a mixture autoregressive model,” Journal of the Royal Statistical Society: Series B, vol. 62, pp. 95–115, 2000. 127. M. Kárný, J. Böhm, T. Guy, and P. Nedoma, “Mixture-based adaptive probabilistic control,” International Journal of Adaptive Control and Signal Processing, vol. 17, no. 2, pp. 119–132, 2003. 128. H. Attias, J. C. Platt, A. Acero, and L. Deng, “Speech denoising and dereverberation using probabilistic models,” in Advances in Neural Information Processing Systems, pp. 758–764, 2001. 129. J. Stutz and P. Cheeseman, “AutoClass - a Bayesian approach to classification,” in Maximum Entropy and Bayesian Methods (J. Skilling and S. Sibisi, eds.), Dordrecht: Kluwer, 1995. 130. M. Funaro, M. Marinaro, A. Petrosino, and S. Scarpetta, “Finding hidden events in astrophysical data using PCA and mixture of Gaussians clustering,” Pattern Analysis & Applications, vol. 5, pp. 15–22, 2002. 131. S. Haykin, "Neural Networks: A Comprehensive Foundation. New York: Macmillan, 1994. 132. K. Warwick and M. Kárný, Computer-Intensive Methods in Control and Signal Processing: Curse of Dimensionality. Birkhauser, 1997. 133. J. Andrýsek, “Approximate recursive Bayesian estimation of dynamic probabilistic mixtures,” in Multiple Participant Decision Making (J. Andrýsek, M. Kárný, and J. Kracík, eds.), pp. 39–54, Adelaide: Advanced Knowledge International, 2004.
References
223
134. S. Roweis and Z. Ghahramani, “A unifying review of linear Gaussian models,” Neural computation, vol. 11, pp. 305–345, 1999. 135. A. P. Quinn, “Threshold-free Bayesian estimation using censored marginal inference,” in Signal Processing VI: Proc. of the 6th European Sig. Proc. Conf. (EUSIPCO-’92). Vol. 2, (Brussels), 1992. 136. A. P. Quinn, “A consistent, numerically efficient Bayesian framework for combining the selection, detection and estimation tasks in model-based signal processing,” in Proc. IEEE Int. Conf. on Acoust., Sp. and Sig. Proc., (Minneapolis), 1993. 137. “Project IST-1999-12058, decision support tool for complex industrial processes based on probabilistic data clustering (ProDaCTool),” tech. rep., 1999–2002. 138. P. Nedoma, M. Kárný, and I. Nagy, “MixTools, MATLAB toolbox for mixtures: User’s ˇ Praha, 2001. guide,” tech. rep., ÚTIA AV CR, 139. A. Quinn, P. Ettler, L. Jirsa, I. Nagy, and P. Nedoma, “Probabilistic advisory systems for data-intensive applications,” International Journal of Adaptive Control and Signal Processing, vol. 17, no. 2, pp. 133–148, 2003. 140. Z. Chen, “Bayesian filtering: From Kalman filters to particle filters, and beyond,” tech. rep., Adaptive Syst. Lab., McMaster University, Hamilton, ON, Canada, 2003. 141. B. Ristic, S. Arulampalam, and N. Gordon, Beyond the Kalman Filter: Particle Filters for Tracking Applications. Artech House Publishers, 2004. 142. E. Daum, “New exact nonlinear filters,” in Bayesian Analysis of Time Series and Dynamic Models (J. Spall, ed.), New York: Marcel Dekker, 1988. 143. P. Vidoni, “Exponential family state space models based on a conjugate latent process,” J. Roy. Statist. Soc., Ser. B, vol. 61, pp. 213–221, 1999. 144. A. H. Jazwinski, Stochastic Processes and Filtering Theory. New York: Academic Press, 1979. 145. R. Kulhavý and M. B. Zarrop, “On a general concept of forgetting,” International Journal of Control, vol. 58, no. 4, pp. 905–924, 1993. 146. R. Kulhavý, “Restricted exponential forgetting in real-time identification,” Automatica, vol. 23, no. 5, pp. 589–600, 1987. 147. C. F. So, S. C. Ng, and S. H. Leung, “Gradient based variable forgetting factor RLS algorithm,” Signal Processing, vol. 83, pp. 1163–1175, 2003. 148. R. H. Middleton, G. C. Goodwin, D. J. Hill, and D. Q. Mayne, “Design issues in adaptive control,” IEEE Transactions on Automatic Control, vol. 33, no. 1, pp. 50–58, 1988. 149. R. Elliot, L. Assoun, and J. Moore, Hidden Markov Models. New York: Springer-Verlag, 1995. 150. V. Šmídl and A. Quinn, “Bayesian estimation of non-stationary AR model parameters via an unknown forgetting factor,” in Proceedings of the IEEE Workshop on Signal Processing, (New Mexico), pp. 100–105, August 2004. 151. M. H. Vellekoop and J. M. C. Clark, “A nonlinear filtering approach to changepoint detection problems: Direct and differential-geometric methods,” SIAM Journal on Control and Optimization, vol. 42, no. 2, pp. 469–494, 2003. 152. G. Bierman, Factorization Methods for Discrete Sequential Estimation. New York: Academic Press, 1977. 153. G. D. Forney, “The Viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp. 268– 278, 1973. 154. J. Deller, J. Proakis, and J. Hansen, Discrete-Time Processing of Speech Signals. Macmillan, New York, 1993. 155. L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals. Prentice-Hall, 1978.
224
References
156. J. M. Bernardo, “Approximations in statistics from a decision-theoretical viewpoint,” in Probability and Bayesian Statistics (R. Viertl, ed.), pp. 53–60, New York: Plenum, 1987. 157. N. D. Le, L. Sun, and J. V. Zidek, “Bayesian spatial interpolation and backcasting using Gaussian-generalized inverted Wishart model,” tech. rep., University of British Columbia, 1999. 158. E. T. Jaynes, Probability Theory: The Logic of Science. Cambridge University Press, 2003. 159. A. T. James, “Distribution of matrix variates and latent roots derived from normal samples,” Annals of Mathematical Statistics, vol. 35, pp. 475–501, 1964. 160. K. Mardia and P. E. Jupp, Directional Statistics. Chichester, England: John Wiley and Sons, 2000. 161. T. D. Downs, “Orientational statistics,” Biometrica, vol. 59, pp. 665–676, 1972. 162. R. A. Fisher, Contributions to Mathematical Statistics. John Wiley and Sons, 1950.
Index
activity curve, 91 additive Gaussian noise model, 17 advisory system, 140 augmented model, 124 Automatic Rank Determination (ARD) property, 68, 85, 86, 99, 103 AutoRegressive (AR) model, 111, 112, 114, 130, 173, 179, 193 AutoRegressive model with eXogenous variables (ARX), 113, 181 Bayesian filtering, 47, 146 Bayesian smoothing, 146 certainty equivalence approximation, 43, 45, 188 changepoints, 173 classical estimators, 18 combinatoric explosion, 130 components, 130 conjugate distribution, 19, 111, 120 conjugate parameter distribution to DEF family (CDEF), 113, 146, 153, 179, 191 correspondence analysis, 86, 92, 103 covariance matrix, 117 covariance method, 116 cover-up rule, 37 criterion of cumulative variance, 83, 86 data update, 21, 146, 149 digamma (psi) function, 133, 164, 211 Dirac δ-function, 44
Dirichlet distribution, 132, 134, 159, 164, 215 discount schedules, 139, 174 distributional estimation, 47 dyad, 115, 181, 190 Dynamic Exponential Family (DEF), 113, 153, 179 Dynamic Exponential Family with Hidden variables (DEFH), 126 Dynamic Exponential Family with Separable parameters (DEFS), 120 dynamic mixture model, 140 economic SVD, 59 EM algorithm, 44, 124 empirical distribution, 46, 152 exogenous variables, 113, 139, 180 exponential family, 112 Exponential Family with Hidden variables (EFH), 126 exponential forgetting, 154 Extended AR (EAR) model, 179, 180 extended information matrix, 115, 181 extended regressor, 115, 131, 180 factor images, 91 factor analysis, 51 factor curve, 91 FAMIS model, 93, 94, 102 Fast Variational PCA (FVPCA), 68 filter-bank, 182, 192, 196 forgetting, 139, 191 forgetting factor, 153
226
Index
Functional analysis of medical image sequences, 89 Gamma distribution, 212 geometric approach, 128 Gibbs sampling, 61 global approximation, 117, 128 Hadamard product, 64, 98, 120, 160 Hamming distance, 167 Hidden Markov Model (HMM), 158 hidden variable, 124, 182 Highest Posterior Density (HPD) region, 18, 79, 195 hyper-parameter, 19, 62, 95 importance function, 152 Independent Component Analysis (ICA), 94 independent, identically-distributed (i.i.d.) noise, 58 independent, identically-distributed (i.i.d.) process, 168 independent, identically-distributed (i.i.d.) sampling, 46, 152 inferential breakpoint, 49 informative prior, 49 initialization, 136, 174 innovations process, 114, 130, 180 inverse-Wishart distribution, 211 Iterative Variational Bayes (VB) algorithm, 32, 53, 118, 122, 134, 174 Jacobian, 180 Jeffreys’ notation, 16, 110 Jeffreys’ prior, 4, 71 Jensen’s inequality, 32, 77 Kalman filter, 146, 155, 196 KL divergence for Minimum Risk (MR) calculations, 28, 39, 128 KL divergence for Variational Bayes (VB) calculations, 28, 40, 121, 147 Kronecker function, 44 Kronecker product, 99 Kullback-Leibler (KL) divergence, 27 Laplace approximation, 62 LD decomposition, 182 Least Squares (LS) estimation, 18, 116 local approximation, 117
Low-Pass Filter (LPF), 200 Maple, 68 Markov chain, 47, 113, 158, 182 Markov-Chain Monte Carlo (MCMC) methods , 47 MATLAB, 64, 120, 140 matrix Dirichlet distribution, 160, 216 matrix Normal distribution, 58, 209 Maximum a Posteriori (MAP) estimation, 17, 44, 49, 188 Maximum Likelihood (ML) estimation, 17, 44, 57, 69 medical imaging, 89 Minimum Mean Squared-Error (MMSE) criterion, 116 missing data, 124, 132 MixTools, 140 Mixture-based Extended AutoRegressive (MEAR) model, 183, 191, 192, 196 moment fitting, 129 Monte Carlo (MC) simulation, 80, 83, 138, 166 Multinomial distribution, 132, 134, 159, 162, 185, 215 Multinomial distribution of continuous argument, 163, 185, 215 multivariate AutoRegressive (AR) model, 116 multivariate Normal distribution, 209 natural gradient technique, 32 non-informative prior, 18, 103, 114, 172 non-minimal conjugate distribution, 121 non-smoothing restriction, 151, 162, 187 nonparametric prior, 47 normal equations, 116 Normal-inverse-Gamma distribution, 115, 170, 179, 211 Normal-inverse-Wishart distribution, 132, 134, 210 normalizing constant, 15, 22, 113, 116, 134, 169, 187 nuclear medicine, 89 observation model, 21, 102, 111, 112, 146, 159, 179, 183 one-step approximation, 117, 121, 129
Index One-step Fixed-Form (FF) Approximation, 135 optimal importance function, 165 orthogonal PPCA model, 70 Orthogonal Variational PCA (OVPCA), 76 outliers, 192 parameter evolution model, 21, 146 particle filtering, 47, 145, 152 particles, 152 Poisson distribution, 91 precision matrix, 117 precision parameter, 58, 93 prediction, 22, 116, 141, 181 Principal Component Analysis (PCA), 57 probabilistic editor, 129 Probabilistic Principal Component Analysis (PPCA), 58 probability fitting, 129 probability simplex, 216 ProDaCTool, 140 proximity measure, 27, 128 pseudo-stationary window, 155
227
scaling ambiguity, 48 separable-in-parameters family, 34, 63, 118 shaping parameters, 3, 34, 36 sign ambiguity, 49, 70 signal flowgraph, 114 Signal-to-Noise Ratio (SNR), 201 Singular Value Decomposition (SVD), 59, 213 spanning property, 191 speech reconstruction, 201 static mixture models, 135 Stiefel manifold, 71, 213 stochastic distributional approximation, 46 stressful regime, 136 Student’s t-distribution, 5, 116, 141, 200, 211 sufficient statistics, 19 time update, 21, 149, 153 transition matrix, 159 truncated Exponential distribution, 172, 216 truncated Normal distribution, 73, 211 uniform prior, 18
Quasi-Bayes (QB) approximation, 43, 128, 133, 140, 150, 186 rank, 59, 77, 83, 99 Rao-Blackwellization, 165 recursive algorithm, 110 Recursive Least Squares (RLS) algorithm, 116, 168 regressor, 111, 180 regularization, 18, 48, 114, 154 relative entropy, 7 Restricted VB (RVB) approximation, 128, 133, 186 rotational ambiguity, 60, 70 scalar additive decomposition, 3, 37 scalar multiplicative decomposition, 120 scalar Normal distribution, 120, 209 scaling, 92
Variational Bayes (VB) method, 33, 126, 149 Variational EM (VEM), 32 Variational PCA (VPCA), 68 VB-approximation, 29, 51 VB-conjugacy, 149 VB-equations, 35 VB-filtering, 148 VB-marginalization, 33 VB-marginals, 3, 29, 32, 34 VB-moments, 3, 35, 36 VB-observation model, 121, 125, 148, 172, 184 VB-parameter predictor, 148, 184 VB-smoothing, 148 vec-transpose operator, 115 Viterbi-Like (VL) Approximation, 188 von Mises-Fisher distribution, 73, 212