Information Science and Statistics Series Editors: M. Jordan J. Kleinberg B. Scho¨lkopf F.P. Kelly I. Witten
Milan St...
69 downloads
1446 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Information Science and Statistics Series Editors: M. Jordan J. Kleinberg B. Scho¨lkopf F.P. Kelly I. Witten
Milan Studeny´
On Probabilistic Conditional Independence Structures With 42 Illustrations
Milan Studeny´, RNDr, DrSc Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic, CZ18208 Pod voda´renskou veˇzˇ´ı 4, Prague 8, Libenˇ Czech Republic Series Editors: Michael Jordan Division of Computer Science and Department of Statistics University of California, Berkeley Berkeley, CA 94720 USA
Jon Kleinberg Department of Computer Science Cornell University Ithaca, NY 14853 USA
Frank P. Kelly Statistical Laboratory Centre for Mathematical Sciences Wilberforce Road Cambridge CB3 0WB UK
Ian Witten Department of Computer Science University of Waikato Hamilton New Zealand
Bernhard Scho¨lkopf Max Planck Institute for Biological Cybernetics Spemannstrasse 38 72076 Tu¨bingen Germany
Cover illustration: Details British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress Mathematics Subject Classification (1991): 62-02, 68-02, 62H05, 68R01, 68T30, 94A17, 06A17, 90C10, 15A99, 52B99 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. ISBN 1-85233-891-1 Springer London Berlin Heidelberg Springer is a part of Springer Science+Business Media springeronline.com © Springer-Verlag London Limited 2005 Printed in the United States of America The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Typesetting: Camera-ready by authors 12/3830-543210 Printed on acid-free paper SPIN 10943397
Special acknowledgements in Czech Na tomto m´ıstˇe bych r´ad podˇekoval zejm´ena sv´ ym rodiˇc˚ um za trpˇelivost, l´asku a vˇsestrannou pomoc a podporu. M˚ uj bratr a jeho rodina, zvl´aˇstˇe pak moje neteˇr Petra, budou jistˇe potˇeˇseni, kdyˇz si pˇreˇctou, ˇze oceˇ nuji jejich mor´aln´ı podporu.
Preface
The book is devoted to the mathematical description of probabilistic conditional independence structures. The topic of conditional independence, which falls within both the scope of statistics and of artificial intelligence, has been at the center of my research activity for many years – since the late 1980s. I have been primarily influenced by researchers working in the area of graphical models but I gradually realized that the concept of conditional independence is not necessarily bound to the idea of graphical description and may have a broader impact. This observation led me to an attempt to develop a non-graphical method for describing probabilistic conditional independence structures which, in my view, overcomes an inherent limitation of graphical approaches. The method of structural imsets described in this book can be viewed as an algebraic approach to the description of conditional independence structures although it remains within the framework of discrete mathematics. The basic idea of this approach was already presented in the middle of the 1990s in a series of papers [137]. However, I was not satisfied with the original presentation of the approach for several reasons. First, the series of papers only dealt with the discrete case, which is a kind of imperfection from the point of view of statistics. Second, the main message was dimmed by unnecessary mathematical peculiarities and important ideas were perhaps not pinpointed clearly. Third, the motivation was not explained in detail. I also think that the original series of papers was difficult for researchers in the area of artificial intelligence to read because “practical” implementation aspects of the presented approach were suppressed there. Another point is that the pictorial representation of considered mathematical objects, to which researchers interested in graphical models are accustomed, was omitted. Within the next six years, further mathematical results were achieved which amended, supplemented and gave more precision to the original idea. I have also deliberated about suitable terminology and the way to present the method of structural imsets which would be acceptable to statisticians and researchers in the area of artificial intelligence, as well as exact from the mathematical point of view. I wrote it up in my DrSc thesis [146], which became
VIII
Preface
the basis of this monograph. After finishing the thesis, I realized the potential future practical application of the method to learning graphical models and decided to emphasize this by writing an additional chapter. Thus, the aim of this monograph is to present the method of structural imsets in its full (present) extent: the motivation; the mathematical foundations, which I tried to present in a didactic form; indication of the area of application; and open problems. The motivation is explained in the first chapter. The second chapter recalls basic concepts in the area of probabilistic conditional independence structures. The third chapter is an overview of classic graphical methods for describing conditional independence structures. The core of the method of structural imsets is presented in the next four chapters. The eighth chapter shows application of the method to learning graphical models. Open problems are gathered in the ninth chapter and necessary elementary mathematical notions are provided in the Appendix for the reader’s convenience. Then the List of Notation follows. As there are many cross-references to elementary units of the text, like Lemmas, Remarks etc., they are listed with page numbers afterwards. The text is concluded by the References and the Index. The book is intended for • mathematicians who may be attracted by this particular application of mathematics in the area of artificial intelligence and statistics; • researchers in statistics and informatics who may become interested in deeper understanding of the mathematical basis of the theory of (graphical) models of conditional independence structures; • advanced PhD students in the fields of mathematics, probability, statistics, informatics and computer science who may find inspiration in the book and perhaps make some progress either by solving open problems or by applying the presented theory in practice. In particular, I have in mind those PhD students who are thinking about an academic career. They are advised to read the book starting with the Appendix and to utilize the lists at the end of the book. Many people deserve my thanks for help with this piece of work. In particular, I would like to thank Marie Kol´ aˇrov´ a for typing the text of the monograph in LATEX. As concerns expert help I am indebted to my colleagues (and former co-authors) Fero Mat´ uˇs and Phil Dawid for their remarks (even for some critical ones made by Fero), various pieces of advice and pointers to the literature and discussion which helped me clarify the view on the topic of the book. I have also profited from cooperation with other colleagues: some results presented here were achieved with the help of computer programs written by Pavel Boˇcek, Remco Bouckaert, Tom´aˇs Koˇcka, Martin Volf and Jiˇr´ı Vomlel. Moreover, I am indebted to my colleagues Radim Jirouˇsek, Otakar Kˇr´ıˇz and Jiˇrina Vejnarov´ a for their encouragement in writing my DrSc thesis, which was quite important for me. The cooperation with all of my colleagues mentioned above involved joint theoretical research as well. A preliminary version of the
Preface
IX
ˇ book was read by my PhD student Petr Simeˇ cek, who gave me several useful comments and recommendations including an important example. I also made minor changes in response to comments given by Tom´aˇs Kroupa and Helen Armstrong, who read some parts of the manuscript. As concerns the technical help I would like to thank V´ aclav Kelar for making special LATEX fonts for me and to Jarmila P´ ankov´ a for helping me to prepare several pages with special pictures. I am likewise grateful to Cheri Dohnal and Anton´ın Ot´ ahal for correcting my (errors in) English. I was very pleased by the positive attitude of Stephanie Harding, who is the Mathematics and Statistics Editor at Springer London; the cooperation with her was smooth and effective. She found suitable reviewers for the book and they gave me further useful comments, which helped me to improve the quality of the book. I am also indebted to other colleagues all over the world whose papers, theses and books inspired me somehow in connection with this monograph. In particular, I would like to mention my PhD supervisor, Albert Perez. However, many other colleagues influenced me in addition to those who were already mentioned above. I will name some of them here: Steen Andersson, Luis de Campos, Max Chickering, Robert Cowell, David Cox, Morten Frydenberg, Dan Geiger, Tom´ aˇs Havr´ anek, Jan Koster, Ivan Kramosil, Steffen Lauritzen, Franco Malvestuto, Michel Mouchart, Chris Meek, Azaria Paz, Judea Pearl, Michael Perlman, Jean-Marie Rolin, Thomas Richardson, Jim Smith, Glenn Shafer, Prakash Shenoy, David Spiegelhalter, Peter Spirtes, Wolfgang Spohn, Nanny Wermuth, Joe Whittaker, Raymond Yeung and Zhen Zhang. Of course, the above list is not exhaustive; I apologize to anyone whose name may have been omitted. Let me emphasize that I profited from meeting several colleagues who gave me inspiration during the seminar, “Conditional Independence Structures”, which was held from September 27 to October 17, 1999 in the Fields Institute for Research in Mathematical Sciences, University of Toronto, Canada, and during several events organized within the framework of the ESF program, “Highly Structured Stochastic Systems” in the years 1997–2000. In particular, I wish to thank Hel`ene Massam and Steffen Lauritzen, who gave me a chance to participate actively in these wonderful events. For example, I remember the stimulating atmosphere of the HSSS research kitchen “Learning conditional independence models”, held in Tˇreˇst’, Czech Republic, in October 2000. Finally, this monograph was written in the Department of Decision-Making Theory of the Institute of Information Theory and Automation (Academy of Sciences of the Czech Republic) in Prague and was supported by the projects ˇ n. K1019101 and GACR ˇ n. 201/01/1482. It is a result of long-term GA AVCR research performed in the institute, which has provided a suitable environment for my work since 1983.
Prague, March 2004
Milan Studen´y
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivational thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Goals of the monograph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Structure of the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 2 6 7
2
Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Conditional independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Semi-graphoid properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Formal independence models . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Semi-graphoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Elementary independence statements . . . . . . . . . . . . . . . . 2.2.4 Problem of axiomatic characterization . . . . . . . . . . . . . . . 2.3 Classes of probability measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Marginally continuous measures . . . . . . . . . . . . . . . . . . . . . 2.3.2 Factorizable measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Multiinformation and conditional product . . . . . . . . . . . . 2.3.4 Properties of multiinformation function . . . . . . . . . . . . . . 2.3.5 Positive measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.6 Gaussian measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.7 Basic construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Imsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 9 11 11 13 15 16 17 19 22 24 27 29 30 36 39
3
Graphical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Undirected graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Acyclic directed graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Classic chain graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Within classic graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Decomposable models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Recursive causal graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Lattice conditional independence models . . . . . . . . . . . . . 3.4.4 Bubble graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43 43 46 52 54 55 56 56 57
XII
Contents
3.5 Advanced graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 General directed graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Reciprocal graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Joint-response chain graphs . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 Covariance graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.5 Alternative chain graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.6 Annotated graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.7 Hidden variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.8 Ancestral graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.9 MC graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Incompleteness of graphical approaches . . . . . . . . . . . . . . . . . . . . .
57 57 58 58 59 60 60 61 62 62 63
4
Structural Imsets: Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Basic class of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Discrete measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Regular Gaussian measures . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Conditional Gaussian measures . . . . . . . . . . . . . . . . . . . . . . 4.2 Classes of structural imsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Elementary imsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Semi-elementary and combinatorial imsets . . . . . . . . . . . . 4.2.3 Structural imsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Product formula induced by a structural imset . . . . . . . . . . . . . . 4.3.1 Examples of reference systems of measures . . . . . . . . . . . . 4.3.2 Topological assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Markov condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Semi-graphoid induced by a structural imset . . . . . . . . . . 4.4.2 Markovian measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Equivalence result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65 65 65 66 66 69 69 71 73 74 75 76 78 78 81 83
5
Description of Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . 87 5.1 Supermodular set functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.1.1 Semi-graphoid produced by a supermodular function . . . 88 5.1.2 Quantitative equivalence of supermodular functions . . . . 90 5.2 Skeletal supermodular functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2.1 Skeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2.2 Significance of skeletal imsets . . . . . . . . . . . . . . . . . . . . . . . 95 5.3 Description of models by structural imsets . . . . . . . . . . . . . . . . . . 99 5.4 Galois connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4.1 Formal concept analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4.2 Lattice of structural models . . . . . . . . . . . . . . . . . . . . . . . . . 104
Contents
XIII
6
Equivalence and Implication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.1 Two concepts of equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.1.1 Independence and Markov equivalence . . . . . . . . . . . . . . . 113 6.2 Independence implication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.2.1 Direct characterization of independence implication . . . . 115 6.2.2 Skeletal characterization of independence implication . . 118 6.3 Testing independence implication . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.3.1 Testing structural imsets . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.3.2 Grade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.4 Invariants of independence equivalence . . . . . . . . . . . . . . . . . . . . . 124 6.5 Adaptation to a distribution framework . . . . . . . . . . . . . . . . . . . . 126
7
The Problem of Representative Choice . . . . . . . . . . . . . . . . . . . . 131 7.1 Baricentral imsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.2 Standard imsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.2.1 Translation of DAG models . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.2.2 Translation of decomposable models . . . . . . . . . . . . . . . . . 137 7.3 Imsets of the smallest degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.3.1 Decomposition implication . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.3.2 Minimal generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.4 Span . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.4.1 Determining and unimarginal classes . . . . . . . . . . . . . . . . . 145 7.4.2 Imsets with the least lower class . . . . . . . . . . . . . . . . . . . . . 146 7.4.3 Exclusivity of standard imsets . . . . . . . . . . . . . . . . . . . . . . 148 7.5 Dual description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.5.1 Coportraits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.5.2 Dual baricentral imsets and global view . . . . . . . . . . . . . . 152
8
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.1 Two approaches to learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.2 Quality criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.2.1 Criteria for learning DAG models . . . . . . . . . . . . . . . . . . . 163 8.2.2 Score equivalent criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 8.2.3 Decomposable criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 8.2.4 Regular criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 8.3 Inclusion neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 8.4 Standard imsets and learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.4.1 Inclusion neighborhood characterization . . . . . . . . . . . . . . 181 8.4.2 Regular criteria and standard imsets . . . . . . . . . . . . . . . . . 184
9
Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 9.1 Theoretical problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 9.1.1 Miscellaneous topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 9.1.2 Classification of skeletal imsets . . . . . . . . . . . . . . . . . . . . . . 195 9.2 Operations with structural models . . . . . . . . . . . . . . . . . . . . . . . . . 199
XIV
Contents
9.2.1 Reductive operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 9.2.2 Expansive operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 9.2.3 Cumulative operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 9.2.4 Decomposition of structural models . . . . . . . . . . . . . . . . . . 203 9.3 Implementation tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 9.4 Interpretation and learning tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 209 9.4.1 Meaningful description of structural models . . . . . . . . . . . 209 9.4.2 Tasks concerning distribution frameworks . . . . . . . . . . . . 210 9.4.3 Learning tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 A
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 A.1 Classes of sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 A.2 Posets and lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 A.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 A.4 Topological concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 A.5 Finite-dimensional subspaces and convex cones . . . . . . . . . . . . . . 222 A.5.1 Linear subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 A.5.2 Convex sets and cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 A.6 Measure-theoretical concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 A.6.1 Measure and integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 A.6.2 Basic measure-theoretical results . . . . . . . . . . . . . . . . . . . . 228 A.6.3 Information-theoretical concepts . . . . . . . . . . . . . . . . . . . . . 230 A.6.4 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 A.7 Conditional independence in terms of σ-algebras . . . . . . . . . . . . . 234 A.8 Concepts from multivariate analysis . . . . . . . . . . . . . . . . . . . . . . . . 236 A.8.1 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 A.8.2 Statistical characteristics of probability measures . . . . . . 238 A.8.3 Multivariate Gaussian distributions . . . . . . . . . . . . . . . . . . 239 A.9 Elementary statistical concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 A.9.1 Empirical concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 A.9.2 Statistical conception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 A.9.3 Likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 A.9.4 Testing statistical hypotheses . . . . . . . . . . . . . . . . . . . . . . . 246 A.9.5 Distribution framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
List of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 List of Lemmas, Propositions etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
1 Introduction
The central topic of this book is how to describe the structures of probabilistic conditional independence in a way that the corresponding mathematical model has both relevant interpretation and offers the possibility of computer implementation. It is a mathematical monograph which found its motivation in artificial intelligence and statistics. In fact, these two fields are the main areas where the concept of conditional independence has been successfully applied. More specifically, graphical models of conditional independence structure are widely used in: • the analysis of contingency tables, an area of discrete statistics dealing with categorical data; • multivariate analysis, a branch of statistics investigating mutual relationships among continuous real-valued variables; and • probabilistic reasoning, an area of artificial intelligence where decisionmaking under uncertainty is done on the basis of probabilistic models. A (non-probabilistic) concept of conditional independence was also introduced and studied in several other calculi for dealing with knowledge and uncertainty in artificial intelligence (e.g. relational databases, possibility theory, Spohn’s kappa-calculus, Dempster-Shafer’s theory of evidence). Thus, the book has a multidisciplinary flavor. Nevertheless, it certainly falls within the scope of informatics or theoretical cybernetics, and the main emphasis is put on mathematical fundamentals. The monograph uses concepts from several branches of mathematics, in particular measure theory, discrete mathematics, information theory and algebra. Occasional links to further areas of mathematics occur throughout the book, for example to probability theory, mathematical statistics, topology and mathematical logic.
2
1 Introduction
1.1 Motivational thoughts The following “methodological” considerations are meant to explain my motivation. In this section six general questions of interest are formulated which may arise in connection with any particular method for describing conditional independence structures. I think these questions should be answered in order to judge fairly and carefully the quality and suitability of every considered method. To be more specific, one can assume a general situation, illustrated by Figure 1.1. One would like to describe conditional independence structures (in short, CI structures) induced by probability distributions from a given fixed class of distributions over a set of variables N . For example, we can consider the class of discrete measures over N (see p. 11), the class of regular Gaussian measures over N (see p. 30), the class of conditional Gaussian (CG) measures over N (see p. 66) or any specific parameterized class of distributions. In other words, a certain distribution framework is specified (see Section A.9.5). In probabilistic reasoning, every particular discrete probability measure over N represents “global” knowledge about a (random) system involving variables of N . That means it serves as a knowledge representative. Thus, one can take an even more general point of view and consider a general class of knowledge representatives within an (alternative) uncertainty calculus of artificial intelligence instead of the class of probability distributions (e.g. a class of possibilistic distributions over N , a class of relational databases over N etc.). $
' '$
Knowledge representatives (probability distributions)
&% Objects of discrete mathematics
&
%
Formal independence models
Fig. 1.1. Theoretical fundamentals (an informal illustration).
Every knowledge representative of this kind induces a formal independence model over N (for definition see p. 12). Thus, the class of induced conditional independence models is defined; in other words, the class of CI structures to be described is specified (the shaded area in Figure 1.1). One has in mind a
1.1 Motivational thoughts
3
method for describing CI structures in which objects of discrete mathematics – for example, graphs, finite lattices and discrete functions – are used to describe CI structures. Thus, a certain universum of objects of discrete mathematics is specified. Typical examples are classic graphical models widely used in multivariate analysis and probabilistic reasoning (for details, see Chapter 3). It is supposed that every object of this type induces a formal independence model over N . The intended interpretation is that the object thus “describes” an induced independence model so that it can possibly describe one of the CI structures that should be described. The definition of the induced formal independence model depends on the type of considered objects. Every particular universum of objects of discrete mathematics has its respective criterion according to which a formal independence model is ascribed to a particular object. For example, various separation criteria for classic graphical models were obtained as a result of evolution of miscellaneous Markov properties (see Remark 3.1 in Section 3.1). The evolution has led to the concept of “global Markov property” which establishes a graphical criterion to determine the maximal set of conditional independence statements represented in a given graph. This set is the ascribed formal independence model. The above-mentioned implicit assumption of the existence of the respective criterion is a basic requirement of consistency, that is, the requirement that every object in the considered universum of objects has unambiguously ascribed a certain formal independence model. Note that some recently developed graphical approaches (see Section 3.5.3) still need to be developed up to the concept of a global Markov property so that they will comply with the basic requirement of consistency. Under the above situation I can formulate the first three questions of interest which, in my opinion, are the most important theoretical questions in this general context. • The faithfulness question is whether every object from the considered universum of objects of discrete mathematics indeed describes one of the CI structures. • The completeness question is whether every CI structure can be described by one of the considered objects. If this is not the case an advanced subquestion occurs, namely the task to characterize conveniently those formal independence models which can be described by the objects from the considered universum. • The equivalence question involves the task of characterizing equivalent objects, that is, objects describing the same CI structure. An advanced subquestion is whether one can find a suitable representative for every class of equivalent objects. The phrase “faithfulness” was inspired by terminology used by Spirtes et al. [122], where it has similar meaning for graphical objects. Of course, the above notions depend on the considered class of knowledge representatives so that one can differentiate between faithfulness in a discrete distribution framework (= relative to the class of discrete measures) and faithfulness in a Gaussian
4
1 Introduction
distribution framework. Note that for classic graphical models, the faithfulness is usually ensured while the completeness is not (see Section 3.6). To avoid misunderstanding let me explain that some authors in the area of (classic) graphical models, including myself, also used a traditional term “(strong) completeness of a separation graphical criterion” [44, 90, 141, 73]. However, according to the above classification, results of this type are among the results gathered under the label “faithfulness” (customary reasons for traditional terminology are explained in Remark 3.2 on p. 45). Thus, I distinguish between the “completeness of a criterion” on the one hand and the “completeness of a universum of objects” (for the description of a class of CI structures) on the other hand. Now I will formulate three remaining questions of interest which, in my opinion, are the most important practical questions in this context (for an informal illustrative picture see Figure 1.2).
DATA
Learning
? ?
HUMAN
Interpretation Learning
'$
-
&% &
%
THEORETICAL FUNDAMENTALS
(see Figure 1.1)
$
'
Implementation
? ?
COMPUTER
Fig. 1.2. Practical questions (an informal illustration).
• The interpretability question is whether considered objects of discrete mathematics can be conveyed to humans in an acceptable way. That usually means whether or not they can be visualized so that they are understood easily and interpreted correctly as CI structures. • The learning question is how to determine the most suitable CI structure either on the basis of statistical data (= testing problem) or on the basis of expert knowledge provided by human experts. An advanced statistical subquestion is the task to determine even a particular probability distri-
1.1 Motivational thoughts
5
bution inducing the CI structure, which is equivalent to the problem of “estimation” of parameters of a statistical model. • The implementation question is how to manage the corresponding computational tasks. An advanced subquestion is whether or not the acceptance of a particular CI structure allows one to do respective subsequent calculation with probability distributions effectively, namely whether the considered objects of discrete mathematics give guidance in the calculation. Classic graphical models are easily accepted by humans; however, their pictorial representation may sometimes lead to another interpretation. For example, acyclic directed graphs can either be interpreted as CI structures or one can prefer “causal” or “deterministic” interpretation of their edges [122], which is different. Concerning computational aspects, an almost ideal framework is provided by the class of decomposable models which is a special class of graphical models (see Section 3.4.1). This is the basis of a well-known “local computation method” [66] which is at the core of several working probabilistic expert systems [49, 26]. Of course, the presented questions of interest are connected to each other. For example, structure learning from experts certainly depends on interpretation while (advanced) distribution learning is closely related to the “parameterization problem” (see p. 210), which also has a strong computational aspect. The goal of these motivational thoughts is the idea that the practical questions are ultimately connected with the theoretical basis. Before inspection of practical questions one should first solve the related theoretical questions, in my opinion. Regrettably, some researchers in artificial intelligence (and to a lesser degree, those in statistics) do not pay enough attention to the theoretical grounds and concentrate mainly on practical issues like simplicity of accepted models, either from the point of view of computation or visualization. They usually settle on a certain class of “nice” graphical models (e.g. Bayesian networks – see p. 46) and do not realize that their later technical problems are caused by this limitation. Even worse, limitation to a small class of models may lead to serious methodological errors. Let me give an example that is my main source of motivation. Consider a hypothetical situation where one is trying to learn the CI structure induced by a discrete distribution on the basis of statistical data. Suppose, moreover, that one is limited to a certain class of graphical models – say, Bayesian networks. It is known that this class of models is not complete in the discrete distribution framework (see Section 3.6). Therefore one searches for the “best approximation”. Some of the learning algorithms for graphical models browse through the class of possible graphs as follows. One starts with a graph with the maximum number of edges, performs certain statistical tests for conditional independence statements and represents the acceptance of these statements by removal of certain edges in the graph. This is a correct procedure in the case where the underlying probability distribution indeed induces the CI structure that can be described by a graph within
6
1 Introduction
the considered universum of graphs. However, in general, this edge removal represents the acceptance of a new graphical model together with all other conditional independence statements that are represented in the “new” graph but which may not be valid with respect to the underlying distribution. Again I emphasize that this erroneous acceptance of additional conditional independence statements is made on the basis of a “correctly recognized” conditional independence statement! Thus, this error is indeed forced by the limitation to a certain universum of graphical models which is not complete. Note that an attitude like this has already been criticized within the community of researchers in artificial intelligence (see [159] and Remark 8.1). In my opinion, these recurring problems in solving practical questions of learning are inevitable consequences of the omission of theoretical grounds, namely the question of completeness. This may have motivated several recent attempts to introduce wider and wider classes of graphs which, however, lose easy interpretation and do not achieve completeness. Therefore, in this book, I propose a non-graphical method for describing probabilistic CI structures which primarily solves the completeness problem and has the potential to take care of practical questions.
1.2 Goals of the monograph The aim of the book is threefold. The first goal is to provide an overview of traditional methods for describing probabilistic CI structures. These methods mainly use graphs whose nodes correspond to variables as basic tools for visualization and interpretation. The overview involves basic results about conditional independence, including those published in my earlier papers. The second goal is to present the mathematical basis of an alternative method for describing probabilistic CI structures. The alternative method of structural imsets removes certain basic defects of classic methods. The third goal is an outline of those directions in which the presented method needs to be developed in order to satisfy the requirements of practical applicability. It involves the list of open problems and promising directions of research. The text of the monograph may perhaps seem longer and more detailed than necessary from an expert’s perspective. The reason for this is that not only top experts in the field and mathematicians are the expected audience. The intention was to write a report which can be read and understood by advanced PhD students in computer science and statistics. This was the main stimulus which compelled me to resolve the dilemma of “understandability” versus “conciseness” in favor of precision and potential understandability.
1.3 Structure of the book
7
1.3 Structure of the book Chapter 2 is an overview of basic definitions, tools and results concerning the concept of conditional independence. These notions, including the notion of an imset, which is a certain integer-valued discrete function, are supposed to form the theoretical basis for the rest of the book. Chapter 3 is an overview of graphical methods for describing CI structures. Both classic approaches (undirected graphs, acyclic directed graphs and chain graphs) and recent attempts are included. The chapter argues for a conclusion that a non-graphical method achieving the completeness (in the sense mentioned on p. 3) is needed. Chapter 4 introduces a method of this type, namely the method of structural imsets. A class of distributions to which this method is applicable is specified – it is the class of distributions with finite multiinformation – and the concept of a structural imset is defined. The main result of the chapter (Theorem 4.1) says that three possible ways of associating probability distributions and structural imsets are equivalent. Chapter 5 compares two different, but equivalent, ways of describing CI structures by means of imsets. An algebraic point of view is emphasized in that chapter. It is shown there that every probabilistic CI structure induced by a distribution with finite multiinformation can be described by the method of structural imsets. Moreover, a duality relationship between those two ways of describing CI structures (by imsets) is established. A unifying point of view provided by the theory of formal concept analysis is offered. Chapter 6 is devoted to an advanced question of equivalence (in the sense mentioned on p. 3) within the framework of structural imsets. A characterization of equivalent imsets is given there and a lot of attention is devoted to implementation tasks. The respective independence implication of structural imsets is characterized in two different ways. One of them allows one to transform the task of computer implementation of independence implication into a standard task of integer programming. Moreover, the question of adaptation of the method of structural imsets to a particular distribution framework is discussed there (Section 6.5). Chapter 7 deals with the problem of choosing a suitable representative of a class of equivalent structural imsets. Two approaches to this problem are offered. The concept of a baricentral imset seems to be a good solution from a theoretical point of view in the general context while the concept of a standard imset for an acyclic directed graph seems to be advantageous in the context of classic graphical models. Chapter 8 concerns the question of learning. It is more an analytic review of methods for learning graphical models than a pure mathematical text. However, the goal is to show that the method of structural imsets can be applied in this area too. A solution to the problem of characterizing inclusion quasi-ordering is offered and the significance of standard imsets in the context of learning is explicated (Section 8.4).
8
1 Introduction
Chapter 9 is an overview of open problems to be studied in order to tackle practical questions (which were mentioned on pp. 4–5). The Appendix is an overview of concepts and facts which are supposed to be elementary and can be omitted by an advanced reader. They are added for several minor reasons: to clarify and unify terminology, to broaden circulation readership and to make reading comfortable as well. For the reader’s convenience two lists are included after the Appendix: the List of Notation and the List of Lemmas, Propositions etc. The text is concluded by the References and the Index.
2 Basic Concepts
Throughout the book the symbol N will denote a non-empty finite set of variables. The intended interpretation is that the variables correspond to primitive factors described by random variables. In Chapter 3 variables will be represented by nodes of a graph. The set N will also serve as the basic set for non-graphical tools of discrete mathematics introduced in this monograph (semi-graphoids, imsets etc.). Convention 1. The following conventions will be used throughout the book. Given sets A, B ⊆ N the juxtaposition AB will denote their union A ∪ B. The following symbols will be reserved for sets of numbers: R will denote real numbers, Q rational numbers, Z integers, Z+ non-negative integers (including 0), N natural numbers (that is, positive integers excluding 0). The symbol |A| will be used to denote the number of elements of a finite set A, that is, its cardinality. The symbol |x| will also denote the absolute value of a real number x, that is, |x| = max {x, −x}. ♦
2.1 Conditional independence A basic notion of the monograph is a probability measure over N . This phrase will be used to describe the situation in which a measurable space (Xi , Xi ) is given for every i ∈ N and a probabilitymeasure P is defined on the Cartesian product of these measurable spaces ( i∈N Xi , i∈N Xi ).In this case I will use the symbol (XA , XA ) as a shorthand for ( i∈A Xi , i∈A Xi ) for every ∅ = A ⊆ N . The marginal of P for ∅ = A ⊂ N , denoted by P A , is defined by the formula P A (A) = P (A × XN \A ) for A ∈ XA . Moreover, let us accept two natural conventions. First, the marginal of P for A = N is P itself, that is, P N ≡ P . Second, a fully formal convention is that the marginal of P for A = ∅ is a probability measure on a (fixed appended)
10
2 Basic Concepts
measurable space (X∅ , X∅ ) with a trivial σ-algebra X∅ = {∅, X∅ }. Observe that a measurable space of this kind only admits one probability measure P ∅ . To give the definition of conditional independence within this framework one needs a certain general understanding of the concept of conditional probability. Given a probability measure P over N and disjoint sets A, C ⊆ N , conditional probability on XA given C (more specifically given XC ) will be understood as a function of two arguments PA|C : XA × XC → [0, 1] which ascribes an XC -measurable function PA|C (A|) to every A ∈ XA such that AC P (A × C) = PA|C (A|x) dP C (x) for every C ∈ XC . C
Note that no restriction concerning the mappings A → PA|C (A|x), x ∈ XC (often called the regularity requirement – see Section A.6.4, Remark A.1) is needed within this general approach. Let me emphasize that PA|C only depends on the marginal P AC and that it is defined, for a fixed A ∈ XA , uniquely within the equivalence P C -almost everywhere (P C -a.e.). Observe that, owing to the convention above, if C = ∅ then the conditional probability PA|C coincides, in fact, with the marginal for A, that means, one has PA|∅ ≡ P A (because a constant function can be identified with its value). Remark 2.1. The conventions above are in accordance with the following unifying perspective. Realize that for every ∅ = A ⊂ N the measurable space (XA , XA ) is isomorphic to the space (XN , X¯A ) where X¯A ⊆ XN is the coordinate σ-algebra representing the set A, namely X¯A = {A × XN \A ; A ∈ XA } = {B ∈ XN ; B = A × XN \A
for A ⊆ XA } .
Thus, A ⊆ B ⊆ N is reflected by X¯A ⊆ X¯B and it is natural to require that the empty set ∅ is represented by the trivial σ-algebra X¯∅ over XN and N is represented by X¯N = XN . Using this point of view, the marginal P A corresponds to the restriction of P to X¯A , and PA|C corresponds to the concept of conditional probability with respect to the σ-algebra X¯C . Thus, the existence and the uniqueness of PA|C mentioned above follows from basic measure-theoretical facts. For details see the Appendix, Section A.6.4. Given a probability measure P over N and pairwise disjoint subsets A, B, C ⊆ N one says that A is conditionally independent of B given C with respect to P and writes A ⊥ ⊥ B | C [P ] if for every A ∈ XA and B ∈ XB PAB|C (A × B|x) = PA|C (A|x) · PB|C (B|x)
for P C -a.e. x ∈ XC .
(2.1)
Observe that in case C = ∅ it collapses to a simple equality P AB (A × B) = P A (A) · P B (B), that is, to a classic independence concept. Note that the validity of (2.1) does not depend on the choice of versions of conditional probability given C since these are determined uniquely within equivalence P C -a.e.
2.2 Semi-graphoid properties
11
Remark 2.2. Let me specify the definition for the case of discrete measures over N , when Xi is a finite non-empty set and Xi = P(Xi ) is the class of all its subsets for every i ∈ N . Then PA|C is determined uniquely exactly on the set {x ∈ XC ; P C ({x}) > 0} by means of the formula PA|C (A|x) =
P AC (A × {x}) P C ({x})
for every A ⊆ XA ,
so that A ⊥ ⊥ B | C [P ] is defined as follows: PAB|C (A × B|x) = PA|C (A|x) · PB|C (B|x) for every A ⊆ XA , B ⊆ XB and x ∈ XC with P C ({x}) > 0. Of course, A and B can be replaced by singletons. Note that the fact that the equality P C -a.e. coincides with the equality on a certain fixed set is a speciality of the discrete case. Other common equivalent definitions of conditional independence are mentioned in Section 2.3. However, the concept of conditional independence is not exclusively a probabilistic concept. This concept was introduced in several non-probabilistic frameworks, namely in various calculi for dealing with uncertainty in artificial intelligence – for details and overview see [133, 117, 31]. Formal properties of respective conditional independence concepts may differ in general, but an important fact is that certain basic properties of conditional independence appear to be valid in all these frameworks.
2.2 Semi-graphoid properties Several authors independently drew attention to the above-mentioned basic formal properties of conditional independence. In modern statistics, they were first accentuated by Dawid [29], then mentioned by Mouchart and Rolin [93], and van Putten and van Shuppen [103]. Spohn [124] interpreted them in the context of philosophical logic. Finally, their significance in (probabilistic approach to) artificial intelligence was discerned and highlighted by Pearl and Paz [99]. Their terminology [100] was later widely accepted, so that researchers in artificial intelligence started to call them the semi-graphoid properties. 2.2.1 Formal independence models Formally, a conditional independence statement over N is a statement of the form “A is conditionally independent of B given C” where A, B, C ⊆ N are pairwise disjoint subsets of N . A statement of this kind should always be understood with respect to a certain mathematical object o over N , for example, a probability measure over N . However, several other objects can occur in place of o; for example, a graph over N (see Chapter 3), a possibility
12
2 Basic Concepts
distribution over N [18, 149], a relational database over N [112] and a structural imset over N (see Section 4.4.1). The notation A ⊥ ⊥ B | C [o] will be used in those cases, but the symbol [o] can be omitted if it is suitable. Thus, every conditional independence statement corresponds to a disjoint triplet over N , that is, a triplet A, B|C of pairwise disjoint subsets of N . Here, the punctation anticipates the intended role of component sets. The third component, written after the straight line, is the conditioning set while the two former components are independent areas, usually interchangeable. The formal difference is that a triplet of this kind can be interpreted either as the corresponding independence statement or, alternatively, as its negation, that is, the corresponding dependence statement. Occasionally, I will use the symbol A ⊥ ⊥ B | C [o] to denote the dependence statement which corresponds to A, B|C. The class of all disjoint triplets over N will be denoted by T (N ). Having established the concept of conditional independence within a certain framework of mathematical objects over N , every object o of this kind defines a certain set of disjoint triplets over N , namely ⊥ B | C [o] }. Mo = { A, B|C ∈ T (N ); A ⊥ Let us call this set of triplets the conditional independence model induced by o. This phrase is used to indicate that the involved triplets are interpreted as independence statements, although from a purely mathematical point of view it is nothing but a subset of T (N ). A subset M ⊆ T (N ) interpreted in this way will be called a formal independence model. Thus, the conditional independence model induced by a probability measure P over N (according to the definition from Section 2.1) is a special case. On the other hand, any class M ⊆ T (N ) of disjoint triplets over N can be formally interpreted as a conditional independence model if one defines A⊥ ⊥ B | C [M] ≡ A, B|C ∈ M . The restriction of a formal independence model M over N to a non-empty set ∅ = T ⊆ N will be understood as the set M ∩ T (T ) denoted by MT . Evidently, the restriction of a (probabilistic) conditional independence model is again a conditional independence model (induced by the marginal). Remark 2.3. I should explain my limitation to disjoint triplets over N , since some authors, e.g. Dawid [33], do not make this restriction at all. For simplicity of explanation consider a discrete probabilistic framework. Indeed, given a discrete probability measure P over N , the statement A ⊥ ⊥ B | C [P ] can also be defined for non-disjoint triplets A, B, C ⊆ N in a reasonable way [41, 81]. However, then the statement A ⊥ ⊥ A | C [P ] has specific interpretation, namely that the variables in A are functionally dependent on the variables in C (with respect to P ), so that it can be interpreted as a functional dependence statement. Let us note (cf. § 2 in [81]) that one can easily derive that A\C ⊥ ⊥ B \ AC | C [P ] and A⊥ ⊥ B | C [P ] ⇔ . (A ∩ B) \ C ⊥ ⊥ (A ∩ B) \ C | C ∪ (B \ A) [P ]
2.2 Semi-graphoid properties
13
Thus, every statement A ⊥ ⊥ B | C of a general type can be “reconstructed” from functional dependence statements and from pure conditional independence statements described by disjoint triplets. The topic of this monograph is pure conditional independence structures; therefore I limit myself to pure conditional independence statements. Remark 2.4. To avoid misunderstanding, the reader should be aware that the noun model may have any of three different meanings in this monograph. First, it can be used in its general sense in which case it is usually used without an adjective. Second, it is a part of the phrase “(formal) independence model” in which case the word independence indicates that one has in mind the concept introduced in this section. Note that this terminology comes from the area of artificial intelligence – see Pearl [100]. Third, it can be a part of the phrase “statistical model” in which case the adjective statistical indicates that one has in mind the concept mentioned in Section A.9.2, that is, a class of probability measures. Note that this terminology is often used in statistics – see Remark A.3 for more detailed explanation. However, there is a simple reason why two different concepts are named by the same noun. The reason is that every formal independence model M ⊆ T (N ) can be understood as a statistical model M, provided that a distribution framework Ψ (see Section A.9.5) is fixed. Indeed, one can put M = { P ∈ Ψ ; A ⊥⊥ B | C [P ]
whenever A, B|C ∈ M } .
Every statistical model of this kind will be called the statistical model of CI structure. Note that this concept generalizes the classic concept of a graphical model [157, 70]. Indeed, the reader can learn in Chapter 3 that a graph G having N as the set of nodes usually induces the class MG of Markovian measures over N , that is, a statistical model. This graphical statistical model is, however, defined by means of the formal independence model MG . Note that the class MG is often introduced in another way – see Section 8.2.1 for equivalent definitions in case of acyclic directed graphs in terms of recursive factorization and in terms of parameterization. 2.2.2 Semi-graphoids By a disjoint semi-graphoid over N is understood any set M ⊆ T (N ) of disjoint triplets over N (interpreted as independence statements) such that the following conditions hold for every collection of pairwise disjoint sets A, B, C, D ⊆ N : 1. 2. 3. 4. 5.
triviality symmetry decomposition weak union contraction
A ⊥⊥ ∅ | C [M], A ⊥⊥ B | C [M] implies B ⊥⊥ A | C [M], A⊥ ⊥ BD | C [M] implies A ⊥⊥ D | C [M], A⊥ ⊥ BD | C [M] implies A ⊥ ⊥ B | DC [M], A⊥ ⊥ B | DC [M] and A ⊥ ⊥ D | C [M] implies A ⊥ ⊥ BD | C [M].
14
2 Basic Concepts
Note that the terminology above was proposed by Pearl [100], who formulated the formal properties above in the form of inference rules, gave them special names and interpretation, and called them the semi-graphoid axioms. Of course, the restriction of a semi-graphoid over N to T (T ) for non-empty T ⊆ N is a semi-graphoid over T . The following fact is important. Lemma 2.1. Every conditional independence model MP induced by a probability measure P over N is a disjoint semi-graphoid over N . Proof. This can be derived easily from Corollary A.2 proved in the Appendix (see p. 235). Indeed, having a probability measure P over N defined on a measurable space (XN , XN ) one can identify every subset A ⊆ N with a coordinate σ-algebra X¯A ⊆ XN as described in Remark 2.1. Then, for a disjoint triplet A, B|C over N , the statement A ⊥ ⊥ B | C [P ] is equivalent to the ⊥ X¯B | X¯C [P ] introduced in Section A.7. Having in mind requirement X¯A ⊥ that X¯AB = X¯A ∨ X¯B for A, B ⊆ N the rest follows from Corollary A.2. Note that the above mentioned fact is not a special feature of a probabilistic framework. Conditional independence models occurring within other uncertainty calculi (in artificial intelligence) mentioned at the end of Section 2.1 are also (disjoint) semi-graphoids. Even various graphs over N induce semi-graphoids, as explained in Chapter 3. Remark 2.5. The limitation to disjoint triplets in the definition of a semigraphoid is not substantial. One can introduce an abstract semi-graphoid on a join semi-lattice (S, ∨) as a ternary relation ⊥⊥ | over elements A, B, C, D of S satisfying • A⊥ ⊥ B | C whenever B ∨ C = C, • A⊥ ⊥ B | C iff B ⊥ ⊥ A | C, • A⊥ ⊥ B ∨ D | C iff [ A ⊥ ⊥ B | D ∨ C and A ⊥⊥ D | C ]. Taking S = P(N ) one obtains the definition of a non-disjoint semi-graphoid over N . A more complicated example is the semi-lattice of all σ-algebras A ⊆ X in a measurable space (X, X ) and the relation ⊥⊥ of conditional independence for σ-algebras with respect to a probability measure on (X, X ) (see Corollary A.2). Note that the above concept of an abstract semi-graphoid is essentially equivalent to the concept of a separoid introduced by Dawid [33], which is a mathematical structure unifying a variety of notions of “conditional independence” arising in probability, statistics, artificial intelligence, and other fields. Let me conclude this remark by a note which indicates the obstacles that authors in mathematics meet if they want to establish new terminology. Pearl and Paz [99] decided to use the word “graphoid” to name a new concept they introduced (see p. 29 for this concept). However, it appeared that this word had already been “occupied”: it was used to name one of equivalent definitions of a matroid [155]. One of the motives which led Dawid [33] to use the word
2.2 Semi-graphoid properties
15
“separoid” to name his general concept was to avoid a terminological clash. However, it appeared that this word had also been used independently by Strausz [128] to name a certain abstract binary relation between sets whose aim is to generalize geometric separation of sets in Rn by hyperplanes. An interesting observation is that, by coincidence, there is a weak connection between two concepts of a separoid. For example, an undirected graph G and the relation of separation for sets of nodes in G, which is defined as in Section 3.1 but non-disjoint sets are allowed, can give an example of both separoids. The difference is that Dawid’s separoid is a ternary relation A ⊥ ⊥ B | C [G] while a binary relation A ⊥ ⊥ B | ∅ [G] can serve as an example of Strausz’s separoid. 2.2.3 Elementary independence statements To store a semi-graphoid over N in the memory of a computer it is not necessary to allocate all |T (N )| = 4|N | bits. A more economic way of their representation is possible. For example, one can omit trivial statements which correspond to triplets A, B|C over N with A = ∅ or B = ∅. Let us denote the class of trivial disjoint triplets over N by Tø (N ). However, independence statements of principal importance are elementary statements, which correspond to elementary triplets, that is, disjoint triplets
A, B|C over N where both A and B are singletons (cf. [3, 79]). A simplifying convention will be used in this case: braces in singleton notation will be omitted so that a, b|K or a ⊥ ⊥ b | K will be written only. The class of elementary triplets over N will be denoted by T (N ). Lemma 2.2. Suppose that M is a disjoint semi-graphoid over N . Then, for every disjoint triplet A, B|C over N , one has A ⊥⊥ B | C [M] iff the following condition holds ∀a ∈ A ∀b ∈ B
∀ C ⊆ K ⊆ ABC \ {a, b}
a⊥ ⊥ b | K [M].
(2.2)
In particular, every semi-graphoid is determined by its “trace” within the class of elementary triplets, that is, by the intersection with T (N ). Moreover, if M1 , M2 are semi-graphoids over N then M1 ∩ T (N ) ⊆ M2 ∩ T (N ) is equivalent to M1 ⊆ M2 . Proof. (see also [79]) The necessity of the condition (2.2) is easily derivable using the decomposition and the weak union properties combined with the symmetry property. For converse implication suppose (2.2) and that A, B|C is not a trivial triplet over N (otherwise it is evident). Use induction on |AB|; the case |AB| = 2 is evident. Supposing |AB| > 2 either A or B is not a singleton. Owing to the symmetry property one can consider without the loss of generality |B| ≥ 2, choose b ∈ B and put B = B \ {b}. By the induction assumption, (2.2) implies both A ⊥ ⊥ b | B C [M] and A ⊥⊥ B | C [M]. Hence, by application of the contraction property A ⊥ ⊥ B | C [M] is derived.
16
2 Basic Concepts
Sometimes, an elementary statement mode of representing a semi-graphoid, that is, by the list of contained elementary triplets, is more suitable. The characterization of those collections of elementary triplets which represent semi-graphoids is given in Proposition 1 of Mat´ uˇs [79]. Remark 2.6. Another reduction of memory demands for semi-graphoid representation follows from the symmetry property. Instead of keeping a pair of mutually symmetric statements a ⊥⊥ b | K and b ⊥⊥ a | K one can choose only one of them according to a suitable criterion. In particular, to represent a semi-graphoid over N with |N | = n it suffices to have only n · (n − 1) · 2n−3 bits. Note that the idea above is also reflected in Section 4.2.1 where just one elementary imset corresponds to a “symmetric” pair of elementary statements. However, further reduction of the class of considered statements is not possible. The reason is as follows: every elementary triplet a, b|K over N generates a semi-graphoid over N consisting of a, b|K, its symmetric image
b, a|K and trivial triplets over N (cf. Lemmas 4.6 and 4.5). In fact, these are minimal non-trivial semi-graphoids over N and one has to distinguish them from other semi-graphoids over N . These observations influenced the terminology: the adjective “elementary” is used to indicate the respective disjoint triplets and independence statements. 2.2.4 Problem of axiomatic characterization Pearl and Paz [99, 100] formulated a conjecture that semi-graphoids coincide with conditional independence models induced by discrete probability measures. However, this conjecture was refuted in Studen´ y [130] by finding a further formal property of these models, which is not derivable from semigraphoid properties, namely [A ⊥ ⊥ B | CD and C ⊥ ⊥ D | A and C ⊥ ⊥ D | B and A ⊥ ⊥ B |∅] ⇔
⇔
[C ⊥ ⊥ D | AB and A ⊥ ⊥ B | C and A ⊥ ⊥ B | D and C ⊥ ⊥ D | ∅ ].
Another formal property of this sort was later derived in An et al. [3]. Consequently, a natural question occurred. Can conditional independence models arising in a discrete probabilistic setting be characterized in terms of a finite number of formal properties of this type? This question is known as the problem of axiomatic characterization because a result of this kind would have been a substantial step towards a syntactic description of these models in the sense of mathematical logic. Indeed, as explained in § 5 of Studen´ y [132], then it would have been possible to construct a deductive system that is an analog of the notion of a “formal axiomatic theory” from Mendelson [92]. The considered formal properties then would have played the role of syntactic inference rules of an axiomatic theory of this sort. Unfortunately, the answer to the question above is also negative. It was shown in Studen´ y [132] (for a more didactic proof see [144]) that, for every n ∈ N, there exists a formal property
2.3 Classes of probability measures
17
of (discrete) probabilistic conditional independence models which applies to a set of variables N with |N | = n but which cannot be revealed on a set of smaller cardinality. Note that a basic tool for derivation of these properties was the multiinformation function introduced in Section 2.3.4. On the other hand, having fixed N , a finite number of possible probabilistic conditional independence models over N suggests that they can be characterized in terms of a finite number of formal properties of semi-graphoid type. Thus, a related task is, for a small cardinality of N , to characterize them in that way. It is no problem to verify that they coincide with semi-graphoids in the case |N | = 3 (see Figure 5.6 for illustration). Discrete probabilistic conditional independence models over N with |N | = 4 were characterized in a series of papers by Mat´ uˇs [84, 85, 87]; for an overview see Studen´ y and Boˇcek [136] where the respective formal properties of these models are explicitly formulated – one has 18300 different models of this kind and these can be characterized by more than 28 formal properties. Remark 2.7. On the other hand, several results on relative completeness of semi-graphoid properties were achieved. In Geiger et al. [45] and independently in Mat´ uˇs [82] models of “unconditional” stochastic independence (that is, submodels consisting of unconditioned independence statements of the form A⊥ ⊥ B | ∅ ) were characterized by means of properties derivable from the semigraphoid properties. An analogous result for the class of saturated or fixedcontext conditional independence statements – that is, statements A ⊥ ⊥ B |C with ABC = N – was achieved independently by Geiger and Pearl [46] and by Malvestuto [77]. The result from Studen´ y [138] can be interpreted as a specific relative-completeness result, saying that the semi-graphoid generated by a pair of conditional independence statements is always a conditional independence model induced by a discrete probability measure. Note that the problem of axiomatic characterization of CI models mentioned above differs from the problem of axiomatization (in the sense of mathematical logic) of a single CI structure over an infinite set of variables N , which was treated in Kramosil [62].
2.3 Classes of probability measures There is no uniformly accepted conception of the notion of a probability distribution in the literature. In probability theory, authors usually understand by a distribution of a (n-dimensional real) random vector an induced probability measure on the respective sample space (Rn endowed with the Borel σ-algebra), that is, a set function on the sample (measurable) space. On the other hand, authors in artificial intelligence usually identify a distribution of a (finitely valued) random vector with a pointwise function on the respective (finite) sample space, ascribing probability to every configuration of values (= to every element of the sample space i∈N Xi , where Xi are finite sets). In
18
2 Basic Concepts
statistics, either the meaning wavers between these two basic approaches, or authors even avoid the dilemma by describing specific distributions directly by their parameters (e.g., elements of the covariance matrix of a Gaussian distribution). Therefore, no exact meaning is assigned to the phrase “probability distribution” in this book; it is used only in its general sense, mainly in vague motivational parts. Moreover, terminological distinction is made between those two above-mentioned approaches. The concept of a probability measure over N from Section 2.1 more likely reflects the first approach, which is more general. To relate this to the second approach one has to make an additional assumption on a probability measure P so that it can also be described by a pointwise function, called the density of P . Note that many authors simply make an assumption of this type implicitly without mentioning it. All probability measures over N
Marginally continuous measures
'
$
Measures with finite multiinformation
'
$
Discrete measures Positive measures
'
$
Regular Gaussian measures
& & &
%
% %
Fig. 2.1. A comparison of basic classes of probability measures over N .
In this section, basic facts about these special probability measures are recalled and several important subclasses of the class of measures having density, called “marginally continuous measures”, are introduced. One of them, the class of measures with finite multiinformation, is strongly related to the method of structural imsets described in later chapters. The informationtheoretical methods are applicable to measures belonging to this class which, fortunately, involves typical measures used in practice. Inclusion relationships among introduced classes of measures are depicted in Figure 2.1.
2.3 Classes of probability measures
19
2.3.1 Marginally continuous measures A probability measure P over N is marginally continuous if it is absolutely continuouswith respect to the product of its one-dimensional marginals, that is, P i∈N P {i} . The following lemma contains an apparently weaker equivalent definition. Lemma 2.3. A probability measure P on (XN , XN ) is marginally continuous iff there exists a collection of σ-finite measures µi on (Xi , Xi ), i ∈ N such that P i∈N µi . Proof. (see also § 1.2.2 in [37])It was shown in [130], Proposition 1, that in {i} iff there are probability measures λi the case |N | = 2 one has P i∈N P on (Xi , Xi ) with P i∈N λi . One can easily show that for every non-zero σ-finite measure µi on (Xi , Xi ) a probability measure λi on (Xi , Xi ) with µi to the requirement λi µi exists. Hence, the condition above is equivalent for the existence of σ-finite measures µi with P i∈N µi . Finally, one can use the induction on |N | to get the desired conclusion. Thus, the marginal continuity of P is equivalent to the existence of a dominating measure µ for P , that is, the product µ = i∈N µi of some σ-finite measures µi on (Xi , Xi ), i ∈ N such that P µ. In particular, every discrete measure over N is marginally continuous since the counting measure on XN can serve as its dominating measure. Note that nearly all multidimensional measures used in practice are marginally continuous (see Sections 2.3.5, 2.3.6 and 4.1.3 for other examples). However, there are probability measures over N which are not marginally continuous; in particular, some singular Gaussian measures – see Example 2.3 on p. 35. Having fixed a dominating measure µ for a marginally continuous measure P over N by a density of P with respect to µ will be understood (every version of) the Radon-Nikodym derivative of P with respect to µ. Remark 2.8. Let us note without explaining details (see Remark 1 in [130]) that the assumption that a probability measure P over N is marginally continuous also implies that, for every disjoint A, C ⊆ N , there exists a regular version of conditional probability PA|C on XA given XC in the sense of Lo´eve [74]. The regularity of conditional probability is usually derived as a consequence of special topological assumptions on (Xi , Xi ), i ∈ N (see the Appendix, Remark A.1). Thus, the marginal continuity is a non-topological assumption implying the regularity of conditional probabilities. The concept of marginal continuity is closely related to the concept of a dominated experiment in Bayesian statistics – see § 1.2.2 and § 1.2.3 in the book by Florens et al. [37]. The next step is an equivalent definition of conditional independence for marginally continuous measures in terms of densities. To formulate it in an elegant way, let us accept the following (notational) conventions.
20
2 Basic Concepts
Convention 2. Suppose that a marginally continuous probability measure P on (XN , XN ) is given. Let us fix one-dimensional σ-finite measures which define a dominating measure µ for P . More specifically, P µ ≡ i∈N µi where µi is a σ-finite measure on (Xi , Xi ) for every i ∈ N . Then, for every ∅ = A ⊆ N , we put µA = i∈A µi , choose a version fA of the Radon-Nikodym derivative dP A /dµA , and fix it. The function fA will be called a marginal density of P for A. It is an XA -measurable function on the set XA . In order to be also able to understand fA as a function on XN , let us accept the following notation. Given ∅ = A ⊆ B ⊆ N and x ∈ XB , the symbol xA will denote the projection of x onto A, that is, xA = [xi ]i∈A whenever x = [xi ]i∈B . The last formal convention concerns the marginal density f∅ for the empty set. It should be a constant function on (an appended) trivial measurable space (X∅ , X∅ ). Thus, in the formulas below, one can simply put f∅ (x∅ ) ≡ 1 for every ♦ x ∈ XB , ∅ = B ⊆ N . Remark 2.9. This is to explain the way of defining marginal densities in Convention 2. First, let me emphasize that the marginal density is not the RadonNikodym derivative of respective marginals of Pand µ since µA = i∈A µi need not coincide with the marginal µA of µ = i∈N µi unless every µi is a probability measure. Indeed, a marginal of a σ-finite measure may not be a σ-finite measure (e.g., µ∅ in the case µ(XN ) = ∞) so that the Radon-Nikodym derivative dP A /dµA may not exist. Instead, one can take the following point of view. Let us fix a density f = dP/dµ and introduce, for every ∅ = A ⊂ N , its “projection” f ↓A as a function on XA defined µA -almost everywhere (µA -a.e) as follows: ↓A f (y) = f (y, z) dµN \A (z) for y ∈ XA . XN \A
One can easily conclude using the Fubini theorem that f ↓A = dP A /dµA in the sense µA -a.e., so that there is no substantial difference between f ↓A and any version of the marginal density fA . The convention for the empty set saying f (x) dµ(x) = 1 . f ↓∅ () = XN
follows this line.
Lemma 2.4. Let P be a marginally continuous measure over N . Let us accept Convention 2. Given A, B|C ∈ T (N ) one has A ⊥ ⊥ B | C [P ] iff the following equality holds fABC (xABC ) · fC (xC ) = fAC (xAC ) · fBC (xBC )
for µ-a.e. x ∈ XN . (2.3)
2.3 Classes of probability measures
21
Proof. Note that minor omitted details of the proof (e.g. verification of equalities µ-a.e.) can be verified with the aid of basic measure-theoretical facts gathered in Section A.6. I. First, choose and fix a density f : XN → [0, ∞) of P such that f ↓A (xA ) ≡ f (xA , y) dµN \A (y) < ∞ , ∀ ∅ = A ⊂ N ∀ x ∈ XN XN \A
and, moreover, for every disjoint A, C ⊆ N , one has f ↓C (xC ) = 0
∀ x ∈ XN
⇒
f ↓AC (xAC ) = 0 ,
(2.4)
where conventions f ↓N = f and f ↓∅ ≡ 1 are accepted. Indeed, these relationships hold µ-a.e. for every version f of dP/dµ and every version can be overdefined by 0 whenever these relationships do not hold. It is no problem to verify that f ↓A = dP A /dµA for every ∅ = A ⊆ N . II. Second, for every disjoint pair of sets A, C ⊆ N , introduce a function hA|C : XA × XC → [0, ∞) as follows: ↓AC f (xz) if f ↓C (z) > 0, f ↓C (z) hA|C (x|z) = for x ∈ XA , z ∈ XC . 0 if f ↓C (z) = 0, One can verify using the Fubini theorem (for µA × P C ), the Radon-Nikodym theorem (for f ↓C = dP C /dµC ), again the Fubini theorem (for µC × µA ) and the Radon-Nikodym theorem (for f ↓AC = dP AC /dµAC ) that the function hA|C (x|z) dµA (x) where A ∈ XA , z ∈ XC , (A, z) → PA|C (A|z) ≡ A
is (a version of) the conditional probability on XA given XC . III. Realize that (2.3) can be written as follows (see Remark 2.9): f ↓ABC (xABC ) · f ↓C (xC ) = f ↓AC (xAC ) · f ↓BC (xBC )
(2.5)
for µ-a.e. x ∈ XN . Further, this can be rewritten in the form hAB|C (xAB |xC ) · f ↓C (xC ) = hA|C (xA |xC ) · hB|C (xB |xC ) · f ↓C (xC )
(2.6)
for µ-a.e. x ∈ XN . Indeed, owing to (2.4), (2.5) and (2.6) are trivially valid on the set {x ∈ XN ; f ↓C (xC ) = 0} while they are equivalent on its complement. IV. The next step is to observe that (2.6) is equivalent to the requirement that ∀ A ∈ XA , ∀ B ∈ XB , ∀ C ∈ XC it holds hAB|C (xAB |xC ) dµAB (xAB ) dP C (xC ) = C A×B
hA|C (xA |xC ) dµA (xA ) ·
= C
A
hB|C (xB |xC ) dµB (xB ) dP C (xC ) . B
22
2 Basic Concepts
Indeed, as mentioned in Section A.6.1 the equality in (2.6) is equivalent to the requirement that their integrals with respect to µABC over all measurable rectangles A × B × C coincide. This can be rewritten using the Fubini theorem, the Radon-Nikodym theorem and basic properties of the Lebesgue integral in the form above. V. As explained in Step II, the last equation can be understood as follows: PAB|C (A × B|z) dP C (z) = PA|C (A|z) · PB|C (B|z) dP C (z) . (2.7) C
C
Having fixed A ∈ XA and B ∈ XB the equality (2.7) for every C ∈ XC is equivalent to the condition that the integrated functions are equal P C -a.e. Hence, one can conclude that the condition (2.1) from p. 10 holds for every ⊥ B | C [P ]. A ∈ XA and B ∈ XB , that is, A ⊥ Let us observe that, in (2.3), one can write “for µABC -a.e. x ∈ XABC ” instead. Of course, the validity of (2.3) trivially does not depend on the choice of (versions) of densities. The point of Lemma 2.4 is that it does not even depend on the choice of a dominating measure µ since A ⊥⊥ B | C [P ] does depend on it as well. Note that this fact may not be so apparent when one tries to introduce the concept of conditional independence directly by means of marginal densities. 2.3.2 Factorizable measures Let ∅ = D ⊆ P(N ) \ {∅} be a non-empty class of non-empty subsets of N and D = T ∈D . We say that a marginally continuous measure P over N factorizes after D (relative to a dominating measure µ for P D ) if the (respective) marginal density of P for D can be expressed in the form fD (xD ) = gS (xS ) for µ-a.e. x ∈ XN , (2.8) S∈D
where gS : XS → [0, ∞), S ∈ D are XS -measurable functions, called potentials. An equivalent formulation is that there exists a version of fD of dP D /dµ and potentials gS such that (2.8) holds for every x ∈ XN . In fact, the factorization does not depend on the choice of a dominating measure µ. One can show that the validity of (2.8) relative to a general dominating product measure µ = i∈D µi where all µi are σ-finite, is equivalent to the validity of (2.8) relative to i∈D P {i} and with other potentials (this can be verified with the help of Lemma 2.3). Of course, the factorization after D is equivalent to the factorization after Dmax , and potentials are not unique unless |D| = 1. Further equivalent definition of conditional independence for marginally continuous measures is formulated in terms of factorization (see also [70], § 3.1).
2.3 Classes of probability measures
23
Lemma 2.5. Let P be a marginally continuous measure over N , µ a dominating measure for P ABC and A, B|C a disjoint triplet over N . Then A ⊥ ⊥ B | C [P ] if and only if P factorizes after D = {AC, BC} relative to µ. More specifically, if Convention 2 is accepted then A ⊥⊥ B | C [P ] iff there exist an XAC -measurable function g : XAC → [0, ∞) and an XBC -measurable function h : XBC → [0, ∞) such that fABC (xABC ) = g(xAC ) · h(xBC ) for µ-a.e. x ∈ XN .
(2.9)
Proof. One can use Lemma 2.4. Clearly, (2.3) ⇒ (2.9) where g = fAC and fBC (xBC ) if fC (xC ) > 0 , fC (xC ) h(xBC ) = for x ∈ XN , 0 if fC (xC ) = 0 , because for µ-a.e. x ∈ XN one has fC (xC ) = 0 ⇒ fBC (xBC ) = 0. For the proof of (2.9) ⇒ (2.3) one can first repeat Step I in the proof of Lemma 2.4 (see p. 21), that is, to choose a suitable version f of the density. Then (2.9) can be rewritten in the form f ↓ABC (xABC ) = g (xAC ) · h (xBC )
for µ-a.e. x ∈ XN .
(2.10)
Now, using the Fubini theorem and basic properties of the integral mentioned in Section A.6.1, one can derive from (2.10) by integrating ⎫ f ↓AC (xAC ) = g (xAC ) · h↓C (xC ) for µ-a.e. x ∈ XN , ⎪ ⎬ ↓BC ↓C f (xBC ) = g (xC ) · h(xBC ) for µ-a.e. x ∈ XN , (2.11) ⎪ ⎭ ↓C ↓C ↓C f (xC ) = g (xC ) · h (xC ) for µ-a.e. x ∈ XN , where the functions g ↓C (xC ) = h↓C (xC ) =
X A
g(y, xC ) dµA (y), h(z, xC ) dµB (z)
for xC ∈ XC ,
XB
are finite µC -a.e. (according to the Fubini theorem, owing to (2.10) and the fact that f ↓ABC is µABC -integrable). Thus, (2.10) and (2.11) give together f ↓ABC (xABC ) · f ↓C (xC ) = g(xAC ) · h(xBC ) · g ↓C (xC ) · h↓C (xC ) = = f ↓AC (xAC ) · f ↓BC (xBC ) for µ-a.e. x ∈ XN , which is equivalent to (2.3).
As a consequence, one can derive a certain formal property of conditional independence which was already mentioned in the discrete case (see [3, 125] and Proposition 4.1 in [81]).
24
2 Basic Concepts
Corollary 2.1. Suppose that P is a marginally continuous measure over N and A, B, C, D ⊆ N are pairwise disjoint sets. Then C⊥ ⊥ D | AB [P ], A ⊥ ⊥ B | ∅ [P ], A ⊥ ⊥ B | C [P ], A ⊥⊥ B | D [P ] implies A ⊥ ⊥ B | CD [P ] . Proof. It follows from Lemma 2.4 that the assumption C ⊥ ⊥ D | AB can be rewritten in terms of marginal densities as follows (throughout this proof I write f (xS ) instead of fS (xS ) for any S ⊆ N ): f (xABCD ) · f (xAB ) · f (x∅ ) · f (xC ) · f (xD ) = = f (xABC ) · f (xABD ) · f (x∅ ) · f (xC ) · f (xD ) for µ-a.e. x ∈ XN . Now, again using Lemma 2.4, the assumptions A ⊥⊥ B | ∅, A ⊥ ⊥ B | C and A⊥ ⊥ B | D imply that f (xABCD ) · f (xA ) · f (xB ) · f (xC ) · f (xD ) = = f (xAC ) · f (xBC ) · f (xAD ) · f (xBD ) · f (x∅ ) for µ-a.e. x ∈ XN . Since f (xA ) = 0 ⇒ f (xABCD ) = 0 for µ-a.e. x ∈ XN (and similarly for B, C, D) one can accept the convention f −1 (xA ) = 0 whenever f (xA ) = 0 and obtain g(xACD )
f (xABCD ) = f −1 (xA ) · f (xAC ) · f (xAD ) · · f (xBC ) · f (xBD ) · f (x∅ ) · f −1 (xB ) · f −1 (xC ) · f −1 (xD )
for µ-a.e. x ∈ XN .
h(xBCD )
Hence, by Lemma 2.5 one has A ⊥ ⊥ B | CD.
2.3.3 Multiinformation and conditional product Let P be a marginally continuous measure over N . The multiinformation of P is the relative entropy H(P | i∈N P {i} ) of P with respect to the product of its one-dimensional marginals. It is always a value in [0, +∞] (see Lemma A.4 in Section A.6.3). A common formal convention is that the multiinformation of P is +∞ in case P is not marginally continuous. Remark 2.10. The term “multiinformation” was proposed by my PhD supervisor Albert Perez in the late 1980s. Note that miscellaneous other terms were used earlier in the literature (even by Perez himself); for example “total correlation” [154], “dependence tightness” [101] or “entaxy” [76]. The main reason for Perez’s later terminology is that the above concept directly generalizes a widely accepted information-theoretical concept of “mutual information” of two random variables; multiinformation can be applied to the case of any finite number of random variables. Indeed, it can serve as a measure of global
2.3 Classes of probability measures
25
stochastic dependence among a finite collection of random variables (see § 4 in Studen´ y and Vejnarov´ a [144]). Asymptotic behavior of “empirical multiinformation”, which can be used as a statistical estimate of multiinformation on the basis of data, was examined in Studen´ y [129]. To clarify the significance of multiinformation for the study of conditional independence, I need the following lemma: Lemma 2.6. Let P be a marginally continuous measure on (XN , XN ) and
A, B|C ∈ T (N ). Then there exists a unique probability measure Q on (XABC , XABC ) such that QAC = P AC , QBC = P BC and A ⊥⊥ B | C [Q] . (2.12) Moreover, P ABC Q i∈ABC P {i} and the following equality holds true (the symbol H denotes the relative entropy introduced in Section A.6.3): {i} H(P ABC | P {i} ) + H(P C | P )= i∈ABC
H(P
ABC
i∈C
| Q) + H(P
AC
|
i∈AC
P {i} ) + H(P BC |
P {i} ) .
(2.13)
i∈BC
Proof. Note again that omitted technical details can be verified by means of basic measure-theoretical facts from Section A.6. I. First, let us verify the uniqueness of Q. Supposing both Q1 and Q2 satisfy (2.12) one can observe that (Q1 )C = (Q2 )C and Q1A|C ≈ Q2A|C , Q1B|C ≈ Q2B|C , where ≈ indicates the respective equivalence of conditional probabilities (on XA resp. XB ) given C mentioned in Section 2.1. Because of A ⊥⊥ B | C [Qi ], i = 1, 2, one can derive using (2.1) that Q1AB|C ≈ Q2AB|C for measurable rectangles which together with (Q1 )C = (Q2 )C implies Q1 = Q2 . II. For the existence proof assume without loss of generality ABC = N and put µ ≡ i∈N P {i} . As in Step I of the proof of Lemma 2.4 (see p. 21) choose a density f = dP/dµ and respective collection of marginal “projection” densities f ↓A , A ⊆ N satisfying (2.4). For brevity, I write f (xA ) instead of f ↓A (xA ) in the rest of this proof so that (2.4) has the form ∀ x ∈ XN ∀ disjoint A, C ⊆ N
f (xC ) = 0 ⇒ f (xAC ) = 0 .
III. Let us define a function g : XN → [0, ∞) by f (xAC )·f (xBC ) if f (xC ) > 0, f (xC ) for x ∈ XN = XABC , g(x) = 0 if f (xC ) = 0, and introduce a measure Q on (XN , XN ) as follows: Q(D) = g(x) dµ(x) for D ∈ XN = XABC . D
(2.14)
26
2 Basic Concepts
IV. Under the convention f (xAC )/f (xC ) ≡ 0 in the case f (xC ) = 0 one can write for every E ∈ XAC using the Fubini theorem, (2.14), and the RadonNikodym theorem: AC Q (E) = g(x) dµ(x) = E×XB
=
f (xAC ) · f (xC )
E
=
f (xB xC ) dµB (xB ) dµAC (xAC ) = XB
f (xAC ) · f (xC ) dµAC (xAC ) = f (xC )
E
f (xAC ) dµAC (xAC ) = E
= P AC (E). Hence, QAC = P AC and Q is a probability measure. Replace (XA , XA ) by (XB , XB ) in the preceding consideration to obtain QBC = P BC . The way Q has been defined implies Q µ and g = dQ/dµ. This form of g implies that Q is factorizable after {AC, BC} so that A ⊥⊥ B | C [Q] by Lemma 2.5. V. To see P ABC Q observe that (2.14) implies g(x) = 0 ⇒ f (x) = 0 for every x ∈ XN , accept the convention f (x)/g(x) ≡ 0 in the case g(x) = 0, and write for every D ∈ XN using the Radon-Nikodym theorem f (x) f (x) dQ(x) = · g(x) dµ(x) = f (x) dµ(x) = P (D) . g(x) g(x) D
D
D
Thus, P Q and f /g = dP/dQ. VI. To derive (2.13) realize that it follows from the definition of g (under the convention above) that f (x) · f (xC ) =
f (x) · f (xAC ) · f (xBC ) g(x)
for every x ∈ XN .
Hence, of course ∀ x ∈ XN
ln f (x) + ln f (xC ) = ln
f (x) + ln f (xAC ) + ln f (xBC ). g(x)
According to (A.3) and Lemma A.4 in Section A.6.3, each of the five logarithmic terms above is P -quasi-integrable and the integral is a value in [0, ∞] – use the fact that XN h(xD ) dP (x) = XD h(xD ) dP D (xD ) for every D ⊆ N . Thus, (2.13) can be derived. Remark 2.11. The measure Q satisfying (2.12) can be interpreted as a conditional product of P AC and P BC . Indeed, one can define the conditional
2.3 Classes of probability measures
27
product for every pair of consonant probability measures – that is, measures sharing marginals – in this way. However, in general, some obscurities can occur. First, there exists a pair of consonant measures such that no joint measure having them as marginals exists. Second, even if joint measures of this type exist, it may happen that none of them complies with the required conditional independence statement. For both examples see Dawid and Studen´ y [32]. Thus, the assumption of marginal continuity implies the existence of a conditional product. Note that the regularity of conditional probabilities PA|C or PB|C in the sense of Remark A.1 is a more general sufficient condition for the existence of a conditional product (see Proposition 2 in [130]). The value of H(P ABC |Q) in (2.13) is known in information theory as the conditional mutual information of A and B given C (with respect to P ). In the case of C = ∅ just the mutual information H(P AB |P A × P B ) is obtained, so that it can be viewed as a generalization of mutual information (but from a different perspective than multiinformation). Conditional mutual information is known as a good measure of stochastic dependence between A and B conditional on knowledge of C; for an analysis in a discrete case see § 3 in Studen´ y and Vejnarov´ a [144]. 2.3.4 Properties of multiinformation function Supposing P is a probability measure over N the induced multiinformation function mP : P(N ) → [0, ∞] ascribes the multiinformation of the respective marginal P S to every non-empty set S ⊆ N , that is, P {i} ) for every ∅ = S ⊆ N . mP (S) = H(P S | i∈S
Moreover, a natural convention mP (∅) = 0 is accepted. The significance of this concept is evident from the following consequence of Lemma 2.6. Corollary 2.2. Suppose that P is a probability measure over N whose multiinformation is finite. Then the induced multiinformation function mP is a non-negative real function which satisfies mP (S) = 0
whenever S ⊆ N, |S| ≤ 1,
(2.15)
and is supermodular, that is, for every A, B|C ∈ T (N ) mP (ABC) + mP (C) − mP (AC) − mP (BC) ≥ 0 .
(2.16)
These two conditions imply mP (S) ≤ mP (T ) whenever S ⊆ T ⊆ N . Moreover, for every A, B|C ∈ T (N ) one has mP (ABC) + mP (C) − mP (AC) − mP (BC) = 0
iff
A⊥ ⊥ B | C [P ]. (2.17)
28
2 Basic Concepts
Proof. The relation (2.15) is evident. Given a set S ⊆ N , let us substitute
A, B|C = S, N \ S | ∅ in Lemma 2.6. Equation (2.13) gives mP (N ) = mP (N ) + mP (∅) = H(P |Q) + mP (S) + mP (N \ S) . Since all terms here are in [0, +∞] and mP (N ) < ∞ it implies mP (S) < ∞. Therefore (2.13) for general A, B|C can always be written in the form mP (ABC) + mP (C) − mP (AC) − mP (BC) = H(P ABC | Q), where Q is the conditional product of P AC and P BC . Using Lemma A.4 we derive (2.16). It suffices to see mP (S) ≤ mP (T ) whenever |T \ S| = 1, which follows directly from (2.16) with A, B|C = S, T \ S | ∅ and (2.15). The uniqueness of the conditional product Q mentioned in Lemma 2.6 implies that A ⊥ ⊥ B | C [P ] iff P ABC = Q, that is, H(P ABC | Q) = 0 by Lemma A.4. Hence (2.17) follows. The class of probability measures having finite multiinformation is, by definition, a subclass of the class of marginally continuous measures. It will be shown in Section 4.1 that it is quite a wide class of measures, involving several classes of measures used in practice. The relation (2.17) provides a very useful equivalent definition of conditional independence for measures with finite multiinformation, namely by means of an algebraic identity. Note that just the relations (2.16) and (2.17) establish a basic method for handling conditional independence used in this monograph. Because these relations originate from information theory – the expression in (2.16) is nothing but the conditional mutual information mentioned in Remark 2.11 – I dare to call them information-theoretical tools. For example, all formal properties of conditional independence from Section 2.2.2 and the result mentioned at the beginning of Section 2.2.4 were derived using these tools. Corollary 2.2 also implies that the class of measures with finite multiinformation is closed under the operation of taking marginals. Note without further explanation that it is closed under the operation of conditional product as well. The following observation appears to be useful later. Lemma 2.7. Let P be a probability measure on (XN , XN ) and P µ ≡ on (Xi , Xi ) for every i ∈ N . Let ∅ = i∈N µi where µi is a σ-finite measure S S ⊆ N such that −∞ < H(P | i∈S µi ) < ∞ and −∞ < H(P {i} | µi ) < ∞ for every i ∈ S. Then 0 ≤ mP (S) < ∞ and mP (S) = H(P S | µi ) − H(P {i} |µi ) . (2.18) i∈S
i∈S
Proof. This is just a rough sketch (for technical detailssee Section A.6). Suppose without loss of generality S = N and put ν = i∈N P {i} . By Lemma 2.3 one knows P ν. Since P {i} µi for every i ∈ N choose versions of
2.3 Classes of probability measures
29
dP/dν and dP {i} /dµi and observe that dP/dν · i∈N dP {i} /dµi is a version of dP/dµ, defined uniquely P -a.e. (as P ν µ). Hence we derive ln
dP dP {i} dP = ln − ln dν dµ dµi
for P -a.e. x ∈ XN .
i∈N
The assumption of the lemma implies that all logarithmic terms on the righthand side are P -integrable. Hence, by integrating with respect to P , (2.18) is obtained. 2.3.5 Positive measures A marginally continuous measure P over N is positive if there exists a dominating measure µ for P whose density f = dP/dµ is (strictly) positive, that is, f (x) > 0 for µ-a.e. x ∈ XN . Note that the positivity of a density may depend on the choice of a dominating measure. However, whenever a measure µ of this kind exists one has µ P . Since P i∈N P {i} and {i} i∈N µi ≡ µ one can equivalently introduce a positive meai∈N P sure P over N by a simple requirement that P i∈N P {i} P and always take i∈N P {i} in place of µ. A typical example is a positive discrete measure P on XN = i∈N Xi with 1 ≤ |Xi | < ∞, i ∈ N such that P ({x}) > 0 for every x ∈ XN (or, more generally, only for x ∈ i∈N Yi with Yi = { y ∈ Xi ; P {i} ({y}) > 0}). These measures play an important role in (the probabilistic approach to) artificial intelligence. Pearl [100] noticed that conditional independence models induced by these measures further satisfy a special formal property (in addition to the semi-graphoid properties), and introduced the following terminology. A disjoint semi-graphoid M over N is called a (disjoint) graphoid over N if, for every collection of pairwise disjoint sets A, B, C, D ⊆ N , one has 6. intersection
A⊥ ⊥ B | DC [M] and A ⊥ ⊥ D | BC [M] implies A ⊥ ⊥ BD | C [M].
It follows from Lemma 2.1 and the observation below that every conditional independence model induced by a positive measure is a disjoint graphoid. Proposition 2.1. Let P be a marginally continuous measure over N and sets A, B, C, D ⊆ N be pairwise disjoint. If P BCD is a positive measure over BCD then A ⊥⊥ B | DC [P ] and A ⊥ ⊥ D | BC [P ] ⇒ A ⊥⊥ BD | C [P ] . Proof. (see also [70] for an alternative proof under additional restrictive assumption) This is a rough hint only. Let µ be a dominating measure for P such thatf = dP/dµ is a density with fBCD (xBCD ) ≡ f (xBCD ) > 0 for µa.e. x ∈ XN (I am again following the notational convention from the proof of
30
2 Basic Concepts
Corollary 2.1, p. 24). The assumptions A ⊥ ⊥ B | DC [P ] and A ⊥ ⊥ D | BC [P ] imply by Lemma 2.4 (one can assume f (xE ) > 0 for µ-a.e. x ∈ XN whenever E ⊆ BCD) f (xACD ) · f (xBCD ) f (xABC ) · f (xBCD ) = f (xABCD ) = f (xCD ) f (xBC ) for µ-a.e. x ∈ XN . The terms f (xBCD ) can be cancelled, so that one derives by dividing f (xACD ) · f (xBC ) = f (xABC ) · f (xCD )
for µ-a.e. x ∈ XN .
One can take the integral with respect to µB and by the Fubini theorem get f (xACD ) · f (xC ) = f (xAC ) · f (xCD )
for µ-a.e. x ∈ XN ,
that is, A ⊥ ⊥ D | C [P ] by Lemma 2.4. This, together with A ⊥⊥ B | DC [P ] implies the desired conclusion by the contraction property. Let us note that there are discrete probability measures whose induced conditional independence model is not a graphoid, that is, it does not satisfy the intersection property (see Example 2.3 on p. 35). On the other hand, Proposition 2.1 holds also under weaker assumptions on P BCD . 2.3.6 Gaussian measures These measures are usually treated in multivariate statistics, often under the alternative name “normal distributions”. In this book Gaussian measures over N are measures on (XN , XN ) where (Xi , Xi ) = (R, B) is the set of real numbers endowed with the σ-algebra of Borel sets for every i ∈ N . Every vector e ∈ RN and every positive semi-definite N × N -matrix Σ ∈ RN ×N defines a certain measure on (XN , XN ) denoted by N (e, Σ) whose expectation vector is e and whose covariance matrix is Σ. The components of e and Σ are then regarded as parameters of the Gaussian measure. Attention is almost exclusively paid to regular Gaussian measures which are obtained in the case that Σ is positive definite (equivalently regular). In that case N (e, Σ) can be introduced directly by its density with respect to the Lebesgue measure on (XN , XN ) fe,Σ (x) = √
1
(2π)|N | ·det (Σ)
· exp−
(x−e) ·Σ −1 ·(x−e) 2
for x ∈ XN ,
(2.19)
where Σ −1 denotes the inverse of the covariance matrix Σ, called the concentration matrix. Its elements are sometimes considered to be alternative parameters of a regular Gaussian measure. Since the density fe,Σ in (2.19) is positive, regular Gaussian measures are positive in the sense of Section 2.3.5.
2.3 Classes of probability measures
31
On the other hand, if Σ is not regular then the respective singular Gaussian measure N (e, Σ) (for a detailed definition see Section A.8.3) is concentrated on an affine subspace in RN = XN having the Lebesgue measure 0. Thus, singular Gaussian measures are not marginally continuous except for some rare cases (when the subspace has the form {y} × XA , A ⊂ N for y ∈ XN \A ); for illustration, see Example 2.3 on p. 35. Given a Gaussian measure P = N (e, Σ) over N and non-empty disjoint sets A, C ⊆ N a usual implicit convention (used in multivariate analysis and applicable even in case of a singular Gaussian measure) identifies the conditional probability PA|C with its unique “continuous” version − PA|C (| z) = N (eA + Σ A·C · Σ − C·C · (z − eC ), Σ A·A − Σ A·C · Σ C·C · Σ C·A )
for every z ∈ XC , where Σ A·C denotes the respective submatrix of Σ and Σ− C·C denotes the generalized inverse of Σ C·C (see Section A.8.1, p. 237). The point is that, for every z ∈ XC , it is again a Gaussian measure whose covariance matrix Σ A|C = Σ A·A − Σ A·C · Σ − C·C · Σ C·A actually does not depend on the choice of z (see Section A.8.3 for further details on the conditioned Gaussian measure). Therefore, the matrix Σ A|C is called a conditional covariance matrix. Recall that in the case C = ∅ one has Σ A|C = Σ A|∅ = Σ A·A by a convention. Elements of miscellaneous conditional covariance matrices can serve as convenient parameters of Gaussian measures – e.g. Andersson et al. [9]. An important related fact is that the expectation vector of a Gaussian measure is not significant from the point of view of conditional independence. It is implied by the following lemma that the covariance matrix alone contains all information about conditional independence structure. Therefore it is used in practice almost exclusively. Lemma 2.8. Let P = N (e, Σ) be a Gaussian measure over N and A, B|C is a non-trivial disjoint triplet over N . Then A⊥ ⊥ B | C [P ] iff
(Σ AB|C )A·B = 0 .
Proof. The key idea is that topological assumptions (see Remark A.1) imply the existence of a regular version of conditional probability on XAB given C, that is, a version P¯AB|C such that the mapping D → P¯AB|C (D | z) is a probability measure on XAB for every z ∈ XC . Clearly, for every A ∈ XA , the mapping z → P¯AB|C (A × XB | z), z ∈ XC , is a version of conditional probability on XA given C; an analogous claim is true for B ∈ XB . Thus, (2.1) can be rewritten in the form ∀ A ∈ XA , ∀ B ∈ XB , P¯AB|C (A × B| z) = P¯AB|C (A × XB | z) · P¯AB|C (XA × B| z)
(2.20)
for P C -a.e. z ∈ XC . Since all involved versions of conditional probability are probability measures for every z ∈ XC , it is equivalent to the requirement that (2.20) hold for every A ∈ YA , B ∈ YB where YA resp. YB are countable classes
32
2 Basic Concepts
closed under a finite intersection such that σ(YA ) = XA resp. σ(YB ) = XB . This can be shown using Lemma A.3 since, given B ∈ XB and z ∈ XC , the class of sets A ∈ XA satisfying (2.20) is closed under proper set difference and monotone countable union. The classes YA resp. YB exist in case of Borel σ-algebras on RA resp. RB . The set of z ∈ XC for which (2.20) holds for every A ∈ YA and B ∈ YB has P C measure 1 (since YA and YB are countable). For these z ∈ XC then (2.20) holds for every A ∈ XA and B ∈ XB by the above mentioned consideration. Hence, A⊥ ⊥ B | C [P ] ⇔ A ⊥ ⊥ B | ∅ [P¯AB|C (| z)]
for P C -a.e. z ∈ XC .
However, in this special case one can suppose that P¯AB|C (| z) is a Gaussian measure (see Section A.8.3) with the same covariance matrix Σ AB|C for every z ∈ XC (while the expectation does depend on z). It is a well-known fact that – regardless of the expectation vector – one has A ⊥ ⊥ B | ∅ with respect to a Gaussian measure iff the A × B-submatrix of its covariance matrix consists of zeros; see (A.9) in Section A.8.3. The previous lemma involves the following well-known criteria for elementary conditional independence statements (see also Proposition 5.2 in [70], Corollaries 6.3.3 and 6.3.4 in [157] and Exercise 3.8 in [100]). Corollary 2.3. Let P be a Gaussian measure over N with a covariance matrix Σ = (σij )i,j∈N and a correlation matrix Γ = (ij )i,j∈N . Then for distinct a, b ∈ N a⊥ ⊥ b | ∅ [P ] ⇔ σab = 0 ⇔ ab = 0 , and for distinct a, b, c, ∈ N a⊥ ⊥ b | {c} [P ] ⇔ σcc · σab = σac · σcb ⇔ ab = ac · cb . If Σ is regular and Λ = (κij )i,j∈N is the concentration matrix, then for distinct a, b ∈ N a⊥ ⊥ b | N \ {a, b} [P ] ⇔ κab = 0 . Proof. The first part is an immediate consequence of Lemma 2.8 since we implicitly assume σii > 0 for i ∈ N . For the last fact, first observe by elementary computation that a non-diagonal element of a regular 2 × 2-matrix vanishes iff the same element vanishes in its inverse matrix. In particular, a ⊥⊥ b | N \{a, b} [P ] ⇔ (Σ {ab}|N \{a,b} )ab = 0 ⇔ ((Σ {ab}|N \{a,b} )−1 )ab = 0 . The second observation is that (Σ D|N \D )−1 = (Σ −1 )D·D = ΛD·D for every non-empty set D ⊆ N (see Section A.8.1). In particular, one has ((Σ D|N \D )−1 )ab = (ΛD·D )ab = κab for D = {a, b}.
2.3 Classes of probability measures
33
Remark 2.12. The proof of Lemma 2.8 reveals a notable difference between the Gaussian and discrete case. While in the discrete case a conditional independence statement A ⊥ ⊥ B | C [P ] is equivalent to the collection of requirements A ⊥⊥ B | ∅ [PAB|C (|z)]
for every z ∈ XC with P C (z) > 0,
in the Gaussian case it is equivalent to a single requirement A⊥ ⊥ B | ∅ [PAB|C (|z)]
for at least one z ∈ XC ,
which already implies the same fact for all other z ∈ XC (one uses the conventional choice of “continuous” versions of PAB|C in this case). Informally said, the “same” conditional independence statement is, in the Gaussian case, specified by a smaller number of requirements than in the discrete case. The reason behind this phenomenon is that the actual number of free parameters characterizing a Gaussian measure over N is, in fact, smaller than the number of parameters characterizing a discrete measure (if |Xi | ≥ 2 for i ∈ N ). Therefore, discrete measures offer a wider variety of induced conditional independence models than Gaussian measures. This is perhaps a surprising fact for those who anticipate that a continuous framework should be wider than a discrete framework. The point is that the “Gaussianity” is quite a restrictive assumption. Thus, one can expect many special formal properties of conditional independence models arising in a Gaussian framework. For example, the following property of a disjoint semi-graphoid M was recognized by Pearl [100] as a typical property of graphical models (see Chapter 3): 7. composition
A⊥ ⊥ B | C [M] and A ⊥ ⊥ D | C [M] implies A ⊥ ⊥ BD | C [M]
for every collection of pairwise disjoint sets A, B, C, D ⊆ N . It follows easily from Lemma 2.8 that it is also a typical property of Gaussian conditional independence models: Corollary 2.4. Let P be a Gaussian measure over N and A, B, C, D ⊆ N are pairwise disjoint. Then A⊥ ⊥ B | C [P ] and A ⊥ ⊥ D | C [P ] ⇒ A ⊥⊥ BD | C [P ]. Proof. Given a covariance matrix Σ observe that (Σ ABD|C )AB·AB = Σ AB|C and (Σ ABD|C )AD·AD = Σ AD|C (see Section A.8.1 – this holds for a general positive semi-definite matrix Σ since one can fix a pseudoinverse matrix (Σ)− C·C ). The premises of the rule (Σ ABD|C )A·B = 0 and (Σ ABD|C )A·D = 0 imply (Σ ABD|C )A·BD = 0. However, the composition property is not a universally valid property of conditional independence models, as the following example shows.
34
2 Basic Concepts
Example 2.1. There exists a discrete (binary) probability measure P over N with |N | = 3 such that a⊥ ⊥ b | ∅ [P ] and a ⊥ ⊥ b | {c} [P ]
for any distinct a, b, c ∈ N.
Indeed, put Xi = {0, 1} for i ∈ N and ascribe the probability 1/4 to all of the following configurations of values: (0, 0, 0), (0, 1, 1), (1, 0, 1) and (1, 1, 0). An example of a positive measure can be obtained by minor modification: one chooses a parameter 0 < ε < 1/8, ascribes the probability 1/4 − ε to the above-mentioned configurations and ε to the remaining ones. ♦ Another special property of Gaussian conditional independence models is the following one which was also mentioned by Pearl [100] in the context of graphical models: 8. weak transitivity
A⊥ ⊥ B | C [M] and A ⊥⊥ B | Cd [M] implies A ⊥ ⊥ d | C [M] or d ⊥ ⊥ B | C [M]
for pairwise disjoint A, B, C ⊆ N , d ∈ N \ ABC. Corollary 2.5. Let P be a Gaussian measure over N , sets A, B, C ⊆ N are pairwise disjoint and d ∈ N \ ABC. Then A⊥ ⊥ B | C [P ] and A ⊥ ⊥ B | Cd [P ] ⇒ { A ⊥⊥ d | C [P ] or d ⊥ ⊥ B | C [P ] }. Proof. It suffices to assume that A and B are singletons. Indeed, owing to Corollary 2.4 (and semi-graphoid properties) A ⊥⊥ B | C is equivalent to the condition {a ⊥ ⊥ b | C for every a ∈ A, b ∈ B} and a similar observation can be made for the other CI statement involved in the premise. There is no pair a ∈ A, b ∈ B with ¬{ a ⊥ ⊥ d | C } and ¬{ d ⊥ ⊥ b | C } because this contradicts the fact { a ⊥ ⊥ b | C and a ⊥ ⊥ b | Cd } implied by the premise. In other terms, either { ∀ a ∈ A a ⊥⊥ d | C } or { ∀ b ∈ B d ⊥ ⊥ b | C } and one can again use Corollary 2.4 to get the desired conclusion. Lemma 2.8 allows one to reduce the general case to the case C = ∅. Indeed, one can consider Σ N \C|C in place of the covariance matrix Σ which is also a positive semi-definite matrix (see Section A.8.1) and therefore it is a covariance matrix of a Gaussian measure over N \ C (see Section A.8.3). If A = {a}, B = {b} and C = ∅ then two cases can be distinguished. If σii > 0 for i ∈ abd then apply Corollary 2.3 to the correlation matrix Γ = (ij )i,j∈abd of P abd : 0 = ab = ad · db . Hence ad = 0 or db = 0 which yields the desired fact. If σaa = 0 then the fact that the covariance matrix Σ is positive semi-definite implies det(Σ ad·ad ) ≥ 0 (see Section A.8.1) which ⊥ d | ∅ by Lemma 2.8. An analogous consideration can implies σad = 0 and a ⊥ be repeated if σbb = 0 or σdd = 0. The above result makes it possible to construct the following example.
2.3 Classes of probability measures
35
Example 2.2. There exists a pair P, Q of regular Gaussian measures over N with |N | = 3 such that M = MP ∩ MQ is not a CI model induced by any Gaussian measure over N . Indeed, put N = {a, b, c} and define matrices )i,j∈N as follows: σii = σii = 1 for i ∈ N , Σ = (σij )i,j∈N and Σ = (σij σbc = σcb = σac = σca = 1/2 and σij = σij = 0 for remaining i, j ∈ N . Put P = N (0, Σ), Q = N (0, Σ ) and observe that MP is the semi-graphoid closure of a, bc|∅ while MQ is the semi-graphoid closure of b, ac|∅. Thus, / M and c, b|∅ ∈ / M. By
a, b|c, a, b|∅ ∈ M ≡ MP ∩ MQ while a, c|∅ ∈ Corollary 2.5 M is not a Gaussian CI model. ♦ In fact, the above counterexample means that the poset of CI models induced by regular Gaussian measures over N (ordered by inclusion) is not a lattice. Note that in case |N | = 3 this poset coincides with the poset of DAG models (see Section 3.2) which is shown in Figure 7.4. However, if |N | > 3 then these posets differ – see Exercise 3.8b in [100]. An additional important fact is that every regular Gaussian measure has finite multiinformation. This follows from Lemma 2.7. Corollary 2.6. Let P be a regular Gaussian measure with a correlation matrix Γ . Then its multiinformation has the value 1 mP (N ) = − · ln(det(Γ )) . 2
(2.21)
Proof. Take the Lebesgue measure λ on (XN , XN ) in place of µ in Lemma 2.7. Substitution of (A.12) from Section A.8.3 into (2.18) gives − ln (2π) 1 1 |N | 1 |N | · ln(2π) − − · ln(det(Σ)) − − − · ln (σii ) − 2 2 2 2 2 2 i∈N
1 1 det(Σ) 1 1 = ln σii − · ln(det(Σ)) = − · ln = − · ln(det(Γ )) , 2 2 2 σ 2 i∈N ii i∈N
which is the fact that was needed to show.
On the other hand, a singular Gaussian measure need not be marginally continuous as the following example shows. It also demonstrates that the intersection property mentioned in Section 2.3.5 is not universally valid. Example 2.3. There exists a singular Gaussian measure P over N with |N | = 3 such that a⊥ ⊥ b | {c} [P ] and a ⊥ ⊥ b | ∅ [P ]
for any distinct a, b, c ∈ N.
Put P = N (0, Σ) where Σ = (σij )i,j∈N with σij = 1 for every i, j ∈ N and apply Corollary 2.3. It is easy to verify (see Section A.8.3) that P is concentrated on the subspace {(x, x, x) ; x ∈ R} while P {i} = N (0, 1) for
36
2 Basic Concepts
every i ∈ N . Since i∈N P {i} is absolutely continuous with respect to the Lebesgue measure on RN , P is not marginally continuous. Note that the same conditional independence model can be induced by a (binary) discrete measure; put Xi = {0, 1} for i ∈ N and ascribe the probability 1/2 to configurations (0, 0, 0) and (1, 1, 1). ♦ 2.3.7 Basic construction The following lemma provides a basic method for constructing probability measures with prescribed CI structure. Lemma 2.9. Let P, Q be probability measures over N . Then there exists a probability measure R over N such that MR = MP ∩ MQ . Moreover, if P and Q have finite multiinformation then a probability measure R over N with finite multiinformation such that MR = MP ∩ MQ exists. The same statement holds for the class of discrete measures over N , respectively for the class of positive discrete measures over N . XN ) = ( i∈N Xi , i∈N Xi ) and Proof. Let P be a measure on a space (XN , Q be a measure on (YN , YN ) = ( i∈N Yi , i∈N Yi ). Let us put (Zi , Zi ) = (Xi × Yi , Xi × Yi ) for i ∈ N , introduce (ZN , ZN ) = i∈N (Zi , Zi ) which can be understood as (XN × YN , XN × YN ) and define a probability measure R on (ZN , ZN ) as the product of P and Q. The goal is to show that for every
A, B|C ∈ T (N ) A⊥ ⊥ B | C [R] ⇔ { A ⊥⊥ B | C [P ] and A ⊥⊥ B | C [Q] } .
(2.22)
Let us take the unifying perspective indicated in Remark 2.1: (ZN , ZN ) and R are fixed, and respective coordinate σ-algebras X¯A , Y¯A , Z¯A ⊆ ZN are ascribed to every A ⊆ N . Then P corresponds to the restriction of R to X¯N , Q to the restriction of R to Y¯N and (2.22) takes the form (see Section A.7 for related concepts): Z¯A ⊥ ⊥ Z¯B | Z¯C [R] ⇔ X¯A ⊥ ⊥ X¯B | X¯C [R] and Y¯A ⊥ ⊥ Y¯B | Y¯C [R] . (2.23) As XA × YA -measurable rectangles generate ZA for every A ⊆ N by the “weaker” formulation of the definition of conditional independence in terms ⊥ Z¯B | Z¯C [R] is equivalent to the of σ-algebras observe that the fact Z¯A ⊥ x y x ¯ ¯ ¯ requirement: ∀ A ∈ XA , A ∈ YA , B ∈ XB , By ∈ Y¯B R(Ax ∩ Ay ∩ Bx ∩ By | Z¯C )(z) = R(Ax ∩ Ay | Z¯C )(z) · R(Bx ∩ By | Z¯C )(z) (2.24) for R-a.e. z ∈ ZN . On the other hand, X¯A ⊥ ⊥ X¯B | X¯C [R] is equivalent, by a usual definition of conditional independence in terms of σ-algebras, to the requirement: ∀ Ax ∈ X¯A , Bx ∈ X¯B P (Ax ∩ Bx | X¯C )(x) = P (Ax | X¯C )(x) · P (Bx | X¯C )(x)
(2.25)
2.3 Classes of probability measures
37
for R-a.e. z = (x, y) ∈ ZN . I write P ( | X¯C )(x) instead of R( | X¯C )(z) because it is a function of x which only depends on P . Analogously, the fact Y¯A ⊥ ⊥ Y¯B | Y¯C [R] is equivalent to the requirement: ∀ Ay ∈ Y¯A , By ∈ Y¯B Q(Ay ∩ By | Y¯C )(y) = Q(Ay | Y¯C )(y) · Q(By | Y¯C )(y)
(2.26)
for R-a.e. z = (x, y) ∈ ZN . Now, given Ax , Ay , Bx , By , one can show using Lemma A.5 (see Section A.6.4) that, given a version of conditional probability P (Ax ∩ Bx | X¯C ) and a version of Q(Ay ∩ By | Y¯C ), their product is a version of conditional probability R(Ax ∩ Ay ∩ Bx ∩ By | Z¯C ). More specifically, the condition (W) in Lemma A.5 can be used with the class G consisting of sets Cx ∩Cy where Cx ∈ X¯C , Cy ∈ Y¯C , and one uses the assumption R = P ×Q and the Fubini theorem. Hence, the uniqueness of conditional probability implies that R(Ax ∩Ay ∩Bx ∩By | Z¯C )(z) = P (Ax ∩Bx | X¯C )(x)·Q(Ay ∩By | Y¯C )(y) (2.27) for R-a.e. z = (x, y) ∈ ZN . Thus, to evidence (2.24)⇒(2.25) put Ay = By = ZN , use (2.27) and the fact Q(ZN | Y¯C )(y) = 1 for R-a.e. z = (x, y) ∈ ZN ; to evidence (2.24)⇒(2.26) put Ax = Bx = ZN . Conversely, (2.25),(2.26)⇒(2.24) by the repeated use of (2.27), which means that (2.23) was verified. If both P and Q have finite multiinformationthen R{i} =P {i} × Q{i} are marginals of R on (Zi , Zi ) for i ∈ N and R i∈N P {i} × j∈N Q{j} = {k} ×Q{k} . Thus, R is a marginally continuous measure over N . Morek∈N P over, one can also apply Lemma 2.6 to R with “doubled” N = Nx ∪ Ny and
A, B|C = Nx , Ny |∅ to see that H(R | P {i} × Q{i} ) = H(P | P {i} ) + H(Q | Q{j} ) . i∈N
j∈N
i∈N
j∈N
Note for explanation that, in the considered case, R is the conditional product of P and Q and therefore the term H(P ABC |Q) in (2.13) vanishes by Lemma A.4 from Section A.6.3. In particular, the multiinformation of R is the sum of the multiinformations P and Q and, therefore, it is finite. The statement concerning discrete and positive discrete measures easily follows from the given construction. Elementary constructions of probability measures are needed to utilize the method from Lemma 2.9. One of them is the product of one-dimensional probability measures. Proposition 2.2. There exists a discrete (binary) probability measure P over N such that A ⊥⊥ B | C [P ]
for every A, B|C ∈ T (N ).
Proposition 2.3. Suppose that |N | ≥ 2 and A ⊆ N with |A| ≥ 2. Then there exists a discrete (binary) probability measure P over N such that
38
2 Basic Concepts
mP (S) =
ln 2 0
if A ⊆ S, otherwise.
1−|N | to every Proof. Put Xi = {0, 1} for i ∈ N and ascribe the probability 2 configuration of values [xi ]i∈N with even i∈A xi (remaining configurations have zero probability).
Lemma 2.10. Suppose that |N | ≥ 3, 2 ≤ l ≤ |N | and L ⊆ {S ⊆ N ; |S| = l}. Then there exists a discrete probability measure P over N such that ∀ a, b|K ∈ T (N ) with |abK| = l
a⊥ ⊥ b | K [P ] ⇔ abK ∈ L . (2.28)
Proof. If L = ∅ then use Proposition 2.2. If L = ∅ then apply Proposition 2.3 to every A ∈ L to get a binary probability measure P[A] such that ∀ elementary triplet a, b|K with |abK| = l
a ⊥⊥ b | K [P[A] ] ⇔ abK = A.
Note that (2.17) in Corollary 2.2 can be used to verify the above claim. Then Lemma 2.9 can be applied repeatedly to get a discrete probability measure over N satisfying (2.28). This gives a lower estimate of the number of “discrete” probabilistic CI structures. Corollary 2.7. If n = |N | ≥ 3 then the number of distinct CI structures n/2 induced by discrete probability measures over N exceeds the number 22 where n/2 denotes the lower integer part of n/2. Proof. Let us put l = n/2 for even n, respectively l = (n + 1)/2 for odd n. By Lemma 2.10 for every subclass L of {S ⊆ N ; |S| = l} a respective probability measure P[L] exists. By (2.28) these measures induce distinct CI models over N . Therefore, the number of distinct induced CI models exceeds 2s where s is the number of elements of {S ⊆ N ; |S| = l}. Find suitable lower estimates for s. If l = n/2 then write n 1 · 3 · . . . · (2l − 1) 2 · 4 · . . . · 2l 2l 1 · 2 · . . . · 2l s= = · ≥ 2l = 2 2 . = (1 · . . . · l) · (1 · . . . · l) 1 · 2 · ... · l 1 · 2 · ... · l l
Similarly, in the case l = (n + 1)/2 write s=
n 1 · 3 · . . . · (2l − 1) 2 · 4 · . . . · (2l − 2) 2l − 1 · ≥ 2l−1 = 2 2 , = 1 · 2 · ... · l 1 · 2 · . . . · (l − 1) l
which implies the desired conclusion 2s ≥ 22
n/2
in both cases.
2.4 Imsets
39
2.4 Imsets An imset over N is an integer-valued function on the power set of N , that is, any function u : P(N ) → Z or, alternatively, an element of ZP(N ) . Basic operations with imsets, namely summation, subtraction and multiplication by an integer are defined coordinate-wisely. Analogously, we write u ≤ v for imsets u, v over N if u(S) ≤ v(S) for every S ⊆ N . A multiset is an imset with non-negative values, that is, any function m : P(N ) → Z+ . Any imset u over N can be written as the difference u = u+ − u− of two multisets over N where u+ is the positive part of u and u− is the negative part of u, defined as follows: u+ (S) = max {u(S), 0} ,
u− (S) = max {−u(S), 0}
for S ⊆ N .
By a positive domain of an imset u will be understood the class of sets Du+ = {S ⊆ N ; u(S) > 0}, the class Du− = {S ⊆ N ; u(S) < 0} will be called a negative domain of u. Remark 2.13. The word “multiset” is taken from combinatorial theory [1] while the word “imset” is an abbreviation for integer-valued multiset. Later in this book certain special imsets will be used to describe probabilistic conditional independence structures (see Section 4.2.3). A trivial example of an imset is the zero imset denoted by 0 which ascribes a zero value to every S ⊆ N . Another simple example is the identifier of a set A ⊆ N denoted by δA and defined as follows: 1 if S = A , δA (S) = 0 if S ⊆ N, S = A . Special notation mA↓ , respectively mA↑ , will be used for multisets which serve as identifiers of classes of subsets, respectively classes of supersets, of a set A ⊆ N: 1 if S ⊆ A , 1 if S ⊇ A , mA↓ (S) = and mA↑ (S) = 0 otherwise , 0 otherwise . It is clear how to represent an imset over N in memory of a computer, namely by a vector with 2|N | integral components which correspond to subsets of N . However, for a small number of variables, one can also visualize imsets in a more telling way, using special pictures. The power set P(N ) is a distributive lattice and can be represented in the form of a Hasse diagram (see Section A.2). Ovals in this diagram correspond to elements of P(N ), that is, to subsets of N , and a link is made between two ovals if the symmetric difference of the represented sets is a singleton. A function on P(N ) can be visualized by writing assigned values into respective ovals. For example, the imset u over N = {a, b, c} defined by the table
40
2 Basic Concepts
S ∅ {a} {b} {c} {a, b} {a, c} {b, c} {a, b, c} u(S) +1 −3 −1 0 +3 +2 0 −2 can be visualized in the form of the diagram from Figure 2.2. The third possible way of describing an imset (used in this monograph) is to write it as a combination of simpler imsets with integral coefficients. For example, the imset u from Figure 2.2 can be written as follows: u = −2 · δN + 3 · δ{a,b} + 2 · δ{a,c} − 3 · δ{a} − δ{b} + δ∅ .
−2 {a, b, c} Q Q Q Q +3 +2 0 {a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q −3 −1 0 {a} {b} {c} Q Q Q Q +1 ∅
Fig. 2.2. Hasse diagram of an imset over N = {a, b, c}.
In this book, certain special imsets over N will be used. Effective dimension of these imsets, that is, the actual number of free values is not 2|N | but 2|N | − |N | − 1 only. There are several ways to standardize imsets of this kind. I will distinguish three basic ways of standardization (for justification of terminology see Remark 5.3 in Section 5.1.2). An imset u over N , respectively a real function u on P(N ), is o-standardized if u(S) = 0 and ∀i∈N u(S) = 0 . S⊆N
S⊆N, i∈S
Alternatively, the second condition in the preceding line can be formulated in the form S⊆N \{j} u(S) = 0 for every j ∈ N . An imset u, respectively a real function u on P(N ), is -standardized if u(S) = 0
whenever S ⊆ N, |S| ≤ 1 ,
and u-standardized if u(S) = 0
whenever S ⊆ N, |S| ≥ |N | − 1 .
2.4 Imsets
41
An imset u over N will be called normalized if the collection of integers {u(S); S ⊆ N } has no common prime divisor. Besides basic operations with imsets, an operation of a scalar product of a real function m : P(N ) → R and an imset u over N defined by m(S) · u(S) ,
m, u = S⊆N
will be used. Indeed, it is a scalar product on the Euclidean space RP(N ) . Note that the function m can be an imset as well; it will often be a multiset.
3 Graphical Methods
Graphs whose nodes correspond to random variables are traditional tools for description of CI structures. One can distinguish three classic approaches: using undirected graphs, using acyclic directed graphs and using chain graphs. This chapter is an overview of graphical methods for describing CI structures with the main emphasis put on theoretical questions mentioned in Section 1.1. Both classic and advanced approaches are included. Note that elementary graphical concepts are introduced in Section A.3.
3.1 Undirected graphs Graphical models based on undirected graphs are also known as Markov networks [100]. Given an undirected graph G over N one says that a disjoint triplet A, B|C ∈ T (N ) is represented in G, and writes A ⊥ ⊥ B | C [G] if every route (equivalently every path) in G between a node in A and a node in B contains a node in C, that is, C separates between A and B in G. For illustration see Figure 3.1. Thus, every undirected graph G over N induces a formal independence model over N by means of the separation criterion (for undirected graphs): ⊥ B | C [G] } . MG = { A, B|C ∈ T (N ) ; A ⊥ Let us call every independence model obtained in this way a UG model. These models were characterized by Pearl and Paz [99] in terms of a finite number of formal properties: 1. triviality A ⊥⊥ ∅ | C [G], 2. symmetry A⊥ ⊥ B | C [G] implies B ⊥ ⊥ A | C [G], 3. decomposition A ⊥ ⊥ BD | C [G] implies A ⊥ ⊥ D | C [G], 4. strong union A⊥ ⊥ B | C [G] implies A ⊥⊥ B | DC [G], 5. intersection A⊥ ⊥ B | DC [G] and A ⊥ ⊥ D | BC [G] implies A ⊥ ⊥ BD | C [G],
44
3 Graphical Methods b
d
f
v
f v @ @ f @v
f
v
a
c
e
g
h
v
Fig. 3.1. The set C = {e, f } separates between sets A = {a, d} and B = {h}.
A ⊥⊥ B | C [G] implies A ⊥⊥ {d} | C [G] or {d} ⊥⊥ B | C [G]. This axiomatic characterization implies that every UG model is a graphoid satisfying the composition property.
6. transitivity
Remark 3.1. Please note that the above-mentioned separation criterion was a result of some evolution. Theory of Markov fields stems from statistical physics [95] where undirected graphs were used to model geometric arrangements in space. Several types of Markov conditions were later introduced (see § 3.2.1 of Lauritzen [70] for an overview) in order to associate these graphs and probabilistic CI structures. The original “pairwise Markov property” was strengthened to the “local Markov property” and this was finally strengthened to the “global Markov property”. These Markov conditions differ in general (e.g. [80]) but they coincide if we restrict our attention to positive measures [69]. The authors who contributed to the theory of Markov fields in the 1970s (see the introduction of Speed [120] for references) basically restricted their attention to the class of positive discrete probability measures. In other words, they used undirected graphs to describe structural properties of probability measures taken from this class; that is, they actually kept to a special distribution framework of positive discrete measures (see Section A.9.5 for an explanation of what I mean by a distribution framework). It was already found in the 1970s that the above-mentioned Markov conditions for undirected graphs are equivalent for the measures from the respective class Ψ of positive discrete measures over N . Moreover, it was shown later that the global Markov property, which is clearly the strongest one of those three Markov properties, cannot be strengthened within the framework of Ψ (see Remark 3.2 for more explanation). Thus, the theory of Markov fields was developed under an implicit assumption that a particular distribution framework is considered. Undirected graphs also appeared in the 1970s in statistics (see Wermuth [156]) where they were used to describe so-called “covariance selection models” formerly introduced by Dempster [34]. However, statisticians restricted their attention to another class of probability measures, namely to the class of regular Gaussian measures (see p. 30). That means, they kept to another distribution framework. Nevertheless, it can be shown that the global Markov property is
3.1 Undirected graphs
45
the strongest possible Markov condition within this framework, too. What I consider to be worth emphasizing is that the authors in the area of graphical models, either in statistics or in probabilistic reasoning, actually have in their mind a particular distribution framework although they may forget to mention this implicit assumption in their papers. Note that a similar story, that is, the evolution of various Markov conditions until the strongest possible Markov condition is reached, was observed in the case of acyclic directed graphs and in the case of chain graphs (for an overview see § 3.2 of Lauritzen [70]). The story has also been repeated recently with advanced graphical models (see Section 3.5). However, in this monograph, attention is only paid to the result of this evolution, that is, to graphical criteria that correspond to the strongest possible Markov condition, that is, to the global Markov property. A probability measure P over N is Markovian with respect to an undirected graph G over N if A⊥ ⊥ B | C [G] implies A ⊥ ⊥ B | C [P ]
for every A, B|C ∈ T (N ) ,
and perfectly Markovian if the converse implication holds as well. It was shown by Geiger and Pearl [46] (Theorem 11) that a perfectly Markovian discrete probability measure exists for every undirected graph over N . In other words, every UG model is a (probabilistic) CI model and the faithfulness (in the sense of Section 1.1) is ensured for the universum of undirected graphs and the discrete distribution framework. Remark 3.2. This is to explain certain habitual terminology sometimes used in the literature. What is claimed in this remark also holds in the case of acyclic directed graphs and chain graphs (see Sections 3.2, 3.3, 3.5.4, 3.5.5). The existence of a perfectly Markovian measure which belongs to a considered class of probability measures Ψ (= a distribution framework) implies the following weaker result. Whenever a disjoint triplet t = A, B|C ∈ T (N ) is not represented in a graph G then there exists a measure P ∈ Ψ which is Markovian with respect to G and t corresponds to a dependence statement with respect to P : A ⊥ ⊥ B | C [P ]. Some authors [39, 60, 54] say then that the class of measures Ψ is perfect with respect to G. Thus, Theorem 2.3 from Frydenberg [40] says that the class of CG measures (see Section 4.1.3) with prescribed layout of discrete and continuous variables is perfect with respect to every undirected graph. This result implies that the global Markov property (see Remark 3.1) is the strongest possible Markov condition both within the framework of positive discrete measures and within the framework of regular Gaussian measures. However, the claim about the perfectness of a class Ψ is also referred to in the literature [44, 141, 73] as the completeness (of the respective graphical criterion relative to Ψ ) since it says that the criterion from the respective global Markov property cannot be strengthened within Ψ any more (contrary
46
3 Graphical Methods
to the case of pairwise and local Markov properties if we only consider positive measures – see Remark 3.1). The existence of a perfectly Markovian measure over N which has a prescribed non-trivial sample space XN is then called the strong completeness [90, 73]. One can say that two undirected graphs G and H over N are Markov equivalent if the classes of Markovian measures with respect to G and H coincide. The result about the existence of a perfectly Markovian measure implies that G and H are Markov equivalent iff MG = MH , that is, if they induce the same formal independence model, in which case we say that they are independence equivalent. b in G iff ¬(a ⊥ ⊥ b | N \ {a, b} [G] ) implies The observation that a that MG = MH iff G = H. Thus, the equivalence question (in the sense of Section 1.1) has a simple solution for the universum of undirected graphs: two undirected graphs are Markov, respectively independence, equivalent iff they coincide. Remark 3.3. A marginally continuous probability measure over N is said to factorize with respect to an undirected graph G over N if it factorizes after the class (see p. 22) of its cliques. It is known that every factorizable measure is Markovian [69], the converse is true for positive measures [59] but not for all (discrete) measures [80]. One can say that two graphs are factorization equivalent if the corresponding classes of factorizable measures coincide. However, this notion is not very sensible within the universum of undirected graphs since it is reduced to an identity of graphs (to show this one can use the same arguments as in the case of Markov equivalence). The restriction of a UG model to a set ∅ = T ⊆ N is a UG model [140]. However, the corresponding marginal graph GT differs from the usual induced b in GT iff there exists a path in G subgraph GT . For a, b ∈ T one has a between a and b consisting of nodes of {a, b} ∪ (N \ T ).
3.2 Acyclic directed graphs These graphical models are also known under the name of Bayesian networks [100]. Note that the majority of authors became accustomed to the phrase “directed acyclic graphs”: hence the abbreviation DAG is commonly used. However, this phrase can be misinterpreted: one can understand it as a phrase which indicates forests (= acyclic undirected graphs) whose lines are directed. Some authors [7] pointed this inaccuracy out and proposed a more appropriate term “acyclic digraphs”. I myself decided to use the phrase acyclic directed graphs. Two basic criteria to determine whether a triplet A, B|C ∈ T (N ) is represented in an acyclic directed graph G were developed. Lauritzen et al.
3.2 Acyclic directed graphs
47
[69] proposed the moralization criterion while the group around Pearl [43] used the d-separation criterion (d means “directional”).
An original graph
An induced subgraph
d
d
u
u @
@ R be ? u c
d
u
? e
a
A moral graph
a
u @ @ R be
e@
@ R fu ? e
g
u
? e
a
u @
e@
e @ be
@ R fu
? u
u
c
c
e@
@ fu
Fig. 3.2. Testing a, f | {c, d} according to the moralization criterion.
The moralization criterion has three stages. First, one takes the set T = anG (ABC) and considers the induced subgraph GT . Second, GT is changed into its moral graph H, that is, the underlying graph of a graph K (with mixed edges) over T which is obtained from the graph GT by adding a line b in K whenever there exists c ∈ T having both a and b as parents a in GT . The name “moral graph” was motivated by the fact that the nodes having a common child are “married”. The third step is to decide whether C separates between A and B in H. If yes, one says that A, B|C is represented in G according to the moralization criterion. For an illustration see Figure 3.2 where the tested triplet is not represented in the original graph. d
u
a
u @
@ R be @ ? u c
? e e@ @ R fu @ ? e
g
Fig. 3.3. The path a → b ← e → f is active with respect to C = {c, d}.
To formulate the d-separation criterion one needs some auxiliary concepts as well. Let ω : c1 , . . . , cn , n ≥ 1 be a route in a directed graph G. By a collider node with respect to ω is understood every node ci , 1 < i < n such
48
3 Graphical Methods
that ci−1 → ci ← ci+1 in ω. One says that ω is active with respect to a set C ⊆ N if • every collider node with respect to ω belongs to anG (C), • every other node of ω is outside C. A route which is not active with respect to C is blocked by C. A triplet
A, B|C is represented in G according to the d-separation criterion if every route (equivalently every path) in G from A to B is blocked by C. For illustration of the d-separation criterion see Figure 3.3. It was shown by Lauritzen et al. [69] that the moralization and the d-separation criteria for acyclic directed graphs are equivalent. Note that the moralization criterion is effective if
A, B|C is represented in G while the d-separation criterion is suitable in the opposite case. The third possible equivalent criterion (a compromise between those two criteria) appeared in Massey [78]. One writes A ⊥ ⊥ B | C [G] whenever A, B|C ∈ T (N ) is represented in an acyclic directed graph G according to one of the criteria. Thus, every acyclic directed graph G induces a formal independence model ⊥ B | C [G] } . MG = { A, B|C ∈ T (N ) ; A ⊥ Following common practice let me call every independence model obtained in this way a DAG model . These models cannot be characterized in terms of formal properties like UG models (see Remark 3.5). Nevertheless, several formal properties of DAG models were given in Pearl [100]. These properties imply that every DAG model is a graphoid satisfying the composition property. The definition of Markovian and perfectly Markovian measure with respect to an acyclic directed graph is analogous to the case of undirected graphs. It was shown by Geiger and Pearl [44] that a perfectly Markovian discrete probability measure exists for every acyclic directed graph. The existence of a perfectly Markovian measure with prescribed non-trivial discrete sample spaces was derived later from that result by Meek [90]. Thus, DAG models are also probabilistic CI models. Two acyclic directed graphs are Markov equivalent if their classes of Markovian measures coincide. An alternative phrase “faithfully indistinguishable graphs” is used in § 4.2 of Spirtes et al. [122]. Note that some authors (see § 3.3 of Lauritzen [70]) introduce Markov equivalent graphs as the graphs which induce the same formal independence model; however, to name that condition I prefer to use the phrase that the graphs are independence equivalent – for an explanation see Section 6.1. The classic graphical characterization of independence equivalent acyclic directed graphs was given by Verma and Pearl [151]; but the same result can also be found in later publications [122, 6] and it alternatively follows from the graphical characterization of Markov equivalent chain graphs [39]. Let us call by an immorality in an acyclic directed graph G every induced subgraph of G for a set T = {a, b, c} such that a → c in G, b → c in G
3.2 Acyclic directed graphs
49
and [a, b] is not an edge in G. Two acyclic directed graphs are independence equivalent iff they have the same underlying graph and the same immoralities. Note that the word “immorality” has the same justification like the phrase “moralization criterion”; other authors used various alternative names like “unshielded colliders” [122], “v-structures” [23] and “uncoupled head-to-head nodes” [151]. An alternative (graphical) transformational characterization of independence equivalent acyclic directed graphs was presented by Chickering [22]. By a legal arrow reversal is understood the change of an acyclic directed graph G into a directed graph H: one replaces an arrow a → b in G by the arrow b → a in H, provided that the condition paG (b) = paG (a) ∪ {a} is fulfilled. The condition ensures that the resulting graph H is again an acyclic directed graph equivalent to G. This observation motivated the terminology here; note that an alternative phrase “a → b is a covered arc in G” is used in the literature [22, 91]. It was shown in Chickering [22] (see also Lemma 3.2 in Koˇcka et al. [58]) that acyclic directed graphs G and H over N are equivalent iff there exists a sequence G1 , . . . , Gm , m ≥ 1 of acyclic directed graphs over N such that G1 = G, Gm = H and Gi+1 is obtained from Gi by a legal arrow reversal for i = 1, . . . , m − 1. However, the question of choosing a suitable representative of an equivalence class has no natural solution in the universum of acyclic directed graphs. There is no rule which allows one to choose a distinguished representative in any given equivalence class of acyclic directed graphs. Thus, hybrid graphs like essential graphs [6] or patterns [151] were used in the literature to represent these equivalence classes. The topic of learning DAG models, more exactly the identification of the essential graph on the basis of the induced independence model (which could be obtained as a result of statistical tests based on data), was addressed in Verma and Pearl [152], Meek [89] and Chickering [22]. Remark 3.4. It is a speciality of the case of acyclic directed graphs that the respective concept of (recursive) factorization for marginally continuous probability measures (see p. 164 for a definition in discrete case) coincides with the concept of Markovness [69]. Another special feature of this case is that an analog of the local Markov property is equivalent to the global Markov property [69]. This fact can also be derived from the result by Verma and Pearl in [150] saying that the smallest semi-graphoid containing the following collection of independence statements ai ⊥⊥ {a1 , . . . , ai−1 } \ paG (ai ) | paG (ai ) for i = 1, . . . , n where a1 , . . . , an , n ≥ 1 is a total ordering of nodes of G consonant with direction of arrows (see p. 220), is nothing but the induced model MG . The above collection of independence statements is often called a causal (input) list [150, 14]. Contrary to the case of UG models, the restriction of a DAG model need not be a DAG model as the following example shows.
50
3 Graphical Methods a
e
e @
@ R e @
d
u @
e
@ R e @ c
b
Fig. 3.4. An acyclic directed graph with hidden variable e.
Example 3.1. There exists a DAG model over N = {a, b, c, d, e} whose restriction to T = {a, b, c, d} is not a DAG model over T . Consider the independence model induced by the graph in Figure 3.4. It was shown in Dawid and Studen´ y [32] (Lemma 5.1) that its restriction to T is not a DAG model. This unpleasant property of DAG models probably motivated attempts to extend study to DAG models with hidden variables, that is, restrictions of DAG models (to non-empty subsets of the set of variables) – see Section 3.5.7. ♦ Remark 3.5. The observation that DAG models are not closed under restriction has an important consequence which may not be evident at first sight. The consequence is that DAG models cannot be characterized in a quasiaxiomatic way, more exactly, in the way UG models were characterized and discrete probabilistic CI models could possibly be characterized. What follows is an attempt to formulate the basic argument this for statement rather than a real rigorous proof of it. An exact proof would require a lot of technical auxiliary concepts. These concepts are analogous to the concepts from mathematical logic [92] and are needed to differentiate thoroughly syntactic and semantic aspects of considered formal properties (of independence models). The reader can find some details about these technical concepts in § 5 of Studen´ y and Vejnarov´ a [144]. First, let me explain informally what kind of characterization I have in mind. It is a characterization by means of a (possibly infinite) system of inference rules which may involve both conditional independence and conditional dependence statements. The general form of these inference rules is as follows: [ a1 and . . . and an ]
implies
[ c1 or . . . or cm ] ,
(3.1)
where n ≥ 0, m ≥ 1; that is, the conjunction of a (possibly empty) set of statements, which are called antecedents, implies the disjunction of a nonempty set of (other) statements, which are called consequents. More specifically, every statement involved in (3.1) is composed of sets of variables taken from an associated collection of pairwise disjoint “atomic” sets A1 , . . . , Ak , k ≥ 3. The only acceptable way of composing (components of a statement) is the union of these atomic sets. For example, the contraction property on p. 13 can be viewed as an inference rule of this type: it has two antecedents, namely A ⊥⊥ B | DC and A ⊥⊥ D | C, and only one consequent A ⊥ ⊥ BD | C. These (independence) statements are composed of sets taken
3.2 Acyclic directed graphs
51
from a collection of four atomic sets A, B, C, D. The symbols for atomic sets used in a syntactic record of an inference rule of this type are interpreted as “free predicates”. This means that, given a particular non-empty set of variables N , one can substitute for those symbols any particular collection of pairwise disjoint subsets of N . The free predicate interpretation implicitly means that the choice of any particular atomic set Ai , 1 ≤ i ≤ k cannot be influenced or restricted by the choice of remaining atomic sets except for the requirement for overall disjointness. For example, in the above example of contraction property, one is not allowed to modify this inference rule by an additional specific condition D ≡ N \ ABC, where N denotes the considered set of variables. This is because this condition brings extra restriction to the choice of atomic sets. Indeed, if N = {a, b, c, d} then it would make it impossible to substitute A = {a}, B = {b}, C = ∅ and D = {d} which contradicts the above mentioned principle of free choice. The only acceptable modification of this general rule is that which results from the use of special atomic set predicates. These are the symbols for the empty set and singletons: one has to substitute sets of the corresponding type for them and comply simultaneously with the requirement for overall disjointness. For example, the symbol for the empty set is used in the triviality property on p. 13 and a symbol for a singleton is used in the weak transitivity property on p. 34. Now, given a particular non-empty finite set of variables N and a formal independence model M over N , one is able to resolve whether or not M satisfies the formal property expressed by (3.1). More specifically, every disjoint triplet over N is interpreted as a statement: triplets from M are interpreted as conditional independence statements and the triplets from T (N ) \ M as conditional dependence statements. Then one says that M satisfies the formal property (3.1), if, for every feasible choice of particular atomic sets, the condition (3.2) [ c1 or . . . or cm or ¬a1 or . . . or ¬an ] , holds. Note that, in (3.2), the negation of A ⊥ ⊥ B | C is A ⊥⊥ B | C and vice versa. The above-mentioned free predicate interpretation of symbols for atomic sets implies that if M ⊆ T (N ) satisfies (3.1) and ∅ = T ⊆ N , then its restriction MT ⊆ T (T ) satisfies it as well. Indeed, it suffices to realize that every feasible choice of (a collection of) subsets of T for atomic sets in (3.1) can be viewed as a feasible choice of subsets of N . By a quasi-axiomatic characterization of an abstract family of independence models, say of the class of DAG models, is meant a (possibly infinite) system of inference rules of type (3.1) such that, for every non-empty finite set of variables, an independence model M ⊆ T (N ) falls within the family iff it satisfies all the formal properties from the system. There is no such system for the class of DAG models as otherwise DAG models would have been closed under restriction. This is, however, not true by Example 3.1. On the other hand, I conjecture that discrete probabilistic CI models can be characterized by means of an infinite (but countable) collection of inference
52
3 Graphical Methods
rules of the above type. This conjecture is based on Proposition 2 in Studen´ y [132] and the fact that the class of discrete CI structures is closed under restriction and permutation of variables – see Direction 1 in Chapter 9.
3.3 Classic chain graphs A chain graph is a hybrid graph without directed cycles or, equivalently, a hybrid graph which admits a chain (see Section A.3, p. 221). This class of graphs was introduced by Lauritzen and Wermuth in the mid 1980s in a research report [64] which later became the basis of a journal paper [67]. Classic interpretation of chain graphs is based on the moralization criterion for chain graphs established by Lauritzen [68] and Frydenberg [39]. The main distinction between the moralization criterion for chain graphs and the one for acyclic directed graphs (see p. 47) is a more general definition of the moral graph in the case of chain graphs. Supposing GT is a hybrid graph over ∅ = T ⊆ N , one defines a graph K with mixed edges over T by adding a line b in K whenever there exist c, d ∈ T belonging to the same connectivity a component of GT (possibly c = d) such that a → c in GT and b → d in GT . The moral graph H of GT is then the underlying graph of K. A triplet
A, B|C ∈ T (N ) represented in a chain graph G over N according to the moralization criterion if C separates between A and B in the moral graph GT where T = anG (ABC). For illustration see Figure 3.5. Original and induced graph a
f
u ? u
b
e
c
e H
? u d
e
u HH
Moral graph a
u u
HH j ? u g
b
f
e c e e u H HH HH u u d
g
Fig. 3.5. Testing a, d | {b, e, g} according to the moralization criterion for chain graphs – the triplet is not represented.
An equivalent c-separation criterion (c stands for “chain”), which generalizes the d-separation criterion for acyclic directed graphs, was introduced in Bouckaert and Studen´ y [15]. This criterion was later simplified as follows [142]. By a section of a route ω : c1 , . . . , cn , n ≥ 1 in a hybrid graph G, is understood a maximal undirected subroute ci ... cj of ω (that is, either i = 1 or [ci−1 , ci ] is not a line, and analogously for j). By a collider
3.3 Classic chain graphs
53
section of ω, is understood a section ci , . . . , cj , 1 < i ≤ j < n such that ... cj ← cj+1 in ω. A route ω is superactive with respect to a ci−1 → ci set C ⊆ N if • every collider section of ω contains a node of C, • every other section of ω is outside C. A route which is not superactive with respect to C is intercepted by C. A triplet
A, B|C ∈ T (N ) is represented in G according to the c-separation criterion if every route in G from A to B is intercepted by C. The equivalence of the cseparation criterion and the moralization criterion was shown in Studen´ y and Bouckaert [141] (Consequence 4.1). One writes A ⊥ ⊥ B | C [G] if A, B|C is represented in a chain graph G according to one of these criteria. The induced formal independence model is then ⊥ B | C [G] } . MG = { A, B|C ∈ T (N ) ; A ⊥ Thus, the class of CG models was introduced. Since the c-separation criterion generalizes both the separation criterion for undirected graphs and the dseparation criterion for acyclic directed graphs, every UG model and every DAG model is a CG model (for illustration see Figure 3.6 on p. 56). Every CG model is a graphoid satisfying the composition property [141]. Note that Example 3.1 can also serve as an example of the fact that the restriction of a CG model need not be a CG model. Therefore, one can repeat the arguments from Remark 3.5 showing that CG models cannot be characterized by means of formal properties of “semi-graphoid” type. Remark 3.6. In contrast to an analogous situation in the case of undirected and acyclic directed graphs, intercepting of all routes required in the cseparation criterion is not equivalent to intercepting of all paths. Consider the chain graph G in the left-hand picture of Figure 3.5. The only path bec → d and this path is intercepted by tween A = {a} and B = {d} is a → b c e ← f → g ← c → d is superC = {b, e, g}. However, the route a → b active with respect to C. For this reason one has ¬{ a ⊥⊥ d | {b, e, g} [G] }. Despite the fact that the class of all routes between two sets could be infinite, the c-separation criterion is finitely implementable for another reason – see § 5.2 in Studen´ y [142]. Note that the above-mentioned phenomenon was the main reason why the original version of the c-separation criterion [15] looked awkward. It was formulated for a special finite class of routes called “trails” and complicated by subsequent inevitable intricacies. A probability measure P over N is Markovian with respect to a chain graph G over N if A⊥ ⊥ B | C [G] implies A ⊥ ⊥ B | C [P ]
for every A, B|C ∈ T (N ) ,
and perfectly Markovian if the converse implication holds as well. The main result of Studen´ y and Bouckaert [141] says that a perfectly Markovian positive
54
3 Graphical Methods
discrete probability measure exists for every chain graph. In particular, the faithfulness (in the sense of Section 1.1) is ensured in the case of CG models as well. Two chain graphs over N are Markov equivalent if their classes of Markovian measures coincide. These graphs were characterized in graphical terms by Frydenberg [39]. By a complex in a hybrid graph G over N , is understood every induced subgraph of G for a set T = {d1 , . . . , dk }, k ≥ 3 such that di+1 for i = 2, . . . , k − 2, dk−1 ← dk in G and no additional d1 → d 2 , d i edge between (distinct) nodes of {d1 , . . . , dk } exists in G. Two chain graphs over N are Markov equivalent iff they have the same underlying graph and the same complexes. However, contrary to the case of acyclic directed graphs, the advanced question of representation of Markov equivalence classes has an elegant solution. Every class of Markov equivalent chain graphs contains a naturally distinguished member! Given two chain graphs G and H over N having the same underlying graph, one says that G is larger than H if every arrow in G is an arrow in H with the same direction. Frydenberg [39] showed that every class of Markov equivalent chain graphs contains a graph which is larger than every other chain graph within the class, that is, it has the greatest number of lines. This distinguished graph is named the largest chain graph of the equivalence class. A graphical characterization of those graphs which are the largest chain graphs was presented in Volf and Studen´ y [153]. That paper also describes an algorithm for transforming every chain graph into the respective largest chain graph. An alternative algorithm was presented in Studen´ y [139] where the problem of finding the largest chain graph on the basis of the induced formal independence model was solved. Roverato [110] has recently proposed an essentially equivalent algorithm, but his formal description of the respective elementary operation with chain graphs is much more elegant. These procedures could be utilized for learning CG models. Remark 3.7. Lauritzen [70], in § 3.2.3, defined the concept of factorization of a (marginally continuous) measure with respect to a chain graph. As in the case of undirected graphs, every factorizable measure is Markovian and the converse is true for positive measures [39] but not for general discrete measures. Let us say that two chain graphs over N are factorization equivalent if the corresponding classes of discrete factorizable measures coincide. Note that the hypothesis that factorization equivalence and Markov equivalence for chain ˇ ep´anov´ graphs coincide has recently been confirmed in Stˇ a [127].
3.4 Within classic graphical models This section deals with some methods for describing probabilistic structures which, in fact, fall within the scope of classic graphical models.
3.4 Within classic graphical models
55
3.4.1 Decomposable models A very important class of undirected graphs is that of triangulated graphs. An undirected graph G is called triangulated or chordal if every cycle of the length at least four a1 , . . . , an = a1 , n ≥ 5 in G has a “chord”, that is, a line between nodes of {a1 , . . . , an−1 } different from the lines of the cycle. There are several equivalent definitions of a chordal graph; one of them (see Lauritzen [70], Proposition 2.5) says that the graph can be decomposed in a certain way (see p. 204) into its cliques, which motivated another alternative name decomposable graph. For this reason UG models induced by triangulated graphs are named decomposable models by Pearl [100]. Another equivalent definition (see Lauritzen [70], Proposition 2.17) is that all cliques of the graph can be ordered into a sequence C1 , . . . , Cm , m ≥ 1 satisfying the running intersection property (3.3) ∀2 ≤ i ≤ m ∃1 ≤ k < i Si ≡ Ci ∩ ( Cj ) ⊆ Ck . j
Note that the phrase acyclic hypergraph is sometimes used in the literature to name a class of sets admitting an ordering of this type. Thesets Si are called separators then, since Si separates the “history” Hi = j
56
3 Graphical Methods Probabilistic CI models
'
$
Classic CG models DAG models
UG models
Decomposable models
&
%
Fig. 3.6. Relationships between classic graphical models.
which are simultaneously UG models and DAG models. For illustration see Figure 3.6. A characterization of decomposable models in terms of a finite number of formal properties of independence models was given by de Campos [19]. It implies that decomposable models are closed under restriction. 3.4.2 Recursive causal graphs The concept of recursive causal graph from Kiiveri et al. [56] seems to precede the concept of a chain graph. Such a graph can be equivalently defined as a chain graph which admits a chain such that all lines of the graph belong to the first block. Thus, both undirected and acyclic directed graphs are special cases of recursive causal graphs. The way to ascribe an independence model to a recursive graph is consonant with the way used in the case of classic chain graphs. 3.4.3 Lattice conditional independence models Andersson and Perlman [5] came up with an idea to describe probabilistic CI structures by finite lattices (of subsets of N ). Given a ring R of subsets of N , one says that a probability measure over N satisfies the lattice conditional independence model (= LCI model) induced by R if ∀ E, F ∈ R
(E \ F ) ⊥ ⊥ (F \ E) | E ∩ F [P ] .
However, it was found later in [8] that LCI models coincide with DAG models induced by transitive acyclic directed graphs in which a → b and b → c
3.5 Advanced graphical models
57
implies a → c. Thus, LCI models also fall within the scope of classic graphical models. Note that these models are advantageous from the point of view of learning. It was shown by Perlman and Wu [102] that an explicit formula for the maximum likelihood estimate exists even in the case of a “non-monotone” pattern of missing data. 3.4.4 Bubble graphs Shafer [115], in § 2.3, defined bubble graphs which are not graphs in the standard sense mentioned in Section A.3. A bubble graph over N is specified by an ordered partition B1 , . . . , Bn , n ≥ 1 of N into non-empty subsets called bubbles and by a collection of directed links which point to bubbles, although they originate from single nodes taken from the preceding bubbles. Every graph of this type describes a class of probability measures over N which satisfy a certain factorization condition. One can associate a chain graph with every bubble graph as follows: join nodes in each bubble by lines and replace any directed link from a node a ∈ N to a bubble B ⊂ N by the collection of arrows from a to every node of B. Then one can show easily that a probability measure over N satisfies the factorization condition corresponding to the bubble graph iff it factorizes with respect to the ascribed chain graph in the sense mentioned in Remark 3.7. In particular, every bubble graph can be interpreted as a classic chain graph. On the other hand, every DAG model can be described by a bubble graph.
3.5 Advanced graphical models Various types of graphs have been recently proposed in the literature in order to describe probabilistic structures (possibly expressed in terms of structural equations for random variables). Some of these graphs can be viewed as tools for the description of CI structures (although this may not be the original aim of their respective authors). This section gives an overview of these graphical approaches. Note that most of the formal independence models that are ascribed to these graphs are semi-graphoids satisfying the composition property. 3.5.1 General directed graphs A natural idea of how to generalize Bayesian networks is to allow directed cycles. Spirtes et al. (see [122], Chapter 12) mentioned the possible use of general directed graphs for describing models allowing feedback. They proposed to use the d-separation criterion (see p. 48) to ascribe a formal independence model to a directed graph (which admits multiple edges). It was shown by Spirtes [123] that even in the case of general directed graphs, the
58
3 Graphical Methods
d-separation criterion is equivalent to the moralization criterion and the criteria are complete (in the sense of Remark 3.2) relative to the class of regular Gaussian measures. Richardson [105] published a graphical characterization of Markov equivalent directed graphs. It is more complex in comparison with the case of acyclic directed graphs (six independent conditions are involved in the characterization). 3.5.2 Reciprocal graphs Koster [60] introduced a very general class of reciprocal graphs. A reciprocal graph G over N is a graph with mixed edges over N (multiple edges are allowed) such that there is no arrow in G between nodes belonging to the same connectivity component of G. Thus, every classic chain graph is a reciprocal graph and every (general) directed graph is a reciprocal graph as well. The moralization criterion for chain graphs (see p. 52) can be used to ascribe a formal independence model to every reciprocal graph. Note that in the case of general directed graphs it reduces to the moralization criterion treated by Spirtes [123]. Thus, the consistency assumption (see p. 3) is fulfilled for the universum of reciprocal graphs. The question of their faithfulness remains open but a related question of the existence of a perfect class of measures (see Remark 3.2) was answered positively. Koster’s aim was to apply these graphs to simultaneous equation systems (LISREL models [53]). A certain reciprocal graph can be ascribed to every LISREL model so that the class of regular Gaussian measures complying with the LISREL model is perfect with respect to the assigned reciprocal graph (in the sense of Remark 3.2). 3.5.3 Joint-response chain graphs Cox and Wermuth [28] generalized the concept of a chain graph by introducing two additional types of edges. A joint-response chain graph G is a chain graph (in the sense of Section A.3) in which, however, every arrow is either a solid arrow or a dashed arrow and every line is either a solid line or a dashed line. Thus, even four types of edges are allowed in a graph of this type. Moreover, two technical conditions are required for every connectivity component C of a joint-response chain graph, namely • all lines within C are of the same type (i.e., either solid or dashed), • all arrows directed to nodes in C are of the same type. The interpretation of these graphs (see [28], § 2.3) is more likely in terms of what is known as the pairwise Markov property (see Remark 3.1). Namely, the absence of an edge between nodes a and b is interpreted as a CI statement a⊥ ⊥ b | C where the set C ⊆ N \ ab depends on the type of “absent” edge. Note that technical conditions above allow one to deduce implicitly what is the type of the “absent” edge.
3.5 Advanced graphical models
59
The resulting interpretation of joint-response chain graphs which have only solid lines and arrows is then in agreement with the original interpretation of chain graphs (see Section 3.3) so that they generalize classic chain graphs. An analog of the global Markov property was established in two other special cases (see Sections 3.5.4 and 3.5.5). Remark 3.8. Following an analogy with the evolution of classic graphical models (see Remarks 3.1 and 3.2), observe that in order to determine the strongest possible Markov condition (on the basis of a pairwise Markov condition) one also needs to know the distribution framework that is implicitly assumed (see Section A.9.5 for this vague concept). This comprehensive set of probability measures has traditionally been closely related to the respective class of graphs. In my view, it was the class of positive discrete measures over N , respectively the class of regular Gaussian measures over N , in the case of UG models (see Remark 3.1), the class of discrete measures over N in the case of DAG models (see Section 3.2, a reference to Geiger and Pearl [44] on p. 48) and the class of positive discrete measures over N in the case of CG models (see Section 3.3, a reference to Studen´ y and Bouckaert [141] on p. 53). Moreover, it seems to be the class of regular Gaussian measures over N in the cases reported in Sections 3.5.4 and 3.5.5. Since Cox and Wermuth did not explicate the distribution framework which should correspond to general joint-response chain graphs, one cannot derive “automatically” the respective global Markov condition. Well, I can only speculate that they probably have in mind the distribution framework of regular Gaussian measures. In particular, the global Markov property for a general joint-response chain graph has not been introduced so far (see the note in the end of § 2.4.5 of Cox and Wermuth [28]) and the task to establish their consistency (see Section 1.1) is pressing. Thus, other theoretical questions mentioned in Section 1.1 do not make sense for the universum of joint-response chain graphs until the consistency is established for them. 3.5.4 Covariance graphs However, the consistency has been ensured in a special case of joint-response chain graphs, namely undirected graphs made of dashed lines and named covariance graphs. Note that in order to distinguish two types of undirected graphs Cox and Wermuth [28] used the term concentration graphs to name undirected graphs made of solid lines which have the traditional interpretation described in Section 3.1. Kauermann [54] formulated a global Markov property for covariance graphs which is equivalent to the above-mentioned condition of Cox and Wermuth for every probability measure whose induced independence model satisfies the composition property. A triplet A, B|C ∈ T (N ) is represented in
60
3 Graphical Methods
a covariance graph G if N \ ABC separates between A and B. Thus, every covariance graph induces a graphoid satisfying the composition property. Kauerman [54] also showed that the class of Gaussian measures is perfect (in the sense of Remark 3.2) with respect to every covariance graph. In particular, his criterion is the strongest possible one for the considered class of measures. 3.5.5 Alternative chain graphs Another class of joint-response chain graphs for which the global Markov property was established are chain graphs which consist of only solid lines and dashed arrows. Led by a specific way of parameterization of regular Gaussian measures, Andersson et al. [9] introduced “alternative Markov property” (AMP) for chain graphs. Their alternative chain graphs are chain graphs in the sense of Section A.3 but their interpretation is different from the interpretation of classic chain graphs (see Section 3.3) so that they correspond to the above-mentioned joint-response chain graphs (see [9], § 1 for details). The corresponding augmentation criterion is analogous to the moralization criterion for classic chain graphs but it is more complex. Testing whether a triplet A, B|C ∈ T (N ) is represented in an alternative chain graph G over N consists of 3 steps. The first step is a special restriction of G to an “extended graph” over a set T ⊆ N involving ABC (= an analog of the induced graph GT with T = anG (ABC) in the moralization criterion). The second step is the transformation of the extended graph into an undirected “augmented graph”. This is done by adding some edges and taking the underlying graph (= an analog of the moralization procedure). The third step is testing whether C separates between A and B in the augmented graph. As in the case of classic chain graphs an equivalent p-separation criterion (p stands for “path”) was introduced [73]. The main result of Levitz et al. [73] is the existence of a perfectly Markovian regular Gaussian measure for every alternative chain graph. Thus, the faithfulness is ensured in the case of these models. Moreover, Markov equivalent alternative chain graphs (relative to the positive Gaussian distribution framework) were also characterized in graphical terms by Andersson et al. [9]. Every Markov equivalence class can be represented by the respective essential graph (for details see [9], § 7). 3.5.6 Annotated graphs Paz [98] proposed a special fast implementation (that is, modification) of the moralization criterion for acyclic directed graphs. In the preparatory stage of that procedure the original directed graph G is changed into its moral graph and each immorality a → c ← b in G is recorded by annotation of the edge b in the moral graph by the set C of all descendants of c in G. Thus, a the original graph over N is changed into an annotated graph over N , that is, an undirected graph supplemented by a collection of “elements” [ {a, b} | C ] where a, b ∈ N , a = b and ∅ = C ⊆ N \ {a, b} which represents annotated
3.5 Advanced graphical models
61
edges. Testing whether a triplet A, B|C ∈ T (N ) is represented in G is based on the application of a special membership algorithm for annotated graphs. This algorithm consists of the following steps: successive restriction of the graph, removal of (respective) annotated edges and, finally, checking to see whether C separates between A and B in the resulting graph. This procedure is equivalent to the moralization algorithm [98]. The point is that this approach has much wider applicability. In Paz et al. [97] the class of regular annotated graphs was introduced together with the corresponding general membership algorithm. A formal independence model induced in this way by a regular annotated graph has been shown to be a graphoid. A regular annotated graph can serve as a condensed record for the smallest graphoid containing the unions of UG models (= their graphoid closure). Given a sequence of undirected graphs Gi = (Ni , Li ), i = 1, . . . , k (k ≥ 1) such that Ni ⊆ Ni+1 and Li ⊆ Li+1 for i = 1, . . . , k − 1 a special annotation algorithm described in [97] allows us to construct a regular annotated graph over N = Nk such that the independence model induced by it is just the graphoid closure of all UG models induced by Gi , i = 1, . . . , k. Since every (classic) CG model can be obtained in this way regular annotated graphs generalize classic graphical models. The idea of the annotation of edges was also applied in the case of acyclic directed graphs. Bouckaert [13] annotated arrows by pairs of disjoint sets of nodes and introduced a special separation criterion for the respective graphs. 3.5.7 Hidden variables Example 3.1 shows that the restriction of a DAG model need not be a DAG model. This may have led to an idea to describe restrictions of DAG models by means of graphical diagrams. These models are usually named the DAG models with hidden variables because besides “observed” variables in N , one anticipates other “unobserved” hidden variables L and a DAG model over N L. Another common name for hidden variables is “latent variables”. Geiger et al. [47] introduced the concept of an embedded Bayesian network . It is a graph over (observed variables) N allowing both directed and bidirected edges (without multiple edges) such that purely directed cycles, that is, directed cycles consisting exclusively of arrows, are not present in the graph. A generalized d-separation criterion was used to ascribe a formal independence model over N to a graph of this type. It is mentioned in [47] that one can always find a DAG model over a set M ⊇ N whose restriction to N is the ascribed independence model. Moreover, according to Pearl’s oral communication, Verma showed that the restriction of every DAG model can be described in this way. Note that the faithfulness (see p. 3) in the case of embedded Bayesian networks is an easy consequence of the faithfulness in the case of DAG models and the above-mentioned claims. However, there are other graphical methods for describing models with hidden variables. For example, Richardson [108] introduced acyclic directed
62
3 Graphical Methods
mixed graphs and showed that independence models induced by those graphs (through the respective separation criterion) are exactly the DAG models with hidden variables. Other options are summary graphs from § 8.5 of Cox and Wermuth [28] and ancestral graphs mentioned below. 3.5.8 Ancestral graphs Motivated by the need to describe classes of Markov equivalent (general) directed graphs, Richardson [106] proposed to use special graphical objects called partial ancestral graphs (PAGs) for this purpose. A PAG is a graph whose every edge has one of three possible endings for both its end-nodes and where the endings of different edges near a common end-node may be connected by two possible “connections”. Every mark of this type in a PAG expresses a certain graphical property shared by all graphs within the Markov equivalence class, for example, that a node is not an ancestor of another node in all equivalent graphs. The idea of graphical representation of shared features in classes of Markov equivalent graphs was later substantially simplified. In a recent paper, Richardson and Spirtes [107] introduced ancestral graphs. These graphs admit three types of edges, namely lines, arrows and bi-directed edges (e.g. a ↔ b) and satisfy some additional requirements. These requirements imply that multiple edges and loops are not present in ancestral graphs. A formal independence model over N is ascribed to an ancestral graph over N by means of the m-separation criterion which generalizes the d-separation criterion for acyclic directed graphs. Additional standardization of ancestral graphs is desirable. A maximal ancestral graph (MAG) is an ancestral graph G such that [a, b] is an edge in G iff ¬{ a ⊥ ⊥ b | K [G] } for every K ⊆ N \ {a, b}. MAGs exhibit some elegant mathematical properties. One can define graphical operation of marginalizing and conditioning of MAGs which corresponds to the respective operation with induced formal independence models (cf. Section 9.2.1). Edges of a MAG G correspond to single real parameters in a certain parameterization of the class of regular Gaussian measures which are Markovian with respect to G. Moreover, there exists a perfectly Markovian Gaussian measure with respect to G. Thus, the question of faithfulness (see Section 1.1) has a positive solution in this case. Note that MAG models involve both UG models and DAG models and coincide with the class of models induced by summary graphs – see § 9.3.1 in Richardson and Spirtes [107]. According to Richardson’s oral communication, MAGs without lines induce just DAG models with hidden variables. 3.5.9 MC graphs Koster [61] introduced a certain class of graphs which admit the same three types of edges as ancestral graphs. However, in these graphs, called MC graphs,
3.6 Incompleteness of graphical approaches
63
multiple edges and some loops are allowed. The abbreviation MC means that graphical operations of “marginalizing” and “conditioning” can be applied to these graphs (as in the case of MAGs). However, unlike the m-separation criterion, the respective separation criterion for MC graphs requires “blocking” of all routes (as in the c-separation criterion for classic chain graphs – cf. Remark 3.6). As mentioned in § 9.2 of Richardson and Spirtes [107] the separation criterion for MC graphs generalizes the m-separation criterion. Thus, the class of formal independence models induced by MC graphs involves MAG models. On the other hand, although MC graphs include chain graphs, the respective separation criterion in the case of chain graphs differs both from the c-separation criterion and from the p-separation criterion.
3.6 Incompleteness of graphical approaches Let me raise the question of how many probabilistic CI models can be described by graphs (cf. the question of completeness in Section 1.1, p. 3). The expressiveness of graphical methods varies. For example, for |N | = 3 there are 8 UG models and 11 DAG models (= CG models). But for |N | = 4 there are 64 UG models, 185 DAG models and 200 CG models, while for |N | = 5 there exist 1024 UG models, 8782 DAG models [6] and 11519 CG models [153]. However, this is not enough to describe CI structures induced by discrete probability measures. In the case |N | = 3 one has 22 discrete CI models but in the case |N | = 4 there are as many as 18300 CI models [136]! So, there is a tremendous gap between the number of classic graphical models and the number of discrete probabilistic CI structures in the case |N | = 4 and this gap increases with |N |. In particular, classic graphical models cannot describe all CI structures. The reader may object that a sufficiently wide class of graphs could possibly cure the problem. Let me give an argument against this. Having fixed a class of graphs over N in which only finitely many types of edges are allowed, the number of these graphs is bounded by the cardinality of the power set of the set of possible edges. The number of these possible edges grows polynomially with n = |N |. This means, the upper estimate for the number of graphs has the form 2p(n) where p is a polynomial. On the contrary, as shown in Corollary 2.7 on p. 38 the number of discrete probabilistic CI structures grows with n at least as rapidly as the power of the power of n, that is, as n rapidly as 22 . Thus, in my opinion, one can hardly achieve the completeness of a graphical approach (see p. 3) relative to the class of discrete measures and this may result in serious methodological errors (see Section 1.1, p. 5). Well, perhaps one can think about a class of advanced complex graphs which admit exponentially many “(hyper)edges” (e.g., annotated graphs) and which has a chance to achieve the completeness. But complex graphs of this sort may lose their easy interpretability for humans.
64
3 Graphical Methods
The conclusion above was the reason for an attempt to develop a nongraphical approach to the description of probabilistic CI structures. The approach described in subsequent chapters achieves the completeness relative to the discrete distribution framework and has some chance to be acceptable by humans. On the other hand, the mathematical objects which are used for this purpose describe more than is necessary in the sense that some “described” formal independence models are not probabilistic CI models. The loss of faithfulness is a natural price for the possibility of interpretation and relatively good solution to the equivalence question. Nevertheless, I consider these two gains more valuable than the faithfulness.
4 Structural Imsets: Fundamentals
The moral of the preceding chapter is that the main drawback of graphical models is their inability to describe all probabilistic conditional independence structures. This motivated an attempt to develop an alternative method for their description which overcomes that drawback and keeps some assets of graphical methods. The central notion of this method is the concept of a structural imset introduced in this chapter. Note that basic ideas of the theory were presented earlier [137] but (later recognized) superfluous details affected understanding of the message of the original series of papers. This monograph brings (in the next four chapters) a much simpler presentation, supplemented by new facts and perspectives.
4.1 Basic class of distributions The class of probability measures for which this approach is applicable, that is, those whose induced conditional independence models can be described by structural imsets, is relatively wide. It is the class of measures over N with finite multiinformation mentioned in Section 2.3.4. The aim of this section is to show that this class involves three basic classes of measures used in practice in artificial intelligence and multivariate statistics. 4.1.1 Discrete measures These simple probability measures (see Remark 2.2, p. 11) are mainly used in probabilistic reasoning [100] which is an area of artificial intelligence. Positive discrete probability measures are at the core of the models used in the analysis of contingency tables (see [70], Chapter 4), which is an area of statistics. The fact that every discrete probability measure over N has finite multiinformation is evident.
66
4 Structural Imsets: Fundamentals
4.1.2 Regular Gaussian measures These measures (see Section 2.3.6 for their basic properties) are widely used in mathematical statistics, in particular in multivariate statistics [28]. Corollary 2.6 says that every regular Gaussian measure over N has finite multiinformation. 4.1.3 Conditional Gaussian measures This class of measures was proposed by Lauritzen and Wermuth [67] with the aim to unify discrete and continuous graphical models. In this book, their original class of measures is slightly extended. A conditional Gaussian measure P over N , called a CG-measure over N , will be specified as follows. The set N is partitioned into the set ∆ of discrete variables and the set Γ of continuous variables. For every i ∈ ∆, Xi is a finite non-empty set and Xi = P(Xi ). For every i ∈ Γ, Xi = R and Xi is the class of Borel sets in R. A (discrete) probability measure P∆ on (X∆ , X∆ ) is given. Moreover, provided that Γ = ∅, a vector e(x) ∈ RΓ and a positive definite Γ × Γ -matrix Σ(x) ∈ RΓ ×Γ is ascribed to every x ∈ X∆ with P∆ (x) > 0. Then P is simply determined by its marginal for ∆ and by the conditional probability on XΓ given ∆: P ∆ ≡ P∆ , PΓ |∆ ( | x) ≡ N (e(x), Σ(x))
for every x ∈ X∆ with P∆ (x) > 0 .
Of course, these two requirements determine a unique probability measure on (XN , XN ). The above definition collapses in the case Γ = ∅ to a discrete measure over N and in the case ∆ = ∅ to a regular Gaussian measure over N . Remark 4.1. Note that positive CG-measures (i.e., such that P∆ (x) > 0 for every x ∈ X∆ ) are mainly used in practice. A CG-measure of this type can be defined (see Lauritzen [70], § 6.1.1) by its density f with respect to the product of the counting measure on X∆ and the Lebesgue measure on XΓ f (x, y) = exp
g(x)+h(x) ·y− 12 ·y ·Γ (x)·y
for x ∈ X∆ , y ∈ XΓ ,
where g(x) ∈ R, h(x) ∈ RΓ and positive definite matrices Γ (x) ∈ RΓ ×Γ are named canonical characteristics of P . One can compute them directly from parameters P∆ (x), e(x), Σ(x) which are named moment characteristics of the CG-measure as follows (see Lauritzen [70], p. 159): −1
Γ (x) = Σ(x)
,
h(x) = Σ(x)
−1
· e(x),
1 1 |Γ | −1 · ln(2π) − · ln(det(Σ(x))) − · e(x) · Σ(x) · e(x). 2 2 2 These measures are positive in the sense of Section 2.3.5 but they do not involve all discrete measures. For this reason, the original class of CG-measures from Lauritzen and Wermuth [67] has been slightly extended in this book. g(x) = ln P∆ (x) −
4.1 Basic class of distributions
67
To evidence that every CG-measure has finite multiinformation (and thus, it is marginally continuous) I use auxiliary estimates with relative entropies modified in a certain way. Supposing (X, X ) is a measurable space, P and Q are probability measures and µ is a σ-finite measure on (X, X ) such that P, Q µ, by the Q-perturbated relative entropy of P with respect to µ the integral dQ dP dP (x) dQ(x) ≡ (x) · ln (x) dµ(x) H(P | µ : Q) = ln dµ dµ dµ X
X
will be understood provided that the function ln(dP/dµ) is Q-quasi-integrable. Of course, the value does not depend on the choice of versions of the RadonNikodym derivatives dP/dµ or dQ/dµ. In case Q = P it coincides with H(P | µ) mentioned in Section A.6.3. A discrete version of this concept is known in information theory as Kerridge’s inaccuracy [148] pp. 322–323. Lemma 4.1. Let (X, X ) be a measurable space and µ a σ-finite measure on (X, X ). Suppose that P1 , . . . , Pr , r ≥ 1 is a finite collection of probability measures on (X, X ) such that −∞ < H(Pk |µ : Pl ) < +∞ for every pair of indices k, l ∈ {1, . . . , r}. Then every convex combination of P1 , . . . , Pr has finite relative entropy with respect to µ, that is, −∞ < H(
r
αk · Pk | µ) < +∞ whenever α1 , . . . , αr ≥ 0,
k=1
r
αk = 1.
k=1
r Proof. Put P = k=1 αk · Pk , choose r and fix a version of dPk /dµ for every k and fix the version of dP/dµ = l=1 αl · (dPl /dµ). The assumption says dPk dPl (x) · | ln (x) | dµ(x) < ∞. ∀ k, l ∈ {1, . . . , r} dµ dµ X
We are expected to show that dP dP dP + (x) | dP (x) = ( ln (x) ) dP (x)+ ( ln (x) )− dP (x) < ∞ . | ln dµ dµ dµ X
X
X
To estimate the first term above we use the Radon-Nikodym theorem, the observation that the function y → (y · ln y)+ is convex on [0, ∞) and the inequality y + ≤ |y|: dP dP dP (x) )+ dP (x) = ( (x) · ln (x) )+ dµ(x) ≤ ( ln dµ dµ dµ X
X
≤
r
αk ·
k=1
≤
r k=1
(
dPk dPk (x) · ln (x) )+ dµ(x) ≤ dµ dµ
X
αk · X
dPk dPk (x) · | ln (x)| dµ(x) < ∞ . dµ dµ
68
4 Structural Imsets: Fundamentals
To estimate the second term we use the fact that the function y → (ln y)− is convex on [0, ∞), the inequality y − ≤ |y|, the Radon-Nikodym theorem and the definition of dP/dµ: ( ln
r dP dPk (x) )− dP (x) ≤ (x) )− dP (x) ≤ αk · ( ln dµ dµ k=1
X
≤
X
r
αk ·
k=1
=
r
dPk (x) | dP (x) = dµ
X
αk ·
k=1
=
| ln
r
r X
αl ·
l=1
dPl dPk (x) · | ln (x)| dµ(x) < ∞ . dµ dµ
αk · αl ·
k,l=1
dPk dPl (x) · | ln (x)| dµ(x) = dµ dµ
X
This concludes the proof. Lemma 4.2. Let P be a CG-measure over N = ∆ ∪ Γ and µ = where µi = υ for i ∈ ∆ and µi = λ for i ∈ Γ . Then −∞ < H(P | µ) < ∞
and
− ∞ < H(P {i} |µi ) < ∞
i∈N
µi
for every i ∈ N .
Proof. A direct formula for H(P |µ) is easy to derive. Indeed, write dP (x, y) = P∆ (x) · fe(x),Σ(x) (y) for x ∈ X∆ with P∆ (x) > 0 and y ∈ XΓ , dµ where fe(x),Σ(x) is given by (2.19), apply logarithm, integrate it with respect to P and obtain using standard properties of the integral H(P |µ) = ln P∆ (x) dP (x, y) + ln fe(x),Σ(x) (y) dP (x, y) = XN
= H(P∆ |µ∆ ) +
XN
P∆ (x) · H( PΓ |∆ (|x) | µΓ ) .
x∈X∆ ,P∆ (x)>0
Note that one can use (A.12) to see that the terms in the latter sum are finite. The fact −∞ < H(P {i} | µi ) < ∞ for i ∈ ∆ is evident. For fixed i ∈ Γ first realize that P ∆∪{i} is again a CG-measure where P{i}|∆ (|x) = N (e(x)i , Σ(x)i·i )
for x ∈ X∆ with P∆ (x) > 0.
Therefore, the marginal P {i} for i ∈ Γ is nothing but a convex combination of regular Gaussian measures. To verify −∞ < H(P {i} |λ) < +∞ one can use Lemma 4.1. Indeed, suppose Pk = N (e, β), Pl = N (f, γ) where
4.2 Classes of structural imsets
69
e, f ∈ R, β, γ > 0 are the corresponding parameters. Because expectation and variance of Pl are known, one can easily compute (x − e)2 1 H(Pk |λ : Pl ) = − · ln(2πβ) − dPl (x) = 2 2β R 1 1 · (x − e)2 dPl (x) = = − · ln(2πβ) − 2 2β R 1 1 · (x − f )2 + 2 · (f − e) · x + (e2 − f 2 ) dPl (x) = = − · ln(2πβ) − 2 2β R
1 1 = − ln(2πβ) − · [γ + 2 · (f − e) · f + (e2 − f 2 )] = 2 2β γ + (e − f )2 1 . = − · ln(2πβ) − 2 2β The obtained result is evidently a finite number.
Corollary 4.1. Every CG-measure over N has finite multiinformation. Proof. Owing to Lemma 4.2 the assumptions of Lemma 2.7 on p. 28 for S = N are fulfilled. The fact above was verified by finding finite lower and upper estimates for the multiinformation. The question of whether there exists a suitable exact formula for values of the multiinformation function in terms of parameters of a CG-measure remains open (see Theme 1 in Chapter 9). Remark 4.2. The class of CG-measures is not closed under marginalizing, which may lead to problems when one tries to study CI within this framework. However, it was shown above that this class can be embedded into a wider class of measures with finite multiinformation that is already closed under marginalizing (see Corollary 2.2).
4.2 Classes of structural imsets Definitions and elementary facts concerning structural imsets are gathered in this section. 4.2.1 Elementary imsets An elementary imset over a non-empty set of variables N is an imset of a special form. Given (an elementary) triplet a, b|K where K ⊆ N and a, b ∈ N \ K are distinct (a = b), the corresponding elementary imset u a,b|K over N is defined by the formula
70
4 Structural Imsets: Fundamentals
+1
+1
+1
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b}
{a, b}
{a, b}
−1
−1
{a, c}
Q
0
{b, c}
−1
0
{a, c}
{b, c}
−1
0
{a, c}
{b, c}
Q Q Q Q Q Q Q Q Q Q Q
Q Q Q Q Q Q Q Q Q Q Q
{a}
{a}
{a}
0
0
{b} {c} Q Q Q Q Q
Q
+1
0
0
{b} {c} Q Q Q Q Q
Q
0
Q
−1
Q Q Q Q Q Q Q Q Q Q Q
+1
Q
−1
0
+1
0
{b} {c} Q Q Q Q Q
Q
0
0
∅
∅
∅
0
0
0
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q Q Q Q Q
{a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q Q Q Q Q
{a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q Q Q Q Q
{a}
{a}
{a}
+1
−1
0
0
−1
0
{b} {c} Q Q Q Q Q
Q
+1
∅
+1
0
−1
0
−1
0
{b} {c} Q Q Q Q Q
Q
0
+1
∅
+1
0
−1
0
−1
{b} {c} Q Q Q Q Q
Q
+1
∅
Fig. 4.1. Elementary imsets over N = {a, b, c}.
u a,b|K = δ{a,b}∪K + δK − δ{a}∪K − δ{b}∪K . The class of elementary imsets over N will be denoted by E(N ). Of course, it is empty for |N | = 1. Let us assume that |N | = n ≥ 2. Then, by the level of an elementary imset u a,b|K over N the number |K| will be understood. For every l = 0, . . . , |N |−2, the class of elementary of level l will be ndenoted imsets n−2 and |E(N )| = . by El (N ). It is easy to see that |El (N )| = n2 · n−2 l 2 ·2 Thus, in the case N = {a, b, c} one has 6 elementary imsets of 2 possible levels. They are shown in Figure 4.1. The following observation is a basis of later results. Proposition 4.1. Supposing n = |N | ≥ 2 and l ∈ {0, . . . , n − 2}, let us introduce a multiset ml over N by means of the formula ml (S) = max { |S| − l − 1, 0}
for every S ⊆ N,
and a multiset m∗ over N by means of the formula m∗ (S) =
1 · |S| · (|S| − 1) 2
for every S ⊆ N.
4.2 Classes of structural imsets
71
Then one can observe the following facts:
ml , u = 1, (a) ∀ u ∈ El (N ) (b) ∀ u ∈ E(N ) \ El (N ) ml , u = 0, (c) ∀ u ∈ E(N )
m∗ , u = 1. Proof. First, the identity m∗ = as follows: n−2 l=0
|S|−2
ml (S) =
n−2 l=0
ml can be verified for S ⊆ N , |S| ≥ 2
|S|−2
ml (S) =
l=0
l=0
|S|−2
|S| −
l=0
|S|−2
l−
1=
l=0
1 1 = |S| · (|S| − 1) − (|S| − 1) · (|S| − 2) − (|S| − 1) · 2 = 2 2 1 = |S| · (|S| − 1) − (|S| − 1) · |S| = m∗ (S). 2 The facts (a) and (b) are easy to see; the fact (c) follows from (a), (b) and the above identity. 4.2.2 Semi-elementary and combinatorial imsets Given a disjoint triplet A, B|C, the corresponding semi-elementary imset u A,B|C is defined by the formula u A,B|C = δABC + δC − δAC − δBC . Evidently, the zero imset is semi-elementary as u A,B|C = 0 for any trivial triplet A, B|C ∈ Tø (N ). Every elementary imset is semi-elementary as well. An example of a non-zero semi-elementary imset which is not elementary is the imset u a,bc|∅ shown in the left-hand picture of Figure 4.2. Accepting the convention that the zero imset is a combination of the empty set of imsets, we can observe the following fact. Proposition 4.2. Every semi-elementary imset is a combination of elementary imsets with non-negative integral coefficients. Proof. A non-zero semi-elementary imset has necessarily the form u A,B|C , where A, B|C ∈ Tø (N ). The formulas u A,BD|C = u A,B|DC + u A,D|C and u AD,B|C = u A,B|DC + u D,B|C can be applied repeatedly. Every imset u over N which is a combination of elementary imsets with non-negative integral coefficients, that is, kv · v where kv ∈ Z+ . (4.1) u= v∈E(N )
72
4 Structural Imsets: Fundamentals +1 {a, b, c} Q Q Q Q −1 0 0 {a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q −1 0 0 {a} {b} {c} Q Q Q Q +1 ∅
+1 {a, b, c} Q Q Q Q +1 −1 0 {a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q −2 −1 0 {a} {b} {c} Q Q Q Q +2 ∅
Fig. 4.2. Two combinatorial imsets over N = {a, b, c}.
will be called a combinatorial imset over N . The class of combinatorial imsets over N will be denoted by C(N ). By Proposition 4.2, every semi-elementary imset is combinatorial. The converse is not true: the imset u a,b|c +2·u a,c|∅ in the right-hand picture of Figure 4.2 is not semi-elementary. Clearly, a combination of combinatorial imsets with coefficients from Z+ is itself a combinatorial imset. In particular, the class of combinatorial imsets could be equivalently introduced as the class of combinations of semi-elementary imsets with nonnegative integral coefficients. Of course, a particular combinatorial imset can sometimes be expressed in several different ways. For example, the imset u from the left-hand picture of Figure 4.2 can be written either as u a,b|c + u a,c|∅ or as u a,c|b + u a,b|∅ . On the other hand, there are characteristics which do not depend on a particular way of combination. Supposing that (4.1) is true, we can introduce the degree of a combinatorial imset u, denoted by deg(u), as follows kv . deg(u) = v∈E(N )
Similarly, for |N | ≥ 2 we introduce the level-degree of u for each l = 0, . . . , |N | − 2, denoted by deg(u, l), as the number kv . deg(u, l) = v∈El (N )
The following observation implies that these numbers do not depend on the choice of coefficients kv for v ∈ E(N ). Proposition 4.3. For u ∈ C(N ) and l ∈ {0, . . . , |N | − 2} with |N | ≥ 2, the equalities deg(u, l) = ml , u, deg(u) = m∗ , u ,
4.2 Classes of structural imsets
73
are true, where the multisets ml , m∗ are those introduced in Proposition 4.1 on p. 70. Proof. Substitute (4.1) in ml , u and m∗ , u and use Proposition 4.1.
The formulas in Proposition 4.3 give interpretation to multisets m∗ and ml , 0 ≤ l ≤ |N | − 2 from Proposition 4.1. The multiset m∗ can serve as a degree detector for a combinatorial imset and ml as its level-degree detector. An easy consequence of Proposition 4.3 is the following observation. Corollary 4.2. Given u ∈ C(N ), one has u = 0 ⇔ deg(u) = 0, and u ∈ E(N ) ⇔ deg(u) = 1. 4.2.3 Structural imsets An imset u over N will be called structural if there exists n ∈ N such that the multiple n · u is a combinatorial imset, that is, kv · v for some n ∈ N, kv ∈ Z+ . (4.2) n·u= v∈E(N )
In other words, an imset is structural if it is a combination of elementary imsets, alternatively a combination of semi-elementary imsets, with nonnegative rational coefficients. The class of structural imsets over N will be denoted by S(N ). By this definition, every combinatorial imset is structural. In the case |N | ≤ 4, the converse is true [131]. However, the question of whether this is true in general remains open (see Question 7 on p. 207). Proposition 4.4. Every structural imset u over N is o-standardized. The inequalities mA↑ , u ≥ 0 and mA↓ , u ≥ 0 hold for every A ⊆ N (see p. 39). The only imset w ∈ S(N ) with −w ∈ S(N ) is the zero imset w = 0. Proof. All three properties of u hold for the zero imset and elementary imsets; they can be extended to combinatorial imsets and then to structural imsets. Given w ∈ ZP(N ) with mA↓ , w = 0 for every A ⊆ N the condition w(S) = 0 for S ⊆ N can be verified by induction on |S|. Given a structural imset u, let us introduce the lower class of u, denoted by Lu , as the descending class induced by the negative domain of u, that is, ↓
Lu = { T ⊆ N ; ∃ S ⊆ N such that T ⊆ S and u(S) < 0 } ≡ (Du− ) . Similarly, one can introduce the upper class of u, denoted by Uu , as the descending class induced by the positive domain of u ↓
Uu = { T ⊆ N ; ∃ S ⊆ N such that T ⊆ S and u(S) > 0 } ≡ (Du+ ) . This terminology is motivated by the following fact and later results (Corollary 4.4 on p. 82).
74
4 Structural Imsets: Fundamentals
Proposition 4.5. Whenever u is a structural imset, one has Lu ⊆ Uu . Moreover S = S. (4.3) S∈Lu
S∈ Uu
Proof. Supposing T ∈ Lu , find T ⊆ S ⊆ N with u(S) < 0. The fact
mS↑ , u ≥ 0 from Proposition 4.4 implies the existence of S ⊆ K ⊆ N with u(K) > 0. The fact that u is o-standardized says m{i}↑ , u = 0 for every i ∈ N which then implies (4.3). Given a structural imset u over N , by the range of u, denoted by Ru , will be understood the set union from (4.3). The following lemma is a basis of a later result. Lemma 4.3. Supposing u is a non-zero combinatorial imset over N , let us consider a fixed particular combination u= kv · v where kv ∈ Z+ , u = 0 . v∈E(N )
Then there exists v ∈ E(N ) such that kv > 0 and Lv ⊆ Lu . Proof. Since u = 0, necessarily Lu ∪ Uu = ∅. Because u is structural, by Proposition 4.5 Lu ⊆ Uu , and therefore Uu = ∅. Take maximal M ∈ Uu and, again using Lu ⊆ Uu , observe that u(M ) > 0 and ∀ L ⊃ M u(L) = 0. Introduce Su = P(N ) \ Lu = { T ⊆ N ; ∀ S such that T ⊆ S ⊆ N u(S) ≥ 0} . Clearly, Su is an ascending class and M ∈ Su ; let us consider a multiset s = S∈Su δS . It follows from the definition of Su that s, u ≥ u(M ) > 0. Thus, one can write 0 < s, u = kv · s, v ≤ kv · s, v, v∈E(N )
v∈E(N ), s,v >0
which implies the existence v ∈ E(N ) with kv > 0 and s, v > 0. Well, since Su is ascending, an elementary imset v = u a,b|K satisfies s, v > 0 iff {a, b} ∪ K ∈ Su and {a} ∪ K, {b} ∪ K ∈ Su . However, this implies Lv ∩ Su = ∅, which means Lv ⊆ Lu .
4.3 Product formula induced by a structural imset This formula provides a direct way of associating a structural imset with a probability measure. It can be viewed as a generalized concept of factorization
4.3 Product formula induced by a structural imset
75
into marginal densities. To give a sensible definition I need the following auxiliary concept, whose sense will become clear later (in Section 4.5). Suppose that P is a probability measure on (XN , XN ) which has finite multiinformation. By a reference system of measures for P we will understand any collection {µi ; i ∈ N } of σ-finite measures on (Xi , Xi ), i ∈ N such that P {i} µi
− ∞ < H(P {i} | µi ) < +∞
for every i ∈ N . Having fixed a reference system {µi ; i ∈ N } one can put µ = i∈N µi and observe P µ, that is, µ is a dominating measure for P . Thus, one can repeat what is done in Convention 2 (p. 20), that is, to choose a marginal density fS (= a version of dP S /dµS ) for every S ⊆ N . Given a structural imset u over N , we say that P satisfies the product formula induced by u if + − fS (xS )u (S) = fS (xS )u (S) for µ-a.e. x ∈ XN . (4.4) S⊆N
and
S⊆N
Of course, the validity of this formula does not depend on the choice (of versions) of marginal densities. The choice of a reference system of measures only seemingly has any effect on the validity of the formula (see Section 4.5 for the proof). On the other hand, this flexibility in its choice is advantageous since miscellaneous special cases can be described in more detail. 4.3.1 Examples of reference systems of measures Let me illustrate this concept with four basic examples. The first one shows that one can always find a reference system for a probability measure with finite multiinformation. The other examples correspond to important special cases mentioned already in Section 4.1. Universal reference system Given a probability measure P over N with H(P | i∈N P {i} ) < ∞, one can simply put µi = P {i} for every i ∈ N . It is evidently a reference system of measures since H(P {i} |µi ) = 0 for every i ∈ N . Let us call it the universal reference system because it can be established for any measure with finite multiinformation. Discrete case Supposing P is a discrete measure on (XN , XN ) with 1 ≤ |Xi | < ∞, i ∈ N , one can consider the counting measure υ on Xi in place of µi for every i ∈ N . This is evidently a reference system for P leading to the following system of marginal densities: fS (xS ) = P S ({xS })
for every S ⊆ N, x ∈ XN .
76
4 Structural Imsets: Fundamentals
Remark 4.3. An alternative choice of a reference system in this case is possible. One can take uniformly distributed probability measure µ ˆi = (υ/|Xi |) on Xi for every i ∈ N . This leads to alternative marginal densities P S ({xS }) fˆS (xS ) = |XS |
for every S ⊆ N, x ∈ XN ,
with a convention |X∅ | = 1.
Gaussian case If P = N (e, Σ) with Σ = (σij )i,j∈N is a regular Gaussian measure over N , we can consider the Lebesgue measure λ on R in place of µi for every i ∈ N . It is a reference system for P because 1 1 H(P {i} |λ) = − − · ln(2πσii ) 2 2
for every i ∈ N
by (A.12) in Section A.8.3. Owing to the fact that a marginal of a Gaussian measure is again Gaussian and (2.19), one can choose marginal densities fS for ∅ = S ⊆ N in the form fS (y) = √
1
1
(2π)|S| ·det(Σ
S·S )
· exp− 2 ·(y−eS )
·(Σ S·S )−1 ·(y−eS )
for y ∈ RS .
CG-measures Let P be a CG-measure over N partitioned into the set ∆ of discrete variables and the set Γ of continuous variables. By a standard reference system for P we will understand the system {µi ; i ∈ N } where µi = υ is the counting measure on finite Xi for i ∈ ∆ and µi = λ is the Lebesgue measure on Xi = R for i ∈ Γ . By Lemma 4.2 it is indeed a reference system of measures for P . In a purely discrete case or in a purely Gaussian case it coincides with one of the two above-mentioned reference systems, which I recalled explicitly to emphasize the importance of these two classic cases. One can choose the following versions of marginal densities fS , ∅ = S ⊆ N fS (x, y) = P∆ (x, z) · fe(x,z)S∩Γ ,Σ(x,z)S∩Γ ·S∩Γ (y) z∈ X∆\S P∆ (x,z)>0
for x ∈ XS∩∆ , y ∈ XS∩Γ – a detailed formula for fe(),Σ() () is in (2.19). 4.3.2 Topological assumptions The reader can object that the product formula (4.4) is not elegant enough since it is dimmed by the non-uniqueness of marginal densities and the equality is understood in the “almost everywhere” sense. However, under certain
4.3 Product formula induced by a structural imset
77
topological assumptions – usually valid in practice – and an additional natural convention it turns into a true equality “everywhere”. A reference system of measures {µi ; i ∈ N } for a probability measure P on (XN , XN ) with finite multiinformation will be called continuous if the following three conditions are fulfilled. (a) Xi is a separable metric space and Xi is the class of Borel sets in Xi for every i ∈ N ; (b) Every open ball in Xi has positive measure µi for every i ∈ N , that is, ∀i∈N
∀ x ∈ Xi
∀ ε > 0 µi (U (x, ε)) > 0;
(c) For every ∅ = S ⊆ N there exists a version fS of dP S /dµS (where µS = i∈S µ i ) which is continuous with respect to the product topology on XS = i∈S Xi . The following observation is easy to prove (see the Appendix, Sections A.4, A.6 and A.8.3 for relevant facts). Proposition 4.6. The standard reference system for a CG-measure over N is continuous. In the case of a continuous reference system Convention 2 (see p. 20) can be explicated as follows. Convention 3. Suppose that P is a probability measure on (XN , XN ) with finite multiinformation and {µi ; i ∈ N } is a continuous reference system for P . Then (a) implies that XS is a separable metric space and XS is the Borel σ-algebra on XS for every ∅ = S ⊆ N . Put µS = i∈S µi , choose a version fS of the Radon-Nikodym derivative dP S /dµS which is continuous with respect to respective topology on XS and fix it. Note that this is possible owing to (c). Let us call it the continuous marginal density of P for S. Note that from (b) it follows that it is determined uniquely (one can apply the arguments used in the proof of Lemma 4.4 below). Other notational arrangements from Convention 2 remain valid. In particular, every fS can be viewed as a continuous function on the joint sample ♦ space XN endowed with the product topology. Lemma 4.4. Let P be a probability measure over N with finite multiinformation and {µi ; i ∈ N } a continuous reference system of measures for P . Let’s accept Convention 3. Then (4.4) is equivalent to the requirement + − fS (xS )u (S) = fS (xS )u (S) for every x ∈ XN . (4.5) S⊆N
S⊆N
Proof. By (a), assume that (Xi , ρi ) is a separable metric space for every i ∈ N . Observe that XN endowed with the distance
78
4 Structural Imsets: Fundamentals
ρ(x, y) = max ρi (xi , yi ) i∈N
for x, y ∈ XN
is a separable metric space inducing the product topology which generates XS ˇ ep´an [126], Theorem I.2.3). This definition implies that open balls (see e.g. Stˇ in XN are Cartesian products of open balls in Xi and therefore one derives from (b) Uρi (xi , ε)) > 0 . ∀ x ∈ XN ∀ ε > 0 µN (Uρ (x, ε)) = µN ( i∈N
Now, both the left-hand side and the right-hand side of (4.4) are continuous functions on XN by (c) (see Convention 3) and (4.4) says that their difference g (which is also a continuous function on XN ) vanishes µN -a.e. Hence, |g(y)| dµN (y) = 0. XN Suppose for contradiction that g(x) = 0 for some x ∈ XN . Then there exists ε > 0 such that ∀ y ∈ Uρ (x, ε) one has |g(y)| ≥ |g(x)|/2 and, therefore, |g(x)| · µN (Uρ (x, ε)) > 0, |g(y)| dµN (y) ≥ |g(y)| dµN (y) ≥ 2 XN
Uρ (x,ε)
which contradicts the fact above. Therefore g(x) = 0 for every x ∈ XN .
Thus, by Proposition 4.6 one can interpret the product formula induced by a structural imset as a real identity of uniquely determined marginal densities in three basic cases used in practice: for discrete measures, for regular Gaussian measures and for CG-measures. Of course, this need not hold for arbitrary measures with finite multiinformation and the respective universal reference system of measures.
4.4 Markov condition There is an analogy between the Markov condition used in graphical models and the second basic way of associating a structural imset with a probability measure. More specifically, one requires that some conditional independence statements, determined by an imset by means of a certain criterion, are valid conditional independence statements with respect to the measure. 4.4.1 Semi-graphoid induced by a structural imset One says that a disjoint triplet A, B|C ∈ T (N ) is represented in a structural imset u over N and writes A ⊥⊥ B | C [u] if there exists k ∈ N such that k·u−u A,B|C is a structural imset over N as well. An equivalent requirement is that there exists l ∈ N such that l·u−u A,B|C is a combinatorial imset over N .
4.4 Markov condition
79
The class of represented triplets then defines the (conditional independence) model induced by u Mu = { A, B|C ∈ T (N );
A ⊥⊥ B | C [u] } .
A trivial example is the model induced by the zero imset. Proposition 4.7. Mu = Tø (N ) for u = 0. Proof. Inclusion Tø (N ) ⊆ Mu is evident. To show Mu ⊆ Tø (N ) suppose for contradiction A, B|C ∈ Mu \ Tø (N ) which means that (both u A,B|C and) −u A,B|C is a structural imset. This contradicts Proposition 4.4. Another example is the model induced by an elementary imset. Lemma 4.5. Supposing v = u a,b|K ∈ E(N ) one has Mv = { a, b|K , b, a|K } ∪ Tø (N ) . Proof. The facts a, b|K, b, a|K ∈ Mv and Tø (N ) ⊆ Mv are evident. Suppose that A, B|C ∈ Mv \ Tø (N ) and k · v = u A,B|C + w for k ∈ N and a structural imset w. To prove ABC ⊆ abK use Proposition 4.4 to derive k· mABC↑ , v = mABC↑ , k·v = mABC↑ , u A,B|C + mABC↑ , w > 0 . (4.6) The fact mABC↑ , u a,b|K > 0 then implies ABC ⊆ abK. Analogously, to evidence C ⊇ K use also Proposition 4.4 with mC↓ in (4.6) instead of mABC↑ . The fact that A, B|C is a disjoint triplet and K ⊆ C ⊂ ABC ⊆ abK then implies that A, B|C coincides either with a, b|K or with b, a|K. A basic fact is this: Lemma 4.6. Every structural imset over N induces a disjoint semi-graphoid over N . Proof. Semi-graphoid properties (see Section 2.2.2) easily follow from the definition above and the fact that the sum of structural imsets is a structural imset. Indeed, realize µ A,∅|C = 0 for the triviality property, µ A,B|C = µ B,A|C
for the symmetry property, and µ A,BD|C = µ A,B|DC +µ A,D|C for the other properties. For the proof of the equivalence result in Section 4.5 a technical lemma is needed. Within its proof the following simple observation concerning upper classes (see p. 73) is used. Proposition 4.8. Supposing u = w +v where w, v ∈ S(N ) has Uu = Uw ∪Uv .
80
4 Structural Imsets: Fundamentals
Proof. Inclusion Uu ⊆ Uw ∪ Uv is trivial. To show Uw ⊆ Uu take S ∈ Uwmax . By Proposition 4.5 w(S) > 0 and w(T ) = 0 whenever S ⊂ T ⊆ N . Hence, by Proposition 4.4 u(T ) 0 < mS↑ , w + mS↑ , v = mS↑ , u = T,S⊆T
which implies that S ∈ Uu . The inclusion Uv ⊆ Uu is analogous.
Lemma 4.7. Suppose that u is a structural imset over N . Then there exists a sequence Uu = D0 , . . . , Dr = Lu , r ≥ 0 of descending classes of subsets of N and a sequence a1 , b1 |K1 , . . . , ar , br |Kr of elementary triplets over N (which is empty in the case r = 0) such that for every i = 1, . . . , r ⊥ bi | Ki [u], (a) ai ⊥ (b) ai Ki , bi Ki ∈ Di and Di−1 = Di ∪ {S; S ⊆ ai bi Ki }. Proof. Observe that for every structural imset u and every n ∈ N one has Un·u = Uu , Ln·u = Lu and A ⊥ ⊥ B | C [n · u] iff A ⊥ ⊥ B | C [u] for each
A, B|C ∈ T (N ). Therefore, it suffices to assume that u is a combinatorial imset and prove the proposition by induction on deg(u). In the case deg(u) = 0 necessarily u = 0 by Corollary 4.2 and one can put r = 0 and D0 = Uu = Lu = ∅. In the case deg(u) ≥ 1 one has u = 0 and can apply Lemma 4.3 to find v = u a,b|K ∈ E(N ) with Lv ⊆ Lu such that ⊥ b | K [u] w = u − v is a combinatorial imset. Of course, {aK, bK} ⊆ Lu , a ⊥ and one can observe that Lw ⊆ Lu ∪{S; S ⊆ abK}. Moreover, by Propositions 4.3 and 4.1 deg(w) = m∗ , w = m∗ , u − m∗ , v = deg(u) − 1 . In particular, one can apply the induction hypothesis to w and conclude that there exists a sequence Uw = F0 , . . . , Fr−1 = Lw , r − 1 ≥ 0 of descending classes and a sequence ai , bi |Ki , i = 1, . . . , r − 1 of elementary triplets with ⊥ bi | Ki [w] and ai ⊥ ai Ki , bi Ki ∈ Fi ,
Fi−1 = Fi ∪ {S; S ⊆ ai bi Ki }.
Let us put Di = Fi ∪ Uv ∪ Lu for i = 0, . . . , r − 1 and, for i = r, define Dr = Lu and ar , br |Kr = a, b|K. By Propositions 4.8 and 4.5 D0 = F0 ∪ Uv ∪ Lu = (Uw ∪ Uv ) ∪ Lu = Uu . It is no problem to evidence that D0 , . . . , Dr satisfies ⊥ bi | Ki [w] implies ai ⊥ ⊥ bi | Ki [u] for the required conditions. Indeed, ai ⊥ i ≤ r − 1 and since Lw ⊆ Lu ∪ {S; S ⊆ abK} by u = w + v one has Dr−1 = Lw ∪ Uv ∪ Lu = Lu ∪ {S; S ⊆ abK} = Dr ∪ {S; S ⊆ ar br Kr }. The significance of the preceding lemma (summarized in the consequence below) is that one can always “reach” the upper class of a structural imset from its lower class with the aid of its induced conditional independence statements. Note that the “reverse order” in formulation of Lemma 4.7 (going from the upper class to the lower class) is used because it is more suitable from the point of view of the proof(s).
4.4 Markov condition
81
Corollary 4.3. Let u be a structural imset over N . Then every descending class of sets E ⊆ P(N ) containing Lu and satisfying ∀ A, B|C ∈ T (N )
if A ⊥ ⊥ B | C [u], AC, BC ∈ E then ABC ∈ E, (4.7)
necessarily contains Uu . Proof. Apply Lemma 4.7 and prove Di ⊆ E by reverse induction on i = r, . . . , 0. 4.4.2 Markovian measures Suppose that u is a structural imset over N and P is a probability measure over N . One says that P is Markovian with respect to u if A⊥ ⊥ B | C [u]
implies
A ⊥⊥ B | C [P ]
whenever A, B|C ∈ T (N ) .
Thus, the statistical meaning of an “imsetal” model is completely analogous to the statistical meaning of a graphical model. Every structural imset u over N represents a class of probability measures over N within the respective distribution framework, namely the class of measures which are Markovian with respect to u. In fact, “imsetal” models generalize graphical models: given a classic graph there exists a structural imset having the same class of Markovian distributions (for DAG models see Lemma 7.1; in general, one can combine the respective faithfulness result with later Theorem 5.2 to show that for every graph G over N a structural imset u over N exists such that MG = Mu ). We say that P is perfectly Markovian with respect to a structural imset u over N if u exactly induces the conditional independence model induced by P , that is, for every A, B|C ∈ T (N ) one has A⊥ ⊥ B | C [u] if and only if A ⊥⊥ B | C [P ] . One of the results of this monograph (Theorem 5.2) is that every probability measure with finite multiinformation is perfectly Markovian with respect to a certain structural imset. On the other hand, there are “superfluous” structural imsets whose induced semi-graphoid is not a model induced by any probability measure with finite multiinformation. Example 4.1. There exists a structural imset u over N = {a, b, c, d} such that no marginally continuous measure over N is perfectly Markovian with respect to u. Put u = u c,d|{a,b} + u a,b|∅ + u a,b|{c} + u a,b|{d} . Evidently c ⊥ ⊥ d | {a, b} [u], a ⊥ ⊥ b | ∅ [u], a ⊥ ⊥ b | {c} [u] and a ⊥⊥ b | {d} [u]. To show that a ⊥ ⊥ b | {c, d} [u] consider the multiset m† in Figure 4.3 and observe that m† , v ≥ 0 for every v ∈ E(N ). Hence, m† , w ≥ 0 for every structural imset w over N . Because m† , k · u − u a,b|{c,d} = −1 for every
82
4 Structural Imsets: Fundamentals
+4 Q {a, b, c, d} Q A Q Q A Q A Q
+2
+2
+2
+2
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP PP PP Q P P PP PP PP Q P P P P Q P P P P PP Q P P P
+1
0 {a, b}
{a, c}
+1 {a, d}
+1 {b, c}
+1 {b, d}
PP PP QPPP PP PP PP Q PP PP PP Q P P PP PP PP Q
0 {a}
0
0
0
{b} {c} {d} Q A Q A Q Q A Q Q 0 ∅
+1 {c, d}
Fig. 4.3. The multiset m† from Example 4.1.
k ∈ N the imset k · u − u a,b|{c,d} is not structural. However, by Corollary 2.1 there is no marginally continuous probability measure over N which is perfectly Markovian with respect to u. ♦ Another important consequence of Lemma 4.7 is that marginals of a Markovian measure with respect to a structural imset u for sets in Lu determine uniquely its marginals for sets in Uu . This motivated the “lower and upper class” terminology introduced in Section 4.2.3. Note that one often has Uu = P(N ), in which case a Markovian measure is determined by its marginals on the lower class. Corollary 4.4. Suppose that both P and Q are probability measures on (XN , XN ) which are Markovian with respect to a structural imset u. Then [ P S = QS for every S ∈ Lu ] ⇒ [ P S = QS for every S ∈ Uu ] . Proof. We can repeat the arguments used in Step I of the proof of Lemma 2.6 (p. 25) to verify the following “uniqueness principle”: ⎫ A ⊥⊥ B | C [P ], ⎪ ⎪ ⎬ A⊥ ⊥ B | C [Q], =⇒ P ABC = QABC . for every A, B|C ∈ T (N ) P AC = QAC , ⎪ ⎪ ⎭ P BC = QBC Then, owing to the fact that S ⊆ T, P T = QT implies P S = QS , one can apply Lemma 4.7 and show by reverse induction on i = r, . . . , 0 that one has [P S = QS for every S ∈ Di ].
4.5 Equivalence result
83
4.5 Equivalence result The third way of associating a structural imset with a probability measure is an algebraic identity in which the measure is represented by its multiinformation function. One says that a probability measure P over N with finite multiinformation complies with a structural imset u over N if mP , u = 0 where mP denotes the multiinformation function defined in Section 2.3.4. Remark 4.4. The above concept can be introduced alternatively in the following way. Suppose that P is a probability measure over N which has a domi nating measure (see p. 19) µ = i∈N µi with −∞ < H(P S | i∈S µi ) < +∞ for each ∅ = S ⊆ N . Note that by Lemma 2.7 P has finite multiinformation then. Actually, using the idea from the proof of Lemma 2.7, one can show that the requirement that {µi ; i ∈ N } is a reference system of measures for a measure P with finite multiinformation (see p. 75) is equivalent to the assumption formulated here. Thus, one can introduce the entropy function of P relative to µ as follows: µi ) for ∅ = S ⊆ N , hP,µ (S) = −H(P S | i∈S
and hP,µ (∅) = 0 by a convention. Then P complies with a structural imset u over N iff hP,µ , u = 0. Indeed, (2.18) implies together with the fact that u is o-standardized mP (S) · u(S) = − hP,µ (S) · u(S) + hP,µ ({j}) · u(S) , S⊆N
S⊆N
j∈N
S⊆N,j∈S
0
that is, mP , u = − hP,µ , u. Note that in the case of a discrete probability measure over N one can always take the counting measure υ in place of µ. The corresponding entropy function is then non-negative and has very pleasant properties which make it possible to characterize functional dependence statements (p. 12) with respect to P [81] (in addition to pure conditional independence statements). More specifically, hP,υ (A) ≤ hP,υ (AC) for A, C ⊆ N while the equality occurs iff A ⊥ ⊥ A | C [P ] (cf. Remark 2.3). However, this pleasant phenomenon seems to be more or less limited to the discrete case. It is not clear which dominating measures give rise to entropy functions with the behavior of this type towards functional dependence in general (except for the case that all involved measures are concentrated on a countable set). For example, in the Gaussian case the entropy function relative to the Lebesgue measure need not be non-negative or even monotone. This is the main reason why I prefer the multiinformation function to the entropy function. The second reason is that the entropy function – unlike the multiinformation function – does depend on the choice of a suitable dominating measure.
84
4 Structural Imsets: Fundamentals
The main result of this chapter is that all three ways of associating a structural imset with a probability measure are equivalent. In other words, a probability measure complies with a structural imset iff it is Markovian with respect to it, and this holds iff the product formula induced by the imset holds. In the proof below, the following simple observation is used. Proposition 4.9. Suppose the situation from Convention 2 (p. 20). Then ∀S ⊆ T ⊆ N
fS (xS ) = 0 ⇒ fT (xT ) = 0 for µ-a.e. x ∈ XN .
(4.8)
Proof. By the arguments used in Remark 2.9 it suffices to prove (4.8) for special density versions f ↓S and f ↓T . This can be verified using the Fubini theorem. Theorem 4.1. Let u be a structural imset over N and P a probability measure on (XN , XN ) with finite multiinformation. Suppose that {µi ; i ∈ N } is a reference system of measures for P (p. 75); let us accept Convention 2 on p. 20. Then the following four conditions are equivalent: + − (i) fS (xS )u (S) = fS (xS )u (S) for µ-a.e. x ∈ XN , S⊆N S⊆N (ii) fS (xS )u(S) = 1 for P -a.e. x ∈ XN , S⊆N
(iii) mP , u = 0, (iv) A ⊥ ⊥ B | C [u] implies A ⊥ ⊥ B | C [P ] for every A, B|C ∈ T (N ). Proof. Implication (i) ⇒ (ii) is trivial since P µ and fS (xS ) > 0 for P -a.e. x ∈ XN and every S ⊆ N . To show (ii) ⇒ (iii) apply logarithm to the assumed equality first and get u(S) · ln fS (xS ) = ln ( fS (xS )u(S) ) = 0 for P -a.e. x ∈ XN . S⊆N
S⊆N
Then by integrating with respect to P (in the notation of Convention 2) S u(S) · H(P | µS ) = u(S) · ln fS (xS ) dP (x) = 0 . S⊆N
XN S⊆N
As explained in preceding Remark 4.4, this is equivalent to mP , u = 0. To see (iii) ⇒ (iv) consider a structural imset w = k · u − u A,B|C with k ∈ N and write 0 = mP , k · u = mP , u A,B|C + mP , w. By the inequality (2.16) in Corollary 2.2, mP , v ≥ 0 for every semielementary imset v. The linearity of scalar product allows one to extend this to every structural imset. Thus, both terms on the right-hand side of above
4.5 Equivalence result
85
equality are non-negative and therefore they vanish. Thus A ⊥⊥ B | C [P ] is implied by (2.17) in Corollary 2.2. Supposing (iv), we already know that fS (xS ) ≥ 0 for every x ∈ XN , S ⊆ N . Thus, the condition (i) can be proved separately on the set − fS (yS )u (S) = 0} Y = {y ∈ XN ; S⊆N
and on the set Z = {z ∈ XN ;
−
fS (zS )u
(S)
> 0} .
S⊆N
Because of Lu ⊆ Uu (Proposition 4.5) it follows from (4.8) in Proposition 4.9 that + fS (yS )u (S) = 0 for µ-a.e. y ∈ Y , S⊆N
and both sides of the expression in (i) vanish µ-a.e. on Y. Suppose now z ∈ Z and put Ez = {S ⊆ N ; fS (zS ) > 0}. Observe that ⊆ {S ⊆ N ; u(S) < 0} ⊆ Ez for every z ∈ Z and that Ez is a descending Lmax u class for µ-a.e. z ∈ Z by (4.8). Hence, Lu ⊆ Ez for µ-a.e. z ∈ Z. Having fixed
A, B|C ∈ T (N ) the assumption A ⊥ ⊥ B | C [u] implies by (iv) A ⊥ ⊥ B | C [P ] and hence by Lemma 2.4 derive that fAC (xAC ) · fBC (xBC ) > 0 ⇒ fABC (xABC ) > 0
for µ-a.e. x ∈ XN .
In particular, for µ-a.e. z ∈ Z, the fact AC, BC ∈ Ez implies ABC ∈ Ez . Altogether, for µ-a.e. z ∈ Z the assumptions of Corollary 4.3 with E = Ez are fulfilled and therefore Uu ⊆ Ez , that is, ∀S⊆N
u(S) > 0 ⇒ fS (zS ) > 0
for µ-a.e. z ∈ Z .
(4.9)
Since u is a structural imset one has n · u = v∈E(N ) kv · v for n ∈ N and kv ∈ Z+ (see Section 4.2.3). For every v = u a,b|K ∈ E(N ) with kv > 0 one has a ⊥ ⊥ b | K [u] and therefore by (iv) and (2.3) on p. 20 derives + − fS (xS )v (S) = fS (xS )v (S) for µ-a.e. x ∈ XN . S⊆N
S⊆N
These equalities can be multiplied by each other so that one gets + − fS (zS ) v∈E(N ) kv ·v (S) = fS (zS ) v∈E(N ) kv ·v (S) S⊆N
(4.10)
S⊆N
for µ-a.e. z ∈ Z. Let us introduce the multiset w = v∈E(N ) kv · v + − n · u+ = − − v∈E(N ) kv · v − n · u . For every S ⊆ N the fact w(S) > 0 implies v(S) > 0 for some v ∈ E(N ) with kv > 0. Hence, S ∈ Uv ⊆ Un·u = Uu by Proposition
86
4 Structural Imsets: Fundamentals
4.8. By application of (4.9) to some T ⊇ S and (4.8) one derives fS (zS ) > 0 for µ-a.e. z ∈ Z. This consideration implies fS (zS )w(S) > 0 for µ-a.e. z ∈ Z . S⊆N
We can thus divide (4.10) by this non-zero expression for µ-a.e. z ∈ Z and conclude that + − fS (zS )n·u (S) = fS (zS )n·u (S) for µ-a.e. z ∈ Z . S⊆N
S⊆N
Take the n-th root of it and obtain what is desired.
Note that one can always take the universal reference system (p. 75) in Theorem 4.1, which implies that the conditions (iii) and (iv) – which are not dependent on the choice of a reference system – are always equivalent (for a probability measure with finite multiinformation). Moreover, it also follows from Theorem 4.1 that the validity of the product formula (4.4) does not depend on the choice of a reference system of measures. Another comment is that one more equivalent definition of conditional independence can be derived from Theorem 4.1. Suppose that P is a probability measure over N with finite multiinformation, A, B|C ∈ T (N ) and accept Convention 2. It suffices to put u = u A,B|C and use (ii)⇔(iv) in Theorem 4.1 to see that A ⊥⊥ B | C [P ] iff fABC (xABC ) =
fAC (xAC ) · fBC (xBC ) fC (xC )
Note that fS (xS ) > 0 for P -a.e. x ∈ XN .
for P -a.e. x ∈ XN .
(4.11)
5 Description of Probabilistic Models
Two basic approaches to description of probabilistic CI structures are dealt with in this chapter. The first one, which uses structural imsets, was already mentioned in Section 4.2. The second one, which uses supermodular functions, is closely related to the first one. It can also use imsets over N to describe CI models over N but the respective class of imsets and their interpretation are completely different. However, despite the formal difference, the approaches are equivalent. In fact, there exists a certain duality relation between these two methods: one approach is complementary to the other (see Section 5.4). The main result of the chapter says that every CI model induced by a probability measure with finite multiinformation can be described both by a structural imset and by a supermodular function.
5.1 Supermodular set functions A real set function m : P(N ) → R is called a supermodular function over N if m(U ∪ V ) + m(U ∩ V ) ≥ m(U ) + m(V )
for every U, V ⊆ N.
(5.1)
The class of all supermodular functions on P(N ) will be denoted by K(N ). The definition can be formulated in several equivalent ways. Proposition 5.1. A set function m : P(N ) → R is supermodular iff any of the following three conditions holds: (i) m, u ≥ 0 for every structural imset u over N , (ii) m, u ≥ 0 for every semi-elementary imset u over N , (iii) m, u ≥ 0 for every elementary imset u ∈ E(N ). Proof. Evidently (i) ⇒ (ii) ⇒ (iii). The implication (iii) ⇒ (i) follows from the definition of a structural imset (see Section 4.2.3, p. 73) and the linearity of the scalar product. The condition (5.1) is equivalent to the requirement
m, u A,B|C ≥ 0 for every A, B|C ∈ T (N ) which is nothing but (ii).
88
5 Description of Probabilistic Models
Further evident observation is as follows. Proposition 5.2. The class of supermodular functions K(N ) is a cone: ∀ m1 , m2 ∈ K(N )
∀ α, β ≥ 0 α · m1 + β · m2 ∈ K(N ) .
(5.2)
Remark 5.1. This is to warn the reader that a different terminology is used in game theory, where supermodular set functions are named either “convex set functions” [109] or even “convex games” [116]. I followed the terminology from game theory in some of my former publications [131, 137]. However, in order to avoid confusion with the usual meaning of the adjective “convex” in mathematics another common term “supermodular” is used in this book. As mentioned in [17] supermodular functions are also named “2-monotone Choquet capacities”. Two kinds of “scalar product” equivalence for supermodular functions are introduced and distinguished below. The weaker one will be called qualitative and the stronger one quantitative. 5.1.1 Semi-graphoid produced by a supermodular function One says that a disjoint triplet A, B|C ∈ T (N ) is represented in a supermodular function m over N and writes A ⊥ ⊥ B | C [m] if m, u A,B|C = 0. The class of represented triplets then defines the model produced by m Mm = { A, B|C ∈ T (N ) ;
A⊥ ⊥ B | C [m] } .
Two supermodular functions over N are qualitatively equivalent if they represent the same class of disjoint triplets over N . Remark 5.2. This is to explain terminology. I usually say that a model is induced by a mathematical object over N (see Section 2.2.1); for example, by a probability measure over N or by a graph over N (see Chapter 3). However, in this and subsequent chapters I need to distinguish between two different ways of ascribing formal independence models to imsets. Both ways appear to be equivalent as concerns the overall class of ascribed models (see Corollary 5.3). The problem is that some imsets (e.g. the zero imset or u a,b|∅
if N = {a, b}) may be ascribed different models depending on the way of ascribing. To prevent misunderstanding I decided to emphasize the difference both in terminology (induced model versus produced model) and in notation (Mu versus Mm ). I regret to admit that the adjective “induced” was also used in a former research report [145] in connection with supermodular functions. Note that the difference between both ways of ascribing formal independence models is also reflected in names of respective equivalences of mathematical objects. The corresponding equivalence of supermodular functions
5.1 Supermodular set functions
89
and imsets (= the coincidence of “produced” models) is the above-mentioned qualitative equivalence while the corresponding equivalence of structural imsets (= the coincidence of “induced” models) is the independence equivalence introduced in Section 6.1.1. A basic fact is this: Lemma 5.1. A supermodular function over N produces a disjoint semigraphoid over N . Proof. This follows easily from respective formulas for semi-elementary imsets and the linearity of scalar product. Let m be a supermodular function over N . For the triviality property, realize m, u A,∅|C = m, 0 = 0, for the symmetry property m, u A,B|C = m, u B,A|C . The formula
m, u A,BD|C = m, u A,B|DC + m, u A,D|C
(5.3)
implies directly the contraction property. To verify the decomposition property and the weak union property, use Proposition 5.1 which says that both terms on the right-hand side of (5.3) are non-negative. A typical example of a supermodular set function is the multiinformation function introduced in Section 2.3.4. In fact, Corollary 2.2 on p. 27 says that, given a probability measure P over N with finite multiinformation, the multiinformation function mP is an -standardized supermodular function over N . One can conclude even more. Proposition 5.3. Given a probability measure P over N with finite multiinformation there exists an -standardized supermodular function m such that MP = Mm . Proof. Let us put m = mP . The relation (2.17) from Corollary 2.2 says A ⊥⊥ B | C [P ] ⇔ A ⊥ ⊥ B | C [mP ] which implies the desired fact.
for every A, B|C ∈ T (N ) ,
Note that the value mP , u A,B|C for a probability measure P and a disjoint triplet A, B|C is nothing but the relative entropy of P ABC with respect to the conditional product of P AC and P BC (see the proof of Corollary 2.2 on p. 27). In the discrete case, this number can be interpreted as a numerical evaluation of the degree of stochastic conditional dependence between A and B given C with respect to P [144]. Thus, given a supermodular function m over N and A, B|C ∈ T (N ) the non-negative value m, u A,B|C could be interpreted as a generalized degree of dependence between A and B given C with respect to m. Having in mind this point of view, there is no reason to distinguish between two supermodular functions for which scalar products with semi-elementary imsets coincide. This motivates the next definition.
90
5 Description of Probabilistic Models
5.1.2 Quantitative equivalence of supermodular functions We say that two supermodular functions m1 and m2 over N are quantitatively equivalent if
m1 , u A,B|C = m2 , u A,B|C
for every A, B|C ∈ T (N ) .
(5.4)
Obviously, this condition implies that m1 and m2 are also qualitatively equivalent. Quantitative equivalence can be described with the aid of a special class of modular functions. A function l : P(N ) → R is called modular if l(U ∪ V ) + l(U ∩ V ) = l(U ) + l(V )
for every U, V ⊆ N.
(5.5)
The class of modular functions over N will be denoted by L(N ). Evidently L(N ) ⊆ K(N ). Note that L(N ) appears to be the class of functions producing the maximal independence model T (N ). Proposition 5.4. The only -standardized modular function is the zero function. Proof. Indeed, supposing that l : T (N ) → R is -standardized modular function one can show by induction on |S| that l(S) = 0 for every S ⊆ N . This is evident in the case |S| ≤ 1. If |S| ≥ 2 then take u a,b|K ∈ E(N ) such that S = abK. The fact l, u a,b|K = 0 says l(S) = l(aK) + l(bK) − l(K) and the right-hand side of this equality vanishes by the induction hypothesis. Lemma 5.2. A supermodular function m over N produces the model T (N ) iff m ∈ L(N ). Two supermodular functions m1 , m2 over N are quantitatively equivalent iff m1 − m2 ∈ L(N ). Every supermodular function is quantitatively equivalent to an -standardized supermodular function. The class L(N ) is a linear subspace of the dimension |N | + 1. The functions m∅↑ and m{i}↑ for i ∈ N (see p. 39) form its linear basis. Proof. Clearly, m : P(N ) → R is modular iff both m and −m are supermodular. Hence, by Proposition 5.1(ii) one has m ∈ L(N ) iff m, u = 0 for every semi-elementary imset u which means Mm = T (N ). On the other hand, due to the linearity of scalar product, two supermodular functions m1 and m2 are quantitatively equivalent iff m1 − m2 , u = 0 for every semi-elementary imset u over N . Let m be a supermodular function over N . The function {m({i}) − m(∅)} · m{i}↑ (5.6) m ≡ m − m(∅) · m∅↑ − i∈N
is evidently -standardized and supermodular because m∅↑ , m{i}↑ ∈ L(N ) for i ∈ N and, of course, L(N ) is a linear subspace. Observe that m∅↑ and m{i}↑ for i ∈ N are linearly independent. To show that they linearly generate L(N ), take m ∈ L(N ) and introduce m by means of the formula (5.6). By Proposition 5.4 m = 0.
5.1 Supermodular set functions
91
Remark 5.3. To have a clear view on quantitative equivalence classes of supermodular functions one should choose one representative from every equivalence class in a systematic way. The choice should follow relevant mathematical principles: to have geometric insight one should make the choice “linearly”. This can be done as follows. Take a linear subspace S(N ) ⊆ RP(N ) such that S(N ) ⊕ L(N ) = RP(N ) . Then every m ∈ K(N ) can be uniquely decomposed: m = s + l where s ∈ S(N ), l ∈ L(N ). The fact −L(N ) ⊆ K(N ) and Proposition 5.2 implies s ∈ K(N ). Moreover, s is quantitatively equivalent to m by Lemma 5.2 and the function s ∈ K(N ) ∩ S(N ) coincides for quantitatively equivalent functions m ∈ K(N ). However, there is flexibility in the choice of S(N ). Fixing on a space S(N ) satisfying the above requirements means that one restricts attention to this linear subspace and represents the class of supermodular functions K(N ) by the respective class of standardized supermodular functions K(N ) ∩ S(N ). In this book three ways of standardization are mentioned (there are theoretical reasons for them). The preferred standardization using the linear subspace S (N ) = { m ∈ RP(N ) ; m(S) = 0 whenever |S| ≤ 1 } is in accordance with the property (2.15) of multiinformation functions (see p. 27). Functions m ∈ K(N )∩S (N ) are non-decreasing: m(S) ≤ m(T ) whenever S ⊆ T (see Corollary 2.2). In particular, they are non-negative. From a purely mathematical point of view, however, another standardization which uses the subspace Su (N ) = { m ∈ RP(N ) ; m(S) = 0 whenever |S| ≥ |N | − 1 } is equally entitled. This standardization can be viewed as a “reflection” of the former one since the composition mι of m ∈ RN with a self-transformation ι : S −→ N \S, S ⊆ N on P(N ) transforms K(N )∩S (N ) onto K(N )∩Su (N ). Thus, this standardization leads to non-increasing standardized supermodular functions, which are non-negative as well. The third natural option is to take the orthogonal complement of L(N ) m(S) = 0 and m(S) = 0 for i ∈ N } . So (N ) = { m ∈ RP(N ) ; S⊆N
S⊆N,i∈S
Note that every independence model produced by a supermodular function is even produced by a supermodular imset (see Corollary 5.3 in Section 5.3). Thus, the ways of imset standardization mentioned in Section 2.4 are those used for supermodular functions. The letters , u, o distinguish the types of standardization: -standardization means that the lower part of the respective diagram of an imset is “vanished”, u-standardization means that the upper part is “vanished” and o-standardization means that the respective linear space is the orthogonal complement of L(N ).
92
5 Description of Probabilistic Models
5.2 Skeletal supermodular functions A supermodular function m over N will be called skeletal if Mm ⊂ T (N ), but there is no supermodular function r over N such that Mm ⊂ Mr ⊂ T (N ). Thus, a supermodular function is skeletal iff it produces “submaximal” independence model. The above definition implies that a supermodular function which is qualitatively equivalent to a skeletal function is also skeletal. In particular, quantitative equivalence has the same property. Of course, qualitative equivalence decomposes the collection of skeletal functions into finitely many equivalence classes. The aim of this section is to characterize these equivalence classes. To have a clear geometric view on the problem it is appropriate to simplify the situation with the aid of the -standardization as mentioned in Remark 5.3. Let K (N ) = K(N ) ∩ S (N ) be the class of -standardized supermodular functions. A basic observation is that K(N ) is the direct sum of K (N ) and L(N ), in notation K(N ) = K (N ) ⊕ L(N ). Proposition 5.5. Every m ∈ K(N ) has unique decomposition m = m +l where m ∈ K (N ) and l ∈ L(N ). In particular, K (N ) ∩ L(N ) = {0}. Proof. Put l = m(∅) · m∅↑ + i∈N { m({i}) − m(∅) } · m{i}↑ . By Lemma 5.2 l ∈ L(N ). As (−l) ∈ L(N ) ⊆ K(N ), by Proposition 5.2 m ≡ m+(−l) ∈ K(N ). The facts l(∅) = m(∅) and l({i}) = m({i}) for i ∈ N imply that m is standardized. The uniqueness of the decomposition follows from Proposition 5.4 since L(N ) ∩ K (N ) = L(N ) ∩ S (N ) = {0}. The following lemma summarizes substantial facts concerning K (N ) (for related concepts see Section A.5.2). Lemma 5.3. Every -standardized supermodular function m ∈ K (N ) is nonnegative. The set K (N ) is a pointed rational polyhedral cone in RP(N ) . In particular, it has finitely many extreme rays and every extreme ray of K (N ) contains exactly one non-zero normalized imset (see p. 41). The set K (N ) is the conical closure of this collection of normalized imsets. Proof. The fact that m ∈ K (N ) is non-negative was already mentioned in Remark 5.3, including a reference to Corollary 2.2. To evidence that the set K (N ) is a rational polyhedral cone, observe that it is the dual cone F ∗ = {m ∈ RP(N ) ; m, u ≥ 0 for u ∈ F} to a finite set F ⊆ QP(N ) , namely to {δ{i} , −δ{i} }. F = E(N ) ∪ {δ∅ , −δ∅ } ∪ i∈N
The fact that it is pointed, that is, K (N ) ∩ (−K (N )) = L(N ) ∩ S (N ) = {0}, follows from Proposition 5.4. All remaining statements of Lemma 5.3 follow from well-known properties of pointed rational polyhedral cones gathered in
5.2 Skeletal supermodular functions
93
Section A.5.2. Observe that every (extreme) ray of K (N ) which contains a non-zero element of QP(N ) must contain a non-zero element of ZP(N ) , that is, a non-zero imset. But just one non-zero imset within the ray is normalized. 5.2.1 Skeleton Let us denote by K (N ) the collection of non-zero normalized imsets belonging to extremal rays of K (N ) and call this set the -skeleton over N . It is empty in the case |N | = 1, otherwise it is a non-empty set. The first important observation concerning K (N ) is the following one. Lemma 5.4. An imset u over N is structural iff it is o-standardized and
m, u ≥ 0 for every m ∈ K (N ). Proof. The necessity of the conditions follows from Propositions 4.4 and 5.1(i). For the sufficiency, suppose that u ∈ ZP(N ) is o-standardized and m, u ≥ 0 for any m ∈ K (N ). The fact that u is o-standardized means that m∅↑ , u = 0 and m{i}↑ , u = 0 for i ∈ N . Hence, by Lemma 5.2 derive that l, u = 0 for every l ∈ L(N ). The fact K (N ) = con(K (N )) (see Lemma 5.3) implies that
m, u ≥ 0 for every m ∈ K (N ). Hence, by Proposition 5.5 get m, u ≥ 0 for every m ∈ K(N ), i.e., u belongs to the dual cone K(N )∗ . However, K(N ) was introduced as the dual cone E(N )∗ in RP(N ) – see Proposition 5.1(iii). This says u ∈ E(N )∗∗ , and E(N )∗∗ is nothing else than the conical closure con(E(N )) – see Section A.5.2. Hence u ∈ con(E(N )) ∩ ZP(N ) and, by Lemma A.2, u is a combination of elementary imsets with non-negative rational coefficients. Therefore, it is a structural imset – see Section 4.2.3. The following consequence of Lemma 5.4 will be utilized later. Corollary 5.1. Let u be a structural imset over N and A, B|C ∈ T (N ). Then A ⊥ ⊥ B | C [u] iff ∀ r ∈ K (N ) r, u A,B|C > 0 implies r, u > 0. Proof. Since both u and v ≡ u A,B|C are o-standardized, wk ≡ k · u − v is ostandardized for every k ∈ N. By Lemma 5.4, wk is structural iff r, k·u−v ≥ 0 for every r ∈ K (N ). Thus, by the definition of Mu , A ⊥⊥ B | C [u] iff ∃ k ∈ N ∀ r ∈ K (N )
k · r, u ≥ r, v .
(5.7)
r, v > 0 ⇒ r, u > 0 .
(5.8)
This clearly implies that ∀ r ∈ K (N )
Conversely, supposing (5.8) observe that ∀ r ∈ K (N ) there exists kr ∈ N such that k · r, u ≥ r, v for any k ∈ N, k ≥ kr . Indeed, owing to Proposition 5.1 kr = 1 in the case r, v = 0 and kr is the least integer greater than r, v/ r, u in the case r, v > 0. As K (N ) is finite one can put k = max {kr ; r ∈ K (N )} to evidence (5.7).
94
5 Description of Probabilistic Models
An important auxiliary result is the following “separation” lemma. Lemma 5.5. For every m ∈ K (N ) there exists a structural imset u ∈ S(N ) such that m, u = 0 and r, u > 0 for any other r ∈ K (N ) \ {m}. Moreover, for every pair m, r ∈ K (N ), m = r, there exists an elementary imset v ∈ E(N ) such that m, v = 0 and r, v > 0. Consequently, Mm \ Mr = ∅ = Mr \ Mm for distinct m, r ∈ K (N ). Proof. By Lemma 5.3 K (N ) is a pointed rational polyhedral cone. It can be viewed as a cone in RP∗ (N ) where P∗ (N ) = {T ⊆ N, |T | ≥ 2}. Observe that this change of standpoint does not influence the concept of an extreme ray. One can apply Lemma A.1 from Section A.5.2 to the extreme ray generated by (the restriction of) m. The respective q ∈ QP∗ (N ) can be multiplied by a natural number to get u ∈ ZP∗ (N ) . This integer-valued function on P∗ (N ) can be extended to an o-standardized imset over N by means of the formulas u(S) for i ∈ N, u(∅) = − u(S). u({i}) = − S,{i}⊂S
S,S=∅
As every element of K (N ) is -standardized, the obtained imset u satisfies the required conditions: it is a structural imset by Lemma 5.4. Theexistence of v ∈ E(N ) is a clear consequence of the existence of u since n·u = v∈E(N ) kv ·v for some kv ∈ Z+ , n ∈ N. Indeed, the linearity of scalar product and the fact
r, u > 0 imply that kv > 0 and r, v > 0 for some v ∈ E(N ). Moreover, 0 = m, u = w∈E(N ) kw · m, w implies m, v = 0 by Proposition 5.1. However, the main lemma of this section is the following one. Lemma 5.6. A function m ∈ K (N ) is skeletal iff it is a non-zero function belonging to an extreme ray of K (N ). Proof. To prove the necessity, suppose that m is skeletal. Then m = 0 because Mm = T (N ). By Lemma 5.3 write αr · r for some αr ≥ 0 . (5.9) m= r∈K (N )
Since m = 0 there exists r ∈ K (N ) such that αr > 0. Linearity of the scalar product says, by Proposition 5.1, that m, u = 0 implies r, u = 0 for every semi-elementary imset u over N . Thus, Mm ⊆ Mr . The fact r ∈ K (N ) implies Mr = T (N ) as otherwise r ∈ L(N ) by Lemma 5.2 and this contradicts Proposition 5.4. The assumption that m is skeletal forces Mm = Mr . By Lemma 5.5, at most one r ∈ K (N ) with Mr = Mm exists. Thus, (5.9) says m = αr · r for some r ∈ K (N ) and αr > 0. To prove the sufficiency, suppose that m = 0 belongs to an extreme ray R of K (N ). The fact m = 0 implies, by Lemma 5.2 and Proposition 5.4, that
5.2 Skeletal supermodular functions
95
Mm = T (N ). Suppose that r is a supermodular function with Mm ⊆ Mr . The aim is to show that either Mr = Mm or Mr = T (N ). By Lemma 5.2, r is quantitatively equivalent to an -standardized supermodular function. Therefore assume without loss of generality r ∈ K (N ). The assumption Mm ⊆ Mr says m, u = 0 ⇒ r, u = 0 for every semi-elementary imset u over N . Thus, by Proposition 5.1, r, u > 0 implies m, u > 0. This means that there exists ku ∈ N with ku · m, u ≥ r, u. Since the class of semielementary imsets is finite, there exists k ∈ N such that k · m, u ≥ r, u for every semi-elementary imset u over N . By linearity of the scalar product and Proposition 5.1(ii) we conclude that k · m − r is supermodular. Since both m and r are -standardized, we get k · m − r ∈ K (N ). The assumption that R is an extreme ray of K (N ) and the decomposition k · m = (k · m − r) + r imply that r ∈ R. Thus, r = α · m for α ≥ 0. If α = 0 then r = 0 says Mr = T (N ), if α > 0 then necessarily Mr = Mm . Hence, the desired characterization of qualitative equivalence classes of skeletal imsets is obtained. Corollary 5.2. Every class of qualitative equivalence of skeletal supermodular functions over N is characterized by the unique element of -skeleton K (N ) belonging to the class. Given m ∈ K (N ) the respective equivalence class consists of functions m =α·m+l
where α > 0, l ∈ L(N ) .
In particular, every skeletal function is qualitatively equivalent to a skeletal imset and K (N ) characterizes all skeletal functions. Proof. Given a skeletal function r ∈ K(N ), by Proposition 5.5 and Lemma 5.2 we find a unique quantitatively equivalent skeletal function r˜ ∈ K (N ) and apply Lemma 5.6 to r˜ to find m ∈ K (N ) and α > 0 with r˜ = α · m. The fact that m is the unique qualitatively equivalent element of the -skeleton follows from Lemma 5.5. 5.2.2 Significance of skeletal imsets One of the main results of this chapter is the following theorem, which explains the significance of the concept of -skeleton. Theorem 5.1. There exists the least finite set of -standardized normalized imsets N (N ) such that for every imset u over N u is structural
⇐⇒
u is o-standardized and
m, u ≥ 0 for every m ∈ N (N ).
Moreover, N (N ) is nothing else than K (N ).
(5.10)
96
5 Description of Probabilistic Models
Proof. Lemma 5.4 says that K (N ) is a finite set of normalized -standardized imsets satisfying (5.10). Let N (N ) be any finite set of this type; the aim is to show that K (N ) ⊆ N (N ). Suppose for contradiction that m ∈ K (N )\N (N ). By Lemma 5.5 there exists a structural imset u over N such that m, u = 0 and r, u > 0 for any other r ∈ K (N ). A basic observation is that s, u > 0 for every s ∈ N (N ), s = 0. Indeed, (5.10) and Proposition 5.1 imply that s is supermodular and, therefore, s ∈ K (N ). By Lemma 5.3 write s = r∈K (N ) αr · r for αr ≥ 0. Observe that αr¯ > 0 for some r¯ ∈ K (N ) \ {m} because otherwise s = αm · m for αm > 0 (as s = 0) and the fact that both s and m are normalized imsets implies s = m, which contradicts m ∈ N (N ). Hence, by Proposition 5.1 αr · r, u ≥ αr¯ · ¯ r, u > 0 .
s, u = r∈K (N )
Now, an o-standardized imset w over N such that m, w < 0 can be found easily. The next step is to put vk = k · u + w for every k ∈ N. The inequality
m, vk = m, w < 0 implies by Lemma 5.4 that vk is not a structural imset over N . On the other hand, for every 0 = s ∈ N (N ) one has s, u > 0 and therefore there exists ks ∈ N with s, vks = ks · s, u+ s, w ≥ 0. Since N (N ) is finite there exists k ∈ N such that s, vk ≥ 0 for every s ∈ N (N ). By the fact that vk is o-standardized and (5.10) derive that vk is a structural imset, which contradicts the conclusion above. Thus, there is no m ∈ K (N ) \ N (N ) and the desired inclusion K (N ) ⊆ N (N ) was verified. Remark 5.4. The number of elements of the -skeleton K (N ) depends on |N |. If |N | = 1 then one has |K (N )| = 0. The simplest non-trivial case |N | = 2 is not interesting because |K (N )| = 1 then: the cone K (N ) consists of a single ray generated by δN in that case. However, in the case N = {a, b, c} the -skeleton already has 5 imsets (see Example 1 in Studen´ y [131]). Figure 5.1 gives their list. Thus, in the case |N | = 3 one needs to check only 5 inequalities to find out whether an o-standardized imset is structural or not. In the case |N | = 4, the -skeleton has 37 imsets; the Hasse diagrams of ten basic types of skeletal imsets are shown in the Appendix of Studen´ y et al. [145] – for the proof see [131]. However, the -skeleton in the case |N | = 5 was found by means of a computer; it has 117978 imsets – see [145] and http://www.utia.cas.cz/user data/studeny/ImsetViewM.html for a related virtual catalog. Note that the problem of suitable characterization of skeletal imsets remains open – see Theme 5 in Section 9.1.2. Remark 5.5. The term “skeleton” was inspired by the idea that the collection of extreme rays of K (N ) can be viewed as the outer skeleton of the cone K (N ). I used this word in [137] to name the smallest finite set of normalized imsets defining a pointed rational polyhedral cone as its dual cone; that is, the
5.2 Skeletal supermodular functions
+1
+2
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b}
{a, b}
0
0
Q
{a, c}
0
{b, c}
+1
+1
+1
{a, c}
{b, c}
Q Q Q Q Q Q Q Q Q Q Q
Q Q Q Q Q Q Q Q Q Q Q
{a}
{a}
0
0
0
{b} {c} Q Q Q Q Q
Q
Q
0
0
+1
0
{b} {c} Q Q Q Q Q
Q
0
∅
0
∅
97
+1
+1
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q Q Q Q Q
{a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q Q Q Q Q
{a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q Q Q Q Q
{a}
{a}
{a}
+1
0
0
0
0
0
{b} {c} Q Q Q Q Q
Q
0
∅
+1
0
0
0
0
0
{b} {c} Q Q Q Q Q
Q
0
∅
0
+1
0
0
0
0
{b} {c} Q Q Q Q Q
Q
0
∅
Fig. 5.1. The -skeleton over N = {a, b, c}.
-skeleton in the case of the conical closure E(N ). Another possible justification is that every independence model produced by a supermodular function is the intersection of submaximal independence models of this type (see Theorem 5.3) so that “skeletal” formal independence models form a certain generator of the whole lattice of models produced by supermodular functions. Note for explanation that some authors interested in graphical models [6, 9] use the word “skeleton” to name the underlying graph of a chain graph (see Section A.3, p. 220). Remark 5.6. As explained in Remark 5.3, -standardization is not the only possible way to standardize supermodular functions. An interesting fact is that all results gathered in Section 5.2 can also be achieved for u-standardization and o-standardization. Thus, one can introduce the u-skeleton Ku (N ) as a uniquely determined finite collection of non-zero normalized imsets belonging to extreme rays of the cone of u-standardized supermodular functions Ku (N ). Analogously, the o-skeleton Ko (N ) consists of non-zero normalized imsets taken from extreme rays of the cone of o-standardized supermodular functions
98
5 Description of Probabilistic Models
0
0
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b}
{a, c}
{b, c}
{a, b}
+1
+1
+1
0
Q
0
0
0
0
0
{a, c}
Q Q Q Q Q Q Q Q Q Q Q
Q Q Q Q Q Q Q Q Q Q Q
{a}
{a}
{b} {c} Q Q Q Q Q
Q
Q
{b, c}
0
0
0
{b} {c} Q Q Q Q Q
Q
+2
∅
0
+1
∅
0
0
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q Q Q Q Q
{a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q Q Q Q Q
{a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q Q Q Q Q
{a}
{a}
{a}
0
0
0
0
+1
0
{b} {c} Q Q Q Q Q
Q
+1
∅
0
0
0
+1
0
0
{b} {c} Q Q Q Q Q
Q
+1
∅
0
0
+1
0
0
0
{b} {c} Q Q Q Q Q
Q
+1
∅
Fig. 5.2. The u-skeleton over N = {a, b, c}.
Ko (N ). The point is that extreme rays of K (N ), Ku (N ) and Ko (N ) correspond to each other; they describe the respective (qualitative equivalence) classes of skeletal imsets. It follows from Corollary 5.2 that two skeletal supermodular functions m1 , m2 are qualitatively equivalent iff m2 = α · m1 + l for α > 0 and l ∈ L(N ). In particular, the respective standardized representative can be computed easily on the basis of any skeletal supermodular function. More specifically, given a skeletal imset m, the corresponding element of the -skeleton m can be obtained as follows. Put m = m − m(∅) · m∅↑ + {m(∅) − m({i})} · m{i}↑ (5.11) i∈N
and normalize m then, i.e., define m = k −1 · m where k is the greatest common prime divisor of {m (S); S ⊆ N } (see Figure 6.4 for an example of m and the respective element of the -skeleton). Given a skeletal imset m over N put
5.3 Description of models by structural imsets
ν(i) = m(N ) − m(N \ {i})
for i ∈ N ,
x = −m(N ) +
99
ν(i) ,
i∈N
introduce
m u = m + x · m∅↑ −
ν(i) · m{i}↑
(5.12)
i∈N
and normalize m u to get the respective element of Ku (N ). Figure 5.2 shows the u-skeleton for N = {a, b, c}. Finally, given a skeletal imset m over N put µ(i) = 2 · m(S) − 4 · m(S) for i ∈ N , S⊆N
and y =2·
S,i∈S
|S| · m(S) − (|N | + 1) ·
S⊆N
m(S).
S⊆N
Then the formula m o = 2|N | · m + y · m∅↑ +
µ(i) · m{i}↑
(5.13)
i∈N
defines an o-standardized imset, which after normalization yields the respective element of the o-skeleton Ko (N ). Figure 5.3 consists of the Hasse diagrams of o-skeletal imsets over N = {a, b, c}. Note that the proofs of the results shown in Section 5.2 for an alternative standardization are analogous. The only noteworthy modification is needed in the proof of Lemma 5.5 in the case of o-standardization. The cone Ko (N ) is viewed as a cone in RP(N ) and after application of Lemma A.1 the respective q ∈ QP(N ) is multiplied to get u ∈ ZP(N ) . Then the formula (5.13) with u in place of m defines the desired o-standardized imset over N . The remaining arguments are analogous.
5.3 Description of models by structural imsets The concept of a semi-graphoid induced by a structural imset was already introduced in Section 4.4.1 on p. 78. The aim of this section is to relate those semi-graphoids to semi-graphoids produced by supermodular functions introduced in Section 5.1.1. The first observation is this: Proposition 5.6. Let m be a supermodular function over N and u a structural imset over N . Then m, u = 0 iff Mu ⊆ Mm . Proof. Supposing m, u = 0 and A, B|C ∈ Mu there exists k ∈ N such that k · u − u A,B|C ∈ S(N ). Write 0 = k · m, u = m, k · u = m, k · u − u A,B|C + m, u A,B|C .
100
5 Description of Probabilistic Models
+2
+1
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b}
{a, b}
{a, c}
{b, c}
−1
−1
−1
−1
−1
Q
{a, c}
−1
{b, c}
0
0
Q Q Q Q Q Q Q Q Q Q Q
Q Q Q Q Q Q Q Q Q Q Q
{a}
{a}
0
0
0
{b} {c} Q Q Q Q Q
Q
Q
0
+1
∅
+1
{b} {c} Q Q Q Q Q
Q
+2
∅
+1
+1
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b, c} Q Q Q Q Q Q
{a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q Q Q Q Q
{a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q Q Q Q Q
{a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q Q Q Q Q
{a}
{a}
{a}
−1
+1
−1
−1
−1
+1
{b} {c} Q Q Q Q Q
Q
+1
∅
−1
−1
−1
+1
−1
+1
{b} {c} Q Q Q Q Q
Q
−1
+1
∅
+1
−1
+1
−1
−1
{b} {c} Q Q Q Q Q
Q
+1
∅
Fig. 5.3. The o-skeleton over N = {a, b, c}.
By Proposition 5.1 both terms on the right-hand side of this equality are nonnegative and must therefore vanish. Thus, m, u A,B|C = 0 which means
A, B|C ∈ Mm . Conversely, supposing Mu ⊆ Mm write by (4.2) n · u = + v∈E(N ) kv · v for n ∈ N, kv ∈ Z . For every v = u a,b|K ∈ E(N ) such that kv > 0 observe a, b|K ∈ Mu and deduce m, v = 0 by Mu ⊆ Mm . In particular, n · m, u = kv · m, v = 0, v∈E(N )
which implies m, u = 0.
The following is an important auxiliary result. Lemma 5.7. Given a structural imset u over N one has Mu = Mm where r. R = { r ∈ K (N ); Mu ⊆ Mr } and m = r∈R
r∈R
Mr =
5.3 Description of models by structural imsets
101
Proof. The fact Mu ⊆ r∈R Mr is evident. For converse inclusion use Corollary 5.1: if A, B|C ∈ T (N ) \ Mu then there exists r ∈ K (N ) such that
r, u A,B|C > 0 and r, u = 0. By Proposition 5.6 Mu ⊆ Mr . Thus, r ∈ R r and A, B|C ∈ M . The inclusion r∈R Mr ⊆ Mm follows from the fact m = r∈R r by linearity of the scalar product; the converse inclusion can be derived similarly with the aid of Proposition 5.1. The following proposition is a substantial fact. Corollary 5.3. Given a formal independence model M ⊆ T (N ) the following four conditions are equivalent: (i) M = Mm for a supermodular function m over N , (ii) M = Mu for a combinatorial imset u over N , (iii) M = Mu for a structural imset u over N , (iv) M = Mm for a supermodular -standardized imset m over N . Proof. For (i)⇒(ii) put u = A,B|C ∈M u A,B|C . Being a combination of semi-elementary imsets, u is a combinatorial imset. For every A, B|C ∈ M ⊥ B | C [u]. observe that u−u A,B|C is a combinatorial imset and therefore A ⊥ Thus M ⊆ Mu . For converse inclusion observe m, u = 0 and use Proposition 5.6. The implication (iii)⇒(iv) is an easy consequence of Lemma 5.7, (ii)⇒(iii) and (iv)⇒(i) are evident. Now the main result of this chapter can easily be derived. Theorem 5.2. Let P be a probability measure over N with finite multiinformation. Then there exists a structural imset u over N such that P is perfectly Markovian with respect to u, that is, MP = Mu . Proof. By Proposition 5.3 on p. 89, there exists a supermodular function m over N such that MP = Mm . By Corollary 5.3, Mm = Mu for a structural imset u over N . Remark 5.7. Going back to the motivation considerations in Section 1.1, Theorem 5.2 means that structural imsets solve satisfactorily the theoretical question of completeness. The answer is affirmative: every CI structure induced by a probability measure with finite multiinformation can be described by a structural imset. On the other hand, the natural price for this achievement is that structural imsets describe some “superfluous” semi-graphoids. That means there are semi-graphoids induced by structural imsets which are not induced by discrete probability measures as Example 4.1 on p. 81 shows (the
102
5 Description of Probabilistic Models
left-hand picture of Figure 6.1 depicts the respective structural imset). In particular, another theoretical question of faithfulness from Section 1.1 has a negative answer. However, mathematical objects which “answer” affirmatively both the faithfulness and completeness questions are not advisable because they cannot satisfactorily solve the practical question of implementation (see Section 1.1, p. 5). These objects must be difficult to handle on a computer as the lattice of probabilistic CI models is quite complicated. For example, in the case of 4 variables there exist infimum-irreducible models which are not coatoms (= submaximal models) [136]; this makes the implementation complicated. An advantage of structural imsets is that the lattice of models induced by them is fairly elegant and gives a chance for efficient computer implementation.
5.4 Galois connection The relationship between both methods for describing CI models mentioned in this chapter can be lucidly explained using the view of the formal concept analysis theory. This approach, developed by Ganter and Wille [42], is a specific application of the theory of complete lattices on (conceptual) data analysis and knowledge processing. Because of its philosophical roots, formal concept analysis is very near to human conceptual thinking. The most important mathematical notion in the core of this approach is a well-known notion of Galois connection. This view helps one to interpret the relation of structural imsets and supermodular imsets (functions) as a duality relation. I hope that the presentation using the Galois connection will make the theory of structural CI models easily understandable for readers. 5.4.1 Formal concept analysis Let me summarize the basic ideas of Chapter 1 of Ganter and Wille [42]. Formal context consists of the following items: • the set of objects Œ, • the set of attributes Æ, • a binary incidence relation % ⊆ Œ × Æ between objects and attributes. If (x, y) ∈ % for x ∈ Œ, y ∈ Æ then one writes x % y and says that the object x has the attribute y. In general, the Galois connection is defined for a pair of posets ([10], § V.8). However, in the special case under consideration, the Galois connection can be introduced as a pair of mappings between the power sets of Œ and Æ (which are posets with respect to inclusion): X ⊆ Œ −→ X = {y ∈ Æ ; x % y for every x ∈ X} , Y ⊆ Æ −→ Y = {x ∈ Œ ; x % y for every y ∈ Y } .
5.4 Galois connection
103
Thus, X is the set of attributes common to objects in X while Y is the set of objects which have all attributes in Y . Clearly, X1 ⊆ X2 implies X1 ⊇ X2
and Y1 ⊇ Y2 implies Y1 ⊆ Y2 . The consequence is that the mapping X → X is a closure operation on subsets of Œ and the mapping Y → Y is a closure operation on subsets of Æ (see p. 218 for this concept; cf. [10], § V.7).
s ⎧ s ⎪ ⎪ ⎨ X s Y ⎪ ⎪ ⎩ s s
s
s
s
s
Y = X
s s
s
s
s Æ attributes s
s
s
s
s
s
s
s
s
s
s
s
s
Œ objects
s
s s
s % incidence relation
Fig. 5.4. The Galois connection – an informal illustration.
By a formal concept of the context (Œ, Æ, %) is understood a pair (X, Y ) with X ⊆ Œ, Y ⊆ Æ, X = Y and Y = X. The set X is called the extent and the set Y the intent of the concept. Observe that the concept is uniquely determined either by its extent, that is, the list of objects forming the concept, or by its intent, which is the list of attributes (= properties) characterizing the concept. It reflects two different philosophical-methodological ways of defining concepts: constructive and descriptive definitions. One says that a concept (X1 , Y1 ) is a subconcept of a concept (X2 , Y2 ) and writes (X1 , Y1 ) & (X2 , Y2 ) if X1 ⊆ X2 . Basic properties of the Galois connection and the definition of the notion of a formal concept implies that X1 ⊆ X2 iff Y1 ⊇ Y2 . Thus, the class of all concepts of a given context (Œ, Æ, %) is a poset ordered by the relation &. In fact, it is a complete lattice (see Theorem 3 in Chapter 1 of [42]) where supremum and infimum (of two concepts) are defined as follows: (X1 , Y1 ) ∨ (X2 , Y2 ) = ( (X1 ∪ X2 ) , Y1 ∩ Y2 ), (X1 , Y1 ) ∧ (X2 , Y2 ) = ( X1 ∩ X2 , (Y1 ∪ Y2 ) ). The lattice is called a concept lattice of the context (Œ, Æ, %).
104
5 Description of Probabilistic Models
Remark 5.8. Note that it follows from the properties of the Galois connection that the above-mentioned concept lattice is order-isomorphic to the poset {X ⊆ Œ; X = X } ordered by inclusion ⊆. Thus, the lattice can be described only in terms of objects with the aid of the closure operation X → X on subsets of Œ (see Section A.2 for the concept of a closure operation). However, for analogous reasons, the same concept lattice is order-isomorphic to the poset {Y ⊆ Æ; Y = Y } ordered by inverse inclusion ⊇. This means that the lattice can be described dually in terms of attributes and the respective closure operation Y → Y as well: this closure operation induces “ordinary” inclusion ordering ⊆ on {Y ⊆ Æ; Y = Y }, which needs to be “reversed” then. Thus, the same mathematical structure can be described from two different points of view, in terms of objects or in terms of attributes. This again corresponds to two different methodological attitudes to describing concepts. The message of Section 5.4 is that the relationship between the description of CI models in terms of structural imsets and the description in terms of supermodular functions is just a relationship of this kind. On the other hand, the role of objects and attributes in a formal context is evidently interchangeable – see Figure 5.4 for illustration. 5.4.2 Lattice of structural models Let us introduce the class U(N ) of structural independence models as the class of formal independence models induced by structural imsets: U(N ) = {M ⊆ T (N ) ; M = Mu for u ∈ S(N ) } .
(5.14)
Corollary 5.3 implies that it coincides with the class of formal independence models produced by supermodular functions: U(N ) = {M ⊆ T (N ) ; M = Mm for m ∈ K(N ) } .
(5.15)
The class U(N ) is naturally ordered by inclusion ⊆. The main result of this section is that U(N ) is a finite concept lattice. Indeed, the respective formal context (Œ, Æ, %) can be constructed as follows: Œ = E(N ), Æ = K (N ) and u % m iff m, u = 0 for u ∈ E(N ), m ∈ K (N ).
(5.16)
Figure 5.5 gives an example of this formal context in the case N = {a, b, c}. The following theorem summarizes the results. Theorem 5.3. The poset (U(N ), ⊆) is a finite concept lattice which is, moreover, both atomistic and coatomistic. The null element of U(N ) is Tø (N ), that is, the model induced by the zero structural imset. The atoms of U(N ) are just the models induced by elementary imsets Mv , v ∈ E(N ). The coatoms of U(N ) are just the models produced by skeletal supermodular functions Mm , m ∈ K (N ). The unit element of U(N ) is T (N ), that is, the model produced by any modular set function l ∈ L(N ).
5.4 Galois connection
105
Æ = K (N ) 2 · δN + δab + δac + δbc
δN + δab
δN + δac
u b,c|a
t
t
t
u a,c|b
t
t
u a,b|c
t
δN
u a,b|∅
t
u a,c|∅
t
t
u b,c|∅
t
t
δN + δbc
t t
t
t
t t
t
Œ = E(N )
Fig. 5.5. The formal context (5.16) in the case N = {a, b, c}.
Proof. The first observation is that U(N ) is a complete lattice. Indeed, it suffices to show that every subset of U(N ) has an infimum (see Section A.2). Let us use (5.15) for this purpose. Given supermodular functions n m = m defines a supermodular funcm1 , . . . , mn , n ≥ 1, the function i i=1 n tion such that Mm = i=1 Mmi . This follows from Proposition 5.1. The infimum of the empty subset of U(N ) is T (N ) (see Lemma 5.2). The second observation is that {Mv ; v ∈ E(N )} is supremum-dense in U(N ): ∀ M ∈ U(N )
M = sup {Mu : u ∈ Q}
(5.17)
where Q = {u a,b|K ∈ E(N ); a, b|K ∈ M} . Indeed, evidently Mv ⊆ M for every v ∈ Q (cf. Lemma 4.5). On the other hand, supposing K ∈ U(N ) satisfies Mv ⊆ K for every v ∈ Q, every elementary independence statement from M belongs to K. We conclude, by Lemma 2.2, that M ⊆ K, which implies (5.17). The third observation is that {Mm ; m ∈ K (N )} is infimum-dense in U(N ): ∀ M ∈ U(N )
M = inf {Mm ; m ∈ R} where R = {r ∈
K (N ); M
(5.18) ⊆ M }. r
Indeed, by (5.14) one can apply Lemma 5.7 to M and observe that M = r r∈R M . This clearly gives (5.18). To show that (U(N ), ⊆) is even a concept lattice one can use Theorem 3 in Chapter 1 of Ganter and Wille [42]. It says in the theorem that to show that U(N ) is order-isomorphic to the concept lattice of a formal context (Œ, Æ, %) it suffices to show that there exist mappings γ : Œ → U(N ) and δ : Æ → U(N ) such that (a) γ(Œ) is supremum-dense in U(N ), (b) δ(Æ) is infimum-dense in U(N ),
106
5 Description of Probabilistic Models
(c) u % m ⇔ γ(u) ⊆ δ(m) for every u ∈ Œ, m ∈ Æ. Let us introduce the formal context by means of (5.16) and define γ and δ as follows: γ ascribes Mv to every v ∈ E(N ) (see p. 78) and δ ascribes Mm to every m ∈ K (N ) (see p. 88). The condition (a) follows from (5.17), the condition (b) from (5.18) and the condition (c) follows directly from Proposition 5.6. Thus, U(N ) is a concept lattice. The next observation is that Tø (N ) is the null element of U(N ). By Proposition 4.7 Tø (N ) ∈ U(N ). By Lemma 4.6, any M ∈ U(N ) is a semi-graphoid and therefore Tø (N ) ⊆ M. To show that every Mv , v ∈ E(N ) is an atom of U(N ), we observe by (5.14) that Mv ∈ U(N ) and assume M ∈ U(N ), M ⊆ Mv . If v = u a,b|K then by Lemma 4.5 obtain Mv = { a, b|K, b, a|K}∪Tø (N ). As M is a semi-graphoid Tø (N ) ⊆ M. If M = Tø (N ) then either a, b|K ∈ M or b, a|K ∈ M, which implies by the symmetry property Mv ⊆ M. The above-mentioned fact implies, with the aid of (5.17), that U(N ) is an atomistic lattice. The fact that T (N ) is the unit element of U(N ) is evident: T (N ) ∈ U(N ) by Lemma 5.2. To show that every Ms , s ∈ K (N ) is a coatom of U(N ), we observe Ms ∈ U(N ) by (5.15)and assume M ∈ U(N ), Ms ⊆ M. By (5.14) and Lemma 5.7, write M = r∈R Mr where R = {r ∈ K (N ); M ⊆ Mr }. If R \ {s} = ∅ then Ms ⊆ M ⊆ Mr for some r ∈ K (N ), r = s which contradicts the fact Ms \ Mr = ∅ implied by Lemma 5.5. Therefore R ⊆ {s}: if R = ∅ then M = T (N ), if R = {s} then M = Ms . The above fact, together with (5.18), implies that U(N ) is a coatomistic lattice. To see that {Mv ; v ∈ E(N )} are indeed all atoms of U(N ), let us realize that every atom is supremum-irreducible and use a well-known fact that every supremum-dense set must contain all supremum-irreducible elements (see Proposition 2 in § 0.2 of [42]). Indeed, {Mv , v ∈ E(N )} is supremumdense in U(N ) by (5.17). Analogously, the fact that {Mm ; m ∈ K (N )} are all coatoms of U(N ) follows from (5.18) and the fact that every infimum-dense subset must contain all infimum-irreducible elements, in particular, it contains all coatoms.
Remark 5.9. However, the formal context (5.16) is not the only option. For example, one can alternatively take combinatorial imsets in place of Œ and combinations of -skeletal imsets with non-negative integral coefficients in place of Æ (but the incidence relation is defined in the same way as in (5.16) ). The second option is to put Œ = S(N ) and Æ = K (N ) ∩ ZP(N ) (see Figure 7.10 for illustration). The third option is Œ = con(E(N )) and Æ = K (N ). Moreover, one can consider an alternative standardization instead of the standardization (see Remark 5.3). A special combined option is Œ = E(N ) and Æ = K(N ). On the other hand, the formal context (5.16) is a distinguished one in a certain sense. Theorem 5.3 implies that every object of (5.16) defines
5.4 Galois connection
107
a supremum-irreducible concept and every attribute of (5.16) defines an infimum-irreducible concept. Thus, the context (5.16) is reduced in the sense of Definition 24, Chapter 1 in Ganter and Wille [42]. A formal context of this type is unique for a given finite concept lattice – provided one takes into account the respective isomorphism of formal contexts (see Proposition 12 in Chapter 1 of [42]). Another point of view on the lattice (U(N ), ⊆) is the following one. It is order-isomorphic to the poset of equivalence classes of structural imsets S(N ) where the corresponding independence equivalence of structural imsets is considered (see Section 6.1.1 for this concept). This is suitable from the computational point of view since the operation of supremum in the lattice corresponds to the sum of structural imsets; see Corollary 6.1 in Section 6.2.1. A dual point of view is also possible: the equivalence classes of elements of K (N ) ∩ ZP(N ) with respect to qualitative equivalence of supermodular functions can be taken into consideration. The following observation says that the infimum is implementable by means of summing supermodular functions (imsets). Proposition 5.7. Let R be a finite set of supermodular functions over N . Then Mm = Mr where r = m. (5.19) inf {Mm ; m ∈ R} = m∈R
m∈R
Proof. To show M ⊆ M for m ∈ R, we take A, B|C ∈ Mr , write 0 = r, u A,B|C = m∈R m, u A,B|C and use Proposition 5.1(ii). The same equality can be used to show that, for every supermodular function s over N , the requirement Ms ⊆ Mm for m ∈ R implies Ms ⊆ Mr . r
m
Remark 5.10. The lattice U(N ) is also order-isomorphic to a face lattice (see § 0.3 in [42]), namely the lattice of faces of a certain polyhedral cone. For example, one can consider the cone con(E(N )) ⊆ RP(N ) . Another option is the class of faces of the cone K (N ) endowed with inverse inclusion ⊇; one can also consider the cone Ku (N ) and the cone Ko (N ). In fact, the original terminology from Studen´ y [137] was motivated by that point of view (see Remark 6.2 on p. 114). Example 5.1. The lattice U(N ) has only 1 element if |N | = 1, namely T (N ) = Tø (N ) and only 2 elements if |N | = 2, namely Tø (N ) and T (N ). However, it has 22 elements in the case N = {a, b, c}: the respective Hasse diagram is shown in Figure 5.6. Every node of the diagram contains a schematic description of the respective independence model in terms of elementary independence statements. Note that Figure 5.6 also shows the lattice of semigraphoids over {a, b, c} because they coincide with structural independence models over {a, b, c}. In fact, every structural model over {a, b, c} is even a CI model (see the next proposition). The number of structural models over 4 variables is 22108 [135]. ♦
% &
&
% &
$ '
% &
&
$ '
% &
$ '
% &
$ '
%
&
% &
$ '
% &
& $
KEY:
$ '
%
$
'
% &
$ '
% &
'
%
$
% &
$ '
% &
&
$ '
%
$
a⊥ ⊥ b|c
a⊥ ⊥ b|∅
&
% &
$ '
% &
$ '
%
$
% &
$ '
%
$
a⊥ ⊥ c|b
'
b⊥ ⊥ c|a
b⊥ ⊥ c|∅ a⊥ ⊥ c|∅
%
$
5 Description of Probabilistic Models
'
$ '
%
&
'
$
'
$ '
'
&
'
108
Fig. 5.6. The lattice of CI models over N = {a, b, c} (rotated).
5.4 Galois connection
109
Proposition 5.8. Let u be a structural imset over N with |N | = 3. Then there exists a discrete probability measure P over N such that Mu = MP . Proof. The proposition is an easy consequence of the fact that (U(N ), ⊆) is coatomistic (see Theorem 5.3) and of Lemma 2.9 from Section 2.3.7. Six respective constructions of perfectly Markovian measures for the unit element and coatoms of U(N ) (see Figure 5.6) were already given: see Proposition 2.2, Proposition 2.3, Example 2.1 and Example 2.3. Remark 5.11. This is to explain the relation of the presented approach to the polymatroidal description of CI models used by Mat´ uˇs [86]. A polymatroid is defined as a non-decreasing submodular function h : P(N ) → R such that h(∅) = 0. Recall that a function h is called submodular if −h is supermodular. Some polymatroids can be obtained by multiplying, by suitable factors, entropy functions of discrete measures relative to the counting measure (for the concept of entropy function see Remark 4.4) [81]. The formal independence model induced by a polymatroid h consists of A, B|C ∈ T (N ) such that
h, u A,B|C = 0. Since −h is a supermodular function, there is no difference between the model induced by a polymatroid h and the model produced by the supermodular function −h. Thus, the models induced by polymatroids are just the structural independence models. There is a one-to-one correspondence between certain polymatroids and u-standardized supermodular function – see § 5.2 in Studen´ y et al. [145].
6 Equivalence and Implication
This chapter deals with the equivalence and implication problems for structural imsets. First, the question of how to understand the concept of equivalence (and implication) is discussed and two basic types of equivalence are compared. The rest of the chapter is devoted to the stronger type of equivalence, called the independence equivalence, and to the respective implication between structural imsets. Two characterizations of independence implication, which are analogous to graphical characterizations of independence equivalence of graphs mentioned in Chapter 3, are given, and related implementation tasks are discussed.
6.1 Two concepts of equivalence Basically, there are two different ways of defining the concept of equivalence for graphs in connection with classic graphical models; these ways appear to be equivalent in standard situations. The first option is independence equivalence which is the requirement that the induced formal independence models coincide. This type of equivalence is not related to a distribution framework. The second option is distribution equivalence. Actually, one can introduce various kinds of distribution equivalence, that is, the conditions that the respective statistical models coincide. This type of equivalence of graphs is always understood relative to a distribution framework, that is, relative to a fixed comprehensive class Ψ of probability measures over N such that the considered statistical models are introduced as certain subsets of Ψ – see Section A.9.5 for more details. A basic form of a distribution equivalence is Markov equivalence which is the requirement that the classes of Markovian measures within Ψ coincide. However, one can take into consideration other forms of distribution equivalence. One of them could be factorization equivalence, that is, the condition that classes of factorizable measures with respect to considered graphs coincide - see Remarks 3.3 and 3.7. Another form of distribution equivalence is parameterization equivalence mentioned in Remark 6.1 below.
112
6 Equivalence and Implication
Clearly, because of the definition of a Markovian measure, independence equivalence implies Markov equivalence regardless of what is the considered distribution framework. The converse is true in the case of faithfulness (see Section 1.1, p. 3). It is easy to see that if a perfectly Markovian measure within a distribution framework Ψ exists for every graph (from the respective universum of graphs) then Markov equivalence relative to Ψ implies independence equivalence. This is the case of classic chain graphs relative to the class of discrete measures (see Section 3.3) and the case of alternative chain graphs relative to the class of regular Gaussian measures (see Section 3.5.5). Nevertheless, Markov and independence equivalence coincide even under a weaker assumption that the considered class of measures is perfect for every graph (see Remark 3.2 on p. 45 for this concept). On the other hand, if the considered distribution framework is somehow limited then it may happen that independence and Markov equivalence differ as the next example shows. Example 6.1. One can design a special distribution framework Ψ in such a way that Markov equivalence of undirected graphs relative to Ψ does not imply their independence equivalence. Consider a discrete distribution framework with prescribed one-dimensional marginals Pi on fixed measurable spaces (Xi , Xi ), i ∈ N (see Section A.9.5). Assume that at least two prescribed marginals Pk are collapsed measures, by which is meant that Pk (A) ∈ {0, 1} for every A ∈ Xk . Supposing Pk is collapsed for every k ∈ M , M ⊆ N one can verify using Lemma A.6 that a ⊥ ⊥ b | K [P ] for every P ∈ Ψ and
a, b|K ∈ E(N ) such that a ∈ M . Consequently, every two undirected graphs over N which have the same induced subgraph for N \ M can be shown to be Markov equivalent relative to Ψ . However, if the graphs differ they are not independence equivalent (see p. 46). ♦ Remark 6.1. The third type of distribution equivalence is parameterization equivalence. This approach is based on the following interpretation of some types of graphs, e.g., ancestral graphs [107] and joint-response chain graphs [28]. A specific distribution framework Ψ , usually the class of regular Gaussian measures over N , is considered. Every edge in a graph of the above-mentioned type represents a real parameter and every collection (= a vector) of edgeparameters determines a unique probability measure from Ψ factorized in a particular way. Every graph of considered type is then identified with the class of parameterized measures which often coincides with the class of Markovian measures within Ψ (e.g., in the case of maximal ancestral graphs [107] mentioned in Section 3.5.8). Two graphs can be called parameterization equivalent if their classes of parameterized measures coincide. Of course, parameterization equivalence substantially depends on the considered distribution framework and may not coincide with Markov equivalence; for example, in the case of general ancestral graphs [107]. This particular point of view motivates a general question of whether some structural imsets may lead to a specific way of parameterizing the corresponding class of Markovian distribution (see Direction 6 in Chapter 9).
6.1 Two concepts of equivalence
113
In typical situations Markov and independence equivalence coincide, which means that the concept of equivalence of graphs is unambiguously defined. This is maybe the reason why Lauritzen ([70], § 3.3) introduced Markov equivalence of chain graphs directly as their independence equivalence. Then the task to characterize Markov equivalence in graphical terms is correctly set. Several solutions to this general equivalence question (in the sense of Section 1.1) were exemplified in Chapter 3. The aim of this chapter is to examine the same equivalence question for structural imsets. The problem is that in the case of structural imsets one has to distinguish two abovementioned types of equivalence and choose one of them as a basis for further study. 6.1.1 Independence and Markov equivalence Two structural imsets u, v over N are independence equivalent if they induce the same CI model, that is, Mu = Mv . Then one writes u v. Let Ψ be a class of probability measures over N and Ψ (u) denote the class of Markovian measures with respect to u in Ψ : Ψ (u) = {P ∈ Ψ ; A ⊥ ⊥ B | C [P ] whenever A, B|C ∈ Mu } .
(6.1)
Two structural imsets u and v over N are Markov equivalent relative to Ψ if Ψ (u) = Ψ (v). The following observation is evident. Proposition 6.1. If two structural imsets are independence equivalent then they are Markov equivalent relative to any class Ψ of probability measures over N . Clearly, Markov equivalence relative to Ψ implies Markov equivalence relative to any subclass Ψ ⊆ Ψ . A natural question is whether the converse of Proposition 6.1 holds for reasonable classes Ψ . The answer is negative even for the class of marginally continuous measures, which involves the class of measures with finite multiinformation (see Section 4.1) – the widest class of measures to which the method of structural imsets is safely applicable. This is illustrated by the following example. Example 6.2. There exist two structural imsets over N = {a, b, c, d} which are Markov equivalent relative to the class of marginally continuous probability measures over N but which are not independence equivalent. Consider the imsets (see Figure 6.1) u = u c,d|{a,b} + u a,b|∅ + u a,b|{c} + u a,b|{d}
and v = u + u a,b|{c,d} .
Clearly, Mu ⊆ Mv but a, b|{c, d} ∈ Mv \ Mu as shown in Example 4.1. On the other hand, by Corollary 2.1, every marginally continuous measure P over N which is Markovian with respect to u satisfies a ⊥⊥ b | {c, d} [P ]. Hence, one can show that a marginally continuous measure P is Markovian with respect to u iff it is Markovian with respect to v. ♦
114
6 Equivalence and Implication
+1
{a, b, c, d} Q Q A Q Q A Q A Q
0
0
0
0
+2
0
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP P P P P Q PP PPP PPP Q PPQ PP P P P Q P P P P P P
+2
−1
−1
−1
−1
−1
+1
+1
0
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P P PPP Q PP P P P Q P P P
−1
{a}
{b} {c} {d} A Q Q A A Q Q Q +1
Q
∅
{a, b, c, d} Q Q A Q Q A Q A Q
−1
−1
−1
−1
−1
−1
+1
+1
0
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP P P P P Q PP PPP PPP Q PPQ PP P P P Q P P P P P P
+2
−1
+1
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P P PPP Q PP P P P Q P P P
−1
{a}
Q
{b} {c} {d} A Q Q A A Q Q Q +1 ∅
Fig. 6.1. Two Markov equivalent imsets that are not independence equivalent.
In fact, the above-mentioned phenomenon is a consequence of the fact that structural imsets do not satisfy the faithfulness condition from Section 1.1; cf. Remark 5.7 on p. 101. Note that, in the case |N | ≤ 3, every structural imset has a perfectly Markovian discrete measure over N – see Proposition 5.8. In particular, if |N | ≤ 3 then the independence equivalence coincides both with the Markov equivalence relative to the class of discrete measures and with the Markov equivalence relative to the class of measures with finite multiinformation (use Proposition 6.1 to see it). In the rest of this chapter, attention is restricted to independence equivalence and related independence implication. One reason for this is that independence implication is not adulterated by considering a specific class of distributions Ψ . Therefore, one has a better chance that the respective deductive mechanism can be implemented on a computer. Moreover, in my opinion, the independence equivalence forms a pure theoretical basis of Markov equivalence. Indeed, it will be shown later (see Lemma 6.6) that for a reasonable distribution framework Ψ every Markov equivalence class relative to Ψ decomposes into independence equivalence classes and just one of these classes consists of “Ψ -representable” structural imsets, that is, imsets having perfectly Markovian measures in Ψ . Thus, to describe CI structures arising within the framework of Ψ one can limit oneself to structural imsets of this type and the independence equivalence on the considered subclass of structural imsets coincides with the Markov equivalence relative to Ψ .
6.2 Independence implication Let u, v be structural imsets over N . Let us say that u i-implies v and write u v if Mv ⊆ Mu . Here, i- stands for “independence”. Observe that u is independence equivalent to v iff u v and v u. Remark 6.2. This is to explain the motivation of former terminology in Studen´ y [146]. I already used the adjective “facial” in [137] to name the respective
6.2 Independence implication
115
implication for structural imsets. This was motivated by an analogy with the theory of convex polytopes where the concept of a face has a central role [16]. Indeed, one can consider the collection of all faces of the cone con(E(N )) and introduce the following implication of structural imsets: u “implies” v if every face of con(E(N )) which contains u also contains v. The original definition of the independence implication of structural imsets used in [137] was nothing but a modification of this requirement. It has appeared to be equivalent to the condition Mv ⊆ Mu : one can show this using the results from the series of papers [137], although it is not explicitly stated there. 6.2.1 Direct characterization of independence implication Lemma 6.1. Let u, v be structural imsets over N . Then u v iff ∃l ∈ N
l · u − v is a structural imset ,
(6.2)
which is under the assumption that v is a combinatorial imset equivalent to the condition ∃ k ∈ N k · u − v is a combinatorial imset . (6.3) Proof. Suppose u v and write n·v = w∈E(N ) kw ·w where n ∈ N, kw ∈ Z+ . If kw > 0 and w = u a,b|K then a, b|K ∈ Mv ⊆ Mu . That means there exists lw ∈ N such that lw · u − w ∈ S(N ) and one can fix lw for every w of this kind. We put l = w∈E(N ),kw >0 kw · lw and observe that l·u−v = (l·u−n·v)+(n−1)·v =
kw ·(lw ·u−w)+(n−1)·v ∈ S(N )
w∈E(N ),kw >0
since S(N ) is closed under summing. Thus, (6.2) has been verified. Conversely, suppose (6.2) and consider A, B|C ∈ Mv . Hence, one can find n ∈ N such that n · v − u A,B|C ∈ S(N ). As S(N ) is closed under summing, write (n · l) · u − u A,B|C = n · (l · u − v) + (n · v − u A,B|C ) ∈ S(N ) , which implies A, B|C ∈ Mu . Evidently (6.3) implies (6.2). Conversely, suppose (6.2) and that v is a combinatorial imset. Take n ∈ N such that n · (l · u − v) is combinatorial and put k = n · l. As v ∈ C(N ) and C(N ) is closed under summing, k · u − v = n · (l · u − v) + (n − 1) · v is a combinatorial imset. Remark 6.3. The main difference between (6.2) and (6.3) is that testing whether an o-standardized imset is combinatorial can be performed in a finite number of steps and the number of these steps is known! Indeed, if an ostandardized imset w = k · u − v is combinatorial, then the degree deg (w) can
116
6 Equivalence and Implication
be directly computed using Proposition 4.3. This is the number of elementary imsets which have to be summed to obtain w. The only combinatorial imset of degree 0 is the zero imset (see Corollary 4.2) and an imset w with
m∗ , w = n ∈ N is a combinatorial imset (of degree n) iff there exists an elementary imset u a,b|K such that w − u a,b|K is a combinatorial imset of degree n − 1. Since the class of elementary imsets is known the testing can be done recursively. Note that the above observation does not mean that testing combinatorial imsets is an easy task from a practical point of view. It may be time-consuming if the expected degree is very high. On the other hand, if the expected degree m∗ , w is low then testing whether w ∈ C(N ) is quite simple, even if |N | is very high. The point of this comment is that no similar effective estimate of the number of necessary steps is known in the case of testing structural imsets – see Section 6.3.1. One can also modify the proof of Lemma 6.1 to show that (6.3) is equivalent to u v even in the case that v is a structural imset such that ∃ k, n ∈ N
n · v and (n · k − 1) · v are combinatorial imsets .
(6.4)
The condition (6.4) is formally weaker than the requirement that v is a combinatorial imset. However, their difference may be illusory. So far, I do not know an example of a structural imset which is not a combinatorial imset – see Question 7. A natural question is how big the number l ∈ N from (6.2) could be. The following example shows that it may happen that l > 1.
+2
{a, b, c, d} Q Q A Q Q A Q A Q
−1
−1
−1
+1
+3
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP P P P P Q PP PPP PPP Q PPQ PP P PP PP Q P P P P
+1
+1
+1
{a, b}
{a, c}
{a, d}
−1
{b, c}
−1
{b, d}
−1
{c, d}
PP PP PP Q PP PPP Q PPP P P PPP Q PP P P P Q P P P
−2
{a}
0
0
0
{b} {c} {d} A Q Q A A Q Q Q
Q
+2
∅
{a, b, c, d} Q Q A Q Q A Q A Q
−2
−2
−2
+3
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP P P P P Q PP PPP PPP Q PPQ PP P PP PP Q P P P P
+2
+2
+2
{a, b}
{a, c}
{a, d}
−2
{b, c}
−2
{b, d}
−2
{c, d}
PP PP PP Q PP PPP Q PPP P P PPP Q PP P P P Q P P P
−3
{a}
Q
0
0
0
{b} {c} {d} A Q Q A A Q Q Q
+3
∅
Fig. 6.2. Structural imsets u and 2 · u − v from Example 6.3.
Example 6.3. There exists a combinatorial imset u over N = {a, b, c, d} and a semi-elementary imset v such that 2 · u − v is a structural imset (and therefore u v) but u − v is not a structural imset. Put u = u a,b|∅ + u a,c|∅ + u c,d|b + u b,d|c + u a,d|bc + u b,c|ad
v = u a,bcd|∅
(6.5)
6.2 Independence implication
117
and observe that 2 · u − v = 2 · u − u a,bcd|∅ = u a,b|∅ + u a,c|∅ + u a,d|∅
+ u c,d|b + u b,d|c + u b,c|d + u b,c|ad + u b,d|ac + u c,d|ab
is a combinatorial imset (see Figure 6.2 for illustration). Thus, u v. To see that u − v is not a structural imset (see the left-hand picture of Figure 6.3 for illustration), consider the multiset m◦ shown in the right-hand picture of Figure 6.3. It is a supermodular multiset by Proposition 5.1(iii). Because of
m◦ , u − v = −1 the imset u − v is not structural by Proposition 5.1(i). On the other hand, u w for an elementary imset w = u a,d|∅ . Indeed, one has u − w = u a,bc|∅ + u b,c|d + u b,d|ac + u c,d|ab
which means that the constant l from (6.2) can be smaller in another case. ♦
+1
Q Q A Q Q A Q A Q
+2
−1
−1
−1
+2
+1
+1
−1
−1
0
0
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP P P P P Q PP PPP PPP Q PPQ PP P P P Q P P P P P P
+1
−1
+1
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P P P P Q P P P P PP Q P P P
−1
{a}
0
{b} {c} {d} A Q Q A A Q Q Q +1
Q
∅
{a, b, c, d} Q Q A Q Q A Q A Q
{a, b, c, d}
+1
+1
0
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP P P P P Q PP PPP PPP Q PPQ PP P P P Q P P P P P P
0
0
0
0
0
0
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P P P P Q P P P P PP Q P P P
0
{a}
0
Q
0
0
{b} {c} {d} A Q Q A A Q Q Q 0 ∅
Fig. 6.3. The imsets u − v and the supermodular multiset m◦ from Example 6.3.
Remark 6.4. Example 6.3 shows that the verification of whether a semielementary imset is i-implied by a structural imset u requires the multiplication of u by 2 at least. Note that later Corollary 6.3 in Section 6.3.2 can be modified by replacing the class of elementary imsets by the class of semielementary imsets to get an upper estimate of the constant in (6.2) in this case. Using the results of Studen´ y [131], one can show that for |N | = 4 max { r, w ; r ∈ K (N ) , w is a semi-elementary imset over N } = 3 , which implies, using the arguments similar to those from Section 6.3.2, that 3 is the minimal integer l∗ satisfying ∀ u ∈ S(N ) ∀ v semi-elementary imset over N
u v iff l∗ · u − v ∈ S(N ) .
Note that one has l∗ = 1 in the case |N | ≤ 3 for the same reasons.
118
6 Equivalence and Implication
The following consequence of Lemma 6.1 has already been reported in Section 5.4.2, p. 107. Corollary 6.1. Let Q be a finite set of structural imsets over N . Then sup {Mu ; u ∈ Q} = Mv with v = u, (6.6) u∈Q
where the supremum is understood in the lattice (U(N ), ⊆). Proof. To show Mu ⊆ Mv for u ∈ Q take A, B|C ∈ Mu , find k ∈ N such that k · u − u A,B|C ∈ S(N ) and write k · v − u A,B|C = k · w + (k · u − u A,B|C ) ∈ S(N ) . w∈Q\{u}
To show, for every structural imset w over N , that the assumption Mu ⊆ Mw for u ∈ Q implies Mv ⊆ Mw , use Lemma 6.1. Indeed, the assumption means u ∈ S(N ) exists for every u ∈ Q. Put l = that lu ∈ N with lu · w − u∈Q lu and observe l · w − v = u∈Q (lu · w − u) ∈ S(N ). Remark 6.5. The definition of independence implication can be extended as follows. A (finite) set of structural imsets Q i-implies a structural imset w (writeQ w) if Mw ⊆ M for every structural independence model M such is equivalent that u∈Q Mu ⊆ M. However, by Corollary 6.1 this condition to the requirement Mw ⊆ supu∈Q Mu ≡ Mv where v = u∈Q u. Thus, the extension of independence implication of this type is not needed because it is covered by the current definition of independence implication. 6.2.2 Skeletal characterization of independence implication Lemma 6.2. Let u, v be structural imsets over N . Then u v iff ∀ m ∈ K (N )
m, v > 0 ⇒ m, u > 0,
(6.7)
which is equivalent to the condition
m, v > 0 ⇒ m, u > 0
for every m ∈ K(N ) .
(6.8)
Moreover, the condition (6.7) is also equivalent to the requirement l∗ · u − v ∈ S(N )
whenever l∗ ∈ N is such that l∗ ≥ r, v for every r ∈ K (N ).
(6.9)
Proof. Evidently (6.8) ⇒ (6.7). Conversely, if (6.7) then observe by Lemma 5.3 and Proposition 5.1(i) that m, v > 0 implies m, u > 0 for every standardized supermodular function m ∈ K (N ). However, every supermodular function is quantitatively equivalent to a function of this type by Lemma 5.2, which means that (6.8) holds.
6.2 Independence implication
119
By Lemma 6.1 u v iff the condition (6.2) holds. However, by Lemma 5.4 this is equivalent to the condition ∃ l ∈ N ∀ m ∈ K (N )
l · m, u ≥ m, v ,
which implies (6.7). The next step is to show that (6.7) implies ∀ m ∈ K (N )
l∗ · m, u ≥ m, v
for l∗ ∈ N from (6.9) .
(6.10)
Indeed, if m ∈ K (N ) such that m, v ≤ 0 then l∗ · m, u ≥ 0 ≥ m, v by Proposition 5.1(iii). If m ∈ K (N ) such that m, v > 0, then (6.7) implies
m, u > 0. However, as both m and u are imsets, m, u ∈ Z for which
m, u ≥ 1 and the assumption about l∗ implies l∗ · m, u ≥ l∗ ≥ m, v. The condition (6.10) then implies l∗ · u − v ∈ S(N ) by Lemma 5.4. Thus, (6.7)⇒(6.9). Since K (N ) is finite, there exists l∗ ∈ N satisfying the requirement in (6.9), which means that (6.9) implies u v by Lemma 6.1. The way to standardize supermodular functions is not substantial in the above result. One can easily derive an analogous result with the u-skeleton, respectively with the o-skeleton, in place of the -skeleton by a similar procedure (see Remark 5.6). It follows from Lemma 6.1 that one has u u A,B|C for a structural imset u over N and A, B|C ∈ T (N ) iff A, B|C ∈ Mu . Therefore, Lemma 6.2 can be viewed as an alternative criterion for testing whether a disjoint triplet over N is represented in a structural imset over N . Note that Lemma 6.1 is suitable in the situation in which one wants to confirm the hypothesis that u v while Lemma 6.2, namely the conditions (6.7) and (6.8), is suitable in the situation in which one wants to disprove u v. This is illustrated by Example 6.4 below. In a way, the relation of these two criteria of independence implication is analogous to the relation of the moralization criterion and the d-separation criterion in the case of DAG models (see Section 3.2). Example 6.4. Suppose N = {a, b, c, d}, consider the combinatorial imsets u and v from (6.5) in Example 6.3 and a semi-elementary imset w = u b,acd|∅ . The fact u v was verified in Example 6.3 using the direct characterization of independence implication, namely by means of the condition (6.3) in Lemma 6.1 with k = 2. To disprove u w, consider the supermodular imset m{a}↓ (see p. 39) which is shown in the left-hand picture of Figure 6.4. Observe that m{a}↓ , w = 1 and m{a}↓ , u = 0. Then apply the condition (6.8) from Lemma 6.2 to see that ¬(u w). Note that m{a}↓ belongs to the u-skeleton Ku (N ) and the corresponding element of the -skeleton (see Remark 5.6) is in the right-hand picture of Figure 6.4. They are quantitatively equivalent. ♦ An easy consequence of Lemma 6.2 is the following criterion for independence equivalence of structural imsets.
120
6 Equivalence and Implication
0
{a, b, c, d} Q Q A Q Q A Q A Q
0
0
0
0
+2
+1
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP P P P P Q PP PPP PPP Q PPQ PP P P P Q P P P P P P
0
0
0
0
0
0
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P P PPP Q PP P P P Q P P P
+1
{a}
0
0
0
{b} {c} {d} A Q Q A A Q Q Q +1
Q
∅
{a, b, c, d} Q Q A Q Q A Q A Q
+1
+1
+2
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP P P P P Q PP PPP PPP Q PPQ PP P P P Q P P P P P P
0
0
+1
0
+1
+1
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P P PPP Q PP P P P Q P P P
0
{a}
0
Q
0
0
{b} {c} {d} A Q Q A A Q Q Q 0 ∅
Fig. 6.4. Quantitatively equivalent elements of the u-skeleton and the -skeleton.
Corollary 6.2. Let u, v be structural imsets over N . Then u v iff ∀ m ∈ K (N )
m, u > 0 iff
m, v > 0 ,
(6.11)
which is equivalent to the condition that m, u > 0 ⇔ m, v > 0 for every supermodular function m over N . Note that the skeletal criteria for testing independence implication and equivalence described in this section are effective in particular in the case |N | ≤ 4 since |K (N )| is small in that case – see Remark 5.4. They are still implementable in the case |N | = 5; a computer program which realizes the independence implication of elementary imsets over a five-element set can be found at http://www.utia.cas.cz/user data/studeny/VerifyView.html . As the -skeleton is not at our disposal in the case |N | ≥ 6, the only available criterion in that case is the criterion given in Lemma 6.1.
6.3 Testing independence implication This section deals with implementation tasks connected with the direct characterization of independence implication. 6.3.1 Testing structural imsets The first natural question is how to recognize a structural imset. One possible method is given by Theorem 5.1 but, as explained in Remark 5.4, that method is not feasible in the case |N | ≥ 6. Thus, the direct definition of a structural imset is only available in general. Therefore one needs to know whether the corresponding procedure is decidable. As explained in Remark 6.3 testing
6.3 Testing independence implication
121
of combinatorial imsets is quite clear. One needs to know whether or not the natural number by which a structural imset must be multiplied to get a combinatorial imset is somehow limited. Lemma 6.3. There exists n ∈ N such that ∀ imset u over N
u ∈ S(N ) iff n · u ∈ C(N ) .
(6.12)
Proof. The proof uses the concepts and facts gathered in Section A.5.2. One can apply Theorem 16.4 from Schrijver [113], which says that every pointed rational polyhedral cone K ⊆ Rn , n ≥ 1 has a (unique) minimal integral Hilbert basis generating K, that is, the least finite set B ⊆ Zn such that ky · y for some ky ∈ Z+ , ∀ x ∈ K ∩ Zn x = y∈B
and con(B) = K (which implies B ⊆ K). One can apply this result to the rational polyhedral cone con(E(N )) ⊆ RP(N ) , which is pointed by Proposition 4.1 since m∗ , t > 0 for every non-zero t ∈ con(E(N )). Moreover, by Lemma A.2, an imset u over N belongs to con(E(N )) iff it is structural. Thus, by the above-mentioned theorem [113], a finite set of structural imsets H(N ) exists such that ∀ u ∈ S(N ) u = kv · v for some kv ∈ Z+ . (6.13) v∈H(N )
One can find n(v) ∈ N for every v ∈ H(N ) such that n(v)·v is a combinatorial imset and put n = v∈H(N ) n(v). Clearly, n · v ∈ C(N ) for every v ∈ H(N ) and (6.13) implies that n · u ∈ C(N ) for every u ∈ S(N ). A natural question arises: what is the minimal n ∈ N satisfying (6.12)? I do not know the answer for |N | ≥ 5 (see Theme 11 on p. 207). But if |N | ≤ 4 then n = 1; the main result of Studen´ y [131] can be formulated as follows. Proposition 6.2. If |N | ≤ 4 then the class of structural imsets over N coincides with the class of combinatorial imsets over N . Remark 6.6. It may be the case that the smallest n ∈ N satisfying (6.12) appears to be too high. An alternative approach to direct testing of structural imsets could be based on the concept of a minimal integral Hilbert basis H(N ) mentioned in the proof of Lemma 6.3 (see Theme 10). It follows from the proof of Theorem 16.4 in Schrijver [113] that H(N ) has the form {v ∈ S(N ); v = 0 & ¬[v = v1 + v2 where v1 , v2 ∈ S(N ), v1 = 0 = v2 ] }. The fact that every elementary imset generates an extreme ray of con(E(N )) allows us to derive E(N ) ⊆ H(N ). Of course, H(N ) = E(N ) if |N | ≤ 4 by
122
6 Equivalence and Implication
Proposition 6.2. The idea of the alternative approach is to give a general characterization of H(N ). If a characterization of this kind is available, then one can possibly modify the procedure described in Remark 6.3 and test effectively whether a given imset can be written as a combination of imsets from H(N ) with non-negative integral coefficients. 6.3.2 Grade Another natural question arising in connection with Lemma 6.1 is whether there exists l ∈ N such that ∀ u ∈ S(N ) v ∈ E(N )
u v iff l · u − v ∈ S(N ).
(6.14)
The answer is yes. Evidently, if l ∈ N satisfies (6.14) then every l ∈ N, l ≥ l satisfies it as well. Therefore one is interested in minimal l ∈ N satisfying (6.14) which appears to depend on N . Actually, it only depends on |N | because of an inherent one-to-one correspondence between E(N ) and E(M ), respectively between S(N ) and S(M ) for sets of variables N and M of the same cardinality. The following number is a good candidate for the minimal l ∈ N satisfying (6.14). Supposing |N | ≥ 2 let us call the natural number gra(N ) = max { r, w; r ∈ K (N ) w ∈ E(N ) } .
(6.15)
the grade and use the notation gra(N ). Evidently, the value of gra(N ) only depends on |N |. Lemma 6.2, the condition (6.9), implies this: Corollary 6.3. If |N | ≥ 2 then l = gra(N ) satisfies (6.14). Corollary 6.3 leads to an effective criterion for testing independence implication of elementary imsets in the case |N | ≤ 4, which utilizes the fact that structural and combinatorial imsets coincide in this case. Corollary 6.4. Suppose that 2 ≤ |N | ≤ 4, u is a structural imset over N and v an elementary imset over N . Then u v iff u − v is a combinatorial imset. Proof. The first observation is that if |N | ≤ 4 then m, v ∈ {0, 1} for every m ∈ K (N ) and v ∈ E(N ) – see [131]. Thus, gra(N ) = 1 and by Corollary 6.3 one has u v iff u − v ∈ S(N ) which is equivalent to u − v ∈ C(N ) by Proposition 6.2. However, as shown in Studen´ y et al. [145], gra(N ) = 7 in the case |N | = 5. In fact, the example from § 4.3 of [145] allows one to show that the minimal natural number l for which (6.14) holds is just 7 in the case |N | = 5. The question of what is the minimal l ∈ N satisfying (6.14) is partially answered by the following lemma.
6.3 Testing independence implication
123
Lemma 6.4. Suppose that |N | ≥ 2. Then the minimal l∗ ∈ N satisfying u v iff l∗ · u − v ∈ S(N )
(6.16)
max { m, w ; w ∈ E(N )} . min { m, w ; w ∈ E(N ) m, w = 0}
(6.17)
∀ u ∈ C(N ) ∀ v ∈ E(N ) is the upper integer part of gra∗ (N ) =
max
m∈K (N )
Proof. To show that every l∗ ∈ N with l∗ ≥ gra∗ (N ) satisfies (6.16) the arguments from the proof of (6.7)⇒(6.9) in Lemma 6.2 can be used. The only modification is that in the case m ∈ K (N ) with m, u > 0, the fact u ∈ C(N ) implies m, u ≥ min { m, w ; w ∈ E(N ) m, w = 0}, which allows one to write l∗ · m, u ≥ gra∗ (N ) ·
min w∈E(N ), m,w =0
m, w ≥ max m, w ≥ m, v. w∈E(N )
To show that for every l ∈ N with l < gra∗ (N ) there exists a combinatorial imset u and an elementary imset v such that u v and l·u−v ∈ S(N ), choose and fix m ∈ K (N ) for which the maximum in (6.17) is achieved. Then choose w ˜ = u a,b|K ∈ E(N ) minimizing non-zero value m, w, w ∈ E(N ) and v ∈ E(N ) maximizing m, w for w ∈ E(N ). By Corollary 5.3, an imset u ˜ ∈ C(N ) ˜ + w. ˜ By Lemma 6.1 Mm = Mu˜ ⊆ Mu . with Mm = Mu˜ exists. Put u = u As a, b|K ∈ Mu \ Mm the fact that m is a skeletal imset (see Lemma 5.6 in Section 5.2) implies Mu = T (N ) ⊇ Mv , which means u v. On the other hand, by Proposition 5.6 m, u ˜ = 0 and therefore l < gra∗ (N ) =
m, v
m, v =
m, w ˜
m, u
implies m, l · u − v < 0,
which means l · u − v ∈ S(N ) by Lemma 5.4.
Remark 6.7. Note that the type of the skeleton is not material in the above result. In fact, the -skeleton can be replaced either by the u-skeleton or by the o-skeleton and the respective constant gra∗ (N ) will have the same value. Indeed, it follows from Corollary 5.2 that for every skeletal supermodular function m there exists α > 0 such that m, u = α · m, u for every u ∈ E(N ) where m ∈ K (N ) is the unique qualitatively equivalent element of the skeleton. Thus, the ratios maximized in (6.17) are invariants of classes of qualitative equivalence of skeletal supermodular functions. Note that provided that |N | ≤ 5 one has ∀ m ∈ K (N )
min { m, u; u ∈ E(N ) m, u = 0} = 1 ,
(6.18)
which implies that gra(N ) = gra∗ (N ) in this case. Thus, if the hypothesis (6.18) holds in general (see Question 8) then gra(N ) is the smallest l ∈ N
124
6 Equivalence and Implication
satisfying (6.14) by Corollary 6.3 and Lemma 6.4. Note that an analog of (6.18) holds for |N | ≤ 5 and the u-skeleton; because of the operation of reflection mentioned on p. 197, one can show that Ku (N ) can replace K (N ) in (6.18) – see § 5.1.3 of Studen´ y et al. [145]. However, this is not true for the o-skeleton; for a counterexample in the case N = {a, b, c} see Figure 5.3 – the respective minimal non-zero value is 4 for every m ∈ Ko (N ). This may be the main difference between the o-standardization and the -standardization.
6.4 Invariants of independence equivalence This section deals with some of those attributes of structural imsets which are either invariable with respect to independence equivalence or characterize classes of independence equivalence. Let u be a structural imset over N . By an effective domain of u, denoted by Du∗ , a class of sets T ⊆ N is understood such that S ⊆ T ⊆ S for some S , S ⊆ N with u(S) = 0 = u(S ). Note that the sets S, S can be chosen to satisfy u(S), u(S ) > 0, that is, Du∗ = (Du+ )↓ ∩ (Du+ )↑ – see Lemma 6.5. Recall that Uu = (Du+ )↓ is nothing else than the upper class of u from Section 4.2.3. It also appears to be an invariant of independence equivalence. The region of u, denoted by Ru , is the class of subsets of N obtained as follows: {K, aK, bK, abK} (6.19) Ru ≡ a,b|K ∈Mu ∩T (N )
=
{C, AC, BC, ABC} .
A,B|C ∈Mu \Tø (N )
Note that the equality of the unions over the class of elementary triplets and over the class of all non-trivial triplets can easily be derived from the fact that Mu is a semi-graphoid (Lemma 4.6) by means of Lemma 2.2 on p. 15. Lemma 6.5. Let u be a structural imset over N . If T is either a maximal set or a minimal set of its effective domain Du∗ then u(T ) > 0. In particular, Du∗ = (Du+ )↓ ∩ (Du+ )↑ . The region of u is a subclass of its effective domain: Ru ⊆ Du∗ . If u, v ∈ S(N ) such that u v then Uu = Uv , Du∗ = Dv∗ and Ru = Rv . Moreover, given S ⊆ N , S ∈ Ru iff w(S) = 0 for every w ∈ S(N ) which is independence equivalent to u. Proof. I. Recall that by Proposition 4.4 mA↑ , u ≥ 0 and mA↓ , u ≥ 0 for every A ⊆ N and a structural imset u over N . Therefore, if T ⊆ N is a maximal set of Du∗ then u(T ) = mT ↑ , u ≥ 0, and the fact u(T ) = 0 implies u(T ) > 0. An analogous argument with mT ↓ can be used if T is a minimal set of Du∗ . The equality Du∗ = (Du+ )↓ ∩ (Du+ )↑ is now evident. II. In particular, if u(S) < 0 for S ⊆ N then there exists both K ⊂ S with
6.4 Invariants of independence equivalence
125
u(K) > 0 and L ⊃ S, L ⊆ N with u(L) > 0. This is because u(S) < 0 implies S ∈ Du∗ . III. To show Ru ⊆ Du∗ , consider an elementary triplet a, b|K ∈ Mu and find k ∈ N with k·u−u a,b|K ∈ S(N ). The inequality mabK↑ , u a,b|K > 0 implies k · mabK↑ , u > 0 by the above-mentioned fact from Proposition 4.4. Hence, u(S) > 0 for some abK ⊆ S ⊆ N , which means K, aK, bK, abK ∈ (Du+ )↓ . Analogously, mK↓ , u a,b|K > 0 implies mK↓ , u > 0 and u(S ) > 0 for some S ⊆ K, which means K, aK, bK, abK ∈ (Du+ )↑ . IV. The next observation is that S ∈ ((Du+ )↓ )max iff mS↑ , u > 0 and
mT ↑ , u = 0 for every T ⊃ S. Indeed, S ∈ ((Du+ )↓ )max means u(S) > 0 and u(T ) = 0 for T ⊃ S by Step II. Moreover, given S ⊆ N , one can show by reverse induction on |S| that [ u(T ) = 0 for every S ⊂ T ⊆ N ] iff [ mT ↑ , u = 0 for every S ⊂ T ⊆ N ]. V. Analogous arguments can be used to show that S ∈ ((Du+ )↑ )min iff
mS↓ , u > 0 and mT ↓ , u = 0 for every T ⊂ S (replace ⊆ by ⊇). However, as mA↑ , mA↓ for A ⊆ N are supermodular functions (by Proposition 5.1), Corollary 6.2 implies that the conditions mA↑ , u > 0, mA↑ , u = 0,
mA↓ , u > 0, mA↓ , u = 0 for A ⊆ N are invariable with respect to independence equivalence. Therefore, Uu and Du∗ are invariable as well. An analogous claim about Ru is evident because of the definition of Ru in terms of Mu . VI. To show that S ∈ Ru implies u(S) = 0, write n·u = v∈E(N ) kv ·v where n ∈ N, kv ∈ Z+ and observe a, b|K ∈ Mu whenever kv > 0 for v = u a,b|K . As S ∈ Ru one has v(S) = 0 for every v ∈ E(N ) of this kind, which implies n · u(S) = 0. The consideration holds for any w ∈ S(N ) with u w in place of u. Conversely, suppose S ∈ Ru , take an elementary triplet a, b|K ∈ Mu with S ∈ {K, aK, bK, abK} and observe that w = u + k · u a,b|K is independence equivalent to u for every k ∈ N by Lemma 6.1. One can find k ∈ Z+ such that w(S) = 0. However, the effective domain and the region of a structural imset may differ as the following example shows. Example 6.5. There exist two structural imsets u, v over N = {a, b, c, d} with the same effective domain but different regions. Consider the imset u = u b,c|a + u a,d|c shown in the left-hand picture of Figure 6.5 and the imset v = u c,d|a + u a,b|c shown in the right-hand picture. The set {a, d} belongs to the effective domain Du∗ = Dv∗ and to the region Rv but it does not ♦ belong to the region Ru . On the other hand, the regions and the effective domains of structural imsets over N coincide for |N | ≤ 3. Remark 6.8. The significance of the concept of an effective domain is that it allows one to restrict the considered class of elementary imsets when one is to test whether an o-standardized imset is combinatorial – see Remark 6.3.
126
6 Equivalence and Implication
0
{a, b, c, d} Q Q A Q Q A Q A Q
+1
+1
0
0
0
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP P P P P Q PP PPP PPP Q PPQ PP P P P Q P P P P P P
−1
−2
0
0
0
−1
+1
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P P PPP Q PP P P P Q P P P
+1
{a}
+1
0
0
{b} {c} {d} A Q Q A A Q Q Q 0
Q
∅
{a, b, c, d} Q Q A Q Q A Q A Q
+1
0
0
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP P P P P Q PP PPP PPP Q PPQ PP P P P Q P P P P P P
0
−2
−1
−1
0
0
+1
0
0
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P P PPP Q PP P P P Q P P P
+1
{a}
Q
{b} {c} {d} A Q Q A A Q Q Q 0 ∅
Fig. 6.5. Two structural imsets with the same effective domain but different regions.
Indeed, if u = v∈E(N ) kv · v with kv ∈ Z+ then for every v = u a,b|K with kv > 0 one has a, b|K ∈ Mu and therefore by Lemma 6.5 K, aK, bK, abK ∈ Ru ⊆ Du∗ . Thus, a reduced class of elementary imsets v = u a,b|K satisfying K, aK, bK, abK ∈ Du∗ can be considered. Observe that the effective domain Du∗ can be identified directly on the basis of u. This is the main difference from the region Ru which formally gives even stronger restriction to the class of considered elementary imsets, but the region cannot be recognized solely on the basis of u. The region is a characteristic of the respective class of independence equivalent structural imsets and can only partially be identified on the basis of the imset – see Lemma 6.5. However, by Proposition 4.3 one can compute the corresponding leveldegrees of u for l = 0, . . . , |N | − 2 which may result in additional restriction to the class of considered elementary imsets; this is specifically the case if some of the level-degrees vanish. Effective domains are attributes of structural imsets which allow one to immediately recognize imsets which are not independence equivalent. A natural question is whether there exists a complete collection of invariant properties of similar type in the sense that, for every pair of structural imsets u and v over N , at least one property of this type exists in which they differ. Corollary 6.2 gives a positive answer to this question. Indeed, every skeletal imset m ∈ K (N ) is associated with an invariant attribute of a structural imset u over N , namely, the fact whether the scalar product m, u is zero or not. The collection of these attributes is complete in the above-mentioned sense. However, as explained at the end of Section 6.2.2 this criterion does not seem feasible in the case |N | ≥ 6.
6.5 Adaptation to a distribution framework Let us consider a class Ψ of probability measures over N (a distribution framework) which satisfies the following two conditions:
6.5 Adaptation to a distribution framework
127
∀ P ∈ Ψ ∃ structural imset u over N such that Mu = MP ,
(6.20)
∀ pair P, Q ∈ Ψ ∃ R ∈ Ψ such that MR = MP ∩ MQ .
(6.21)
There are at least three examples of distribution frameworks satisfying these two natural conditions: the class of measures with finite multiinformation, the class of discrete measures and the class of positive discrete measures (see Theorem 5.2 and Lemma 2.9). On the other hand, the class of regular Gaussian measures over N considered in Section 2.3.6 does not satisfy the condition (6.21) as Example 2.2 shows. The main reason for this is the limitation to fixed sample spaces Xi = R for i ∈ N made there. The goal of this section is to show that, with a suitable restriction put on the class of structural imsets, the independence equivalence coincides with the Markov equivalence relative to Ψ . A structural imset u over N is representable in Ψ , shortly Ψ -representable, if there exists P ∈ Ψ which is perfectly Markovian with respect to u, that is, Mu = MP . Evidently, every structural imset which is independence equivalent to a Ψ -representable structural imset is Ψ -representable as well. The class of Ψ -representable structural imsets over N will be denoted by SΨ (N ). Lemma 6.6. Let Ψ be a class of probability measures over N satisfying (6.20) and (6.21) and u ∈ S(N ). Then the class of structural imsets which are Markov equivalent to u relative to Ψ is the union of a finite collection of independence equivalence classes partially ordered by the relation . Moreover, the poset (, ) has the greatest element which is the only independence equivalence class ℘ ∈ consisting of Ψ -representable imsets. Proof. The first claim of the lemma follows easily from Proposition 6.1. Let us put M = P ∈Ψ (u) MP where Ψ (u) is defined by (6.1) and Φ = {P ∈ Ψ ;
A⊥ ⊥ B | C [P ]
whenever A, B|C ∈ M } .
The inclusion Ψ (u) ⊆ Φ follows directly from the definition of M. The fact Mu ⊆ M implies Φ ⊆ Ψ (u) and therefore Φ = Ψ (u). As T (N ) is finite the set {MP ; P ∈ Ψ (u)} is also finite and one can show by repetitive application of the assumption (6.21) that R ∈ Ψ such that MR = M exists. Of course, R ∈ Φ = Ψ (u). By (6.20) a structural imset v with Mv = MR = M exists, which means Φ = Ψ (v). Thus, u and v are Markov equivalent relative to Ψ . As R ∈ Ψ is perfectly Markovian with respect to v, the imset v is Ψ -representable. Suppose that w ∈ S(N ) such that Ψ (w) = Ψ (u) and observe that Mw ⊆ MP = MP = M = Mv . P ∈Ψ (w)
P ∈Ψ (u)
Thus, v w and the class ℘ of imsets that are independence equivalent to v is the greatest element of (, ). If w is Ψ -representable then Q ∈ Ψ with Mw = MQ exists and Q ∈ Ψ (w) = Ψ (u), which implies M ⊆ MQ = Mw ⊆ M. Hence, Mw = M = Mv which says w v.
128
6 Equivalence and Implication
Remark 6.9. Note that (, ) is even a join semi-lattice. Indeed, v w ˜w and Ψ (w) = Ψ (v) implies Ψ (w) ˜ = Ψ (v) for v, w, ˜ w ∈ S(N ). Hence, Ψ (u + w) = Ψ (v) for structural imsets u, w with Ψ (u) = Ψ (w) = Ψ (v) where v ∈ S(N ) belongs to the greatest element ℘ of mentioned in Lemma 6.6. By Corollary 6.1 Mu+w = Mu ∨ Mw which means that u + w represents the join of u and w in (, ). On the other hand, the set , viewed as a subset in the lattice of structural imsets, need not be closed under the operation of meet in that lattice (an example is omitted). It may happen that only consists of one independence equivalence class. This means that the class of structural imsets Markov equivalent to u coincides with the class of imsets independence equivalent to u. For example, this phenomenon is quite common in the case |N | = 4 for the class of discrete measures over N : one then has 18300 Markov equivalence classes and 22108 independence equivalence classes [135, 136]. The following fact immediately follows from Proposition 6.1 and Lemma 6.6. Corollary 6.5. Let Ψ be a class of probability measures over N satisfying (6.20) and (6.21). Consider the collection of Ψ -representable structural imsets SΨ (N ) over N . Then the independence equivalence and the Markov equivalence relative to Ψ coincide for imsets from SΨ (N ). In the considered case the class SΨ (N ) satisfies both the requirement of faithfulness and the requirement of completeness relative to the class Ψ (see Section 1.1). Thus, a theoretical solution to those problems is at our disposal but a practical question of how to recognize imsets from SΨ (N ) still remains to be solved. Moreover, the condition (6.21) also implies that the class of CI models induced by the imsets from SΨ (N ) is a lattice. This assumption appears to be important in connection with certain learning procedures – for details see Section 8.1, pp. 158–161. Remark 6.10. The idea of implementing the respective implication for structural imsets on a computer is as follows. In addition to the usual algebraic operations with structural imsets, one needs to implement one more operation which ascribes the respective Ψ -representable structural imset v ∈ SΨ (N ) to every structural imset u ∈ S(N ). Suppose that u1 , . . . , un , n ≥ 1 are structural imsets which represent input pieces of information about the CI structure induced by an unknown distribution P which is, however, known to belong to a given distribution framework Ψ ; this is a subclass of the class of mean sures with finite multiinformation which satisfies (6.21). The sum u = i=1 ui then represents the aggregated information about the CI structure of P . But within the considered distribution framework Ψ even more can be deduced: one should find the respective v ∈ SΨ (N ) which represents necessary conclusions of the input pieces of information about the CI structure and the assumption P ∈ Ψ .
6.5 Adaptation to a distribution framework
129
Nevertheless, the possible inherent complexity of the lattice of CI structures arising within Ψ cannot be avoided. Indeed, the implementation of the operation ascribing the respective Ψ -representable imset v ∈ SΨ (N ) to a structural imset u ∈ S(N ) may be complicated (see Remark 5.7 for an analogous consideration). Hopefully, the presented approach helps to decompose the original problem properly.
7 The Problem of Representative Choice
This chapter deals with the problem of choice of a suitable representative within a class of independence equivalent structural imsets. It is an advanced subquestion of the general equivalence question mentioned in Section 1.1 studied in the universum of structural imsets; an analogous question has already been treated in graphical universa – see Chapter 3 (the concept of an essential graph and the concept of the largest chain graph). A few principles of representative choice are introduced and discussed in this chapter. Special attention is devoted to the representation of graphical models by structural imsets. Two auxiliary criteria for the choice of a representative are introduced and discussed. A dual way of describing structural independence models is mentioned in the last section of the chapter.
7.1 Baricentral imsets An imset u over N is called baricentral if it has the form 1 u= w or equivalently, u = · 2
u a,b|K . (7.1)
a,b|K ∈Mu ∩T (N )
w∈E(N ), u w
Evidently, every elementary imset is baricentral and every baricentral imset u is a combinatorial imset with the degree |{w ∈ E(N ); u w}|. Moreover, the definition implies that every independence equivalence class of structural imsets contains exactly one baricentral imset. Nevertheless, a semi-elementary imset need not be baricentral. Given a semi-elementary imset u A,B|C for
A, B|C ∈ T (N ), the respective independence equivalent baricentral imset need not even be its multiple, despite the fact that the formulas deg(u A,B|C ) = |A| · |B| |{w ∈ E(N ) ; u A,B|C w}| = |A| · |B| · 2|A|−1 · 2|B|−1 , suggest that it might be the case.
132
7 The Problem of Representative Choice
Example 7.1. There exists a semi-elementary imset v over N = {a, b, c, d} such that no multiple k · v, k ∈ N is a baricentral imset. Put v = u a,bcd|∅ – see the left-hand picture of Figure 7.1. Then u w ∈ E(N ) iff w = u a,e|K where e ∈ {b, c, d}, K ⊆ {b, c, d} \ {e} (cf. Lemma 2.2). The respective baricentral imset u is shown in the right-hand picture of Figure 7.1. Observe that 12 = deg(u) = 4 · deg(v) but u = 4 · v since the level-degrees of u and v are not proportional: deg(v, l) = 1 for l = 0, 1, 2 while deg(u, 0) = deg(u, 2) = 3 and deg(u, 1) = 6. On the other hand, u = 3 · v + u a,b|c + u a,c|d + u a,d|b . ♦
+1
{a, b, c, d} Q Q A Q Q A Q A Q
0
0
0
−1
+3
+1
{a, b, c} {a, b, d} {a, c, d} {b, c, d} P P PP P P Q PP PPP PPP Q P PP PPQ P P Q P P P P P P
0
0
0
0
0
0
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P PPP P Q P P P PP Q P P P
−1
{a}
0
0
0
{b} {c} {d} A Q Q A A Q Q Q +1
Q
∅
{a, b, c, d} Q Q A Q Q A Q A Q
+1
+1
−3
{a, b, c} {a, b, d} {a, c, d} {b, c, d} P P PP P P Q PP PPP PPP Q P PP PPQ P P Q P P P P P P
−1
−1
−1
−1
−1
+1
+1
+1
−1
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P PPP P Q P P P PP Q P P P
−3
{a}
Q
{b} {c} {d} A Q Q A A Q Q Q +3 ∅
Fig. 7.1. Non-proportional respective semi-elementary and baricentral imsets.
The significance of baricentral imsets consists in the fact that testing independence implications between them is very simple. Proposition 7.1. Let u, v be baricentral imsets over N . Then u v iff u − v ∈ C(N ). Proof. If u v then, for every w ∈ E(N ), v w implies u w and (7.1) gives u − v ∈ C(N ). The converse follows from Lemma 6.1. Note that testing combinatorial imsets is straightforward (see Remark 6.3). An analogous result holds if v is replaced by a semi-elementary imset (see Corollary 7.4 below). In particular, the induced model Mu can easily be identified on the basis of a baricentral imset u; that means no multiplication of u is needed as in the case of a general structural imset (see Section 4.4.1). Remark 7.1. The terminology “baricentral imset” was inspired by a geometric idea that the class of those structural imsets that are i-implied by v ∈ S(N ) is just the class of imsets belonging to the cone con({w ∈ E(N ); v w}) (cf. Remark 6.2 on p. 114). Thus, a (minimal) balanced combination of all extreme imsets of this cone forms its “baricenter”.
7.1 Baricentral imsets
133
A natural question arises: what is the number of baricentral imsets over N ? A very rough upper bound can be obtained as follows. Suppose n = |N | ≥ 2 and S ⊆ N , |S| = k. Then a limited number of w ∈ E(N ) takes a nonzero value w(S) ∈ {−1, +1}. In particular, by (7.1) every baricentral imset u over N takes the value u(S) in a finite set L (S) ⊆ Z which depends on k = |S| only. In fact, L (S) = {−k(n − k), . . . , k2 + n−k 2 } for 2 ≤ k ≤ n − 2, L (S) = {−n + 1, . . . , n−1 } for k ∈ {1, n − 1} and L (S) = {0, . . . , n2 } 2 n for k ∈ {0, n}. Thus, |L(S)| = 2 + 1 for every S ⊆ N . Since baricentral imsets are o-standardized it suffices to know their values for only 2n − n − 1 sets. Therefore, every baricentral imset can be represented as a function on {S ⊆ N ; |S| ≥ 2} taking values in L (S) for every S. The number of these functions is n βn = { n2 + 1}2 −n−1 which serves as an upper estimate of the number of baricentral imsets over N and, therefore, of the number of structural models over N . The following consideration gives an indirect comparison of memory demands when a structural model over N is represented either in the form of a baricentral imset or “directly”. By Lemma 2.2, every semi-graphoid M over N is determined by M ∩ T (N ). Thus, owing to the symmetry property (see Section 2.2.2), it can be represented as a function on E(N ) taking its values in a two-element set. As |E(N )| = n2 · 2n−2 the number of these functions is γn = 2n·(n−1)·2
n−3
.
One has β2 = 2 = γ2 , β3 = 28 > 26 = γ3 , β4 = 711 > 224 = γ4 and β5 = 1126 > 280 = γ5 . On the other hand, one can show n2 + 1 ≤ 2n−2 and n (n−2)·(2n −n−1) < n2 ·2n−2 for n ≥ 6. This implies βn ≤ 2(n−2)·(2 −n−1) < n n−2 2( 2 )·2 = γ for n ≥ 6 so that “asymptotically” the number of considered n
integral functions on {S ⊆ N ; |S| ≥ 2} is smaller than the number of binary functions on E(N )! Thus, only (n − 2) bits suffices to represent elements of L (S) for S ⊆ N for n ≥ 6. This means that the overall memory demands (n − 2) · (2n − n − 1) bits are slightly smaller in the case of representation by baricentral imsets in comparison with the elementary statement mode of representing a semi-graphoid which requires n2 · 2n−2 bits. On the other hand, the actual number of baricentral imsets, that is, the number of structural models, for n = 3, 4 is much smaller than the estimates mentioned in Remark 7.1 – see Example 5.1. The lattice of baricentral imsets over {a, b, c} (ordered by ) is shown in Figure 7.2. Baricentral imsets provide quite a good solution to the problem of representative choice from the computational point of view. However, the question of getting the respective baricentral imset from any given structural imset remains to be solved satisfactorily. For example, formulas for the respective baricentral imsets of graphical models are needed (see Theme 4 in Chapter 9).
−1
−2
+2
Q Q
Q
&
Q
−1
+3 &
Q Q
Q Q
0
+1
+2
+1 &
Q Q Q
0
+1
−1
+2
+1
0
0
−1
Q Q Q Q
−1
Q Q Q Q Q Q Q Q
0
0
Q Q Q Q
−1
−1
−1
+1
+1
Q Q Q
Q
−1
Q Q Q Q Q Q Q Q
+1
0
+1
0
−1
−1
Q Q Q
Q
Q Q Q Q Q Q Q Q
0
0
Q Q Q Q
0
Q Q Q Q
+1
+2 &
+1 % &
0
Q Q Q Q
−2
−1
−1
+3
−1
−2
0
−1
+1
−1
+1
Q Q Q
Q
−1
Fig. 7.2. Baricentral imsets over N = {a, b, c} (rotated).
0 &
−2
−1
+1
+1
−1
−1 0
+1 Q Q Q Q
0 &
0
Q Q Q Q Q Q Q Q
0
−1
−2
0
Q
+1 0
Q Q Q
+1
Q Q Q Q Q Q Q Q
−1
0
0
Q
−1 0
Q Q Q
+1
+1
+1
Q Q Q
−2
−1
Q
+1
+1 0
Q Q Q
0
Q Q Q Q Q Q Q Q
−1
Q Q Q Q
+2
0 &
Q
+1
−1
−1
Q
+1
0
Q Q Q
0
−1 0 0
Q Q
+1
Q Q
+1
%
$
% &
−1
Q Q Q Q Q Q Q Q
−2
+2
Q Q Q Q
%
{c}
7 The Problem of Representative Choice $
{b, c}
$ '
0
%
Q Q Q Q Q Q Q Q
0 % &
+1
{b}
$ ∅
Q Q Q Q
$ '
{a, c}
Q Q Q Q
{a}
−2
% &
Q Q Q Q Q Q Q Q
−1
+1
−2
Q Q Q Q Q Q Q Q
−2
+3
Q Q Q Q
'
Q Q Q Q Q Q Q Q
{a, b}
{a, b, c}
Q Q Q Q
KEY:
$ '
Q Q Q Q
0 % &
$ '
% &
Q Q Q Q
+2
Q Q Q Q
$ '
%
$
' +1
−1
Q Q
−1
Q
Q
+1
Q Q Q Q Q Q Q Q
−1
+1
Q Q Q Q
%
0
Q Q Q Q
0
0
0
$
0
0
Q Q Q Q
% &
Q Q Q Q Q Q Q Q
0
0
$ '
0
0
Q Q Q Q Q Q Q Q
−2
+2 % &
+2
Q Q Q Q
$ '
%
$
Q Q Q Q
' 0
% &
+1
Q Q Q Q Q Q Q Q
−1
+1
Q Q Q Q
$ '
0
Q Q Q Q
−2
Q Q Q Q Q Q Q Q
0
+2 % &
+2
Q Q Q Q
0
Q Q Q
Q
$ '
%
−1
&
$
% &
−2
$ '
0
Q Q Q Q Q Q Q Q
0
+2
Q Q Q Q
'
$ '
−2
Q Q Q Q
−1
+1 % &
−1
Q
Q Q Q Q Q Q Q Q
−1
0
0
+1
+1
Q Q Q Q Q Q Q Q
0
% &
Q Q Q Q
0
'
0
Q Q Q Q
$ '
%
$
$ '
−1
Q Q
−2
Q Q
−1
Q Q Q Q Q Q Q Q
+1
0
Q Q Q Q
% &
0
+1
Q Q Q Q Q Q Q Q
+1
0
+1
−2
$ '
Q Q Q Q
'
+1
−2
+1
Q Q Q Q Q Q Q Q
−2
Q Q Q Q
0
'
−1
Q Q Q Q Q Q Q Q
−1
Q Q Q Q
+3
'
134
7.2 Standard imsets
135
A relative disadvantage of baricentral imsets is that they do not seem to offer easy interpretation in comparison with “standard” imsets for DAG models, mentioned below.
7.2 Standard imsets Some classic graphical models can be represented by certain standard structural imsets which seem to exhibit important characteristics of the models. These standard representatives of graphical models may differ from the baricentral representatives and seem to be more suitable from the point of view of interpretation. They are introduced in this section together with relevant basic facts. Note that the motive of later Sections 7.3 and 7.4 is to find out whether these exceptional representatives reflect some deeper principles so that the concept of a standard imset could be extended even beyond the framework of graphical models. 7.2.1 Translation of DAG models Let G be an acyclic directed graph over N . By a standard imset for G will be understood the imset uG over N given by the formula { δpaG (c) − δc ∪ paG (c) } . (7.2) u G = δN − δ∅ + c∈N
The following example shows that different acyclic directed graphs may share a standard imset. In fact, they share it iff they are independence equivalent – see Corollary 7.1.
G:
H: a e
? b e ? c e
a e
6 e b 6 c e
+1 {a, b, c} Q Q Q Q −1 −1 0 {a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q +1 0 0 {a} {b} {c} Q Q Q Q 0 ∅
Fig. 7.3. Two different acyclic directed graphs and their shared standard imset.
136
7 The Problem of Representative Choice
Example 7.2. There exist distinct acyclic directed graphs which have the same standard imset. Let us put N = {a, b, c} and consider the graphs G and H shown in Figure 7.3. The formula (7.2) implies uG = δ{a,b,c} − δ∅ + {δ∅ − δ{a} } + {δ{a} − δ{a,b} } + {δ{b} − δ{b,c} } = δ{a,b,c} − δ{a,b} + δ{b} − δ{b,c} = u a,c|b . The reader can verify that the same result is obtained for H – it is the elementary imsets shown in Figure 7.3. ♦ Lemma 7.1. Let G be an acyclic directed graph over N . Then the standard imset u = uG is a combinatorial imset and Mu = MG . Moreover, deg(uG ) = (1/2) · |N | · (|N | − 1) − |A(G)| where |A(G)| is the number of arrows in G. Proof. Consider a fixed total ordering a1 , . . . , an , n ≥ 1 of nodes of G consonant with the direction of arrows and the corresponding causal input list (see Remark 3.4)
aj , a1 . . . aj−1 \ paG (aj ) | paG (aj )
for j = 1, . . . , n .
(7.3)
Introduce uj as the semi-elementary imset corresponding to the j-th triplet from (7.3) for j = 1, . . . , n and write u≡
n j=1
uj =
n
δ{a1 ,...,aj } − δ{a1 ,...,aj−1 } − δaj paG (aj ) + δpaG (aj ) = uG
j=1
since almost all terms δ{a1 ,...,aj } are cancelled. Thus, uG is a combinatorial imset n and the substitution of deg(uj ) = j − 1 − |paG (aj )| into deg(u) = the desired formula for deg(uG ). Note that the formula j=1 deg(uj ) gives n above implies that j=1 uj actually does not depend on the choice of a causal input list. Since Mu is a semi-graphoid containing (7.3), the result from Verma and Pearl [150] saying that MG is the least semi-graphoid containing (7.3) implies MG ⊆ Mu . For the converse inclusion use the result from Geiger and Pearl [44], implying that a discrete probability measure P over N with MG = MP exists, and Theorem 5.2, saying that v ∈ S(N ) with MP = Mv exists. Since the list n(7.3) belongs to MG , one has v uj for j = 1, . . . , n and, therefore, v j=1 uj = u by Lemma 6.1. This means Mu ⊆ Mv = MG . Remark 7.2. In fact, it was shown in the proof of Lemma 7.1 that uG ∈ SΨ (N ) where Ψ is the class of discrete measures over N (cf. Section 6.5). Standard imsets appear to be a suitable tool for testing independence equivalence of acyclic directed graphs. Corollary 7.1. Let G, H be acyclic directed graphs over N . Then MG = MH if and only if uG = uH .
7.2 Standard imsets
137
Proof. By Lemma 7.1 uG = uH ⇒ MG = MH . The converse implication can be verified with the aid of a transformational characterization of equivalent acyclic directed graphs – see p. 49. Owing to that result, it suffices to verify uG = uH if H is obtained from G by a legal reversal of an arrow a → b in G. In that case paG (c) = paH (c) for any c ∈ N \ {a, b}. Thus, to show uG = uH one has to evidence δpaG (c) − δc ∪ paG (c) = δpaH (c) − δc ∪ paH (c) . c∈{a,b}
c∈{a,b}
This is easy because by the condition enabling the legal reversal of a → b one has paG (a) = paH (b) = C, paG (b) = C ∪ a and paH (a) = C ∪ b for a certain set C ⊆ N .
Remark 7.3. Every semi-elementary imset over N is a standard imset for an acyclic directed graph over N . Indeed, given A, B|C ∈ T (N ), consider a total ordering of nodes of N in which the nodes of C precede the nodes of A, which precede the nodes of B, and these precede the nodes of N \ ABC. Consider an undirected graph over N in which every pair of distinct nodes is a line except for those pairs [a, b] for which a ∈ A, b ∈ B. Construct a directed graph G that has the above undirected graph as its underlying graph and has the direction of arrows consonant with the total ordering above. Then it is no problem to see, by the procedure used in the proof of Lemma 7.1, that uG = u A,B|C . Note that standard imset representation of DAG models appears to be advantageous from the point of view of learning DAG models – see Section 8.4 for details. The poset of standard imsets over {a, b, c} is shown in Figure 7.4. 7.2.2 Translation of decomposable models Decomposable models, that is, independence models induced by triangulated undirected graphs, form an important class of graphical models – see Section 3.4.1. Let H be a triangulated undirected graph over N and C be the class of all its cliques. By a standard imset for H the imset uH over N will be understood that is given by (−1)|B| · δ B . (7.4) u H = δN + ∅=B⊆C
It is shown below that uH is a combinatorial imset (Corollary 7.2); the next lemma helps to compute uH efficiently.
a⊥ ⊥ b|∅
0
+1
0
0
−1
+1
Q Q Q Q
&
−1
Q Q Q Q Q Q Q Q
0
Q Q Q Q
'
+1 &
%
$
%
0
+1
−1
−1
+1
Q Q Q Q
&
0
Q Q Q Q Q Q Q Q
0
0
Q Q Q Q
'
0
−1
−1
−1
0
−1
$
Fig. 7.4. Standard imsets for DAG models over N = {a, b, c} (rotated). 0 &
0
+1
$
%
a⊥ ⊥ b|c
%
0
Q Q Q Q
0
0
0
Q Q Q Q
$
0
0
−1
&
Q Q Q Q Q Q Q Q
0
0
0
−1
Q Q Q Q Q Q Q Q
0
+1
Q Q Q Q
'
%
$
Q Q Q Q
'
%
0
Q Q Q
Q
+1 &
0
Q Q Q Q Q Q Q Q
0
+1
Q Q Q Q
'
%
b⊥ ⊥ ac | ∅
&
+2
Q Q Q
Q
−1
$
0
0
Q
0
−1
Q Q Q
+1
0
0
Q Q Q Q
&
0
%
−1
0
$
%
$
Q Q Q Q Q Q Q Q
−1
+1
Q Q Q Q
'
+1 &
0
Q Q Q Q Q Q Q Q
−1
+1
Q Q Q Q
'
{a, c}
{b, c}
−1
−1
0
0
0
0
Q Q Q Q
&
+1
Q Q Q Q Q Q Q Q
Q Q Q Q
+1
'
∅
{b}
%
$
{c}
Q Q Q Q
{a}
Q Q Q Q Q Q Q Q
{a, b}
{a, b, c}
Q Q Q Q
KEY:
7 The Problem of Representative Choice
&
+1
%
Q Q Q Q
0
−1
−1
0
Q Q Q Q Q Q Q Q
+1
0
$
0
Q Q Q Q
'
0
0
Q Q Q Q
−1
−1
0
$
Q Q Q Q Q Q Q Q
0
Q Q Q Q
+1
'
a⊥ ⊥ bc | ∅
0
Q Q Q Q Q Q Q Q
0
Q Q Q Q
+1
'
138
7.2 Standard imsets
139
Lemma 7.2. Let H be a triangulated undirected graph over N and : C1 , . . . , Cm , m ≥ 1 be a sequence of (all) its cliques satisfying the running intersection property (see (3.3) on p. 55). Then u H = δN −
m
δ Ci +
i=1
m
δ Si
(7.5)
i=2
where Si = Ci ∩ ( j
S∈S
where C is the class of cliques of H, S is the class of separators and w(S) denotes the multiplicity of a separator S ∈ S. Proof. The idea is to verify (7.5) by induction on m = |C|. It is evident for m ≤ 2. If m ≥ 3 then put C = C \ {Cm }, T = C and H = HT . Observe that C1 , . . . , Cm−1 is a sequence of all cliques of H satisfying the running intersection property. Write by (7.4) (−1)|B| · δ B . (7.7) u H = δN − δT + u H + Cm ∈B⊆C
The running intersection property says Sm = Cm ∩ ( j<m Cj ) ⊆ Ck for some k < m. This allows us to write (−1)|B| · δ B = Cm ∈B⊆C
(−1)|A| · δ A − (−1)|A| · δ A∩Ck
= −δCm + δCm ∩Ck ,
Cm ∈A⊆C\{Ck }
where the last equality holds because every term in braces vanishes whenever |A| ≥ 2: the inclusion A ⊆ Cm ∩ ( j<m Cj ) ⊆ Ck says A ∩ Ck = A. Hence, by (7.7) and the induction hypothesis applied to H (over T ) get uH = δN − δT + (δT −
m−1 i=1
which gives (7.5).
δ Ci +
m−1
δ Si ) − δ C m + δ Sm ,
i=2
Remark 7.4. Note that (7.6) implies that the product formula induced by uH (see Section 4.3) is a modified version of the well-known formula (3.4) characterizing Markovian measures with respect to triangulated graphs mentioned in Section 3.4.1. Thus, that classic result can be viewed as a special case of Theorem 4.1 on structural imsets.
140
7 The Problem of Representative Choice
Decomposable models can be viewed as DAG models (see Figure 3.6). The reader may ask if the “standard” translation of DAG models and decomposable models leads to the same imset. The positive answer is given by the following lemma. Lemma 7.3. Let H be a triangulated graph over N and C1 , . . . , Cm , m ≥ 1 be a sequence of its cliques satisfying the running intersection property. Put Ri = Ci \ j
∀ a, b ∈ paG (c)
a = b ⇒ [a, b] is an edge in G.
(7.8)
Indeed, c ∈ Rl for uniquely determined l ≤ m. If a ∈ paG (c) then a ∈ R j≤l j = j≤l Cj and {a, c} belongs to a clique of H. Let Ci be the first . , Cm containing {a, clique in the sequence C1 , . . c}. Necessarily i ≤ l, because otherwise a, c ∈ Ci ∩ ( j≤l Cj ) ⊆ Ci ∩ ( j
u G = δN − δ∅ +
† m
i=1 d=di∗
{δpaG (d) − δd ∪ paG (d) } = δN − δ∅ +
m
{ δ Si − δ C i }
i=1
(where S1 = ∅) as paG (di∗ ) = Si and di† ∪ paG (di† ) = Ci for i = 1, . . . , m and all remaining terms within the inside sum are cancelled. Corollary 7.2. Let H be a triangulated undirected graph over N . Then u = uH is a combinatorial imset, MH = Mu and u coincides with the standard imset for any acyclic directed graph G over N for which MG = MH . Proof. This follows from Lemma 7.3, Lemma 7.1 and Corollary 7.1.
Remark 7.5. Because of Remark 7.2 the preceding consequence implies that uH ∈ SΨ (N ) where Ψ is the class of discrete measures over N .
7.3 Imsets of the smallest degree
141
7.3 Imsets of the smallest degree One of the possible approaches to representing an independence equivalence class ℘ of structural imsets is to choose a combinatorial imset of the smallest degree (for the definition of degree see p. 72). Note that ℘ contains combinatorial imsets by Corollary 5.3. The definition of degree implies that only finitely many combinatorial imsets over a fixed set N with a prescribed degree exists. In particular, the set of combinatorial imsets in ℘ of the smallest degree is finite. By an imset of the smallest degree a combinatorial imset u will be understood which has the smallest degree within the class of combinatorial imsets v with Mu = Mv . It should be noted that the class ℘ may contain more than one imset of the smallest degree. Example 7.3. There exists a class of independence equivalent structural imsets over N = {a, b, c} which has two different imsets of the smallest degree. Consider the class ℘ of w ∈ S(N ) with Mw = T (N ). Then both u = u b,c|a + u a,b|∅ + u a,c|∅ and v = u a,b|c + u a,c|b + u b,c|∅ , which are shown in Figure 7.5, have the smallest degree 3 within the class of combinatorial imsets from ℘. Observe that Lu = {a, b, c}↓ while Lv = {ab, ac, bc}↓ . Note that the fact that u and v are all imsets of this kind can be verified using the procedure described later in Section 7.3.2. ♦
+1 {a, b, c} Q Q Q Q 0 0 0 {a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q −1 −1 −1 {a} {b} {c} Q Q Q Q +2 ∅
+2 {a, b, c} Q Q Q Q −1 −1 −1 {a, b} {a, c} {b, c} Q Q Q Q Q Q Q Q 0 0 0 {a} {b} {c} Q Q Q Q +1 ∅
Fig. 7.5. Two distinct equivalent imsets of the smallest degree.
Lemma 7.4. Every standard imset for an acyclic directed graph G over N is an imset of the smallest degree. Proof. Let v ∈ C(N ) with v u where u = uG . To verify deg(v) ≥ deg(u) = | { {a, b} ⊆ N ; a = b , [a, b] is not an edge in G } |
142
7 The Problem of Representative Choice
(see Lemma 7.1), write v = w∈E(N ) kw · w for kw ∈ Z+ and show that, for every unordered pair of nodes a, b ∈ N , a = b such that [a, b] is not an edge in G, there exists w = u a,b|K ∈ E(N ) with kw > 0 for some K ⊆ N \ ab. Indeed, otherwise put m = mab↑ and the fact that m, w > 0 for w ∈ E(N ) iff w = u a,b|K for some set K ⊆ N \ ab implies m, v = 0. Hence, by Proposition 5.6, Mv ⊆ Mm . But the moralization criterion (see Section 3.2) says that a, b | paG (a)paG (b) ∈ MG \ Mm = Mu \ Mm (use Lemma 7.1), which implies a contradictory conclusion Mv = Mu . The previous lemma and the fact mentioned in Remark 7.3 imply the following. Corollary 7.3. Each semi-elementary imset is an imset of the smallest degree. The method of finding all imsets of the smallest degree within a given equivalence class announced in Example 7.3 is based on the fact that every imset of this type determines a certain minimal generator of the respective induced independence model. The method uses a computer program and its theoretical justification is given in the rest of this section. 7.3.1 Decomposition implication Let u, v be combinatorial imsets over N . Let us say that u decomposes into v and write u ; v if u − v is a combinatorial imset. Clearly, u ; v implies u v by Lemma 6.1. The relation ; is a partial ordering on C(N ) (for antisymmetry use Proposition 4.4). Its advantage in contrast to is that it can be easily tested (see Remark 6.3). Proposition 7.2. Every imset u of the smallest degree is minimal with respect to ; within the class {v ∈ C(N ) ; v u }. Proof. If u = v are combinatorial imsets and u ; v then deg(u) − deg(v) = deg(u − v) > 0 as the only combinatorial imset of degree 0 is the zero imset by Corollary 4.2. The above observation contradicts the assumption on u. However, the question of whether or not the converse implication holds remains open (see Question 5 on p. 195). 7.3.2 Minimal generators The point is that imsets satisfying the condition stated in Proposition 7.2 correspond to some minimal generators with respect to a special closure operation on subsets of X = T (N ) (see Section A.2, p. 218 for related definitions). Indeed, the class U(N ) is a closure system of subsets of X = T (N ) by Proposition 5.7 and one can introduce the respective closure operation cl U (N ) on
7.3 Imsets of the smallest degree
143
subsets of T (N ). The structural closure of G ⊆ T (N ) is the least structural model containing G defined by M for G ⊆ T (N ). cl U (N ) (G) = G⊆M∈ U(N )
A set G ⊆ T (N ) is called a structural generator of M ∈ U(N ) if M = cl U (N ) (G); if, moreover, G consists of elementary triplets G ⊆ T (N ) then it is called an elementary generator of M. A structural (elementary) generator of M is called minimal if none of its proper subsets is a structural generator of M. As every structural model M over N is a semi-graphoid by Lemma 2.2, M ∩ T (N ) is an elementary generator of M. This implies the following observation. Proposition 7.3. Every structural model M over N has a minimal elementary generator. Remark 7.6. Note that the concept of a (minimal) generator can be understood as a concept which is defined relative to any closure operation on subsets of T (N ). For example, one can also introduce the concept of generator with respect to the semi-graphoid closure operation or with respect to any closure operation defined in terms of syntactic inference rules of semi-graphoid type. The concept of complexity of a model (with respect to a closure operation), which can be introduced as the smallest cardinality of its generator, appears to be an interesting characteristic of the model [143]. The following lemma provides a basis of a method for finding all imsets of the smallest degree. Lemma 7.5. Let M be a structural model over N , which is endowed with a total ordering &, and ℘¯ = {v ∈ C(N ); Mv = M}. Then every element u of ℘¯ which is minimal with respect to ; has the form kw · w where kw ∈ {0, 1} (7.9) u= w∈E(N )
and G = { a, b|K ∈ T (N ) ; kua,b|K = 1 , a ≺ b} is a minimal elementary generator of M. Proof. Write u = w∈E(N ) kw · w where kw ∈ Z+ . If kw ≥ 2 for some w ∈ E(N ) then u − w ∈ C(N ) is independence equivalent to u and u ; (u − w). Therefore, (7.9) necessarily holds and G is in a one-to-one correspondence with the elements of E(N ) having non-zero coefficients in (7.9). To show that G is a structural generator of M, consider M ∈ U(N ) with G ⊆ M . Because it is a semi-graphoid, by Lemma 4.5, Mw ⊆ M for w = u a,b|K , a, b|K ∈ G and, by Corollary 6.1, Mu ⊆ M . Thus Mu ⊆ cl U (N ) (G) and the converse inclusion follows from the fact that Mu ∈ U(N ) and G ⊆ Mu . To show that no proper
144
7 The Problem of Representative Choice
subset F ⊂ G is a generator of M, introduce v = a,b|K ∈F u a,b|K ∈ C(N ). Observe that cl U (N ) (F) = Mv (by an analogous procedure). If cl U (N ) (F) = M = Mu then u v and u ; v = u which contradicts the assumption. In the case |N | ≤ 4, all minimal elementary generators of a structural model M over N can be found by a computer program written by my colleague, P. Boˇcek [12]. Thus, owing to Proposition 7.2 and the preceding lemma, given M ∈ U(N ) the list of imsets of the smallest degree inducing M can be obtained by reducing the list of imsets u satisfying (7.9). The reduction is sometimes necessary as the following example shows. Example 7.4. There exists a structural model M over {a,b, c, d} and a minimal elementary generator G ⊆ M ∩ T (N ) such that v = e,f |K ∈G u e,f |K is not an imset of the smallest degree. Let us consider the independence model from Example 3.1 on p. 50, which is a restriction of a DAG model. Then both imsets u = u a,d|∅ + u a,c|d + u b,d|a ,
v = u a,c|∅ + u b,d|∅ + u a,d|b + u a,d|c
are defined by means of minimal elementary generators of M while 4 = deg(v) > deg(u) = 3 (the imsets are shown in Figure 7.6). Note that v = u + u a,d|∅ and u + v is a baricentral imset. Moreover, u is the unique imset of the smallest degree among independence equivalent imsets while v is an imset with the least lower class (see Section 7.4.2) among independence equivalent imsets. ♦
0
{a, b, c, d} Q Q A Q Q A Q A Q
+1
0
{a, b, c}
+1
0
{a, c, d}
0
PP Q P P P P PP PPP PPP Q PPQ PP P PP PP Q P P P P
−1
{a, b, d}
0
−1
0
0
0
−1
{a, b, c}
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P PPP P Q P P P PP Q P P P
0
{a}
0
{b} {c} {d} A Q Q A A Q Q Q +1
Q
∅
+1
0
{b, c, d}
0
{a, b, c, d} Q Q A Q Q A Q A Q
+1
0
{a, c, d}
0
PP Q P P P P PP PPP PPP Q PPQ PP P PP PP Q P P P P
−1
{a, b, d}
0
0
{b, c, d}
0
−1
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P PPP P Q P P P PP Q P P P
−1
{a}
Q
0
0
−1
{b} {c} {d} A Q Q A A Q Q Q +2 ∅
Fig. 7.6. Equivalent imsets with the smallest degree and the least lower class.
The first claim of the following consequence easily follows from the definition of a baricentral imset, Proposition 7.2 and Lemma 7.5; its second part follows from Corollary 7.3.
7.4 Span
145
Corollary 7.4. If u is a baricentral imset over N and v is an imset of the smallest degree such that u v then (u − v) is a combinatorial imset. In particular, for every A, B|C ∈ T (N ), A, B|C ∈ Mu iff u − u A,B|C is a combinatorial imset.
7.4 Span Recall that the lower class Lu of a structural imset u is contained in the upper class Uu (see Section 4.2.3, p. 73) and the inclusion is sharp (i.e., the classes are not equal) for a non-zero imset. Moreover, by Corollary 4.4, the marginals of Markovian measures for sets in Lu determine the marginals for sets in Uu . The upper class is an invariant of independence equivalence (see Section 6.4) but the lower class is not, which is demonstrated by Example 7.3. In the considered example, the imset shown in the left-hand picture of Figure 7.5 tells more about which marginals determine the whole Markovian measure in comparison with the imset shown in the right-hand picture. Thus, independence equivalent imsets need not be equiinformative from this point of view. This consideration motivates an informal concept of span of a structural imset u which is the class Uu \ Lu . 7.4.1 Determining and unimarginal classes Let us take a more general view on some results of Section 4.4. Suppose that M is a structural model over N . The upper class U ≡ Uu and the class of probability measures which are Markovian with respect to u do not depend on the choice of u ∈ S(N ) with Mu = M; they are only determined by M. A descending class D ⊆ U of subsets of N will be called determining for M if the only descending class E with D ⊆ E ⊆ U such that AC, BC ∈ E ⇒ ABC ∈ E for every A, B|C ∈ M is the class E = U. A descending class D ⊆ U will be called unimarginal for M if every pair of Markovian measures over N whose marginals coincide for sets in D also has coinciding marginals for sets in U. Note that no restriction to a special distribution framework is made in this definition; that is, any pair of probability measures over N in the sense of Section 2.1 can be considered here. Evidently, if D ⊆ U is determining then every descending class D with D ⊆ D ⊆ U is determining too, and the same principle holds for unimarginal classes. Therefore, one is interested in minimal determining classes for M, that is, determining classes D ⊆ U for M such that no proper descending subclass D ⊂ D is a determining class for M. A related question is for which M ∈ U(N ) just one minimal determining, respectively unimarginal, class exists. In other words, the question is for which structural models M the least determining class (respectively the least unimarginal class) for M exists, that is, a determining (respectively unimarginal)
146
7 The Problem of Representative Choice
class D such that one has D ⊆ D for every determining (respectively unimarginal) class D ⊆ U. DAG models appear to be examples of structural models of this type (see Section 7.4.3). Proposition 7.4. Every determining class is unimarginal. Proof. If D ⊆ U is a determining class and P, Q are Markovian measures then put E = {S ∈ U; P S = QS } and observe that D ⊆ E and AC, BC ∈ E ⇒ ABC ∈ E for every A, B|C ∈ M (use the “uniqueness principle” mentioned in the proof of Corollary 4.4). Recall that Corollary 4.3 says that the lower class Lu is a determining class for Mu whenever u ∈ S(N ). Thus, one can summarize the above implications as follows: lower class
=⇒
determining class
=⇒
unimarginal class.
(7.10)
Note that a determining class need not be a lower class (see Example 7.5 below) and the question of whether every unimarginal class is determining remains open – see Question 10 on p. 210. On the other hand, it is known that the concepts of determining and unimarginal classes essentially coincide for DAG models (see Corollary 7.5 in Section 7.4.3). Remark 7.7. The concept of a unimarginal class can alternatively be introduced as a concept relative to a distribution framework Ψ . Then every unimarginal class relative to Ψ is unimarginal relative to any subframework Ψ ⊆ Ψ . One can expect that unimarginal classes may differ for different distribution frameworks. Given a distribution framework Ψ and a structural model M, we can ask what the minimal unimarginal classes are for M relative to Ψ (see Theme 15, Chapter 9). 7.4.2 Imsets with the least lower class A structural imset u ∈ S(N ) is called an imset with the least lower class if Lu ⊆ Lv for every v ∈ S(N ) with u v. Some classes of independence equivalence contain imsets of this type, for example, the imset u from Example 7.3 on p. 141. On the other hand, the subsequent example shows that there are classes of independence equivalence which do not have these imsets but several imsets with a minimal lower class, that is, imsets u ∈ S(N ) such that no independence equivalent v ∈ S(N ) exists with Lv ⊂ Lu . Example 7.5. There exists a structural model M over N = {a, b, c, d} such that • the collection ℘ = {u ∈ S(N ); Mu = M} has three imsets with distinct minimal lower classes,
7.4 Span
147
• 16 distinct minimal determining classes for M exist and none of them is a lower class for any u ∈ ℘. Introduce M ⊆ T (N ) and a class of elementary imsets K as follows K = E1 (N ) ∪ E2 (N ) .
M = { A, B|C ∈ T (N ); |C| ≥ 1 } ∪ Tø (N ),
Observe that M = Mm for the supermodular imset m = ml for l = 0 shown in the left-hand picture of Figure 7.7 and K = {w ∈ E(N ); Mw ⊆ M}. Introduce a combinatorial imset u = w∈K kw · w where 4 iff w = u a,b|K ∈ K or w = u c,d|K ∈ K , kw = 1 for remaining w ∈ K , (u is shown in the right-hand picture of Figure 7.7). By Propositions 4.1(b) and 5.6 we get Mu ⊆ Mm . Using Lemma 2.2 derive Mm ⊆ Mu . Hence, Mu = Mm = M. The first step to show that u is an imset with a minimal lower class is an observation that, for every v ∈ S(N ) with Mv = M, one has v(S) < 0 for some S ∈ S ≡ {abc, ab, ac}. Indeed, put K = {u a,d|K ∈ E(N ) ; |K| ≥ 1}, write n · v = w∈K lw · w where n ∈ N, lw ∈ Z+ and observe that lw > 0 for some w ∈ K . This is because
mad↑ , w = 0 for w ∈ K \ K while mad↑ , w > 0 for w ∈ K . Therefore, the assumption lw = 0 for w ∈ K implies mad↑ , v = 0, which means, by Proposition 5.6, that M = Mv is a subset of the model produced by mad↑ . ad↑ This contradicts the fact a, d|bc ∈ M and m , u a,d|bc = 1. Thus, one can put s = S∈S δS and observe that s, w ≤ 0 for w ∈ K and s, w = −1 for w ∈ K . That implies n · s, v < 0 and the desired conclusion.
+3
{a, b, c, d} Q Q A Q Q A Q A Q
+2
+2
+2
+2
+12
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP P P P P Q PP PPP PPP Q PPQ PP P P P Q P P P P P P
+1
+1
+1
+1
+1
+1
0
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P P PPP Q PP P P P Q P P P
0
{a}
0
0
0
{b} {c} {d} A Q Q A A Q Q Q 0
Q
∅
{a, b, c, d} Q Q A Q Q A Q A Q
0
0
0
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP P P P P Q PP PPP PPP Q PPQ PP P P P Q P P P P P P
0
−9
−9
−9
−9
+6
+6
+6
0
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP P PP Q PP PP Q PPP P P PPP Q PP P P P Q P P P
+6
{a}
Q
{b} {c} {d} A Q Q A A Q Q Q 0 ∅
Fig. 7.7. A multiset producing a model and a structural imset inducing it.
An analogous observation can be made for any class S consisting of a three-element subset of N and a pair of its two-element subsets. If v ∈ S(N ) satisfies Mv = M and Lv ⊆ Lu then the observation that v(S) < 0 for some
148
7 The Problem of Representative Choice
S ∈ S necessitates v(ac) < 0. We can similarly derive v(ad), v(bc), v(bd) < 0, which implies Lv = Lu . Therefore, u is an imset with a minimal lower class and a permutation of variables gives two other examples of independence equivalent imsets with distinct minimal lower classes. On the other hand, the class D1 = {ab, bc, cd}↓ is a determining class for M as a, c|b, b, d|c, a, d|bc ∈ M. One can show that D1 is a minimal determining class and an analogous conclusion can be made for any class obtained by a permutation of variables either from D1 or from D2 = {ab, ac, ad}↓ . Note that this list of minimal determining classes for M can be shown to be complete. ♦ 7.4.3 Exclusivity of standard imsets The standard imset uG for an acyclic directed graph G appears to be an exclusive imset within the class of structural imsets u with Mu = MG . The first step to show that it is an imset with the least lower class is the following lemma. Lemma 7.6. Let M be a structural model over N . Introduce an independence equivalence class ℘ = {u ∈ S(N ) ; Mu = M} and put U = Uu for u ∈ ℘ (by Lemma 6.5 Uu does not depend on the choice of u). If S ∈ U has the form S = cT where c ∈ N and M ∩ { e, c|K ∈ T (N ); e ∈ T } = ∅ , M ∩ { e, f |K ∈ T (N ); e, f ∈ T, c ∈ K} = ∅ ,
(7.11)
then every (descending) unimarginal class D for M contains S. Proof. I. The first observation is that any probability measure of the form P = Q × i∈N \S Pi , where Q is a probability measure over S with QT = i∈T Qi and Pi , Qi are arbitrary one-dimensional probability measures, is Markovian with respect to u ∈ ℘. Indeed, by Lemma 2.2 it suffices to verify M ∩ T (N ) ⊆ MP . Suppose a, b|K ∈ M ∩ T (N ). If a ∈ S then the fact a ⊥ ⊥ N \ a | ∅ [P ] implies a ⊥ ⊥ b | K [P ]; analogously for b ∈ S. If a, b ∈ S then (7.11) implies a, b ∈ T and c ∈ K so that a ⊥ ⊥ N \ ac | ∅ [P ] implies a⊥ ⊥ b | K [P ] as well. II. The second step is a construction:put Xi = {0, 1} and define a pair of probability measures Q1 , Q2 on XS = i∈S Xi : −|S| 2 + ε if i∈S xi is even, Q2 ([xi ]i∈S ) = Q1 ([xi ]i∈S ) = 2−|S| 2−|S| − ε if i∈S xi is odd, where 0 < ε < 2−|S| . Then put Pj = Qj × i∈N \S Pi for j = 1, 2 where Pi are some fixed probability measures on Xi , i ∈ N \ S. Observe that both P1 and P2 is Markovian with respect to u ∈ ℘, P1S = P2S and P1L = P2L whenever L ⊆ N , S \ L = ∅.
7.5 Dual description
149
III. Finally, suppose for contradiction that S ∈ D where D ⊆ U is a unimarginal descending class for M. This implies P1L = P2L for L ∈ D and therefore P1L = P2L for L ∈ U, which contradicts the assumption S ∈ U. Corollary 7.5. Given an acyclic directed graph G over N , the lower class Lu for u = uG is the least unimarginal and the least determining class for MG . In particular, uG is an imset with the least lower class. Proof. By (7.10) and Lemma 7.1, Lu is both a determining class and a unimarginal class for MG . If D is a lower class for v ∈ S(N ) with Mv = MG , respectively a determining class for MG , then it is a unimarginal class for MG by (7.10). Lemma 7.6 can then be used to verify Lu ⊆ D. Indeed, if then S ∈ Uu by Proposition 4.5 and S = c paG (c) for some c ∈ N S ∈ Lmax u by (7.2). The moralization criterion (see Section 3.2) allows one to verify that the condition (7.11) for T = paG (c) is fulfilled for M = MG .
Remark 7.8. Thus, the standard imset uG for an acyclic directed graph G over N is both an imset of the smallest degree (see Lemma 7.4) and an imset with the least lower class. A computer program [12] helped me to show that in the case |N | ≤ 4 the converse holds; that is, given an acyclic directed graph G, the only imset satisfying these two conditions is the standard imset uG . The question of whether these two requirements determine standard imsets for acyclic directed graphs remains open – see Question 9 on p. 209. Another interesting feature of the standard imset u for an acyclic directed graph is that it has a lot of zero values, namely, that it vanishes outside Lu ∪ Uumax .
7.5 Dual description Two approaches to the description of independence models by imsets were distinguished in Chapter 5. Every structural model is induced by a structural imset and produced by a supermodular imset (see Corollary 5.3) and both methods can be viewed as mutually dual approaches. 7.5.1 Coportraits This is to explain the dual perspective in more detail with the help of the concept of Galois connection from Section 5.4. It was explained there (p. 104) that the poset of structural models (U(N ), ⊆) can be viewed as a concept lattice given by the formal context (5.16). More specifically, it follows from Lemma 2.2 that every structural model M over N is in a one-to-one correspondence with a set of elementary imsets over N , namely with
150
7 The Problem of Representative Choice
{v ∈ E(N ) ; v = u a,b|K where a, b|K ∈ M ∩ T (N ) }. In particular, every u ∈ S(N ), respectively m ∈ K (N ) ∩ ZP(N ) , corresponds through Mu , respectively through Mm , to a subset of Œ = E(N ): Eu ≡ {v = {v m E ≡ {v = {v
∈ E(N ) ; ∈ E(N ) ; ∈ E(N ) ; ∈ E(N ) ;
v = u a,b|K , a, b|K ∈ Mu } = u v}, v = u a,b|K , a, b|K ∈ Mm } =
m, v = 0}.
(7.12)
Thus, every structural model can be identified with a set of objects of the formal context (5.16). In fact, it is an extent of a formal concept so that structural models correspond more or less to the description in terms of objects. However, as explained in Remark 5.8, every formal concept can also be described by means of its intent, that is, in terms of attributes. In this case the set of attributes is the -skeleton Æ = K (N ) which motivates the following definition. By a coportrait of a structural imset u over N will be understood the set of skeletal imsets Hu given by Hu = {r ∈ K (N ) ; r, u = 0}.
(7.13)
Indeed, Hu = {r ∈ K (N ) ; r, v = 0 for every v ∈ Eu }, which means that Hu is nothing but Eu . As Eu = Eu by Lemma 6.2, the pair (Eu , Hu ) is a formal concept in the sense of Section 5.4.1. By Corollary 6.2, two structural imsets are independence equivalent iff they have the same coportrait. Thus, every class of independence equivalence is uniquely represented by the respective coportrait. The lattice of all coportraits over three variables is shown in Figure 7.8. Remark 7.9. This is to explain the terminology. The idea of dual description of structural models was presented already in Studen´ y [137] where the concept of a portrait of u ∈ S(N ) was introduced as the set of skeletal imsets {r ∈ K (N ) ; r, u > 0}.
(7.14)
Thus, the coportrait Hu is nothing else but the relative complement of (7.14) in K (N ) and this motivated the terminology used here. Provided that the -skeleton is known, (7.14) and (7.13) are equiinformative but the concept of coportrait seems to be more natural from the theoretical point of view (in light of the Galois connection). Despite this fact, I decided to keep the former terminology and not to rename things. The reason I preferred (7.14) to (7.13) in [137] was my anticipation that for |N | > 2 the relative occurrence of zeros in { m, u; m ∈ K (N ), u ∈ E(N )} exceeds the relative occurrence of non-zero values (which seem to be true in the explored cases). A practical consequence should be that portraits have on average less cardinality than coportraits.
% &
&
% &
$ '
% &
&
$ '
'
$ '
%
&
'
$
'
% &
% &
$
%
&
% &
$ '
% &
$ '
%
&
KEY:
% &
$ '
% &
$ '
&
'
δN
2 · δN + δab + δac + δbc
$
'
% &
$ '
% &
$ '
%
$
'
%
$
% & $ '
$ '
% &
& $ '
$ '
'
&
'
%
$
% &
$ '
%
$
δN + δab
δN + δac
δN + δbc
%
$
7.5 Dual description
Fig. 7.8. Coportraits of structural imsets over N = {a, b, c} (rotated).
151
152
7 The Problem of Representative Choice
Nevertheless, the method of dual description of structural models is limited to the situation when the skeleton is known. Of course, as explained in Remark 5.6, the type of the skeleton is not substantial since the use of the u-skeleton, respectively the o-skeleton, instead of the -skeleton leads to an “isomorphic” concept of portrait and coportrait. 7.5.2 Dual baricentral imsets and global view As mentioned before Proposition 5.7 (p. 107) we can take a dual point of view and describe structural models as independence models produced by standardized supermodular imsets. These imsets are actually multisets – they are non-negative by Lemma 5.3. An analog of independence equivalence is qualitative equivalence from Section 5.1.1 and the respective implication for these multisets can be introduced as follows: m ∈ K (N ) ∩ ZP(N ) implies r ∈ K (N ) ∩ ZP(N ) if Mm ⊆ Mr . Moreover, every multiset of this kind is a non-negative rational combination of -skeletal imsets (see Lemma 5.3 and Lemma A.2) so that these imsets play the role which is analogous to the role of elementary imsets within the class of structural imsets (cf. Theorem 5.3). Following this analogy, an -standardized supermodular multiset m over N will be called a dual baricentral imset if it has the form r. (7.15) m= r∈K (N ), Mm ⊆Mr
The corresponding poset of dual baricentral imsets is shown in Figure 7.9. Global view Of course, owing to Corollary 5.3 and (7.12), every coportrait of a structural imset can also be written in the form Hm = (E m ) where m ∈ K (N ) ∩ ZP(N ) . Note that one can show, by a procedure which is analogous to the proof of Lemma 6.1, that Hm = {r ∈ K (N ) ; k · m − r ∈ K (N ) for some k ∈ N } . Therefore, the mutual relation of a structural imset u and the corresponding set of elementary imsets Eu given by (7.12) is completely analogous to the mutual relation of an -standardized supermodular imset m and the corresponding set of -skeletal imsets Hm . The global view on all four abovementioned approaches to the description of a structural model is indicated by Figure 7.10. To describe a structural model one can use 1. 2. 3. 4.
a set of elementary imsets, a structural imset, a set of -skeletal imsets, an -standardized supermodular imset.
+2
0
+1
Q
0
0
Q Q
Q
0
&
0
Q Q Q Q Q Q Q Q
0
Q Q Q Q
'
+1
0
0
0
0
0
0 &
Q Q Q
Q
0
0
0
0
+1
0
0
Q Q Q Q
0
Q Q Q Q Q Q Q Q
+1
+3
Q Q Q Q
0
+2
+1
+1
0
0
0
Q Q Q
Q
0
Q Q Q Q Q Q Q Q
0
+1
0
0
0
0
Q Q Q
Q
Q Q Q Q Q Q Q Q
+1
+3
Q Q Q Q
0
Q Q Q Q
0 % &
0
Q Q Q Q
0 &
0
0
0
0
+1
0
0
0
Q Q Q
Q
0
Q Q Q Q Q Q Q Q
+1
+2
Q Q Q Q
Fig. 7.9. Dual baricentral imsets over N = {a, b, c} (rotated). 0 &
Q Q Q Q
0
0
+1
0
0 Q
Q
0
+2
+2 0
%
0
Q Q Q Q
0 &
0
Q Q Q Q Q Q Q Q
+1
+1
+2
0
Q
0 0
Q Q Q
0
Q Q Q Q Q Q Q Q
+1
+1
0
Q
+2 0
Q Q Q
0
+1
0
Q
0
+3
+2
+1 0
Q
0 0
Q Q Q
0
Q Q Q Q Q Q Q Q
Q Q Q Q
+1
0
Q Q Q
0 &
+2
+2
0
Q
+1 0
+1
+1 0
Q Q
0 0
Q Q
0
Q Q Q Q Q Q Q Q
+2
+3
Q Q Q Q
%
$
% &
{c}
%
$
{b, c}
$ '
Q Q Q
0
%
Q Q Q Q Q Q Q Q
{b}
$ ∅
Q Q Q Q
+4
{a, c}
Q Q Q Q
{a}
0 % &
Q Q Q Q Q Q Q Q
{a, b}
$ '
% &
Q Q Q Q Q Q Q Q
+2
+4
+1
Q Q Q Q Q Q Q Q
+1
+2
Q Q Q Q
'
{a, b, c}
Q Q Q Q
KEY:
$ '
Q Q Q Q
0 % &
$ '
% &
Q Q Q Q
+3
Q Q Q Q
$ '
%
$
' +4
0
Q Q
0
Q Q Q Q Q Q Q Q
+1
+2
$
0
Q Q Q Q
% &
0
Q Q Q Q
$ '
0
0
Q Q Q Q Q Q Q Q
+1
0 % &
0
0
+2
+2
+1
Q Q Q Q
$ '
Q Q Q Q Q Q Q Q
+2
Q Q Q Q
' +6
% &
0
Q Q Q Q
$ '
%
+1
Q Q Q Q Q Q Q Q
0
0 % &
Q Q Q Q
+1
$ '
$
% &
+1
$ '
0
Q Q Q Q Q Q Q Q
0
+1
Q Q Q Q
'
$ '
0
Q Q Q Q
0
0 % &
0
+1
+1
0
Q Q Q Q Q Q Q Q
+1
% &
Q Q Q Q Q Q Q Q
0
Q Q Q Q
+3
'
+2
Q Q Q Q
$ '
$ '
0
Q Q
Q Q
0
Q Q Q Q Q Q Q Q
0
% &
Q Q Q Q
+2
$ '
0 &
%
0
Q Q Q Q
0
0
0
Q Q Q Q Q Q Q Q
0
0
$
Q Q Q Q
+1
'
&
%
0
0
Q Q Q
Q
0
0
0
0
Q Q Q Q Q Q Q Q
0
$
Q Q Q Q
0
'
7.5 Dual description 153
154
7 The Problem of Representative Choice
Recall that the set of elementary imsets can be viewed as a direct translation of the considered structural model, the structural imset is obtained by a nonnegative rational combination of elementary imsets, the set of skeletal imsets is obtained by the transformation given by the Galois connection and the supermodular imset by a non-negative rational combination of skeletal imsets. Let me emphasize that, in comparison with a general case of the Galois connection described in Section 5.4.1, additional operations of summing elementary and summing skeletal imsets are at our disposal. This fact allows us to describe the respective relationships among formal concepts (namely the relation “be a subconcept”) with the help of algebraic operations, more precisely by means of arithmetic of integers! This is an additional advantage of the described approach.
6 Non-negative +j rational combinations
3
mq Æ = K (N ) ∪ ZP(N )
q q q
Æ = K (N )
+j ⎧ ⎨q
u
q
⎩ q
Œ = E(N )
q
q q q q q q q q q
Incidence relation m, u = 0 can be extended to a wider context
Œ = S(N ) Fig. 7.10. Extension of Galois connection for structural models (an illustration).
Remark 7.10. However the dual approach exhibits some different mathematical properties. One can introduce an analog of the concept of a combinatorial imset, that is, an imset which is a non-negative integral combination of skeletal imsets. But there is no analog of the concept of degree for imsets of this type: the sum of two -skeletal imsets from the upper line in Figure 5.1 equals to the sum of three -skeletal imsets in the lower line of the figure.
8 Learning
This chapter is devoted to the methods for learning CI structures on the basis of data. However, it is not the aim of this text to provide an overview of numerous methods for learning graphical models from the literature. Rather, the purpose of this chapter is to show how to apply the method of structural imsets to learning DAG models and to indicate that this approach can be extended to learning more general classes of models of CI structure. Some thoughts about the learning methods which are based on statistical CI tests are mentioned in Section 8.1. The next two sections of the chapter contain an analysis of certain DAG model learning algorithms based on the maximization of a quality criterion, and also deal with related questions. It is argued in Section 8.4 that the method of structural imsets can be applied in this area as well.
8.1 Two approaches to learning There is plenty of literature about learning graphical models of CI structure – both in the area of statistics and in the area of artificial intelligence. For an overview of some of these methods see § 1.4 and § 4 of Bouckaert [14] and § 3 of Castelo [20]. The literature concerns learning UG models, DAG models and DAG models with hidden variables; most attention is devoted to learning DAG models. In my view, the algorithms for learning graphical models can be divided into two groups on the basis of the fundamental methodological approach. • Some of these algorithms are based on significance tests between two statistical models (of graphical CI structure). These significance tests often correspond to statistical tests for the validity of some CI statements. • Other algorithms are based on the maximization of a suitable quality criterion designed by a statistician. On the basis of the way to derive the criterion, algorithms of this kind could be further divided into those based
156
8 Learning
on the Bayesian approach and those which stem from a frequentist point of view. Nevertheless, some algorithms may be included in both groups because they can be interpreted in both ways and there is a simulation method applicable to learning graphical models which does not belong to either of these two groups (see Remark 8.3 below). Data faithfulness assumption Typical examples of algorithms based on significance tests are the SGS algorithm for learning DAG models and its more effective modifications known as the PC algorithm and the PC* algorithm described in Spirtes et al. [122]. These procedures stem from the premise called the data faithfulness assumption which can be paraphrased as follows (cf. Section 1.1): data are “generated” by a probability measure P which is perfectly Markovian with respect to an object o within the considered universum of objects of discrete mathematics. In the case of the above algorithms from [122] the universum of the objects is the collection of acyclic directed graphs over N . The algorithms are based on statistical CI tests, that is, tests which – on the basis of given data – either reject or do not reject the hypothesis that a certain elementary CI statement a ⊥ ⊥ b | C [P ] is true. Tests of this kind are usually derived from statistics which are known to be measures of stochastic conditional dependence (see Section A.9.1 for the concept of a statistic). For example, in § 5.5 of Spirtes et al. [122] two statistics of this kind are mentioned in the discrete case: the X 2 -statistic and the G2 -statistic. The goal of the algorithms is to determine the equivalence class of graphs consisting of those acyclic directed graphs with respect to which P is perfectly Markovian. The basic step is an observation that if P is perfectly Markovian with respect to an acyclic directed graph G, then the following characterization of edges and immoralities in G holds: [a, b] is an edge in G
⇔
∀ C ⊆ N \ {a, b}
a ⊥ ⊥ b | C [P ] ,
and if [a, b], [b, c] are edges in G while [a, b] is not an edge in G then a → c ← b in G
⇔
∀ C c ∈ C ⊆ N \ {a, b}
a ⊥⊥ b | C [P ]
– see Verma and Pearl [151] or Koˇcka et al. [58]. On the condition that data are “generated” from P , the above mentioned statistical CI tests give a criterion for the composite conditional dependence statements on the right-hand side. This allows one to determine the underlying graph of G and all immoralities in G. Thus, a hybrid graph H, called a pattern, can be obtained, which has the same underlying graph as G and just those arrows which belong to the immoralities in G; the other edges in H are lines. The final step is a purely
8.1 Two approaches to learning
157
graphical procedure whose aim is to direct some other edges of the pattern so as to get a direction that is shared by all acyclic directed graphs equivalent to G. Nevertheless, the hybrid graph H‡ obtained from H by the algorithm from Spirtes et al. [122] need not be complete in the sense that every essential arrow, that is, an arrow which has the same direction in all acyclic directed graphs equivalent to G, is an arrow with this direction in H‡ . However, the final step of the SGS algorithm may be replaced by the procedure described in Meek [89]. The resulting hybrid graph is then the essential graph, named also the completed pattern, which can serve as a unique representative of the equivalence class of acyclic directed graphs with respect to which P is perfectly Markovian. Analogous procedures, sometimes called recovery algorithms, were proposed for other classes of graphs. An algorithm of this kind for (classic) chain graphs whose result is the largest chain graph (see p. 54) was presented in Studen´ y [139]. In § 5.6 of Koˇcka [57] a reference is given to an algorithm for learning decomposable models which is based on statistical tests and on the idea of gradual decomposition of cliques of a triangulated undirected graph. A modification of the PC algorithm for learning acyclic directed graphs with hidden variables is described in § 6.2 of Spirtes et al. [122]. All the above-mentioned algorithms are ultimately based on the data faithfulness assumption and the validity of this assumption can hardly be ensured. Indeed, its validity for arbitrary data actually means that, for every probability measure P in the considered distribution framework, there exists an object o of discrete mathematics in the considered universum of objects such that P is perfectly Markovian with respect to o. This is nothing else than the completeness requirement mentioned in Section 1.1. However, it is explained in Section 3.6 that such a condition cannot be fulfilled for an arbitrary class of graphical models and the discrete distribution framework. The point of this consideration is as follows: learning methods based on statistical CI tests are safely applicable only if the data faithfulness assumption is guaranteed. Remark 8.1. A boon of the book by Spirtes et al. [122] is that the data faithfulness assumption was made explicit there and the limitation of learning methods based on statistical CI tests was indicated in that way. Unfortunately, some researchers in the area of artificial intelligence do not seem to be aware of this limitation, which leads to the methodological error mentioned at the end of Section 1.1. These researchers use the procedures for learning Bayesian networks based on CI tests for “single edge removals” which have already been criticized by Xiang et al. [159]. Examples of similar procedures are the Construct-TAN algorithm for “learning restricted Bayesian networks” presented in Friedman et al. [38] and its modifications from Cheng and Greiner [21]. The argument used on p. 8 of [38] to justify the restriction to a class of TAN models is that the model learned by the usual method
158
8 Learning
of maximization of a quality criterion may result in a poor classification. However, the Construct-TAN algorithm is ultimately based on the values of the G2 -statistic for triplets of the form a, b|c, a, b, c ∈ N which correspond to the CI test for the respective elementary statement a ⊥⊥ b | c. Owing to the restriction to a special class of TAN models, the “learned” model always involves a fixed-context CI statement a ⊥⊥ b | N \ {a, b}. But no evidence for the validity of this CI statement was provided by data and the procedure is, therefore, methodologically incorrect, unless the data faithfulness assumption is somehow ensured. Another issue is that researchers in the area of artificial intelligence often support the applicability of their various heuristic algorithms for learning Bayesian networks of the above type by computer simulations. The trouble is that, in simulations of this type, data are “generated” artificially by a probability measure which is perfectly Markovian with respect to an acyclic directed graph. Thus, the data faithfulness assumption is “fulfilled” as a result of the way in which artificial data were generated. The problem is that these heuristic algorithms are only tested on “favorable” artificial data. Their behavior is not checked on those data for which the data faithfulness assumption is not “fulfilled” – I use quotation marks around “fulfilled” to avoid dubious vague question of what it actually means that data (understood in the sense of Section A.9.1) are “generated” by a probability measure. Thus, I think that these algorithms may appear to be worthless if they are applied to real data for which the data faithfulness assumption is not “fulfilled”. In my view, real data of this sort are quite common. Perhaps Fret’s heads data analyzed by Whittaker (see [157], Exercise 1 in § 8.6) can serve as a simple example of real data for which the data faithfulness assumption relative to the universum of undirected graphs is not “fulfilled”. Whittaker ([157], § 8.4) compares several simple procedures to learn UG models. Each of them leads to another UG model and Whittaker concludes ([157], p. 260) that a model based on hidden variables seems to give a better explanation of the occurrence of data rather than any of the considered UG models. Need for lattice structure One of the ways to behave in a situation in which data do not fit one of the considered models is the change of the objective of the learning procedure. Instead of seeking “the only true model” the aim may become to find the “best approximation” in the universum of considered statistical models or an approximation which is “good enough”. Then, of course, one has to introduce a suitable measure of approximation. An example of an algorithm whose aim is to find an approximation in the above sense and which can also be interpreted as an algorithm based on significance tests is the simple two step procedure for learning UG models described in § 8.3 of Whittaker [157]. The measure he uses is the deviance of a statistical model MG determined by an undirected graph G (see Section A.9.2).
8.1 Two approaches to learning
159
It is defined as a multiple by 2 of the difference between the unconstrained maximum and the constrained maximum of the logarithm of the likelihood function over MG (see Section A.9.3). Whittaker showed that in the cases he deals with, the deviance has information-theoretical interpretation in terms of relative entropy (see [157], Proposition 6.8.1 in the Gaussian case and Proposition 7.4.2 in the discrete case). The procedure described by Whittaker is a series of mutual comparisons of two statistical models M and M which are nested each time. Moreover, during the algorithm only “close” models are compared, that is, models determined by undirected graphs which differ in the presence of one edge. To compare statistical models, a significance test based on the deviance difference is used. Whittaker [157] explicitly wrote on p. 228 that, if possible, he would like to give the meaning to the deviance difference (between two close models), namely the meaning of a statistic for testing (elementary) CI statements. Indeed, this interpretation is possible in most of the situations Whittaker deals with; for example, for a pair of decomposable models in the discrete distribution framework – see Remark 8.14 on p. 187. The “simple two step procedure” ([157], p. 252) could be described as follows: (i) The algorithm starts with the saturated model (see Section A.9.2), that is, the model determined by an undirected graph H 0 in which every pair of distinct nodes in N is a line. (ii) Significance tests for the presence or the absence of each particular edge in H 0 are performed. In the discrete case, these are the statistical tests for CI statements of the form d ⊥ ⊥ e | N \ {d, e} based on the G2 -statistics (see Section A.9.1). b in H 0 for which these tests do not reject the hypothesis (iii) The lines a about the validity of a ⊥ ⊥ b | N \ {a, b} are removed from H 0 and a graph 1 H is obtained. This is the end of the so-called backward phase of the procedure. Since the deviance of the model MH 1 may be too high, the algorithm may continue by its forward phase: (iv) For every pair [a, b] which represents a missing line in H 1 , one computes the deviance differences between the model determined by a graph which b to H 1 and the model determined by H 1 . Note is created by adding a that the result is often the value of the G2 -statistics for testing a certain CI statement a ⊥⊥ b | C where C ⊆ N \ {a, b} is a certain set separating a and b in H 1 . b which correspond to significant deviance differ(v) Finally, those lines a ences are added to H 1 and the resulting undirected graph H 2 is obtained. Whittaker [157], who also offered similar procedures for other universa of statistical models, mentioned on p. 242 that the universa of models he considers often have the lattice structure. In my view, this is the crucial requirement
160
8 Learning
on the universum of statistical models so that a learning method based on significance tests could be applied. Actually, it is a requirement on the respective class of formal independence models. Indeed, one can describe the backward phase of the above procedure from the point of view of CI interpretation as follows. In (ii) certain acceptable UG models were generated and in (iii) one moves to their supremum (provided one has in mind the order MG ⊆ MH between UG models which corresponds to inverse inclusion MG ⊇ MH of the respective statistical models – see Remark 8.10 for terminological explanation). Moreover, in the forward phase, in (iv) some suitable UG models are generated, that is, models which seem better than the “current” UG model MH 1 . Then in (v) one moves to their infimum. Thus, the assumption that the class of induced formal independence models has the lattice structure allows one to combine the knowledge obtained as a result of various particular CI tests. Actually, an analogous situation occurs in the case of learning CI structure on the basis of the knowledge provided by human experts (see Section 1.1). Imagine a situation in which two different experts provide the information about CI structure in the form of two objects of discrete mathematics o1 and o2 , for example graphs over N . The knowledge can be expressed in the form of the induced formal independence models Mo1 and Mo2 (see Section 2.2.1). If one relies on both experts then one would like to represent the joint knowledge (within the considered class of models) by means of an object o such that Mo involves both Mo1 and Mo2 and Mo is the least possible model of this kind. A natural requirement for the uniqueness of this model is just the requirement of the supremum existence. On the other hand, in the case of a suspicious attitude one would only like to accept the pieces of information which are confirmed by both experts and the natural requirement is the requirement of the infimum existence. The goal of the above thoughts is the following conclusion: if the data faithfulness assumption is not fulfilled and the class of formal independence models induced by objects from the considered universum of objects of discrete mathematics is not a lattice, then a method based on statistical tests of particular CI statements may appear not to be applicable. Unfortunately, some classes of graphical models do not meet the above condition. For example, the class of DAG models for |N | ≥ 3 is not a lattice; see Figure 7.4 where the model which corresponds to a ⊥⊥ b | ∅ and the model which corresponds to a ⊥ ⊥ b | c have no join but two minimal upper bounds, namely the models which describe a ⊥ ⊥ bc | ∅ and b ⊥⊥ ac | ∅. Another example of a class of models which is not a lattice is the class of decomposable models for |N | ≥ 4. Remark 8.2. Whittaker gave examples showing that his simple two step procedure both may and may not lead to the choice of an adequate statistical model, that is, a model with sufficiently low deviance ([157], Example 8.5.1
8.2 Quality criteria
161
in § 8.4). In my view, failing in the aim of the procedure is caused by the fact that the data faithfulness assumption is not valid for UG models in that case. The above-mentioned procedures and other procedures mentioned in § 8.3 of [157] can alternatively be understood as modifications of methods based on the maximization of a quality criterion from Section 8.2. Indeed, the deviance of a statistical model MH for an undirected graph H has the form k − 2 · MLL (MH , D) where k is a constant and MLL (MH , D) is the value of maximized log-likelihood criterion for the model MH and data D (see Section A.9.3). Thus, the task of finding a model which has the deviance as low as possible is equivalent to the task of finding a model which has the value of MLL (MH , D) as high as possible. In Example 8.2.2 Whittaker [157] mentioned an interesting phenomenon: the deviance difference for different pairs of UG models may coincide! Moreover, the deviance difference is then the value of the G2 -statistic for testing a certain CI statement. This feature could be nicely explained as a special case of Proposition 8.4 – see Remark 8.14. Universum of structural imsets The goal of the considerations in this section, which are to a certain extent an extension of the motivation thoughts from Section 1.1, is the moral that in the case of learning methods based on statistical CI tests, the universum of considered objects of discrete mathematics should satisfy two natural conditions. The first condition is the completeness with respect to the respective distribution framework and the second condition is the requirement that the class of induced formal independence models has the lattice structure. The universum of structural imsets satisfies both of these conditions. The completeness with respect to the discrete and positive Gaussian distribution framework follows from Theorem 5.2; the fact that the class of structural models is a lattice follows from Theorem 5.3. If one dreams about a learning procedure for structural imsets based on statistical CI tests, then baricentral imsets considered in Section 7.1 are perhaps the most suitable representatives of classes of (independence) equivalent structural imsets.
8.2 Quality criteria Most of the algorithms for learning graphical models are based on the idea of maximizing a quality criterion which is a function ascribing to a graph and data a real number which “evaluates” how the statistical model described by the graph fits the data. Alternative names for a quality criterion are quality measure [14], score criterion [26] and score metric [23]. Maximization of a quality criterion is often used in the area of artificial intelligence and seems to be more suitable for learning DAG models and decomposable models than methods based on significance tests. One can distinguish at least
162
8 Learning
three methodological approaches to derivation of quality criteria (for details see Bouckaert [14], § 4.2). • Classic statistical interpretation of a graphical model in terms of a parameterized class of probability measures leads to information criteria; for example, the Akaike’s information criterion (AIC) and the Bayesian information criterion (BIC) – see Section A.9.3. • The Bayesian approach is based on an additional assumption that a “prior” probability measure on the respective parameter space is given for each considered graphical model. This approach, initiated by the paper by Cooper and Herskovits [24], leads to a variety of Bayesian criteria – see Remark 8.5. • The minimum description length (MDL) principle [63] stems from coding theory. Roughly said, the basic idea is to evaluate a graphical model by the overall number of bits needed to encode data within the model. However, the resulting MDL criterion appears to coincide with the classic Bayesian information criterion (see [14], § 4.2.3). The weak point of this approach is that the class of considered graphs is often too large, which makes the direct maximization of a quality criterion infeasible. To avoid this problem, various heuristic local search methods were developed (see [14], § 4.3). Every method of this kind has its specific search space, which consists of the collection of states and the collection of moves between the states. The states are typically graphs over N , either acyclic directed graphs or essential graphs, and the moves are some predefined “local” graphical operations with them. Thus, in order to apply a local search method one needs to introduce a neighborhood structure in the class of considered graphs. Every graph G is assigned a relatively small set of neighboring graphs nei (G), which are typically graphs differing from the considered graph in the presence of one edge. The point is that most quality criteria used in practice have a pleasant property that the difference of their values for neighboring graphs is a single term which is easy to compute – see Section 8.2.3. Thus, instead of the global maximum of a quality criterion, various algorithms allow us to find a local maximum relative to the neighborhood structure in consideration. On the other hand, some algorithms of this kind are guaranteed to achieve the global maximum provided that certain quite strong conditions are fulfilled (these conditions include the data faithfulness assumption) – see Meek [91]. Note that the incremental search procedure mentioned already by Whittaker ([157], pp. 251–252) can perhaps also be viewed as an example of a local search method. Remark 8.3. Some of the Bayesian approaches come even with a further assumption that a “prior” probability measure on the finite class of considered graphs is given. This assumption is then reflected by an additional term in the respective Bayesian criterion. Note that this additional assumption does not seem to prevent one from using local search methods.
8.2 Quality criteria
163
Nevertheless, the whole collection of assumptions – that is, every graphical model is understood as a parameterized class of probability measures, a certain reasonable system of “priors” on parameter spaces is given and a “prior” on the class of models is specified – allows one to apply a special Bayesian simulation method known as the Markov chain Monte Carlo (MCMC) method. From a certain perspective, this method can be viewed as a particular method for learning graphical models which is based neither on significance tests nor on the maximization of a quality criterion. The idea is that the above assumptions allow one to define, on the basis of data, a “posterior” probability measure on the class of considered graphs and a system of “posteriors” on parameter spaces. The posteriors could be interpreted as stochastic estimates of both the graphical models and its parameters. However, their direct computation is again infeasible. To overcome this obstacle, a stochastic simulation method is used, whose aim is to get an approximation of these “posteriors”. Of course, lots of technical assumptions are accepted (which are intentionally omitted here) to make the method feasible. Nevertheless, various versions of the MCMC method are similar to the local search methods mentioned above; they are also based on the idea of a search space and the idea of “travelling” in the class of considered graphs – for details see § 3.6 in Castelo [20]. Some theoretical results ensure the (stochastic) convergence of the MCMC method if mild assumptions are valid. The method was used both for learning decomposable models [48] and DAG models [75]. 8.2.1 Criteria for learning DAG models The aim of this section is to provide details about quality criteria for learning DAG models in the case of a discrete distribution framework with prescribed sample spaces (see Section A.9.5). First, let us introduce the symbol DAGS(N ) to denote the collection of all acyclic directed graphs having N as the set of nodes. Let a discrete joint sample space XN = i∈N Xi over N (see Section A.9.1) be fixed. By a quality criterion for learning DAG models adapted to this sample space any function Q will be understood which ascribes a real number Q(G, D) to a graph G ∈ DAGS(N ) and a database D ∈ DATA(N, d) (see p. 242 for this notation). To derive formulas for basic information criteria details about the parameterization of statistical models described by acyclic directed graphs are needed. The statistical model described by G ∈ DAGS(N ) consists of the class MG of probability measures on XN which are Markovian with respect to G. Recall that a probability measure P on XN is uniquely determined by its density f with respect to the counting measure υ on XN , that is, f (x) = P ({x}) for every x ∈ XN . The respective marginal densities of P (see Convention 2 on p. 20) are then f (y, z) for ∅ = A ⊂ N, y ∈ XA , fA (y) = P A ({y}) = z∈XN \A
164
8 Learning
and fN ≡ f , f∅ ≡ 1 by a convention. The conditional density fA|C of P for disjoint A, C ⊆ N can be defined as follows: fAC ([x,z]) if fC (z) = P C ({z}) > 0, fC (z) for x ∈ XA , z ∈ XC . fA|C (x|z) = 0 if fC (z) = 0 by a convention, A well-known fact (see Theorem 1 in Lauritzen et al. [69]) is that P ∈ MG iff its density recursively factorizes with respect to G, that is, f (x) = fi|paG (i) (xi |xpaG (i) ) for every x ∈ XN . (8.1) i∈N
Note that this fact can also be derived as a consequence of Theorem 4.1: owing to Lemma 7.1, it suffices to apply the theorem to the standard imset uG and the standard reference system (in the discrete case) – see p. 75. The definition of conditional density then allows us to rewrite the condition (i) in Theorem 4.1 in the form of (8.1). Convention 4. Let XN = i∈N Xi be a discrete sample space over N . The letter i will be used as a generic symbol for a variable in N : i ∈ N . r(i) Denote r(i) = |Xi | ≥ 1 and fix an ordering yi1 , . . . , yi of elements of Xi for every i ∈ N . The letter k will be used as a generic symbol for a code of an element of Xi : k ∈ {1, . . . , r(i)}. More specifically, k is the code of the k-th node configuration yik in the fixed ordering. Given i ∈ N and x ∈ XA such that i ∈ A ⊆ N symbol k(i, x) will denote the code of xi , that is, the unique 1 ≤ k ≤ r(i) such that xi = yik . Let G ∈ DAGS(N ). Denote by q(i, G) ≡ |XpaG (i) | = l∈paG (i) r(l) ≥ 1 the number of parent configurations for the variable i ∈ N and accept the q(i,G) of convention that q(i, G) = 1 if paG (i) = ∅. Fix an ordering zi1 , . . . , zi parent configurations, that is, elements of XpaG (i) , for every i ∈ N . If paG (i) = ∅ then it consists of the only parent configuration. The letter j will be used as a generic symbol for a code of a parent configuration: j ∈ {1, . . . , q(i, G)}. Thus, j is the code of the j-th configuration zij in the fixed ordering. Given i ∈ N and x ∈ XA such that paG (i) ⊆ A ⊆ N the symbol j(i, x) will denote the code of xpaG (i) , that is, the unique 1 ≤ j ≤ q(i, G) such that xpaG (i) = zij if paG (i) = ∅ and j = 1 otherwise. Moreover, let D ∈ DATA(N, d), d ∈ N be data over N (see p. 242); more specifically D : x1 , . . . , xd . Introduce the following special notation for the numbers of configuration occurrences in the database D: dij = |{1 ≤ ≤ d; x paG (i) = zij }| dijk = |{1 ≤ ≤ d; x {i}∪paG (i) = (yik , zij )}| d[x]
for i ∈ N, j ∈ {1, . . . , q(i, G)}, k ∈ {1, . . . , r(i)}, = |{1 ≤ ≤ d; x A = x}| for ∅ = A ⊆ N, x ∈ XA .
Of course, di1 = d if paG (i) = ∅.
♦
8.2 Quality criteria
165
Given G ∈ DAGS(N ), the respective “standard” parameterization of MG is based on the recursive factorization (8.1). The set of parameters ΘG consists of vectors θ ≡ [θijk ] where θijk ∈ [0, 1] for i ∈ N, j ∈ {1, . . . , q(i, G)}, k ∈ {1, . . . , r(i)}
(8.2)
r(i)
such that
θijk = 1 for every i ∈ N, 1 ≤ j ≤ q(i, G) .
k=1
Actually, every single parameter θijk can be interpreted as the value of the conditional density fi|paG (i) (yik |zij ). Given a vector parameter θ, the respective probability measure on XN is given by its density θ i j(i,x) k(i,x) for x ∈ XN . (8.3) f θ (x) = i∈N
Lemma 8.1. Let us accept Convention 4 and suppose that a vector of parameters θ satisfying (8.2) is given. Then the formula (8.3) defines a density of a probability measure P θ on XN (with respect to the counting measure). Moreover, P θ ∈ MG and ∀ i ∈ N , ∀ j ∈ {1, . . . , q(i, G)}, ∀ k ∈ {1, . . . , r(i)} θ fi|pa (yik |zij ) = θijk G (i)
θ if fpa (zij ) > 0 . G (i)
(8.4)
The mapping θ → P θ is a mapping onto MG . Proof. Given A ⊆ N let us denote the collection of parameters θijk with i ∈ A by θ[A]. Let t ∈ N be a terminal node in G. I. The first observation is that f θ ([y, z]) = ( θi j(i,z) k(i,z) ) · θt j(t,z) k(t,y) (8.5) i∈N \t
for every z ∈ XN \t , y ∈ Xt . Indeed, given x = [y, z] ∈ XN by Convention 4 and the fact that paG (i) ⊆ N \ {t} for i ∈ N one has j(i, x) = j(i, z) for i ∈ N , k(t, x) = k(t, y) and k(i, x) = k(i, z) for i ∈ N \ {t}. It suffices to substitute these equalities into (8.3). II. The second observation is that θ f θ ([y, z]) = θi j(i,z) k(i,z) = f θ[N \t] (z) (8.6) fN \t (z) ≡ y∈Xt
i∈N \t
θ θ for every z ∈ XN \t . To this end substitute (8.5) in fN y∈Xt f ([y, z]) \t (z) = θ and obtain easily fN y∈Xt θt j(t,z) k(t,y) . The \t (z) = ( i∈N \t θi j(i,z) k(i,z) ) · last sum is 1 by (8.2) and one can apply (8.3) to GN \t and θ[N \ t]. III. To verify that f θ is a probability density, choose an ordering t1 , . . . , t|N |
166
8 Learning
of the nodes of G which is consonant with the direction of arrows. Put N (n) = {t1 , . . . , tn } for n = 1, . . . , |N |, observe that tn is a terminal node in GN (n) and prove by induction on n = 1, . . . , |N | that f θ[N (n)] is a probability density on XN (n) . It is evident that f θ[N (n)] (x) ≥ 0 for x ∈ XN (n) . The fact x∈XN (n) f θ[N (n)] (x) = 1 for n = 1 is implied by (8.2) and the induction step follows from (8.6). Thus, f θ defines a probability measure P θ on XN . IV. The next observation is that, provided t is a terminal node in G, one has ftθ| N \t (y|z) = θt j(t,z) k(t,y)
θ for z ∈ XN \t with fN \t (z) > 0, y ∈ Xt .
(8.7)
To this end, substitute (8.5) and (8.6) into the definition of ftθ| N \t (y|z). The term i∈N \t θi j(i,z) k(i,z) is ensured to be non-zero in this case. Thus, it can be cancelled. V. Lemma 2.5 and (8.5) implies that t ⊥ ⊥ N \ t ∪ paG (t) | paG (t) [P θ ]. An analogous argument applied to tn and Gn for n = |N |, . . . , 1 allows one to ⊥ N (n) \ (tn ∪ paG (tn )) | paG (tn ) [P θ ]. This property, known as the show tn ⊥ “local Markov property” (see Remark 3.4) implies by the result from Verma and Pearl [150] that P θ is (globally) Markovian with respect to G, that is, P θ ∈ MG (see also Theorem 1 in Lauritzen et al. [69]). VI. The next observation is that if t is a terminal node in G then ftθ| paG (t) (ytk |ztj ) = θtjk for j ∈ {1, . . . , q(t, G)} θ with fpa (ztj ) > 0, k ∈ {1, . . . , r(t)} . G (t) j θ Indeed, choose z ∈ XN \t such that fN \t (z) > 0 and zpa(t) = zt , that is, j = j(t, z). Then take y ∈ Xt such that k = k(t, y). Thus, one is entitled to apply (8.7) to z. Write ftθ| pa(t) (ytk |ztj ) = ftθ| pa(t) (y|zpa(t) ) = θ θ (fpa(t) (zpa(t) ))−1 · ft∪pa(t) ([y, zpa(t) ]). However, t ⊥⊥ N \ t ∪ pa(t) | pa(t) [P θ ] implies by Lemma 2.4 that the latter term equals (f θ (z))−1 · f θ ([y, z]). By (8.7), it equals θt j(t,z) k(t,y) = θtjk . VII. The same argument applied to tn and GN (n) for n = 1, . . . , |N | gives (8.4). Finally, given P ∈ MG let us put θijk = fi | paG (i) (yik |zij ) for every i ∈ N , j ∈ {1, . . . , q(i, G)}, k ∈ {1, . . . , r(i)} such that fpaG (i) (zij ) > 0. If i, j are such that fpaG (i) (zij ) = 0 then choose any numbers θijk ∈ [0, 1] for k ∈ {1, . . . , r(i)} θ whose sum is 1. Observe that (8.2) is valid, whichθ allows one to define P . By (8.1) and (8.3) get f (x) = i∈N θi j(i,x) k(i,x) = f (x) for every x ∈ XN . Since both f and f θ are densities, this implies f = f θ , that is, P = P θ .
Remark 8.4. The correspondence θ → P θ is not a one-to-one mapping. The reason is that if t is a terminal node in G and PNθ \t (xN \t ) = 0 then P θ (x) = 0 no matter what is the value of θt j(t,x) k(t,x) . However, the correspondence is a one-to-one mapping on the set of parameters θ such that θijk ∈ (0, 1) for every i, j, k. These parameters correspond to strictly positive probability measures on MG .
8.2 Quality criteria
167
To get a maximum likelihood estimate in MG and a formula for the MLL criterion the following lemma is needed. Lemma 8.2. Let dk ≥ 0, k ∈ Kbe a non-empty finite collection of nonnegative numbers such that d = k∈K dk > 0. Then the vector function f with values in [−∞, ∞) given by dk · ln θk , f ([θk ]k∈K ) = k∈K
defined on the domain D(f ) = {[θk ]k∈K ; θk ≥ 0 k∈K θk = 1 } attains θˆk = (dk /d) for k ∈ K. In particular, the its maximum in [θˆk ]k∈K where maximum value of f on D(f ) is k∈K,dk >0 dk · ln(dk /d). Proof. This follows directly from Corollary A.1. It suffices to put ak = θk , bk = dk /d for d ∈ K and multiply the inequality (A.4) by d. Corollary 8.1. Let us accept Convention 4. Then the formulas for parameters d ijk if dij > 0, dij θˆijk = for i ∈ N, 1 ≤ j ≤ q(i, G), 1 ≤ k ≤ r(i), (8.8) 1 if dij = 0, r(i) ˆ
define a maximum likelihood estimate P θ in MG on the basis of data D ∈ DATA(N, d) (see Section A.9.3). Moreover, the maximized log-likelihood criterion has the form MLL (MG , D) =
r(i) q(i,G)
dijk · ln
i∈N j=1 k=1
dijk dij
(8.9)
for G ∈ DAGS(N ), D ∈ DATA(N, d) provided that a convention 0·ln (0/) ≡ 0 is accepted. Proof. Let D : x1 , . . . xd , d ≥ 1. Consider the task of maximizing the logarithm of the likelihood function l(θ) = ln L(θ, D). By the definition of the likelihood function and (8.3) write l(θ) = ln L(θ, D) =
d =1
ln f θ (x ) =
d
ln θi j(i,x ) k(i,x ) .
=1 i∈N
To get a better formula, introduce auxiliary notation for i ∈ N , x ∈ XN , 1 ≤ j ≤ q(i, G) and 1 ≤ k ≤ r(i): 1 if xi∪paG (i) = (yik , zij ), δi (j, k, x) = 0 otherwise. This allows us to write
168
8 Learning
l(θ) =
r(i) d q(i,G)
ln θijk · δi (j, k, x )
=1 i∈N j=1 k=1
=
r(i) q(i,G)
ln θijk ·
i∈N j=1 k=1
d
δi (j, k, x ) .
=1
By Convention 4 we observe that the latter sum is nothing but dijk . Therefore, l(θ) =
r(i) q(i,G)
dijk · ln θijk .
(8.10)
i∈N j=1 k=1
Clearly, the task of maximizing l(θ) is equivalent to the task of maximizing, for r(i) every i ∈ N and j ∈ {1, . . . , q(i, G)}, the function k=1 dijk · ln θijk defined r(i) r(i) r(i) on the set {[θijk ]k=1 ; θijk ≥ 0 k=1 θijk = 1 }. Since k=1 dijk = dij , if dij > 0 then the maximum is attained at θˆijk = (dijk /dij ) by Lemma 8.2. If dij = 0 then the function is constant and its maximum 0 is attained at any −1 point, in particular, at θˆijk = r(i) . This gives (8.8). Finally, substitute this into (8.10) to get (8.9). Corollary 8.2. Let us accept Convention 4. Then the effective dimension of the statistical model MG (see Section A.9.3) is (r(i) − 1) · q(i, G) = (r(i) − 1) · r(l) . (8.11) DIM (MG ) = i∈N
i∈N
l∈paG (i)
In particular, Akaike’s information criterion (AIC) and the Bayesian information criterion (BIC) are given by the formulas ⎧ ⎫ r(i) ⎨ q(i,G) dijk ⎬ AIC (G, D) = dijk · ln 1 − r(i) + (8.12) ⎩ dij ⎭ i∈N j=1 k=1 ⎧ ⎫ r(i) ⎨ ln d ln d q(i,G) dijk ⎬ − · r(i) + BIC (G, D) = dijk · ln (8.13) ⎩ 2 2 dij ⎭ i∈N j=1
k=1
for every G ∈ DAGS(N ) and every D ∈ DATA(N, d). Proof. It follows from (8.2) that, for fixed i ∈ N , 1 ≤ j ≤ q(i, G), the number of linearly independent parameters is r(i) − 1. This implies (8.11). The other formulas follow from Corollary 8.1 – see Section A.9.3. Remark 8.5. Bayesian criteria for learning DAG models in a discrete distribution framework with prescribed sample spaces could be introduced in the following manner. In general, a prior probability measure π G on the respective parameter space ΘG defined by (8.2) is given for every G ∈ DAGS(N ). This
8.2 Quality criteria
169
allows one to define the marginal likelihood as the integral of the likelihood function with respect to π G . The respective Bayesian criterion is the logarithm of the marginal likelihood given by LML (G, D) = ln L(θ, D) dπ G (θ) . (8.14) ΘG
However, to get a more suitable formula for this criterion, additional assumptions are accepted. More specifically, every π G is assumed to be a product q(i,G) measure i∈N j=1 π (ij) where π (ij) is a “prior” probability measure on the “local” parameter space r(i)
Θ(ij) = { [θijk ]k=1 ; θijk ≥ 0
r(i)
θijk }
k=1
– see the assumptions of “global” and “local” (parameter) independence mentioned in Spiegelhalter and Lauritzen [121]. These assumptions allow one to derive a more explicit formula using (8.10) and the Fubini theorem: LML (G, D) =
q(i,G) i∈N j=1
ln
r(i) d
ijk θijk dπ G (θijk ) .
(8.15)
Θ(ij) k=1
Typically, there are lots of technical assumptions on the priors π (ij) which allow one to obtain even more convenient formulas for the LML criterion. The flexibility in the choice of priors leads to a great variety of particular Bayesian criteria for learning DAG models. However, because it is not the aim of this monograph to explain peculiarities of the Bayesian approach to learning DAG models, the reader is referred to § 3.2 of Castelo [20]. 8.2.2 Score equivalent criteria A quality criterion Q for learning DAG models is score equivalent if for every pair G, H ∈ DAGS(N ) and every D ∈ DATA(N, d) Q(G, D) = Q(H, D)
whenever G and H are Markov equivalent. (8.16)
This requirement is quite natural from the statistical point of view. Indeed, if acyclic directed graphs G and H are Markov equivalent then they represent the same statistical model and data “generated” by P ∈ MG = MH do not distinguish between G and H. Thus, provided that the aim is to learn a statistical model, neither the respective quality criterion should distinguish between Markov equivalent acyclic directed graphs. Of course, by Proposition 6.1, the condition (8.16) implies a weaker condition Q(G, D) = Q(H, D)
if G and H are independence equivalent.
(8.17)
170
8 Learning
Nevertheless, if we consider a non-trivial distribution framework, by which is meant that we assume r(i) ≥ 2 for every i ∈ N in Convention 4, then (8.16) is equivalent to (8.17). Indeed, it was explained in Section 6.1 that it follows from the existence of a perfectly Markovian measure [90] with prescribed nontrivial discrete sample spaces that Markov and independence equivalence of G, H ∈ DAGS(N ) coincide then. Remark 8.6. The concept of score equivalent criterion was pinpointed in § 4.2.5 of Bouckaert [14]. Note that most criteria for learning DAG models are score equivalent – see Proposition 8.2 for examples. However, there are quality criteria which are not score equivalent. A well-known example, which is mentioned on p. 71 of [14], is a particular Bayesian criterion named the K2 metric in Koˇcka [57] after the algorithm in which it was first used by Cooper and Herskovits [24]. A possible argument in favor of quality criteria which are not score equivalent is that if one is interested in a causal interpretation of acyclic directed graphs (see Spirtes et al. [122]) then a criterion of this kind allows one to distinguish between different causal interpretations. However, the point is that the difference in a causal interpretation cannot be identified on the basis of statistical data and therefore it is not a good idea to use such a criterion in the learning phase, which is based on data. 8.2.3 Decomposable criteria Given a database D ∈ DATA(N, d) : x1 , . . . , xd , d ≥ 1 over N and ∅ = A ⊆ N , the database DA ∈ DATA(A, d) : x1A , . . . , xdA over A will be called a projection of D onto A. A quality criterion Q for learning DAG models will be called decomposable if there is a collection of functions qi|B : DATA({i} ∪ B, d) → R, where i ∈ N , B ⊆ N \ {i} and d ∈ N, such that qi|paG (i) (Di∪paG (i) ) (8.18) Q(G, D) = i∈N
for every G ∈ DAGS(N ) and every D ∈ DATA(N, d). An important fact is that the functions qi|B do not depend on G; the graph G is only represented in the right-hand side of (8.18) by the collection of sets paG (i), i ∈ N . The criterion Q will be called strongly decomposable if, moreover, qi|B only depends on Di∪B through the respective marginal contingency table cti∪B (D) (see p. 242 for this concept). More precisely, Q is strongly decomposable if a collection of functions q¯i|B : CONT({i} ∪ B, d) → R, i ∈ N , B ⊆ N \ {i}, where d ∈ N, exists such that Q(G, D) = q¯i|paG (i) (cti∪paG (i) (D)) (8.19) i∈N
for every G ∈ DAGS(N ) and D ∈ DATA(N, d) (see Section A.9.1).
8.2 Quality criteria
171
Remark 8.7. The concept of a decomposable criterion was pinpointed in Chickering [23]. Actually, all criteria for learning DAG models which are used in practice are decomposable owing to the way they are constructed – see Castelo [20]. This is caused by the intended application of a local search method (see p. 162). Indeed, if acyclic directed graphs G and H differ in the presence of one arrow then there exists t ∈ N such that paG (i) = paH (i) for every i ∈ N \ t and by (8.18) Q(G, D) − Q(H, D) = qt|paG (t) (Dt∪paG (t) ) − qt|paH (t) (Dt∪paH (t) ) . Thus, the decomposability of a criterion for learning DAG models is a requirement brought by researchers in the area of artificial intelligence in order to make computations feasible. However, the definitions of the concept of a decomposable criterion I found in the literature are slightly vague. The authors [20, 23] seem to repeat a sketchy phrase from Heckerman [51]: a criterion is decomposable if it can be written as a product of measures, each of which is a function of only one node and its parents. What is not clear is how data affect the value of the criterion and what type of data is supposed to be given. Indeed, one can either consider data in the form of an ordered sequence of elements of the sample space or in the form of a contingency table. The distinction can be explained by means of the following simple example: if y, z ∈ XN , y = z then the database D : x1 = y, x2 = z of the length 2 differ from the database D : x1 = z, x2 = y if one accepts the interpretation of data in terms of an ordered sequence, but they coincide if one accepts the other interpretation! One can certainly imagine quality criteria whose values do depend on the order of items in a database. In particular, it really matters what type of input data one has in mind because the other understanding of data confines the class of quality criteria. A formal definition of a decomposable criterion from Chickering [23] seems to be consonant with both forms of data. However, Chickering [23] restricts his attention to Bayesian quality criteria (see Remark 8.5) which have the property that their value does not depend on the order of items in a database. Moreover, in the formulas for various decomposable criteria [14, 20] data are represented in the form of a contingency table. On the other hand, other authors [57, 20] implicitly understand data in the form of a sequence of elements of joint sample space. Because two different understandings of data lead to two different concepts of a decomposable quality criterion, I decided to include both versions and to distinguish between them by means of terminology: the concept of strong decomposability corresponds to the situation when data are supposed in the form of a contingency table. The distinction is also reflected in the subsequent concepts and results. 8.2.4 Regular criteria Two traditional requirements on a quality criterion can be joined in the following concept. A quality criterion for learning DAG models will be called regular
172
8 Learning
if there exists a collection of functions tA : DATA(A, d) → R, ∅ = A ⊆ N and a constant depending on the sample space and d, by a convention denoted by t∅ (D∅ ), such that Q(G, D) = ti∪paG (i) (Di∪paG (i) ) − tpaG (i) (DpaG (i) ) (8.20) i∈N
for every G ∈ DAGS(N ) and every D ∈ DATA(N, d). A criterion Q will be called strongly regular if tA depend on DA through ctA (D), that is, if there exists a collection of functions ¯tA : CONT(A, d) → R, ∅ = A ⊆ N and a constant ¯t∅ () depending on the sample space and d such that ¯ti∪pa (i) (cti∪pa (i) (D)) − ¯tpa (i) (ctpa (i) (D)) Q(G, D) = (8.21) G G G G i∈N
for every G ∈ DAGS(N ) and every D ∈ DATA(N, d) (see Section A.9.1). Observe that it follows immediately from this definition that a linear combination of (strongly) regular criteria is again a (strongly) regular criterion. Lemma 8.3. Let us accept Convention 4 and assume r(i) ≥ 2 for every i ∈ N . A quality criterion Q for learning DAG models is regular iff it is decomposable and score equivalent. Moreover, it is strongly regular iff it is strongly decomposable and score equivalent. Proof. I. If Q is regular then put qi|B (D) = ti∪B (D) − tB (DB )
(8.22)
for i ∈ N , B ⊆ N \ {i} and D ∈ DATA({i} ∪ B, d). Observe that (8.20) implies (8.18). Transformational characterization of equivalent acyclic directed graphs (see p. 49) can simplify the proof of (8.17): it suffices to verify for every D ∈ DATA(N, d) that Q(G, D) = Q(H, D) whenever H is obtained from G ∈ DAGS(N ) by a legal reversal of an arrow a → b in G. The proof of this fact is analogous to the proof of Corollary 7.1: owing to (8.20) and the fact paG (i) = paH (i) for i ∈ N \ {a, b} one needs to evidence ti∪paG (i) (Di∪paG (i) ) − tpaG (i) (DpaG (i) ) i∈{a,b}
=
ti∪paH (i) (Di∪paH (i) ) − tpaH (i) (DpaH (i) ) .
i∈{a,b}
This causes no problems since there exists C ⊆ N such that paG (a) = paH (b) = C, paG (b) = C ∪ a and paH (a) = C ∪ b. As explained in Section 8.2.2, since a non-trivial distribution framework is considered, (8.17) implies (8.16). II. To verify the converse implication, suppose that Q is decomposable and
8.2 Quality criteria
173
score equivalent. The first observation is that ∀ C ⊆ N , ∀ a, b ∈ N \ C, a = b and ∀ D ∈ DATA(N, d) qa|bC (DabC ) + qb|C (DbC ) = qb|aC (DabC ) + qa|C (DaC ) .
(8.23)
Indeed, let us construct G, H ∈ DAGS(N ) as follows. Both graphs have all arrows from C to {a, b}. Moreover, b → a is an arrow in G while a → b is an arrow in H. Clearly, G and H are equivalent acyclic directed graphs over N and ∀ i ∈ N \ {a, b} paG (i) = paH (i). It suffices to substitute (8.18) into (8.16); after cancellation of the terms which correspond to i ∈ N \ {a, b}, (8.23) is obtained. III. The required system of functions tA , A ⊆ N can be constructed recursively. First, put t∅ (D∅ ) = 0. For ∅ = A ⊆ N define tA (DA ) on the basis of tB (DB ) with B ⊂ A. Choose a ∈ A and put tA (DA ) = qa|A\a (DA ) + tA\a (DA\a )
for D ∈ DATA(N, d) .
(8.24)
IV. However, one has to show that this definition does not depend on the choice of a ∈ A, in other words, that (8.24) holds for any other b ∈ A in place of a. This can be proved by induction on |A|. It is trivial if |A| = 1. If |A| ≥ 2 observe that ∀ D ∈ DATA(N, d), ∀ a, b ∈ A, a = b qa|A\a (DA ) + tA\a (DA\a ) = qb|A\b (DA ) + tA\b (DA\b ) . Indeed, put C = A \ {a, b} in (8.23) and add the term tA\ab (DA\ab ) to both sides of the equality qa|A\a (DA ) + { qb|A\ab (DA\a ) + tA\ab (DA\ab ) } = qb|A\b (DA ) + { qa|A\ab (DA\b ) + tA\ab (DA\ab ) } . By the induction hypothesis, the expressions in braces are tA\a (DA\a ), respectively tA\b (DA\b ), which yields the desired equality. V. The validity of (8.24) for every A ⊆ N implies (8.22): put A = i ∪ B and a = i. This in combination with (8.18) gives (8.20). VI. The proof of the claim concerning a strongly regular criterion is omitted. It is analogous: one simply writes ctA (D) instead of DA , q¯ instead of q and ¯t instead of t. The functions tA , A ⊆ N inducing a regular criterion are not uniquely determined. To characterize those collections of functions which induce the same quality criterion, the following lemma characterizing special modular functions (see p. 90) is needed. Lemma 8.4. Let L∗ (N ) denote the class of modular functions l such that l(N ) = l(∅). Then L∗ (N ) has the form { l ∈ RP(N ) : l, δN − δ∅ = 0 and ∀ G ∈ DAGS(N ) l, uG = 0} .
(8.25)
174
8 Learning
Moreover, L∗ (N ) is a linear subspace of RP(N ) of dimension |N | and wi · m{i}↑ (8.26) l ∈ L∗ (N ) ⇔ l = w∅ · m∅↑ + i∈N
where w∅ , wi ∈ R, i ∈ N are numbers satisfying
i∈N
wi = 0.
Proof. I. To show (8.25), it suffices to show that l ∈ RP(N ) is modular iff
l, uG = 0 for every G ∈ DAGS(N ). Indeed, by Proposition 5.1 l ∈ L(N ) iff l, u = 0 for every u ∈ E(N ), respectively for every u ∈ S(N ). However, by Remark 7.3 and Lemma 7.1, E(N ) ⊆ {uG ; G ∈ DAGS(N )} ⊆ S(N ). II. By Lemma 5.2, L(N ) is a linear subspace of dimension |N | + 1. Thus, L∗ (N ) ⊂ L(N ) implies that L∗ (N ) is a linear subspace whose dimension does not exceed |N |. To show that its dimension is |N | it suffices to construct a linearly independent subset of L∗ (N ) of cardinality |N |. To this end, fix a total ordering b1 , . . . , b|N | of N and put l1 = m∅↑ , lj = m{bj−1 }↑ − m{bj }↑
for j = 2, . . . , |N | .
(8.27)
By Lemma 5.2, l1 , . . . , l|N | ∈ L∗ (N ). III. Observe that they form a linearly independent set. |N | Indeed, suppose that for some αj ∈ R one has j=1 αj · lj (A) = 0 for any A ⊆ N . If |N | = 1, then it implies α1 = 0. Suppose |N | ≥ 2, put A = {bj } for j = 1, . . . , |N | and obtain 0 = α1 + α2 , 0 = α1 − αj + αj+1 for j = 2, . . . , |N | − 1 and 0 = α1 − α|N | . Sum these equalities to get 0 = |N | · α1 . Then substitute α1 = 0 into the equalities and show by induction that αj = 0 for j = 2, . . . , |N |. Thus, α1 = . . . = α|N | = 0 which concludes the proof of the fact that l1 , . . . , l|N | are linearly independent. IV. The above observation means that l1 , . . . , l|N | is a linear basis of L∗ (N ). |N | Therefore, l ∈ L∗ (N ) iff l = j=1 αj · lj for some αj ∈ R. Substitute (8.27) in this formula and get (8.26) where w∅ = α1 , w{b1 } = α2 , w{bj } = −αj + αj+1 for j = 2, . . . , |N | − 1 and w{b|N | } = −α|N | . Because the above correspondence between w’s and α’s is a one-to-one correspondence, a general expression for l ∈ L∗ (N ) is given by (8.26). Corollary 8.3. Let Q be a regular quality criterion for learning DAG models given by (8.20) for some functions tA , A ⊆ N . Then a collection of functions ˜tA : DATA(A, d) → R, A ⊆ N (where ˜t∅ is a constant depending on d and the
8.2 Quality criteria
175
sample space) defines Q by (8.20) iff there exist constants w∅ (d) and wi (d), i ∈ N such that i∈N wi (d) = 0 and ˜tA (DA ) = tA (DA )+w∅ (d)+
wi (d)
for A ⊆ N, D ∈ DATA(N, d) . (8.28)
i∈A
Proof. Let us fix the sample space and the length d of a database. Then the collection of functions tA , A ⊆ N can be interpreted as a real function on DATA(N, d) × P(N ) which assigns the value tA (DA ) ≡ tD (A) to D ∈ DATA(N, d) and A ⊆ N . Moreover, this function, for every ∅ = A ⊆ N , depends on DA and it is constant on DATA(N, d) if A = ∅. The collection ˜tA , A ⊆ N can be represented in an analogous way. Now, by (8.20) and the definition of a standard imset (see p. 135), derive that tA , A ⊆ N and ˜tA , A ⊆ N both induce Q iff ∀ D ∈ DATA(N, d) ∀ G ∈ DAGS(N )
t˜D − tD , δN − δ∅ − uG = 0 .
I. Observe that this condition is equivalent to the requirement that lD ≡ t˜D − tD ∈ L∗ (N ) for every D ∈ DATA(N, d). Indeed, for a fixed D ∈ DATA(N, d), choose G ∈ DAGS(N ) with uG = 0 and observe lD , δN − δ∅ = 0. Hence lD , uG = 0 for every G ∈ DAGS(N ). These two conditions are equivalent to lD ∈ L∗ (N ) by (8.25) in Lemma 8.4. II. Now, let us apply (8.26) to every lD , D ∈ DATA(N, d) to see that the requirement in Step I. is equivalent to theexistence of some real functions w∅ , wi : DATA(N, d) → R, i ∈ N such that i∈N wi (D) = 0 and lD (A) = w∅ (D) · m∅↑ (A) +
wi (D) · m{i}↑ (A)
(8.29)
i∈N
for D ∈ DATA(N, d), A ⊆ N . The point is that functions w∅ and wi , i ∈ N have to be constant on DATA(N, d). Indeed, the substitution A = ∅ in (8.29) gives w∅ (D) = lD (∅) = ˜t∅ (D∅ ) − t∅ (D∅ ) which is a constant. Analogously, by putting A = {i} for i ∈ N we derive that wi depends on Di . Suppose for contradiction that wi is not constant for some i ∈ N . Then databases D1 : x1 , . . . , xd and D2 : y 1 , . . . , y d exist such that wi (D1 ) = wi (D2 ). Construct a database D3 : z 1 , . . . , z d by 3 1 3 2 putting z = [x N \i , yi ] for = 1, . . . d. Thus, DN \i = DN \i and Di = Di . 3
− i ∈N wi (D1 ) = As wi , i ∈ N depend on Di one has i ∈N wi (D ) wi (D2 ) − wi (D1 ) = 0 which contradicts the assumption i∈N wi (D) = 0 for any D ∈ DATA(N, d). III. On the other hand, the validity of (8.29) for every D ∈ DATA(N, d) and constant functions w∅ and wi , i ∈ N is nothing but (8.28). Remark 8.8. The class L∗ (N ) induces an equivalence relation t − t˜ ∈ L∗ (N ) for t, t˜ ∈ RP(N ) . One can introduce a reasonable “standard” representative of
176
8 Learning
every equivalence class then, namely the only t∗ within the class such that t∗ (A) = 0 for A ⊆ N with |A| = 1. However, this idea is not applicable in the situation where one is looking for a “standardized” collection of functions tA , A ⊆ N inducing a regular criterion. The reason is that ti need not be constant for i ∈ N while the available “standardization” operation (8.28) deals with constants. Thus, one can achieve by “standardization” ti (Di ) = 0 for every i ∈ N and any chosen database D ∈ DATA(N, d) but then another database D may exist for which this condition is not valid. The formulas derived in Section 8.2.1 allow one to verify that basic information criteria are strongly regular. Proposition 8.1. The maximized log-likelihood criterion (MLL) is a strongly regular criterion for learning DAG models. Proof. Suppose the situation described by Convention 4. Put x∈XA d(x) · ln d(x) if ∅ = A ⊆ N, d ∈ CONT(A, d), ¯tA (d) = d · ln d if A = ∅,
(8.30)
where a convention 0 · ln 0 ≡ 0 is accepted. By (8.9), the formula dij = r(i) k=1 dijk and Convention 4 we can write: MLL (G, D) =
r(i) q(i,G)
dijk · ln
i∈N j=1 k=1
=
r(i) q(i,G)
dijk = dij
dijk · ln dijk −
i∈N j=1 k=1
=
i∈N
q(i,G)
q(i,G) r(i)
{
ln dij ·
i∈N j=1
r(i)
dijk =
k=1
q(i,G)
dijk · ln dijk
j=1 k=1
¯ti∪pa (i) (cti∪pa (i) (D)) G G
for G ∈ DAGS(N ) and D ∈ DATA(N, d).
−
dij · ln dij } .
j=1
¯tpa (i) (ctpa (i) (D)) G G
The mapping which assigns the effective dimension of MG to every G ∈ DAGS(N ) can be viewed as a special quality criterion for learning DAG models, namely a criterion which does not depend on data. Proposition 8.2. The effective dimension (DIM) is a strongly regular criterion for learning DAG models. Akaike’s information criterion (AIC) and the Bayesian information criterion (BIC) are strongly regular criteria as well. In particular, provided that r(i) ≥ 2 for i ∈ N , they are score equivalent and strongly decomposable.
8.3 Inclusion neighborhood
Proof. Supposing Convention 4 put l∈A r(l) if ∅ = A ⊆ N, for every d ∈ CONT(A, d), ¯tA (d) = 1 if A = ∅.
177
(8.31)
By (8.11) and Convention 4 for every G ∈ DAGS(N ) and D ∈ DATA(N, d) we can now write: (r(i) − 1) · q(i, G) = r(i) · q(i, G) − q(i, G) = DIM (G, D) = i∈N
=
i∈N
{
l∈{i}∪paG (i)
i∈N
r(l)
¯ti∪pa (i) (cti∪pa (i) (D)) G G
i∈N
−
r(l)
}.
l∈paG (i)
¯tpa (i) (ctpa (i) (D)) G G
Thus, the DIM criterion is strongly regular. This fact, together with Proposition 8.1 and the definitions of the AIC criterion and the BIC criterion, implies the same conclusion for these criteria. Finally, use Lemma 8.3. Remark 8.9. In my view, the logarithm of the marginal likelihood (LML) mentioned in Remark 8.5 can also be shown to be strongly regular if relevant assumptions on the priors π G , G ∈ DAGS(N ) are accepted. More exactly, I have in mind the assumptions explicated in Dawid and Lauritzen [30], namely a technical assumption that π G are Dirichlet measures and the assumption that the priors are compatible (see § 6 of [30]). Actually, as mentioned on p. 54 of [20], there are examples of Bayesian criteria which are both decomposable and score equivalent. Therefore, they are regular by Lemma 8.3.
8.3 Inclusion neighborhood The binary relation MK ⊆ ML for K, L ∈ DAGS(N ) defines the inclusion quasi-ordering on the set of acyclic directed graphs over N . It can also be viewed as a partial ordering on the set of Markov equivalence classes of acyclic directed graphs over N , respectively on the set of DAG models over N . In this chapter, every G ∈ DAGS(N ) is also interpreted as a statistical model MG on a (fixed) discrete sample space XN = i∈N Xi (see p. 165). Clearly, MK ⊆ ML implies MK ⊇ ML and the converse is true if |Xi | ≥ 2 for every i ∈ N because of the existence of a perfectly Markovian measure on XN for every G ∈ DAGS(N ) [90]. In particular, the strict inclusion of formal independence models MK ⊂ ML for K, L ∈ DAGS(N ) is equivalent to strict inclusion MK ⊃ MK of statistical models then. Given K, L ∈ DAGS(N ), we say that they are inclusion neighbors and write MK ML if MK ⊂ ML and there is no G ∈ DAGS(N ) such that MK ⊂ MG ⊂ ML . More precisely, we say then that ML is an upper neighbor of MK or, dually, that MK is a lower neighbor of ML . The inclusion neighborhood of
178
8 Learning
MG , G ∈ DAGS(N ) consists of the union of the collection of upper neighbors and the collection of lower neighbors. Remark 8.10. The inclusion quasi-ordering can be understood in two ways: either in the sense MK ⊆ ML or in the sense MK ⊇ ML . Thus, it is a matter of taste how this asymmetric binary relation of graphs K and L is reflected in the terminology, notation and pictures. Whittaker [157] had in mind the interpretation of graphs in terms of statistical models. Thus, in his view, if MK ⊇ ML then K represents a bigger statistical model than L. Therefore, he called the saturated model, which is represented by a complete graph, the maximal graphical model (see p. 185 of [157]) and always placed the pictorial representatives of this model at the top of his diagrams. Chickering [23] used the formula L ≤ K to denote MK ⊆ ML . It seems that he was mainly led by graphical pictorial representation. Indeed, MK ⊆ ML implies that K contains at least those edges which are present in L. Similarly, in Koˇcka [57] and Castelo [20], graphs with maximal amount of edges are at the top of the respective Hasse diagrams. However, CI interpretation is the main issue in this monograph. Therefore I prefer the interpretation in terms of formal independence models, that is, MK ⊆ ML . One reason for my preference is that this point of view admits a wider perspective: graphical models can be interpreted as the elements of the lattice of CI structures. This approach is naturally reflected in terminology and pictures. Therefore, if MK ML , then ML is called an upper neighbor of MK and the pictorial representative of ML is put above the pictorial representative of MK in all diagrams in this book. In particular, the saturated model, which represents the largest class of probability measures but empty CI information, is always placed at the bottom of a Hasse diagram. This is also natural from the point of view of an arithmetic approach to the description of these models, which is explained in Section 8.4. Indeed, if models are described by standard imsets then the zero imset is placed at the bottom and a move towards an upper neighbor corresponds to adding an (elementary) imset. Note that in order to avoid confusion I intentionally do not use a notation like L ≤ K. The concept of an inclusion neighborhood is certainly important from the theoretical point of view. Nevertheless, there are practical reasons for which it deserves serious attention. These reasons are strongly related to local search methods mentioned on p. 162. It is claimed on p. 66 of Castelo [20] that the “inclusion boundary condition” from Koˇcka [57] is a necessary condition for a neighborhood structure to avoid an unpleasant situation in which a local search procedure finds a local maximum of a quality criterion which is not the global maximum. This condition requires, for every G ∈ DAGS(N ), that the inclusion neighborhood of MG should be covered by {MH ; H ∈ nei (G)} where nei (G) is the class of neighboring graphs for G given by the respective local search method.
8.3 Inclusion neighborhood
179
Actually, the significance of this condition was implicitly recognized already by Meek [91] in connection with a particular local search procedure named the greedy equivalence search (GES) algorithm. The goal of the GES algorithm is learning DAG models. Every DAG model over N can be represented graphically by any member of the respective equivalence class of acyclic directed graphs over N . Therefore, the quality criteria used in this procedure are supposed to be both decomposable and score equivalent. The states of the procedure are in fact equivalence classes of graphs, although the procedure formally works with individual elements of DAGS(N ). The algorithm has two phases. In the first phase edges are inserted and in the second phase edges are removed. Thus, in each phase a different part of the overall neighborhood structure of a “current” graph is considered. (i) The first phase starts with an acyclic directed graph G over N which has no arrows. In this phase, the set of neighboring graphs of a current G ∈ DAGS(N ) consists of K ∈ DAGS(N ) such that there exists L ∈ DAGS(N ) which is equivalent to G and K is obtained from L by one arrow addition. Provided there exists a neighboring graph K which has a strictly positive increase Q(K, D) − Q(G, D), D ∈ DATA(N, d) of the quality criterion value, one chooses K with the maximum increase and moves from G to K, which becomes a new current graph. The procedure is repeated until the quality criterion value in the current graph cannot be increased in this way. (ii) The second phase starts with the result of the first phase. It employs the complementary part of the overall neighborhood structure. In the second phase, the set of graphs neighboring the current G ∈ DAGS(N ) consists of L ∈ DAGS(N ) such that there exists K ∈ DAGS(N ) which is equivalent to G, and L is obtained from K by the removal of an arrow. Again, one is trying to maximize the increase Q(L, D)−Q(G, D) of the quality criterion value and the procedure stops when the value of Q in the current graph cannot be increased in this way. The output of the GES algorithm is the equivalence class of acyclic directed graphs over N which contains the last current graph G ∈ DAGS(N ). Meek [91] formulated a conjecture about a transformational graphical characterization of inclusion quasi-ordering for acyclic directed graphs (see Lemma 8.5 below). As reported in Chickering [23], Meek also showed that if his conjecture is true and many other assumptions are fulfilled, then the GES algorithm should find the “optimal solution”. That means, one should reach that equivalence class of acyclic directed graphs with respect to which the probability measure which “generates” data is perfectly Markovian. Note that his assumptions involve the data faithfulness assumption (see p. 156), several other technical statistical assumptions and the assumption that the length of the database D approaches to infinity. The validity of the conjecture is needed to show that the GES algorithm cannot terminate with a model that is not the optimal solution – for details see the proof of Lemma 9 in [23].
180
8 Learning
The conjecture about a transformational characterization of inclusion quasiordering has recently been confirmed by Chickering [23] with the help of a method used in Koˇcka et al. [58] to verify the conjecture in a special case. The characterization is based on the operation of a legal arrow reversal (see p. 49) and the operation of a legal arrow addition by which an arrow addition into an acyclic directed graph is meant such that the resulting directed graph remains acyclic. However, I prefer a simpler formulation of the result, which uses the inverse operation. By an arrow removal the following change of an acyclic directed graph K over N into a directed graph L over N is understood: one simply removes an arrow a → b in K. Evidently, L is again an acyclic directed graph. Moreover, one has MK ⊂ ML (which follows from the following lemma). Lemma 8.5. Suppose that K, L ∈ DAGS(N ). Then MK ⊆ ML iff there exists a sequence K = G1 , . . . , Gn = L, n ≥ 1 of acyclic directed graphs over N such that for every i = 1, . . . , n − 1 • either Gi+1 is obtained from Gi by a legal arrow reversal, or • Gi+1 is obtained from Gi by an arrow removal. In particular, MK ⊂ ML iff a sequence of graphs of this type exists such that n ≥ 2 and the arrow removal operation is applied at least once in the sequence. Finally, MK ML iff there exists K , L ∈ DAGS(N ) such that K is equivalent to K, L is equivalent to L, and L is obtained from K by an arrow removal. Proof. I. The first claim of the lemma is proved in Chickering [23]. However, it is formulated in a converse order; namely MK ⊆ ML iff there exists a sequence L = H1 , . . . , Hn = K, n ≥ 1 such that, for every j = 1, . . . , n − 1, Hj+1 is obtained from Hj either by a legal arrow reversal or by a legal arrow addition. II. The second claim of the lemma can be verified as follows: the necessity of the condition follows from the first claim and the transformational characterization of equivalence of acyclic directed graphs mentioned on p. 49. III. The sufficiency follows from the equivalence characterization and the observation that if L is obtained from K ∈ DAGS(N ) by the removal of an edge then MK ⊂ ML . Indeed, MK ⊆ ML follows from the definition of the induced DAG model (p. 48): every route that is active in L is an active route in K which means that every route blocked in K has to be blocked in L. The fact MK = ML follows from the graphical characterization of independence equivalence of acyclic directed graphs (p. 48): equivalent graphs have the same underlying graph. IV. The condition in the third claim is equivalent to the requirement that there exists a sequence K = G1 , . . . , Gn = L, n ≥ 2 of the considered type
8.4 Standard imsets and learning
181
such that the arrow removal operation is applied exactly once (use the transformational characterization of equivalence). Indeed, if there are two arrow removals in the sequence then one has MK ⊂ MG ⊂ ML for any graph G in the sequence which is after the first removal and before the last removal. Conversely, if MK ⊂ MG ⊂ ML for an acyclic directed graph G then by the second claim of the lemma |A(K)| > |A(G)| > |A(L)|, where A(G) denotes the collection of arrows in G. This implies that K has at least two more edges than L. V. Thus, the third claim can be derived from the second one. Thus, a transformational characterization of inclusion quasi-ordering for acyclic directed graphs is available. However, it does not seem to be very suitable for testing inclusion of DAG models which is the task to decide, given K, L ∈ DAGS(N ), whether MK ⊆ ML holds or not. Indeed, one has to try to construct a sequence of graphs between K and L and it is not clear how to do it. The question of whether a suitable direct graphical characterization of inclusion quasi-ordering for acyclic directed graphs exists remains open – see Theme 2 in Chapter 9. On the other hand, a special arithmetic criterion is proposed in the next section.
8.4 Standard imsets and learning The method of structural imsets can be applied in the area of learning DAG models. It brings a certain simplifying perspective. 8.4.1 Inclusion neighborhood characterization Standard imsets for acyclic directed graphs (see p. 135) allow one to characterize the inclusion quasi-ordering in an arithmetic way. Proposition 8.3. Let K, L ∈ DAGS(N ) be such that L is obtained from K by the removal of an arrow a → b in K. Then uL − uK = u a,b|C where C = paK (b) \ {a} = paL (b). Proof. Since paL (c) = paK (c) for c ∈ N \ b, paL (b) = C and paK (b) = aC, by (7.2) we get uL − uK = {δpaL (b) − δbpaL (b) } − {δpaK (b) − δbpaK (b) } = δC − δbC − δaC + δabC . Lemma 8.6. Suppose K, L ∈ DAGS(N ). Then MK ⊆ ML (≡ MK ⊇ ML provided that r(i) ≥ 2 for i ∈ N in Convention 4) iff uL −uK is a combinatorial imset over N . Moreover, MK ⊂ ML (≡ MK ⊃ ML ) iff uL − uK is a non-zero combinatorial imset.
182
8 Learning
Proof. If MK ⊆ ML then apply Lemma 8.5 and consider the respective sequence K = G1 , . . . , Gn = L, n ≥ 1. If Gi+1 is obtained from Gi by a legal arrow reversal then Gi+1 is Markov equivalent to Gi and uGi+1 = uGi by Corollary 7.1. If Gi+1 is obtained from Gi by an arrow removal then uGi+1 − n−1 uGi ∈ E(N ) by Proposition 8.3. Thus uL − uK = i=1 uGi+1 − uGi is a combinatorial imset. Conversely, if uL − uK ∈ C(N ) then by Lemma 6.1 uL uK , that is, MuK ⊆ MuL . This means MK ⊆ ML by Lemma 7.1. Thus, the proved fact implies that MK ⊂ ML iff both uL − uK ∈ C(N ) and uK − uL ∈ C(N ). However, it follows from Proposition 4.4 for w ∈ C(N ) that −w ∈ C(N ) ⇔ w = 0. Hence, the latter condition is that uL − uK is a non-zero combinatorial imset. Note that the above characterization gives a simple arithmetic criterion for testing inclusion of DAG models. Indeed, as mentioned in Remark 6.3, the question of whether an imset is a combinatorial imset is decidable. Corollary 8.4. Suppose K, L ∈ DAGS(N ). Then MK ML if and only if uL − uK is an elementary imset. Proof. To prove the necessity of the condition, use the third claim of Lemma 8.5: by Corollary 7.1 and Proposition 8.3 we get uL − uK = uL − uK ∈ E(N ). To prove the sufficiency, use Lemma 8.6 to derive MK ⊂ ML and suppose for contradiction that MK ⊂ MG ⊂ ML for some G ∈ DAGS(N ). Then observe again by Lemma 8.6 that uL − uK = (uL − uG ) + (uG − uK ) is the sum of two non-zero combinatorial imsets. Thus, deg(uL − uK ) ≥ 2 and uL − uK ∈ E(N ) by Corollary 4.2, which contradicts the assumption. Of course, given a pair K, L ∈ DAGS(N ) such that MK ML , the elementary imset uL −uK is determined uniquely. It will be called a differential imset for K and L. Remark 8.11. The concept of a differential imset brings an interesting interpretation of the links in the Hasse diagram of the poset of DAG models over N . Every link of this diagram can be annotated by the respective elementary triplet and one can interpret the links of the diagram in terms of elementary conditional (in)dependence statements. An example of a diagram of this type for N with |N | = 3 is given in Figure 8.1. Note that the DAG models are represented by standard imsets in the figure and that graphical representatives, namely the essential graphs which coincide with the largest chain graphs for |N | ≤ 3 are attached for the reader’s convenience. The above-mentioned interpretation becomes relevant in connection with local search methods for learning DAG models (see p. 162). It was explained in Section 8.3 that, from both the theoretical and practical points of view, it is natural to consider the search space (of unique representatives) of DAG models endowed with the neighborhood structure given by the concept of an inclusion neighborhood. This search space is nothing but the Hasse diagram
{a, b, c}
{b}
{a}
{b, c}
&
+1
cg
b
a, b|∅
%
a
7 o S g S g
Q Q Q Q
0
−1
−1
0
0
Q Q Q Q Q Q Q Q
+1
$
Q Q Q Q
0
a, c|b
∅
Q Q Q Q
'
{c}
Q Q Q Q Q Q Q Q
{a, c}
{a, b}
Q Q Q Q
KEY:
+1
0
0
0
+1
0
0
−1
+1
Q Q Q Q
a, b|∅
b
SS wg -
cg
%
a
g
$
a, c|∅
&
−1
Q Q Q Q Q Q Q Q
0
Q Q Q Q
b
%
a, c|∅
'
a, b|c
&
+1
Q Q Q Q
−1
a
−1
0
g
SS g
cg
$
Q Q Q Q Q Q Q Q
0
Q Q Q Q
'
0
0
0
−1
−1
+1
Q Q Q Q
b, c|∅
&
0
0
−1
−1
−1
0
0
−1
cg g
b
0
a
Fig. 8.1. The search space of standard imsets over N = {a, b, c} (rotated). 0 &
−1
0
+1
0
Q Q Q Q
a, b|c
c
b
%
a
g S g S g
0
0
Q Q Q
Q
−1
&
0
0
0
0
Q Q Q Q Q Q Q Q
$
b
+1
g
a
%
−1
b
0
0
Q
0
+1
−1
Q Q Q
0
−1 +1
0
0
Q Q Q Q
a, c|b
&
0
Q Q Q Q Q Q Q Q
−1
+1
Q Q Q Q
'
b, c|∅
&
0
Q Q Q Q Q Q Q Q
Q Q Q Q
+1
a
cg
a
g
b
−1
0
0
0
0
Q Q Q Q
&
+1
Q Q Q Q Q Q Q Q
−1
+1
Q Q Q Q
'
b, c|a
g
%
$
cg
a, c|∅
g
b
%
g
$
%
$
the growth of ? DIM (G) and MLL (G, D) '
a, c|b
SS g
cg
b, c|a
$
a, b|∅
Q Q Q Q
'
Q Q Q Q Q Q Q Q
0
b
%
g
a, b|∅
cg
g
0
g
b
$
b, c|∅
c
cg
%
Q Q Q Q
0
a
a
g
$
g S g S g
'
%
a
/ g
$
&
+1
Q Q Q
Q
Q Q Q Q Q Q Q Q
0
+1
Q Q Q Q
'
a, c|∅
&
+2
Q Q Q
Q
−1
a, b|c
+1
Q Q Q Q Q Q Q Q
Q Q Q Q
0
b, c|a
'
b, c|∅
0
Q Q Q Q Q Q Q Q
Q Q Q Q
+1
'
8.4 Standard imsets and learning 183
184
8 Learning
mentioned above. The point is that the “moves” in the search space between neighboring states have natural interpretation in terms of elementary conditional (in)dependence statements. This fact becomes even easier to see when one considers a particular regular criterion for learning DAG models – see Remark 8.13. Note, however, that this particular CI interpretation of moves between states of the search space is only possible in the context of DAG models – see Example 9.5. 8.4.2 Regular criteria and standard imsets Lemma 8.7. Let us accept Convention 4 and Q be a regular quality criterion for learning DAG models. Then there exist two statistics, namely, a function s : DATA(N, d) → R and a mapping t : D ∈ DATA(N, d) → tD ∈ RP(N ) , such that • ∀ A ⊆ N |A| ≤ 1 tD (A) = 0 for every D ∈ DATA(N, d), • ∀ A ⊆ N |A| ≥ 2 the coordinate mapping D →
tD (A) depends on DA , and the formula Q(G, D) = s(D) − tD , uG
(8.32)
holds for every G ∈ DAGS(N ) and D ∈ DATA(N, d). The function s and the mapping t are uniquely determined by these two requirements. Moreover, if Q is strongly regular, then • ∀ A ⊆ N |A| ≥ 2 the mapping D → tD (A) depends on the contingency table ctA (D). Proof. Supposing (8.20) holds for a collection tA : DATA(A, d) → R, A ⊆ N , let us put tD (A) = tA (DA ) for D ∈ DATA(N, d), A ⊆ N.
(8.33)
Thus, tD ∈ RP(N ) for every database D. Introduce for every D ∈ DATA(N, d) s(D) = tD (N ) − tD (∅) = tD , δN − δ∅ and { tD ({i}) − tD (∅) } · m{i}↑ tD = tD − tD (∅) · m∅↑ −
(8.34) (8.35)
i∈N
(see p. 39). This definition ensures tD (A) = 0 for every D and |A| ≤ 1. Moreover, for every D and |A| ≥ 2 one has t{i} (Di ) + (|A| − 1) · t∅ (D∅ ), tD (A) = tA (DA ) − i∈A
which implies that the mapping D → tD (A) depends on DA . Observe that if Q is strongly regular then tD (A) depends on ctA (D) since ctB (D) is a function of ctA (D) if B ⊆ A.
8.4 Standard imsets and learning
185
By Lemma 5.2, tD − tD ∈ L(N ) for every D, which implies by Proposition 5.1 that tD − tD , u = 0 for any u ∈ S(N ). Hence, tD , uG = tD , uG for every G ∈ DAGS(N ), D ∈ DATA(N, d) by Lemma 7.1. Thus, owing to (7.2), (8.20) takes the form Q(G, D) = tD , δN − δ∅ − tD , uG = s(D) − tD , uG . To evidence the uniqueness of s, take H ∈ DAGS(N ) with uH = 0 and observe s(D) = Q(H, D) for every D by (8.32). To verify the uniqueness of t, suppose that another mapping t˜ : D ∈ DATA(N, d) → t˜D ∈ S (N ) satisfies (8.32). Then tD − t˜D , uG = 0 for every D ∈ DATA(N, d) and G ∈ DAGS(N ). By Remark 7.3, E(N ) ⊆ {uG ; G ∈ DAGS(N )} and therefore tD − t˜D ∈ L(N ) for every D (use Proposition 5.1). As tD − t˜D is -standardized, tD − t˜D ≡ 0 by Proposition 5.4. Remark 8.12. The condition (8.32) from Lemma 8.7 is necessary for the regularity of a quality criterion but it alone is not sufficient. To evidence this in the case of non-trivial sample spaces, that is, if |Xi | ≥ 2 for every i ∈ N , consider a statistic s : DATA(N, d) → R which cannot be written in the form of the sum s(D) = i∈N si (Di ) where si : DATA({i}, d) → R, i ∈ N . The existence of s follows easily from the dimension comparison of both linear spaces of functions. Then put tD = 0 for any D and define Q by (8.32): Q(G, D) = s(D) for any G ∈ DAGS(N ), D ∈ DATA(N, d). However, Q is not regular because s was constructed not to be decomposable – consider a graph G ∈ DAGS(N ) which has no arrow. Another warning is that even if Q is a regular criterion given by (8.32) the mapping A → tD (A) from Lemma 8.7 need not induce Q in the sense of (8.20) because (8.34) need not hold for tD in place of tD . Given a particular quality criterion Q for learning DAG models, the symbol sQ , respectively tQ , will be used to denote that unique function s, respectively mapping t such that the requirements from Lemma 8.7 are satisfied. The function sQ will be called the saturating function (of the criterion Q) and the mapping tQ the -standardized transformation of data (relative to Q). P(N ) Moreover, given a database D ∈ DATA(N, d), the vector [tQ D (A)]A⊆N ∈ R will be called the (-standardized) data vector (relative to Q). Corollary 8.5. Let us accept Convention 4. Let Q be a regular quality criterion for learning DAG models and K, L ∈ DAGS(N ) be such that MK ML . Moreover, let tQ D denote the data vector relative to Q given by D ∈ DATA(N, d) and u a,b|C be the differential imset for K and L. Then Q(K, D) − Q(L, D) = tQ D , u a,b|C .
(8.36)
Proof. This immediately follows from (8.32) and the definition of the differential imset uL − uK = u a,b|C .
186
8 Learning
Remark 8.13. The method of structural imsets leads to the following proposal of how to modify and implement a local search method for learning DAG models (see p. 162). • The states of the search space can be represented by standard imsets. • The moves between states of the space can be represented by differential imsets. • Given a regular criterion Q for learning DAG models data can be represented by the respective -standardized data vector. The formula (8.32) says that Q is a sum of a constant depending on data and a linear function of those state representatives and data representatives. Moreover, owing to (8.36), the change in the value of Q after a move can be interpreted in terms of CI. Indeed, the value tQ D , u a,b|C can be viewed as the qualitative evaluation of the step in which the hypothesis a ⊥ ⊥ b | C is either rejected or accepted. This interpretation is indeed possible in the case of the MLL criterion – see Remark 8.14. Proposition 8.4. The saturating function and the data vector relative to the MLL criterion are as follows: sMLL (D) = d · H(Pˆ |υ) , tMLL D (A) = d · mPˆ (A) ,
for D ∈ DATA(N, d), A ⊆ N,
(8.37)
where Pˆ is the empirical measure computed from D (see Section A.9.1), υ the counting measure on XN (see p. 227) and mPˆ the empirical multiinformation function (see Section 2.3.4). Proof. Apply the procedure used in the proof of Lemma 8.7. By (8.33) and (8.30), and using Convention 4, write for D ∈ DATA(N, d) tD (A) = d[x] · ln d[x] if ∅ = A ⊆ N, tD (∅) = d · ln d . x∈XA
Hence, by (8.34) and the definition of empirical measure density d[x] · ln d[x] − d · ln d = d[x] · ln d[x] − d[x] · ln d sMLL (D) = =
x∈XN
d[x] · ln
x∈XN
x∈XN
d[x] d
x∈XN
d[x] d[x] = d· · ln = d· pˆN (x) · ln pˆN (x) . d d x∈XN
x∈XN
The following formula for tMLL is implied by (8.35) using Convention 4: D d[x] · ln d[x] − d[y] · ln d[y] + (|A| − 1) · d · ln d = tMLL D (A) = i∈A y∈Xi
x∈XA
=
x∈XA
d[x] · ln d[x] −
i∈A x∈XA
d[x] · ln d[xi ] + (|A| − 1) ·
x∈XA
d[x] · ln d
8.4 Standard imsets and learning
=
x∈XA
= d·
187
d[x] d[x] · d−1 d[x] · d|A|−1 · ln d[x] · ln = d· = −1 d i∈A d[xi ] i∈A d[xi ] · d
x∈XA
x∈XA
pˆA (x) = d · mPˆ (A) . pˆA (x) · ln ˆi (xi ) i∈A p
Thus, the formulas in (8.37) were obtained.
Remark 8.14. It follows from Proposition 8.4 and Corollary 2.2 that tMLL is D a supermodular function for every D ∈ DATA(N, d). More specifically, supposing K, L ∈ DAGS(N ) such that MK ML and u a,b|C is the differential imset for K and L, the formula (8.36) in Corollary 8.5 implies MLL (K, D) − MLL (L, D) = tMLL D , u a,b|C = d · mPˆ , u a,b|C ≥ 0 . Thus, the MLL criterion is non-increasing with respect to the inclusion quasiordering. No matter what data are given, it attains its maximum for graphs whose underlying graph is complete (see Figure 8.1 for illustration). Moreover, it follows from the proof of Corollary 2.2 that the number mPˆ , u a,b|C is ˆ just the relative entropy of Pˆ abC with respect to the conditional product Q aC bC ˆ ˆ of P and P (see Remark 2.11 for this concept). Therefore, ˆ . MLL (K, D) − MLL (L, D) = d · mPˆ , u a,b|C = d · H(Pˆ abC |Q) ˆ is nothing but the fitted empirical measure defined in Section A.9.1 Because Q the latter expression is one half of the value of the G2 -statistics for testing the CI statement a ⊥ ⊥ b | C. This observation is interesting in connection with Whittaker’s approach [157] to learning UG models by the minimization of the deviance (see Section 8.1). As a matter of fact, if one is interested in learning decomposable models which are known to be special cases of DAG models (see Section 3.4.1) then, by Remark 8.2, the value of the deviance of a statistical model described by an undirected graph H over N can be expressed as k−2·MLL (G, D) where k is a constant which does not depend on the model and G ∈ DAGS(N ) induces the same model as H. Thus, the deviance difference for a pair of triangulated undirected graphs which differ in the presence of one line is nothing but the multiple by two of the difference in the value of the MLL criterion for a pair of acyclic directed graphs which are inclusion neighbors. In particular, as explained above, the deviance difference is the value of the G2 -statistic for testing the respective elementary CI statement. This explains the phenomenon observed by Whittaker [157] and pinpointed at the end of Remark 8.2. Corollary 8.6. The saturating function and the data vector relative to the DIM criterion are as follows: sDIM (D) = −1 + i∈N r(i) (8.38) tDIM D (A) = |A| − 1 + i∈A r(i) − i∈A r(i)
188
8 Learning
for any D ∈ DATA(N, d), A ⊆ N . The formulas for data vectors relative to the AIC criterion and the BIC criterion are here: tAIC D (A) = d · mPˆ (A) − |A| + 1 − i∈A r(i) + i∈A r(i) (8.39) ln d BIC tD (A) = d · mPˆ (A) − 2 · { |A| − 1 + i∈A r(i) − i∈A r(i) } . Proof. The formula (8.38) can be obtained by substituting (8.31) into (8.34) and (8.35). The other formulas follow from Proposition 8.4 using the definition of information criteria (see Section A.9.3). Note that it follows from Corollary 8.6 that r(c)
tDIM D , u a,b|C = (r(a) − 1) · (r(b) − 1) ·
for every u a,b|C ∈ E(N ) .
c∈C
In particular, tDIM is also supermodular and the DIM criterion is nonD increasing with respect to inclusion quasi-ordering (see Figure 8.1 for illustration). On the other hand, this is not true for the AIC and BIC criteria: each of these two criteria is defined as a difference of two non-increasing criteria. Thus, either of the criteria can attain its maximal values inside the Hasse diagram of the inclusion ordering for DAG models. As indicated in Remark 8.9, I believe that formulas analogous to those in Corollary 8.6 can be derived for some of the Bayesian criteria for learning DAG models. Remark 8.15. One possible objection against the use of imsets is that memory demands for their representation in a computer are too high. However, if one is only interested in learning DAG models then standard imsets for acyclic directed graphs can be effectively represented in the memory of a computer. Indeed, any standard imset over N has at most 2|N | + 2 non-zero values – see (7.2). To represent it in computer memory, one can assign to every subset A ⊆ N a numerical code code (A) and represent an imset u by a list of at most 2 · |N | + 2 items where each item is a pair [code (A), u(A)] for a set A ⊆ N with u(A) = 0. Thus, the number of bytes needed to represent a standard imset is a polynomial function of |N |, which means that memory demands are essentially equivalent to those in the case of the use of graphs over N . As concerns computer representation of data, a data vector tD (see p. 185) can be represented by a list of 2|N | items of the form [code (A), tD (A)] for A ⊆ N . This may appear to be even more effective than the traditional way of representing data in the form of a contingency table because 2|N | ≤ |XN |. On the other hand, a contingency table is an integral function on XN while tD is a real function on P(N ). One can also keep a contingency table in the memory of a computer and compute any value of tD (A), A ⊆ N each time this is needed. Thus, the memory demands for representing data are equal to those in the case of the use of graphs for representing models.
9 Open Problems
The goal of this chapter is to gather open problems and present a few topics omitted in the previous chapters. Open problems are classified according to their degrees of specificity in three categories. Questions are clear inquiries formulated as mathematical problems. Formal definitions of related concepts are given and the expected answer is yes or no. Themes (of research) are wider areas of mutually related problems. Their formulation is slightly less specific (but still in mathematical terms) and they may deserve some clarification of the involved concepts. Directions (of research) are wide groups of problems with a recognized common motivation source. They are formulated quite vaguely and may become a topic of research in forthcoming years. The secondary criterion of classification of open problems is their topic: the division of this chapter into sections was inspired by the motivational thoughts presented in Section 1.1.
9.1 Theoretical problems In this section open problems concerning theoretical foundations are gathered. Some of them were already mentioned earlier. They are classified by their topics. 9.1.1 Miscellaneous topics Multiinformation There are some open problems related to the concept of multiinformation. Question 1. Let P and Q be probabilitymeasures over N defined on the product of measurable spaces (XN , XN ) = i∈N (Xi , Xi ) that have finite multiinformation (p. 24). Has their convex combination α · P + (1 − α) · Q, α ∈ [0, 1] finite multiinformation as well?
190
9 Open Problems
Question 2. Let K (N ) denote the conical closure of the set of multiinformation functions induced by discrete probability measures over N (see p. 11). Is K (N ) a rational polyhedral cone? The answer to Question 2 is positive in the case |N | ≤ 3; but I do not know the answer if |N | = 4. The significance of this question consists in the fact that discrete CI models can be characterized properly if the answer is positive. Proposition 9.1. If the answer to Question 2 is positive then there exists a non-empty finite set S ⊆ ZP(N ) \ {0} such that every s ∈ S generates an extreme ray of K (N ) and K (N ) = con(S). Then the following conditions are equivalent for M ⊆ T (N ): (i) M is a CI model induced by a discrete probability measure over N , (ii) M is produced by an element of K (N ), (iii) M has the form M = t∈T Mt where T ⊆ S. Proof. Because K (N ) ⊆ K (N ) and K (N ) is a pointed cone (Lemma 5.3), K (N ) is a pointed rational polyhedral cone. As mentioned in Section A.5.2, this implies that K (N ) has finitely many extreme rays and every extreme ray is generated by a non-zero integral vector. Moreover, K (N ) is their conical closure. The implication (i) ⇒ (ii) follows directly from Corollary 2.2. To prove (ii) ⇒ (iii), suppose M = Mm where m = s∈S αs · s with αs ≥ 0 and put T = {t ∈ S; αt > 0}.Using the fact K (N ) ⊆ K (N ) and Proposition 5.1(ii), we can derive M = t∈T Mt . To prove (iii) ⇒ (i), observe that every s ∈ S has the form α · mP for a discrete probability measure over N and α > 0 (since s generates an extreme ray of K (N )). Thus, for every s ∈ S ∪ {0}, Ms is a discrete CI model and we can use Lemma 2.9 to derive (i). Moreover, it seems that if Question 2 has a positive answer then discrete CI models and inclusions between them can be characterized in terms of an arithmetic relationship between certain special imsets over N . What follows is more likely an intuitive plan than a list of verified claims. Roughly speaking, the plan is to repeat with the cone K (N ) something analogous to what was done with the cone K (N ) in Chapters 5 and 6. However, it is quite possible that some of the steps indicated below cannot be made. The first observation should be that the cone K (N ) has finitely many faces and each of them is generated by a finite subset T ⊆ S. The second step should be to establish a one-to-one correspondence between discrete CI models and faces of K (N ): every M ⊆ T (N ) is assigned theface {t ∈ K (N ) ; M ⊆ Mt } and every F ⊆ K (N ) is assigned the model t∈T Mt . The conjecture is that it should define the Galois connection in the sense of Section 5.4. The third possible step is to introduce a suitable pointed rational polyhedral cone K∗ (N ), which should correspond to the dual cone of K (N ). The cone K∗ (N )
9.1 Theoretical problems
191
should be an analog of the cone con(E(N )) in the case of K (N ); perhaps it can also be introduced as the cone dual to K (N ) ⊕ L(N ). Faces of K∗ (N ) should correspond to faces of K (N ) – the respective Galois connection should be given by the incidence relation m, u = 0 for m ∈ K (N ) and u ∈ K∗ (N ). Moreover, every face of K∗ (N ) should be generated by an element of K∗ (N ) ∩ ZP(N ) . The fourth step should be to characterize extreme rays of K∗ (N ) and choose a non-zero normalized imset in every extreme ray of the cone. These imsets should be analogous to elementary imsets, while the imsets in K∗ (N ) ∩ ZP(N ) should be analogous to structural imsets. The last step should be to characterize the inclusion of faces of K∗ (N ) as an arithmetic relation of (generating) elements of K∗ (N ) ∩ ZP(N ) – this should be an analog of the direct characterization of independence implication mentioned in Section 6.2.1. The definition of the formal independence model induced by an element of K∗ (N ) ∩ ZP(N ) can be obtained as a special case of this characterization – like the definition in Section 4.4.1. The expected advantage of that approach should be that the obtained universum of special imsets from K∗ (N ) ∩ ZP(N ) achieves both the completeness and the faithfulness relative to the discrete distribution framework (see Section 1.1). Moreover, the conjectured arithmetic characterization of inclusion of faces should offer an arithmetic way of computer implementation. Note that Question 2 was formulated for a discrete distribution framework, but an analogous question can be raised for any other distribution framework Ψ which satisfies the condition (6.21). Theme 1. Is there any (direct) formula for the multiinformation function of a positive CG measure (see p. 66) in terms of their canonical or moment characteristics? Alternatively, is there any (iterative) method of its computing? Note that, owing to Lemma 2.7, an equivalent formulation of Theme 1 can be as follows: to find a formula for the entropy of a CG measure P with respect to i∈N µi where {µi ; i ∈ N } is the standard reference system for P (see p. 76). I am more skeptical about the existence of a direct formula of this kind. Formal independence models Two open problems concern the concept of a formal independence model (see Section 2.2.1). Question 3. Is it true that every formal independence model induced by a regular Gaussian measure over N is induced by a discrete probability measure over N ? If it is so, can one even find a positive binary probability measure inducing the model? Note that the converse implication does not hold: an example of a positive binary measure over N with |N | = 3 inducing a formal independence model
192
9 Open Problems
which is not induced by any Gaussian measure was given in Example 2.1. In this case, the reason is that the composition property of Gaussian CI models (see p. 33) is not valid universally. However, one can also offer a counterexample based on the weak transitivity property (see p. 34). Indeed, the following example shows that this property need not hold for binary measures provided that the conditioning set C is non-empty. Example 9.1. There exists a binary measure P over the set N = {a, b, c, d} such that it holds a ⊥ ⊥ b | {c} [P ], a ⊥ ⊥ b | {c, d} [P ], a ⊥ ⊥ d | {c} [P ] and d ⊥ ⊥ b | {c} [P ]. Indeed, put Xi = {0, 1} for i ∈ N and assign the probability 1/8 to each of the following configurations of values (the order of variables is a, b, c, d): (0, 0, 0, 0), (0, 1, 0, 0), (1, 0, 0, 1), (1, 1, 0, 1), (0, 0, 1, 0), (1, 0, 1, 0), (0, 1, 1, 1) and (1, 1, 1, 1). ♦ Note that the original version of Question 3 (in the manuscript of the book) was whether every Gaussian CI model is induced by a binary measure. ˇ However, this question has recently been answered negatively by P. Simeˇ cek [119] who found the following counterexample. Example 9.2. There exists a singular Gaussian measure over N = {a, b, c, d} such that its induced formal independence model is not induced by any discrete probability measure. Indeed, let us put P = N (0, Σ) where Σ = (σij )i,j∈N is a symmetric matrix given by σii = 5 for i ∈ N , σab = σcd = 0, σac = σbd = 3, σad = 4 and σbc = −4. One can verify that Σ is a positive semi-definite matrix and its main submatrices Σ A·A for A ⊆ N , |A| = 2 are regular. Thus, one can show using Lemma 2.8 that a ⊥⊥ b | ∅ [P ], c ⊥⊥ d | ∅ [P ], a ⊥ ⊥ c | ∅ [P ], a ⊥ ⊥ c | {b, d} [P ] and b ⊥ ⊥ d | {a, c} [P ]. Suppose for contradiction that MP is induced by a discrete measure Q over N . Since Q has finite multiinformation (see Section 4.1.1) by Corollary 2.2 observe mQ , u = 0 where u = u a,b|∅ + u c,d|∅ + u a,c|{b,d} + u b,d|{a,c} . However, it is easy to see that u = u a,c|∅ + u b,d|∅ + u a,b|{c,d} + u c,d|{a,b} , which, by Corollary 2.2 implies mQ , u a,c|∅ = 0, and, therefore, a ⊥⊥ c | ∅ [Q]. This leads to a contradictory conclusion a ⊥ ⊥ c | ∅ [P ]. ♦ Note that the validity of the conjecture in Question 3 was confirmed for |N | = 4 [119]. Direction 1. Let us consider an abstract family of formal independence models, that is, a mapping which assigns a collection {Mα ⊆ T (N ) ; α ∈ Ξ(N )} of formal independence models over N to every non-empty finite set of variables N . Find out what conditions on an abstract family of independence models ensure that there exists a quasi-axiomatic characterization of formal
9.1 Theoretical problems
193
independence models in the family by means of a (possibly infinite) system of inference rules of type (3.1) (see Remark 3.5). Try to characterize those abstract families which have a quasi-axiomatic characterization of this kind. Find out whether the class of discrete probabilistic CI models falls within this scope. A basic conjecture is that if an abstract family of formal independence models admits a quasi-axiomatic characterization of that kind then it has to be closed under the operation of restriction (see p. 12) and under the operation which is attributed to a permutation of variables as follows. Every permutation of variables π : N → N induces a transformation of a formal independence model M ⊆ T (N ) into the model { π(A), π(B)|π(C) ; A, B|C ∈ M} where π(A) = {π(i) ; i ∈ A} for A ⊆ N . It may be the case that these two conditions already characterize abstract families of formal independence models admitting a quasi-axiomatic characterization of the above kind. Graphs Further open problems are related to Chapter 3. The following problem, named the “inclusion problem” in Koˇcka et al. [58], can be viewed as an advanced subquestion of the equivalence question (see Section 1.1). Theme 2. Let K, L be acyclic directed graphs over N (see p. 220). Is there any direct graphical characterization of the inclusion MK ⊆ ML (see Section 3.2)? Note that a suitable graphical characterization of independence equivalence of acyclic directed graphs is known, namely the coincidence of their underlying graphs and immoralities – see p. 48. By a direct characterization of inclusion MK ⊆ ML a generalization of that equivalence characterization is meant, that is, a characterization in terms of conditions on induced subgraphs K and L. More specifically, I have in mind a collection of conditions on K, L ∈ DAGS(N ), each of which says something like this: if L has a certain induced subgraph for T ⊆ N then K has a certain induced subgraph for a subset of T . A concrete conjecture of this kind is formulated on p. 35 in Koˇcka et al. [58]. Note that an indirect transformational characterization of the inclusion of DAG models is mentioned in Lemma 8.5 and a characterization in terms of standard imsets is given by Lemma 8.6. Theme 3. Let K, L be chain graphs over N (see p. 221). Is there any transformational graphical characterization of independence equivalence of chain graphs? If yes, can it be extended to a transformational characterization of the inclusion MK ⊆ ML generalizing the result in Lemma 8.5? Note that I have in mind a generalization of transformational characterization of independence equivalence of acyclic directed graphs presented on p.
194
9 Open Problems
49. In that characterization just one edge is changed in every step. However, it is clear that this cannot be achieved for chain graphs: consider an equivalence class of chain graphs over N = {a, b, c} whose underlying graph is complete and realize that the complete undirected graph over N differs from the other chain graphs in the equivalence class in at least two arrows. Thus, the problem above includes the task to find out what are the respective elementary transformations of chain graphs which keep graphs independence equivalent. Semi-graphoids The concept of a semi-graphoid (see Section 2.2.2) can be viewed as a special concept of discrete mathematics which has its own significance. Several recent papers devoted to the topic of semi-graphoid inference [138, 33, 88] indicate that it is a challenging topic of research. As the intersection of any class of semi-graphoids is a semi-graphoid, the set of disjoint semi-graphoids over N is a (finite) complete lattice. A natural question is: what is the relationship between this lattice and the lattice of structural independence models introduced in Section 5.4.2? Question 4. Is it true that every coatom of the lattice of disjoint semigraphoids over N is a structural independence model over N ? Note that the condition in Question 4 is equivalent to the condition that the set of (all) coatoms of the semi-graphoid lattice coincides with the set of (all) coatoms of the structural model lattice – this can be verified using the fact that every structural model is a semi-graphoid (see Lemma 4.6). An analogous question for atoms of the semi-graphoid lattice has a positive answer – use Lemma 4.6 and Lemma 4.5 to show that every Mv , v ∈ E(N ) is both an atom of the semi-graphoid lattice and a structural model, and Lemma 2.2 to show that every atom of that lattice has this form. The above conjecture is true for |N | ≤ 4. A computer program [11] made it possible to find all 37 coatoms of the semi-graphoid lattice for |N | = 4. These semi-graphoids can be shown to be structural models over N using results of the analysis made in Studen´ y [131]. However, I do not know whether the conjecture is true for |N | = 5. Remark 9.1. If |N | ≤ 3, then the semi-graphoid lattice and the structural model lattice coincide. If |N | = 4, then they are similar: they have the same unit element, the same null element, the same set of atoms and the same set of coatoms. The basic difference is as follows. While the structural model lattice is both atomistic and coatomistic (see Theorem 5.3), the semi-graphoid lattice is only atomistic (use Lemma 2.2 to verify this fact). However, it is not coatomistic as the following example shows. Consider the semi-graphoid generated by the following list of disjoint triplets over N = {a, b, c, d}:
a, b|{c, d}, a, b|{c}, a, b|{d}, c, d|{a}, c, d|{b}, a, b|∅, c, d|∅.
9.1 Theoretical problems
195
This semi-graphoid is not an intersection of coatomistic semi-graphoids over N because each of those is a structural model and, therefore, their intersection is a structural model, for which reason it satisfies the formal property mentioned on p. 16 in Section 2.2.4. Indeed, to show the last claim one can use Proposition 5.1 and the equality u A,B|CD + u C,D|A + u C,D|B + u A,B|∅ = = u C,D|AB + u A,B|C + u A,B|D + u C,D|∅ . Actually, it can be shown that the above mentioned semi-graphoid over N is an infimum-irreducible element of the semi-graphoid lattice. Thus, if the answer to Question 4 is yes, then the set of structural independence models can equivalently be introduced as follows: these are the semigraphoids over N which can be obtained as intersections of coatomistic semigraphoids (the empty intersection is T (N ) by a convention). Structural imsets There are some open problems related to Chapter 7. The first group of these problems concerns the concept of a baricentral imset (see p. 131). Theme 4. Let G be a chain graph over N (see p. 221 and Section 3.3). Is there any direct formula for the baricentral imset u over N that satisfies Mu = MG ? Can every supermodular function m over N (see pp. 87–88) be effectively “translated” into the baricentral imset u over N with Mu = Mm ? Is there any effective criterion which decides whether a given structural imset is a baricentral imset? Note that a positive solution to the first question in Theme 4 can have an impact on methods of learning graphical models. Chain graph models can be represented by baricentral imsets then – see the thoughts about the universum of structural imsets on p. 161 and the note after Direction 9. Question 5. Let ℘ be a class of independence equivalence of structural imsets over N (see p. 113) and u ∈ ℘ be a combinatorial imset which is a minimal element of ℘ with respect to the ordering ; on ℘ ∩ C(N ) (see p. 142). Is u an imset of the smallest degree in ℘? Note that the converse implication is true – see Proposition 7.2. 9.1.2 Classification of skeletal imsets The following is a basic problem related to the concept of a skeletal imset (see Section 5.2). Theme 5. Is there any suitable characterization of skeletal imsets which allows us to find the -skeleton K (N ) for any finite non-empty set of variables N ? How does |K (N )| depend on |N |?
196
9 Open Problems
Note that the paper by Rosenm¨ uler and Weidner [109] offers a characterization of extreme supermodular functions, but that result more likely gives a criterion for whether a given -standardized supermodular function is skeletal; more precisely, the result of the paper can be utilized for that purpose. However, that criterion does not seem suitable for the purpose of computer implementation. Therefore, the result of [109] does not solve the problem of finding the skeleton for every N . A promising idea of how to tackle the problem is indicated in the rest of this section. A related task is that of classifying coatomic structural models. One can fix a way of standardization of skeletal imsets (see Remark 5.6), since coatomic (= submaximal) structural models are in a one-to-one correspondence with the elements of the respective skeleton, say the -skeleton. Every permutation π : N → N on a set of variables N can be extended to a permutation on the power set π : P(N ) → P(N ). This step allows one to introduce permutation equivalence on the class of normalized standardized skeletal imsets: any such a skeletal imset m is permutation equivalent in this sense to the composition mπ (which is also standardized in the same way and normalized – see [145]). Of course, every permutation of such an imset m defines a transformation of the produced independence model Mm . A basic way to classify skeletal imsets is the division of the class of (normalized standardized) skeletal imsets into the classes of permutation equivalence. Every permutation equivalence class represents a type of a skeletal imset then. For example, 5 elements of the -skeleton break into 3 types in the case |N | = 3, while 37 -skeletal imsets break into 10 types in the case |N | = 4 and 117978 -skeletal imsets break into 1319 types in the case |N | = 5 – see [145]. Remark 9.2. Of course, permutation equivalence can be viewed as an equivalence on classes of qualitative equivalence of skeletal imsets. Actually, one can introduce it without fixing a way of standardization as follows. One says that skeletal imsets m and r are permutation equivalent if r = mπ for some permutation π on N . Then one can show using Corollary 5.2 that, given a permutation π on M , skeletal imsets m1 and m2 are qualitatively equivalent iff m1 π and m2 π are qualitatively equivalent. Thus, the operation of composition with π can be viewed as an operation with qualitative equivalence classes. Since standardization and normalization is saved by composition with π, it gives rise to the above equivalence on K (N ). This consideration also shows that the way of standardization is not substantial in the definition of permutation equivalence. Level equivalence Nevertheless, perhaps an even more telling way to classify skeletal imsets exists. Suppose that m ∈ K(N ) is a skeletal imset over N ; let respective symbols m , mu and mo denote the respective qualitatively equivalent element of the -skeleton, the u-skeleton and the o-skeleton obtained by formulas from
9.1 Theoretical problems
197
Remark 5.6 (pp. 98–99). Thus, m, more precisely, the respective qualitative equivalence class of skeletal imsets, defines a certain equivalence on the class of subsets of N : ∀ S, T ⊆ N S ∼m T ⇔ [ mo (S) = mo (T ), m (S) = m (T ) and mu (S) = mu (T ) ] . (9.1) The equivalence classes of ∼m could be interpreted as the areas in which the considered standardized skeletal imsets have the same values; in other words, they correspond to some value levels. Two skeletal imsets over N will be called level equivalent if they induce the same equivalence on P(N ). Of course, qualitatively equivalent skeletal imsets are level equivalent by definition; the converse is not true (see Example 9.3). Thus, level equivalence can be viewed as an equivalence on equivalence classes of qualitative equivalence of skeletal imsets, in particular, on the -skeleton. Proposition 9.2. Let m1 , m2 be level equivalent skeletal imsets over N and π be a permutation on N (extended to P(N )). Then m1 π and m2 π are level equivalent. Proof. This is only a hint. Given a skeletal imset m over N , put r = mπ and, with the help of formulas from Remark 5.6, observe that r = m π, ru = mu π and ro = mo π. Hence, for every S, T ⊆ N one has S ∼r T iff π(S) ∼m π(T ), which implies the desired fact immediately. Remark 9.3. Another interesting operation with supermodular functions can be introduced with the aid of a special self-transformation ι of P(N ): ι(S) = N \ S
for every S ⊆ N .
Given a supermodular function m over N one can introduce z = mι and observe (see § 5.1.3 in [145]) that z is also a supermodular function over N called the reflection of m. Indeed, the reflection of z is again m. One can show reflection of a skeletal imset is a skeletal imset. Moreover, using the formulas from Remark 5.6 we can show that z = mu ι, zu = m ι and zo = mo ι. Consequently, for every S, T ⊆ N we get S ∼z T iff ι(S) ∼m ι(T ). Thus, whenever two skeletal imsets are level equivalent then their reflections are level equivalent. An interesting fact is that, in the case |N | ≤ 4, S ∼m T iff N \ S ∼m N \ T holds true for every m ∈ Ko (N ) (see Example 9.3 below). In particular, m and z = mι are level equivalent in this case. Nevertheless, the question of whether the above hypothesis holds in general is open. Question 6. Let m ∈ Ko (N ) and S, T ⊆ N such that S ∼m T . Is it true that N \ S ∼m N \ T hold then?
198
9 Open Problems
Supertypes A natural consequence of Proposition 9.2 is that the concept of permutation equivalence can be extended to classes of level equivalence. Every class of this extended permutation equivalence breaks into several classes of level equivalence, and these break into individual (standardized) skeletal imsets. Thus, every class of permutation equivalence of this kind represents a supertype. For example, two supertypes exists in the case |N | = 3 and five supertypes in the case |N | = 4. An interesting fact is that if |N | = 4 then every equivalence on P(N ) induced by a skeletal imset m through (9.1) can be described by means of at most two “cardinality intersection” criteria. These criteria distribute sets S ⊆ N to their equivalence classes (= value levels) on the basis of the cardinality of the intersection of S with one or two given disjoint subsets of N . Every equivalence of this kind on P(N ) is therefore determined by a certain system of disjoint subsets of N having at most two components. This phenomenon is illustrated by the following example. Example 9.3. One can distinguish five groups of “cardinality intersection” criteria distributing subsets S of N = {a, b, c, d} to levels which correspond to five supertypes of skeletal imsets. The analysis is based on a catalog of -skeletal imsets from the Appendix of [145]. 1. The criterion |S ∩{a, b}| divides P(N ) into 3 levels – see the upper picture in Figure 9.1. The corresponding class of level equivalence has 1 standardized imset but the class of permutation equivalence has 6 classes of level equivalence. Therefore, the respective supertype involves 6 standardized skeletal imsets. 2. The criterion |S ∩ {a, b, c}| divides P(N ) into 4 levels – see the lower picture in Figure 9.1. The corresponding class of level equivalence has 2 imsets, and the class of permutation equivalence has 4 classes of level equivalence. Hence, the supertype involves 8 imsets. An example of a permutated skeletal imset of this type is shown in Figure 6.4 (both the -standardized and the u-standardized versions are there). 3. The criterion |S ∩ {a, b, c, d}| divides P(N ) into 5 levels – see the upper picture in Figure 9.2. The corresponding class of level equivalence has 3 imsets while the corresponding class of permutation equivalence has just one level equivalence class. Thus, the supertype involves 3 imsets. 4. The composed criterion [ |S ∩ {a, b, c}| , |S ∩ {d}| ] divides P(N ) into 8 levels – see the lower picture in Figure 9.2. The corresponding class of level equivalence has 2 imsets, the class of permutation equivalence has 4 classes of level equivalence and the supertype involves 8 imsets. An example of a permutated imset of this kind is m◦ in the right-hand picture in Figure 6.3. 5. The composed criterion [ |S ∩ {a, b}| , |S ∩ {c, d}| ] divides P(N ) into 9 levels – see Figure 9.3. The corresponding class of level equivalence has 4 imsets while the corresponding class of permutation equivalence has 3
9.2 Operations with structural models
199
classes of level equivalence. The supertype involves 12 imsets; an example ♦ is the imset m† from Figure 4.3. The endeavor described in Section 9.1.2 can be summarized in the following open problem. Theme 6. Can the classification of supertypes of skeletal imsets by cardinality intersection criteria be extended to the case of general N ?
9.2 Operations with structural models There are various basic operations with structural models – some of them are realized by means of operations with structural imsets and some of them by means of operations with supermodular functions. An overview is given in § 8.2 of Studen´ y [146]. The aim of this section is to recall some of these operations and to formulate relevant open problems. 9.2.1 Reductive operations The operations of this kind assign a formal independence model over a set T , ∅ = T ⊆ N to a structural model over N . A typical example of an operation of this kind is the operation of restriction to a set ∅ = T ⊆ N mentioned on p. 12 which assigns the model MT ≡ M ∩ T (T ) to a formal independence model M ⊆ T (N ). The basic observation is that a restriction of a structural model is a structural model. Proposition 9.3. If M ∈ U(N ) and ∅ = T ⊆ N then MT ∈ U(T ). Proof. Given m ∈ RP(N ) , consider its restriction mT to P(T ). By Proposition 5.1, m ∈ K(N ) implies mT ∈ K(T ). Moreover, m, u A,B|C = mT , u A,B|C for every A, B|C ∈ T (T ). Thus, the model produced by mT coincides with the restriction of the model produced by m to T . Hence, Proposition 9.3 follows from (5.15). Theme 7. Let u be a baricentral imset over N and ∅ = T ⊆ N . Is there any direct formula for the baricentral imset inducing (Mu )T in terms of u? One can consider an alternative version of the above problem: provided u ∈ S(N ) is there an arithmetic formula which always gives a structural imset over T that induces the restriction of Mu to T ? Remark 9.4. In § 8.2.1 of [146] two other reductive operations with structural models are defined and shown to yield structural models. One of them is an operation which corresponds to the concept of a minor of a semi-graphoid introduced in Mat´ uˇs [86].
200
9 Open Problems
|S ∩ {a, b}|
Q {a, b, c, d} Q Q A Q A Q QQ A A
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP PP PP Q P PPP P Q PP PP PP PP Q PPP P PP Q P P P P P P P PQ
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP QPPP P P PP PP PP Q PP PP PP Q PP P P P P Q P P P Q P P P
{a}
{b} {c} {d} Q A Q Q A Q AA Q QQ ∅
|S ∩ {a, b, c}|
Q {a, b, c, d} Q Q A Q A Q QQ A A
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP PP PP Q P PPP P PP PP Q P P P PP QQ PP PP P P PP P P PQ P P
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP QPPP P P PP PP PP Q PP PP PP Q PP P P P P Q P P P Q P P P
{a}
{b} {c} {d} Q A Q Q A Q AA Q QQ ∅
Fig. 9.1. Cardinality intersection criteria and respective levels for N = {a, b, c, d}.
9.2 Operations with structural models
201
|S ∩ {a, b, c, d}|
Q {a, b, c, d} Q Q A Q A Q QQ A A
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP PP PP Q P PPP P Q PP PP PP PP Q PPP P PP Q P P P P P P P PQ
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP QPPP P P PP PP PP Q PP PP PP Q PP P P P P Q P P P Q P P P
{a}
{b} {c} {d} Q A Q Q A Q AA Q QQ ∅
[ |S ∩ {a, b, c}| , |S ∩ {d}| ]
Q {a, b, c, d} Q Q A Q A Q QQ A A
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP PP PP Q P PPP P PP PP Q P P P PP QQ PP PP P P PP P P PQ P P
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP QPPP P P PP PP PP Q PP PP PP Q PP P P P P Q P P P Q P P P
{a}
{b} {c} {d} Q A Q Q A Q AA Q QQ ∅
Fig. 9.2. Further cardinality intersection criteria and respective levels for |N | = 4.
202
9 Open Problems
[ |S ∩ {a, b}| , |S ∩ {c, d}| ]
Q {a, b, c, d} Q Q A Q A Q QQ A A
{a, b, c} {a, b, d} {a, c, d} {b, c, d} PP PP PP Q P PPP P PP PP Q PP Q PP PPP PP Q P P P P P P PQ P P
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d} PP PP QPPP P P PP PP PP Q PP PP PP Q PP P P P P Q P P P Q P P P
{a}
{b} {c} {d} Q A Q Q A Q AA Q QQ ∅
Fig. 9.3. The last cardinality intersection criterion and respective levels for |N | = 4.
9.2.2 Expansive operations Operations of this kind assign a formal independence model over N to a formal independence model over T , ∅ = T ⊆ N . Four examples of these operations are given in § 8.2.2 of Studen´ y [146]. It is also shown there that those four operations ascribe structural models to structural models. One of the operations is the ascetic extension as (M, N ) ≡ Tø (N ) ∪ M of M ⊆ T (T ) with ∅ = T ⊆ N . This concept is a natural counterpart of the concept of restriction mentioned in Section 9.2.1. Actually, it is the least semi-graphoid extension M of a semi-graphoid M over T such that MT = M. Proposition 9.4. If M ∈ U(T ), ∅ = T ⊆ N then as (M, N ) ∈ U(N ) and as (M, N )T = M. Proof. Let M be induced by v ∈ S(T ). Consider the zero extension u ∈ S(N ) of v given by v(S) if S ⊆ T , u(S) = for S ⊆ N , 0 otherwise , and observe that as (M, N ) is induced by u. Note that to this end Lemma 6.2 can be used. Another important fact is that every r ∈ K(T ) can be extended to m ∈ K by the formula m(S) = r(S ∩ T ), S ⊆ N .
9.2 Operations with structural models
203
9.2.3 Cumulative operations Operations of this kind ascribe a formal independence model to a pair of formal independence models. More specifically, given U, V ⊆ N such that U V = N , a cumulative operation should ascribe a model M3 over N to a pair M1 ⊆ T (U ), M2 ⊆ T (V ) which is consonant, that is, M1U ∩V = M2U ∩V . The intended interpretation is that the knowledge represented by M1 and M2 is put together in M3 . Note that the motivation source for the idea to introduce some cumulative operations with structural independence models is an abstract concept of conditional product introduced in Dawid and Studen´ y [32] and the intention to utilize this operation for the purpose of decomposition of a structural model (see Section 9.2.4). The following concept could serve as a simple example of a cumulative operation with structural imsets. Given
A, B|C ∈ T (N ) and M1 ∈ U(AC), M2 ∈ U(BC) such that M1C = M2C , by a weak composition of M1 and M2 we will understand the structural model M1 ⊗ M2 over ABC given by M1 ⊗M2 = cl U (ABC) as (M1 , ABC) ∪ as (M2 , ABC) ∪ { A, B|C} . (9.2) In words, both M1 and M2 are embedded into U(ABC) by the ascetic extension, then the triplet A, B|C is added and the structural closure operation (see p. 143) is applied. It follows directly from the definition that M1 ⊗ M2 ∈ U(ABC). Theme 8. Suppose A, B|C ∈ T (N ), M1 ∈ U(AC), M2 ∈ U(BC) with M1C = M2C . Is it true that (M1 ⊗ M2 )AC = M1 and (M1 ⊗ M2 )BC = M2 ? Can the domain of the operation ⊗ defined by (9.2) be suitably restricted so that the axioms of (an abstract operation of) conditional product from [32] are fulfilled for ⊗? Is there an algebraic formula for the baricentral imset of M1 ⊗ M2 on the basis of baricentral imsets inducing M1 and M2 ? 9.2.4 Decomposition of structural models The goal of the considerations concerning operations with structural models is to develop an effective method of “local” representation of structural independence models. More precisely, we would like to represent a structural model over a large set of variables in computer memory by means of a list of structural models’ representatives over small sets of variables. To this end, we need a suitable concept of decomposition of a structural model. The first attempt at the definition of this concept is the following; to a great extent it is related to the concept of a weak composition introduced in the previous section. Let M be a structural model over N and U, V ⊆ N be such that U V = N . We say that the pair (MU , MV ) is a weak decomposition of M if M = MU ⊗ MV . This weak decomposition is proper if U \ V = ∅ = V \ U . Note that (MU )U ∩V = (MV )U ∩V , which means that the
204
9 Open Problems
weak composition MU ⊗ MV is always defined. Clearly, a necessary condition for the existence of a weak decomposition (MU , MV ) is the CI statement U \V ⊥ ⊥ V \ U | U ∩ V [M] (see p. 12). Motivational thoughts about the concept of decomposition The motive for open problems formulated below is the concept of a decomposition of a UG model from Lauritzen [70] and Cowell et al. [26]. Let us say that an undirected graph G over N has a proper decomposition if there ⊥ B | C [G] (see p. 43) and exists A, B|C ∈ T (N ) \ Tø (N ) such that A ⊥ GC is a complete graph. The graphs GAC and GBC then form a (proper) decomposition of G. The above concept allows us to classify subsets of N . A set T ⊆ N is called a prime set relative to G if GT has no proper decomposition. An undirected graph G over N is called a prime graph if the set N is a prime set relative to G. It is evident that every complete set in an undirected graph H is a prime set relative to H; but the converse implication does not hold, as the following example shows. Example 9.4. There exists a prime graph over N = {a, b, c, d} which is not a complete graph. Let us consider the graph shown in Figure 9.4. The example also shows that a subset of a prime set need not be a prime set. Indeed, the set {a, b, c} is not a prime set in the considered graph. ♦
a
f
b
f
f
f
c
d
Fig. 9.4. A prime graph over N = {a, b, c, d} which is not complete.
The maximal sets (with respect to inclusion) which are prime sets relative to an undirected graph G over N will be called prime components of G. The max (G). collection of prime components will be denoted by Ppri It is clear that we can try to apply the operation of proper decomposition to any undirected graph G over N ; if successful, we can then try to apply it to one of the induced subgraphs that were obtained. Thus, we can successively decompose G into a system of prime graphs. Nevertheless, the entire result of that successive decomposition need not be unique. Indeed, let us consider the graph G over N = {a, b, c} which only c. The first option is that one decomposes G into G{a,b} has the line b and G{b,c} using a ⊥⊥ c | b [G] and then decomposes G{a,b} into G{a} and
9.2 Operations with structural models
205
G{b} because of a ⊥ ⊥ b | ∅ [G]. The resulting list of prime graphs is G{a} , G{b} and G{b,c} . The second option is to decompose G into G{a} and G{b,c} using a ⊥⊥ {b, c} | ∅ [G]. The graph G{b} in the first list of prime graphs seems to be superfluous because {b} is not a prime component of G. Thus, we are interested in those successive decompositions of an undirected graph G into subgraphs that lead to prime components. In my view, the result from Leimer [72] and Mat´ uˇs [83] can be paraphrased max (G) as follows: for every undirected graph G over N , the class of set Ppri satisfies the running intersection property (see p. 55) and, therefore, it is a system of cliques of a triangulated undirected graph H over N , which can be called a canonical triangulation of G (evidently G is a subgraph of H). Then max (G) } one can say that the collection of induced subgraphs {GT ; T ∈ Ppri together with the triangulated graph H defines a canonical decomposition of G into prime components. The concept of (proper) decomposition of an undirected graph G into its induced subgraphs GAC and GBC has a special significance. The existence of the decomposition allows one to simplify the task of finding the maximum likelihood estimate (see p. 246) in the statistical model MG (= the class of Markovian probability measures with respect to G) to the task of finding the maximum likelihood estimates in the statistical models MGAC and MGBC – see § 4.3 in Lauritzen [70]. The desired “global” estimate in MG can be obmax (G) by means of the product tained from “local” estimates in MGT , T ∈ Ppri formula (3.4) for the triangulated graph H. The point is as follows: if prime components of G are sets of small cardinality then the task of computing the maximum likelihood estimate is manageable, unlike the task of finding the maximum likelihood estimate in MG which involves too many variables. The same idea is utilized in the paper by Mat´ uˇs [83], where the existence of a canonic decomposition of an undirected graph is utilized to transform the task of computing the maximum entropy extension of a collection of consonant discrete probability measures for a large set of variables to the task of computing several maximum entropy extensions for small sets of variables. Moreover, an analogous idea is behind the method of local computation [26], which is a classic method used to overcome computational problems in the case of DAG models. The essential idea of this method is that the original representation of a discrete probability measure over N in the form of a recursive factorization (8.1) is transformed by a series of graphical procedures into a representation in the form of potentials {gT ; T ∈ C} where C is the collection of cliques of a triangulated graph over N . These potentials could be marginals of the original probability measure P , i.e., the measure can then be expressed in terms of its marginals for T ∈ C by means of the above-mentioned product formula (3.4). Remark 9.5. One equivalent definition of a triangulated graph is as follows (cf. [83]). An undirected graph is triangulated iff its prime components are complete sets in the graph. In other words, a graph is triangulated iff com-
206
9 Open Problems
plete graphs can be obtained by successive application of the operation of proper decomposition (see Proposition 2.5 in Lauritzen [70]). That is why some authors call triangulated graphs decomposable. In my view, this terminology is slightly misleading: the adjective “decomposable” usually means that a graph can be decomposed, which in the considered situation means that it is not a prime graph. However, the above terminology is generally accepted in the field of graphical models. The concept of a canonical decomposition of an undirected graph G can be interpreted as follows. The graph G can be viewed as a mathematical object which somehow represents our knowledge about the UG model MG , or our knowledge about the statistical model MG . The decomposition can be viewed as a form of parsimonious representation of the knowledge about the “big” model MG by means of • a system of knowledge representatives of small “local” models MGT , and • a “formula” to “compute” a knowledge representative of the “big” global model on the basis of “local” representatives. The way to “compute” the global knowledge is highly effective – it is nothing but the application of the method of local computation. The above-mentioned interpretation leads to a natural conjecture that an undirected graph can possibly be replaced by another object of discrete mathematics which describes the models of CI structure. This step should enlarge the applicability of the method of local computation to more general classes of models of CI structure. I think that the concept of weak decomposition defined above should have some connection to a hypothetical concept of decomposition of a structural model. This motivated me to put forth the following open problem. Theme 9. Let G be an undirected graph over N and a non-trivial triplet
A, B|C ∈ T (N ) \ Tø (N ) define a proper decomposition of G. Is the pair ((MG )AC , (MG )BC ) a weak decomposition of MG ? Is there an analog of the concept of a prime graph within the framework of (classic) chain graph models? However, the goal of the motivational thoughts above is the following open problem. Direction 2. Is there a concept of decomposition of a structural model which generalizes the concept of decomposition of a UG model such that an analog of the result about the existence of a canonical decomposition holds? Try to find sufficient conditions for the existence of decomposition of this type which can be verified by statistical tests of CI or on the basis of knowledge provided by human experts. If there is generalization of the result on canonical decomposition of a UG model, then every structural model over N should have a canonical decomposition into a decomposable model MH over N and a series of structural
9.3 Implementation tasks
207
models MT over T ∈ C where C is the class of cliques of H. This should make it possible to develop a generalized local computation method applicable to a wider class of structural models, namely to those for which the sets T ∈ C in the hypothetical decomposition have small cardinality. The next step should be an analogy of Shenoy’s pictorial method of valuation networks [117] for local representation of structural imsets and Markovian measures in the memory of a computer.
9.3 Implementation tasks These open problems are motivated by the task of implementing independence implication on a computer. The most important question is probably the next one. Question 7. Is every structural imset u (see p. 73) over N already a combinatorial imset (see p. 72) over N ? If the answer to Question 7 is negative then the following two problems may become a topic of immediate interest. Theme 10. Given a finite non-empty set of variables N , find the least finite class H(N ) of structural imsets such that ∀ u ∈ S(N ) u = kv · v for some kv ∈ Z+ . v∈H(N )
Recall that the existence of the class H(N ), named a minimal integral Hilbert basis of con(E(N )), follows from Theorem 16.4 in Schrijver [113]. One has E(N ) = H(N ) iff S(N ) = C(N ). Theme 11. Given a finite non-empty set of variables N , determine the smallest n∗ ∈ N such that an imset over N is structural iff its multiple n∗ · u is a combinatorial imset, that is, ∀ u ∈ ZP(N )
u ∈ S(N ) ⇔ n∗ · u ∈ C(N ).
Determine the smallest n∗∗ ∈ N satisfying ∀ u ∈ ZP(N )
u ∈ S(N ) ⇔ ∃ n ∈ N n ≤ n∗∗
n · u ∈ C(N ) .
Find out how the values n∗ and n∗∗ depend on |N |. Note that the existence of n∗ and n∗∗ follows from Lemma 6.3 and one has n∗∗ ≤ n∗ . At this stage, I am not able to decide whether the inequality is strict. Indeed, n∗ = 1 ⇔ n∗∗ = 1 ⇔ S(N ) = C(N ). Another important question concerns the -skeleton.
208
9 Open Problems
Question 8. Let K (N ) be the -skeleton over N (see p. 93) and E(N ) the class of elementary imsets over N (p. 69). Is the condition min { m, u ; u ∈ E(N ) m, u = 0 } = 1 fulfilled for every m ∈ K (N )? Note that if the condition in Question 8 is true then gra(N ) = gra∗ (N ) (cf. Remark 6.7 on p. 123). The following problem becomes relevant if both Question 7 and Question 8 have negative answers. Theme 12. How does the value of the smallest l ∈ N satisfying the condition ∀ u ∈ S(N ) ∀ v ∈ E(N )
u v ⇔ l · u − v ∈ S(N )
(9.3)
depend on |N | (see Section 6.3.2)? Can we determine gra(N ) directly without finding the skeleton, that is, without solving Theme 5? Recall that if either Question 7 or Question 8 has a positive answer then the smallest l ∈ N satisfying (9.3) is +gra∗ (N ), (cf. Lemma 6.4 and Remark 6.7). The following open problem is a modification of the preceding problem. It may appear sensible if the answer to Question 7 is negative. Theme 13. Is there the smallest l∗ ∈ N such that ∀ u ∈ S(N ) ∀ v ∈ E(N )
u v ⇔ l∗ · u − v ∈ C(N ) ?
How does l∗ depend on |N | then? Is there a structural imset u ∈ S(N ) such that the condition (6.4) in Remark 6.3 is not fulfilled? The last open problem of this section also concerns independence implication. Direction 3. Is there a method of testing independence implication which combines direct and skeletal criteria (see Lemma 6.1 on p. 115 and Lemma 6.2 on p. 118) and which is more suitable for efficient implementation on a computer? The above formulation is partially vague; let me specify in more detail what I have in mind. The direct criterion of the implication u v consists in testing whether k ·u−v ∈ C(N ) for some k ∈ N. This can be tested recursively as mentioned in Remark 6.3. However, the majority of the “transient” imsets obtained during the “decomposition” procedure are not combinatorial imsets. The fact that a “transient” imset w is not a combinatorial imset can often be recognized immediately by means of Theorem 5.1, which is a basis of the skeletal criterion for testing independence implication. That can save superfluous steps of the recursive “decomposition” procedure. The desired conclusion about w can be made on the basis of the fact that m, w < 0 for a supermodular imset m over N taken from a prescribed class of imsets, for example, the
9.4 Interpretation and learning tasks
209
imset mA↑ resp. mA↓ for A ⊆ N (p. 39) or ml for l = 0, . . . , |N | − 2 (p. 70). The point is that we need not have the whole skeleton at our disposal to do that. In fact, Remark 6.8 is based just on the observations of this type. Note that another conceivable method for testing independence implication is to transform it to a classic maximization problem of linear programming – see § 5 of Studen´ y [147] for discussion about this topic.
9.4 Interpretation and learning tasks In this section, open problems loosely motivated by “practical” questions of interpretation and learning from Section 1.1 are gathered. 9.4.1 Meaningful description of structural models The following two open problems are motivated by the concept of a standard imset for an acyclic directed graph from Section 7.2.1. Question 9. Let G be an acyclic directed graph over N (p. 220). Is it true that the standard imset for G (see p. 135) is the only imset from the class of combinatorial imsets inducing MG that is simultaneously an imset of the smallest degree (p. 141) and an imset with the least lower class (p. 146)? In my view, the concept of a standard imset for an acyclic directed graph emphasizes some interpretable aspects of the respective DAG model. Thus, a natural question arises whether the concept of a standard imset for a DAG model can be generalized. Direction 4. Is there a consistent principle of unique choice of representatives of classes of independence equivalence (see p. 113) such that, for every acyclic directed graph G over N , the standard imset uG is chosen from the class ℘ = {u ∈ S(N ) ; Mu = MG }? Another open problem is motivated by the ideas taken from Section 6.4. Direction 5. Look for conditions that are implied by independence equivalence of structural imsets, are formulated in terms of invariants of independence equivalence, are easy to verify and offer a clear interpretation. The aim is to find a complete set of conditions of this type, that is, a set of conditions that is able to recognize every pair of structural imsets which are not equivalent. The desired complete set of interpretable invariants could then become a basis of an alternative way to describe structural models, which should be suitable from the point of view of interpretation.
210
9 Open Problems
9.4.2 Tasks concerning distribution frameworks The below-mentioned open problems are more or less concerned with the concept of a distribution framework (see Section A.9.5 and Section 6.5). Theme 14. Let Ψ be a class of probability measures over N satisfying the conditions (6.20) and (6.21) on p. 127. Let Ψ (u) denote the class of Markovian measures with respect to u ∈ S(N ) given by (6.1) on p. 113 and SΨ (N ) the class of Ψ -representable structural imsets over N (p. 127). Is the condition ∀ u, v ∈ SΨ (N )
u v ⇔ Ψ (u) ⊆ Ψ (v)
fulfilled then? If not, what additional assumptions on Ψ are needed to ensure the validity of the condition? Question 10. Let M be a structural model over N , U = Uu be the upper class (p. 73) of u ∈ S(N ) with Mu = M and D ⊆ U be a unimarginal class for M (see Section 7.4.1, p. 145). Is D then necessarily a determining class for M? Note that the above question can also be formulated relative to a distribution framework Ψ (see Remark 7.7 on p. 146). Theme 15. Let Ψ1 , Ψ2 be classes of probability measures satisfying (6.20) and M be a structural model over N . May it happen that minimal unimarginal classes for M relative to Ψ1 and Ψ2 differ? More specifically, what is the answer if the class of discrete measures (p. 11) is put in place of Ψ1 and the class of regular Gaussian measures (p. 30) is put in place of Ψ2 ? The last two open problems in this section are related to mathematical statistics. The first one is a “parameterization problem”. Direction 6. Find out for which structural imsets u over N and for which classes Ψ of probability measures with a prescribed sample space (XN , XN ) = (X i , Xi ) a suitable parameterization of the class of Markovian measures i∈N Ψ (u) with respect to u can be defined. Note that one is interested in parameterization by means of “independent” parameters, that is, parameterizations in which elements of Ψ (u) correspond to parameters belonging to a polyhedron in Rn for some n ∈ N – see, for example, the parameterization (8.2) of MG for G ∈ DAGS(N ). Another typical example is the parameterization of a regular Gaussian measure which is Markovian with respect to an acyclic directed graph [9, 107]. Direction 7. Can an informal concept of a distribution framework (see Section A.9.5) be formalized and defined in an axiomatic way? Try to clarify under which operations a general class of probability measures should be closed to constitute what is meant by a distribution framework.
9.4 Interpretation and learning tasks
211
9.4.3 Learning tasks The open problems gathered in this section are motivated by Chapter 8. The first two of them concern the concept of a regular quality criterion from Section 8.2.4. The following problem was indicated in Remark 8.9. Theme 16. Clarify what assumptions on the priors π G , G ∈ DAGS(N ) (see Remark 8.9) ensure that the respective LML criterion for learning DAG models is strongly regular. Derive the respective formula for the data vector relative to the LML criterion (see p. 185) in terms of Dirichlet measures’ hyperparameters. The conjectured formula for the data vector in terms of hyperparameters should be an analog of the formula (8.39) and could be a basis for comparison of different criteria for learning DAG models. The next open problem is motivated by the observations made in Section 8.4. Direction 8. Can the fact derived in Lemma 8.7, namely, that a regular criterion for learning DAG models is a shifted linear function, be utilized in an alternative method for finding the maximal value of the criterion by means of procedures of integer programming? One of the typical tasks of integer programming is to find the maximum of a linear function on the set of integral vectors from a certain bounded polyhedron. Perhaps some of the algorithms developed in that area can be successfully applied to a similar problem mentioned here. The other motive for Direction 8 is that the links in the Hasse diagram of the poset of DAG models correspond to elementary imsets (see Remark 8.11) and the number of elementary imsets is limited. Perhaps this observation can be utilized somehow. The next open problem concerns the local search method (see p. 162) and the idea of applying the method of structural imsets in this area (see Remark 8.13). It can also be viewed as an extension of the problem of characterization of inclusion neighborhood (see Section 8.3 and Theme 2). Theme 17. Let uG be a standard imset for an acyclic directed graph G over N (see p. 135). Is it possible to characterize inclusion neighbors of MG (see p. 177) directly in terms of uG ? In other words, is there any criterion to find, on the basis of uG , all differential imsets (see p. 182) that correspond to the moves from MG to its lower neighbors and to its upper neighbors in the sense of inclusion neighborhood? Motivational thoughts about learning general CI structures The last open problem concerns a particular phenomenon mentioned in Remarks 8.11 and 8.14, namely the fact that the moves between neighboring
212
9 Open Problems
DAG models have a CI interpretation if one uses a regular quality criterion for learning DAG models. This is a very pleasant property because it is advantageous both from the point of view of interpretation and from the point of view of computer implementation of a local search method (see p. 162). The consequence of the above fact about the universum of DAG models is that the same statement is true for the universum of decomposable models which is embedded in that universum (see Figure 3.6). Note that this statement is valid regardless of which regular criterion is taken into consideration. Thus, the local search method combined with a CI interpretation of moves can also be successfully applied within the universum of decomposable (UG) models if classic quality criteria for learning decomposable models are used. A natural question is whether these classic criteria can be extended to a wider universum of models of CI structure in such a way that the interpretation of moves to neighboring models is kept. Unfortunately, the answer is negative. The following example shows that the classic criteria cannot be extended from the universum of decomposable graphs to the universum of undirected graphs in that way. Example 9.5. The concept of a standard imset for a triangulated undirected graph (see Section 7.2.2) cannot be extended to general undirected graphs in such a way that the difference between imsets representing models which are in inclusion would be a structural imset. In particular, the consequence is that usual quality criteria like the MLL criterion, the AIC criterion and the BIC criterion, which have the form (8.32), cannot be extended to the universum of undirected graphs in such a way that a CI interpretation of moves between neighboring models in the sense of Corollary 8.5 is possible. To show this, consider the universum of undirected graphs over N = {a, b, c, d} and undirected graphs K, L, G, H shown in Figure 9.5. Observe that MK ⊆ MH , MG ⊆ ML , and that K, L and H are triangulated. Sup¯ ∈ S(N ), u ¯ − uK ∈ S(N ) pose that there exists u ¯ ∈ S(N ) such that uL − u and u ¯ induces MG . It suffices to show that these assumptions lead to a contradiction. Indeed, by Proposition 6.2 we derive that v ≡ u ¯ − uK ∈ C(N ) and w ≡ uL − u ¯ ∈ C(N ). Thus, put u ≡ u a,d|c + u a,b|d = uL − uK = w + v. The next step is to show that one of the following four options occurs: v = 0, v = u, v = u a,b|d and v = u a,d|c . To this end, observe that the level-degrees of u (see p. 72) are as follows: deg(u, 0) = deg(u, 2) = 0 and deg(u, 1) = 2. Hence, deg(v, 0) = deg(w, 0) = 0 and deg(v, 2) = deg(w, 2) = 0 and 0 ≤ deg(v, 1) ≤ 2 by Corollary 4.2. If deg(v, 1) = 0 then deg(v) = 0 which means v = 0. If deg(v, 1) = 2 then deg(w, 1) = deg(u, 1) − deg(v, 1) = 0 implies deg(w) = 0 which means w = 0 and v = u. If deg(v, 1) = 1 then deg(w, 1) = 1 and one can conclude v, w ∈ E1 (N ). Since 1 = mabd↑ , u = mabd↑ , v + mabd↑ , w by Proposition 5.1
mabd↑ , v ∈ {0, 1}. If mabd↑ , v = 1 then observe that the only imsets in E1 (N ) which satisfy this equality are u a,b|d , u a,d|b and u b,d|a . However,
ma↓ , u = 0 = mb↓ , u together with Proposition 5.1 implies ma↓ , v = 0 =
9.4 Interpretation and learning tasks
213
mb↓ , v, which means that one necessarily obtains v = u a,b|d . If mabd↑ , v = 0 then mabd↑ , w = 1 and the same consideration gives w = u a,b|d , which means v = u a,d|c . That is what was needed to show. Now, the cases v = 0, v = u and v = u a,b|d are excluded because G differs from K, L and H. The only remaining option is v = u a,d|c . However, the respective imset u ¯ = uK + v = u b,c|ad + u a,d|c does not induce MG because a, d|{b, c} ∈ MG \ Mu¯ . To evidence this, observe that, for every k ∈ N, mabd↑ , k · (uK + v) − u a,d|{b,c} = −1, which means that k · (uK + ♦ v) − u a,d|{b,c} ∈ S(N ) by Proposition 4.4.
L
' a
$ b
h
h
h
h
c
d
&
H
' a
c
'
$
a
b
h @ h
% w = uL − u ¯
ua,d|c
h
h
@ @ h
&
d
G
$
b
h
u ¯ =?
% ua,b|d
h
h
c
d
& v=u ¯ − uK
K
' a
c
$ b
h @ h
%
h
@ @ h
&
d
%
Fig. 9.5. Undirected graphs from Example 9.5.
The preceding consideration and counterexample, therefore, lead to the following open problem. Direction 9. Is there any other way to derive quality criteria for learning models of CI structure such that the moves between neighboring models (in the sense of inclusion neighborhood) have a CI interpretation? Propose a suitable method of learning structural models on the basis of data. One of the possible ways to derive a quality criterion which admits a CI interpretation of “moves” is as follows. One chooses a suitable representative of
214
9 Open Problems
every structural model over N such that the difference between representatives of neighboring models is always a structural imset. An example of such a choice is the choice of a baricentral imset (see Proposition 7.1). One can also choose a special way to represent data and penalize model complexity by introducing a special formula for a data vector, which should be a real vector whose components correspond to subsets of N . The criterion can then be introduced as the scalar product of the data vector and the imsetal representative of the model (cf. Section 8.4.2). An alternative methodological approach to learning statistical models of CI structure could be as follows. One can introduce a suitable distance on the set of probability measures belonging to a considered distribution framework (with a fixed sample space). Then one can compute the distance of the empirical measure from the respective statistical model of CI structure, which is the set of Markovian measures with respect to a respective structural imset.
A Appendix
University graduates in mathematics should be familiar with the majority of the concepts and facts gathered in the Appendix. However, certain misunderstandings can occur regarding their exact meanings and, moreover, graduates in other fields, in particular in computer science and statistics, may not be familiar with all basic facts. Thus, to avoid misunderstanding and to facilitate reading I decided to recall these concepts here. The aim is to provide the reader with a reference source for well-known facts. It can be easily utilized with the help of the Index.
A.1 Classes of sets By a singleton a set containing only one element is understood; the symbol ∅ is reserved for the empty set. The symbol S ⊆ T (also T ⊇ S) denotes that S is a subset of T (alternatively T is a superset of S) which involves the situation S = T . However, strict inclusion is denoted as follows: S ⊂ T or T ⊃ S means that S ⊆ T but S = T . The power set of a non-empty set X is the class of all of its subsets { T ; T ⊆ X}, denoted by P(X). The symbol D denotes the union of a class of sets D ⊆ P(X); the symbol D denotes the intersection of a class D ⊆ P(X). Supposing N is a non-empty finite set (of variables) and A, B ⊆ N the juxtaposition, AB will be used as a shorthand for A ∪ B. A class D ⊆ P(N ) is called ascending if it is closed under supersets, that is, ∀ S, T ⊆ N S ∈ D, S ⊆ T ⇒ T ∈ D . Given D ⊆ P(N ), the induced ascending class, denoted by D↑ , is the least ascending class containing D, that is, D↑ = { T ⊆ N ; ∃ S ∈ D
S ⊆ T }.
Analogously, a class D ⊆ P(N ) is called descending if it is closed under subsets, that is,
216
A Appendix
∀ S, T ⊆ N
S ∈ D, T ⊆ S
⇒ T ∈ D.
Given D ⊆ P(N ), the induced descending class D↓ consists of subsets of sets in D, that is, D↓ = { T ⊆ N ; ∃ S ∈ D T ⊆ S}. A set S ∈ D, where D ⊆ P(N ), is called a maximal set of D if ∀ T ∈ D S ⊆ T ⇒ S = T ; the class of maximal sets of D is denoted by Dmax . Clearly, max ↓ Dmax = (D↓ ) and D↓ = (Dmax ) . Dually, a set S ∈ D is called a minimal set of D if ∀ T ∈ D T ⊆ S ⇒ S = T and Dmin denotes the class of minimal sets of D. By a permutation on a finite non-empty set N we will understand a oneto-one mapping π : N → N . It can also be viewed as a mapping on the power set P(N ) which assigns a set π(S) ≡ {π(s); s ∈ S} to every set S ⊆ N . Then, given a real function m : P(N ) → R, the juxtaposition mπ will denote the composition of m and π defined by S → m(π(S)) for S ⊆ N .
A.2 Posets and lattices A partially ordered set (L, &), briefly a poset, is a non-empty set L endowed with a partial ordering &, that is, a binary relation on L which is (i) reflexive: ∀ x ∈ L x & x, (ii) transitive: ∀ x, y, z ∈ L x & y, y & z ⇒ x & z , (iii) antisymmetric: ∀ x, y ∈ L x & y, y & x ⇒ x = y. A phrase total ordering is used if, moreover, ∀ x, y ∈ L either x & y or y & x. A quasi-ordering is a binary relation which satisfies (i) and (ii). Given a poset (L, &) and x, y ∈ L, one writes x ≺ y for x & y and x = y. If x ≺ y and there is no z ∈ L such that x ≺ z and z ≺ y then x is called a lower neighbor of y and y is an upper neighbor of x. Given M ⊆ L an element x ∈ M is a minimal element of M with respect to & if there is no z ∈ M with z ≺ x; y ∈ M is a maximal element of M with respect to & if there is no z ∈ M with z - x. A finite poset L can be represented by a special diagram, sometimes named the Hasse diagram (see Faure and Heurgon [35], § I.1.10). In a diagram of this kind, elements of L are represented by ovals so that the oval representing y is higher than the oval representing x whenever y - x. Moreover, a segment is drawn between ovals representing x and y if y is an upper neighbor of x. Given M ⊆ L, a supremum of M in L, denoted by sup M , and alternatively called the least upper bound of M , is an element of y ∈ L such that z & y for every z ∈ M but y & y for each y ∈ L with z & y for every z ∈ M . Owing to the antisymmetry of &, a supremum of M is determined uniquely if it exists. Given x, y ∈ L, their join, denoted by x ∨ y, is the supremum of the set {x} ∪ {y}. A poset in which every pair of elements has a join is called a join semi-lattice. Note that it can be equivalently introduced as a pair (L, ∨)
A.2 Posets and lattices
217
where L is a non-empty set and ∨ a binary operation on L which satisfies some axioms – see § II.1.1 in Faure and Heurgon [35]. Analogously, an infimum of M ⊆ L, denoted by inf M and also called the greatest lower bound of M , is an element of x ∈ L such that x & z for every z ∈ M but x & x for each x ∈ L with x & z for every z ∈ M . It is also determined uniquely if it exists. A meet of elements x, y ∈ L, denoted by x∧y, is the infimum of the set {x} ∪ {y}. A lattice is a poset (L, &) such that, for every x, y ∈ L, there exists both the supremum x ∨ y and the infimum x ∧ y in L. The concept of a lattice can also be equivalently introduced as a set L endowed with two binary operations ∨ and ∧ which satisfy some axioms – see § II.1.2 in [35]. A lattice (L, &) is distributive if for every x, y, z ∈ L x ∧ (y ∨ z) = (x ∧ y) ∨ (x ∧ z) and x ∨ (y ∧ z) = (x ∨ y) ∧ (x ∨ z) . A typical example of a distributive lattice is a ring of subsets of a finite nonempty set N , that is, a collection R ⊆ P(N ) which is closed under (finite) intersection and union. In particular, P(N ), ordered by inclusion ⊆, is a distributive lattice. A complete lattice is a poset (L, &) such that every subset M ⊆ L has a supremum and an infimum in L. Note that to show that L is a complete lattice it suffices to show that every M ⊆ L has an infimum. Any finite lattice is an example of a complete lattice. By a null element of a complete lattice L the least element in L is understood, that is, x0 ∈ L such that x0 & z for every z ∈ L; it is nothing but the supremum of the empty set in L. By a unit element the greatest element in L is understood, that is, y1 ∈ L such that z & y1 for every z ∈ L. An element x of a complete lattice L is supremum-irreducible if x = sup {z ∈ L ; z ≺ x } and infimum-irreducible if x = inf {z ∈ L ; x ≺ z }. It is easy to see that an element of a finite lattice is supremum-irreducible iff it has exactly one lower neighbor and it is infimum-irreducible iff it has exactly one upper neighbor. The set of supremum-irreducible elements in a finite lattice (L, &) is the least set M ⊆ L that is supremum-dense by which is meant that, for every x ∈ L, there exists M ⊆ M such that x = sup M (see Proposition 2 in § 0.2 of Ganter and Wille [42]). Analogously, the set of infimum-irreducible elements in L is the least set M ⊆ L which is infimumdense, that is, for every y ∈ L, there exists M ⊆ M with y = inf M . A standard example of a supremum-irreducible element in a complete lattice L is an atom of L defined as an upper neighbor of the null element. By a coatom of L a lower neighbor of the unit element of L is understood. A complete lattice L is atomistic if the set of its atoms is supremum-dense; equivalently, if the only supremum-irreducible elements in L are its atoms. A complete lattice is coatomistic if the set of its coatoms is infimum-dense, that is, the only infimum-irreducible elements in L are its coatoms. Two posets (L1 , &1 ) and (L2 , &2 ) are order-isomorphic if there exists a mapping φ : L1 → L2 onto L2 such that
218
A Appendix
x &1 y ⇔ φ(x) &2 φ(y) for every x, y ∈ L1 . The mapping φ is then a one-to-one mapping between L1 and L2 and it is called an order-isomorphism. If the poset (L1 , &1 ) is a complete lattice then (L2 , &2 ) is also a complete lattice and φ is even a (complete) lattice isomorphism, by which is meant that φ(sup M ) = sup {φ(z); z ∈ M } φ(inf M ) = inf {φ(z); z ∈ M }
for every M ⊆ L1 .
A general example of a complete lattice can be obtained by means of a closure operation on subsets of a non-empty set X, by which is meant a mapping cl : P(X) → P(X) that is (i) isotonic: ∀ S, T ⊆ X S ⊆ T ⇒ cl(S) ⊆ cl(T ), (ii) extensive: ∀ S ⊆ X S ⊆ cl(S), (iii) idempotent: ∀ S ⊆ X cl(cl(S)) = cl(S). A set S ⊆ X is called closed with respect to cl if S = cl(S). Given a closure operation cl on subsets of X, the collection K ⊆ P(X) of closed sets with respect to cl is closed under set intersection: D⊆K ⇒ D∈K (by a convention D = X if D = ∅) . Every collection K ⊆ P(X) satisfying this requirement is called a closure system of subsets of X. The correspondence between a closure operation cl and a closure system K is one-to-one since the formula {T ⊆ X ; S ⊆ T ∈ K } for S ⊆ X , clK (S) = defines a closure operation on subsets of X having K as the collection of closed sets with respect to clK (see Theorem 1 in Ganter and Wille [42] or § V.1 of Birkhoff [10]). The poset (K, ⊆) is then a complete lattice in which sup D = cl ( D ) inf D = D for every D ⊆ K . Every complete lattice is lattice isomorphic to a lattice of this type – see Proposition 3 in Chapter 1 of Ganter and Wille [42].
A.3 Graphs A graph is specified by a non-empty finite set of nodes N and by a set of edges consisting of pairs of distinct elements taken from N . Several types of edges are mentioned in this book, but classic graphs admit only two basic types of edges. An undirected edge or a line over N is an unordered pair {a, b} where a, b ∈ N , a = b, that is, a two-element subset of N . A directed edge or an arrow over N is an ordered pair (a, b) where a, b ∈ N , a = b. Pictorial
A.3 Graphs
219
representation is clear: nodes are represented by small circles and edges by the corresponding links between them. Note that an explicit requirement a = b excludes the occurrence of a loop, that is, an edge connecting a node with itself (but loops are allowed in some non-classic graphs). A graph with mixed edges over (a set of nodes) N is given by a set of lines L over N and by a set of arrows A over N . Supposing G = (N, L, A) is a b in G” in the case {a, b} ∈ L and says graph of this kind, one writes “a that there exists a line between a and b in G. Similarly, in the case (a, b) ∈ A we say that there exists an arrow from a to b in G and write “a → b in G” or “b ← a in G”. Pictorial representation naturally reflects notation in both cases. Two examples of graphs with mixed edges are given in Figure A.1. h
f 6 ? f
a
b
f f @ 6 @ ? R f fd @ i e @ @ @ @ ? ? @f @ Rf f g
j
k
c
f
a
b
f @
f
@ R f @
ff
e
f
c
f ff
? f g
k Fig. A.1. Two graphs with mixed edges.
If either a b in G, a → b in G or a ← b in G, then one briefly says that [a, b] is an edge in G. Note explicitly that this definition still allows, for b, a → b and a ← b are a pair of distinct nodes a, b ∈ N , that each of a simultaneously edges in G. If ∅ = T ⊆ N , then the induced subgraph of G for T is the graph GT = (T, LT , AT ) over T where LT (AT ) is the set of those lines (arrows) over T which are also in L (in A). A hybrid graph over N is a graph G with mixed edges G which has no multiple edges. That means, for an ordered pair of distinct nodes (a, b), a, b ∈ N at most one of three abovementioned options can occur. An example of a hybrid graph is given on the right-hand side of Figure A.1. It is an induced subgraph of the graph on the left-hand side of Figure A.1. A route from a node a to a node b (or between nodes a and b) in a graph G with mixed edges is a sequence of nodes c1 , . . . , cn ∈ N , n ≥ 1 together with a sequence of edges 1 , . . . , n−1 ∈ L ∪ A (possibly empty in the case ci+1 , ci → ci+1 or n = 1) such that a = c1 , b = cn and i is either ci ci+1 ci ← ci+1 for i = 1, . . . , n − 1. A route is called undirected if i is ci for i = 1, . . . , n − 1, descending if i is either ci ci+1 or ci → ci+1 for i = 1, . . . , n − 1 and strictly descending if n ≥ 2 and i is ci → ci+1 for i = 1, . . . , n − 1. In particular, every undirected route is a descending route. A path is a route in which all nodes c1 , . . . , cn are distinct. The adjectives
220
A Appendix
undirected and (strictly) directed are used for paths as well. A cycle is a route where n ≥ 3, c1 = cn and c1 , . . . , cn−1 are distinct such that, in the case b a, a → b ← a n = 3, 2 is not a reverse copy of 1 (this implies that a and a ← b → a are not cycles while a b → a and a → b → a are supposed to be cycles). A directed cycle is a cycle which is a descending route and i is ci → ci+1 at least once. An undirected cycle is a cycle which is an undirected route, that is, it consists of lines. Example A.1. The sequence j ← i ← h a→e←a→d i is an example of a route in the graph G shown in the left-hand picture of Figure A.1. An d → g; the path a → e → g example of a descending path in G is h → i d i is is strictly descending. There are several cycles in G: i → j → k i k d is undirected. ♦ directed and d A node a is a parent of a node b in G, or b is a child of a, if a → b in G; a is an ancestor of b in G, and dually b is a descendant of a, if there exists a descending route (equivalently a descending path) from a to b in G. The set of parents of a node b in G will be denoted by paG (b). Supposing A ⊆ N , the symbol anG (A) will denote the set of ancestors of the nodes of A in G. Analogously, a is a strict ancestor of b (b is a strict descendant of a) if there exists a strictly descending route from a to b. Similarly, a is connected to b in G if there exists an undirected route (equivalently an undirected path) between a and b. Clearly, the relation “to be connected” is an equivalence relation which decomposes N into equivalence classes, named connectivity components. Example A.2. The set of parents of a node e in the graph H shown on the right-hand side of Figure A.1 consists of two nodes: paH (e) = {a, c}. The node a is an ancestor of the node f in H: a ∈ anH (f ). It is not a strict ancestor; an example of a strict descendant of a in H is the node g. The graph H has 4 connectivity components: {a, b, c}, {e, f }, {k} and {g}. ♦ An undirected graph is a graph containing only lines (that is, A = ∅), a directed graph is a graph containing only arrows (that is, L = ∅). The underlying graph H of a graph with mixed edges G = (N, L, A) is an undirected b in H iff [a, b] is an edge in G. An undirected graph H over N such that a graph is connected if every pair of its nodes is connected, that is, if N is the only connectivity component. A forest is an acyclic undirected graph, that is, an undirected graph without (undirected) cycles. A tree is a forest which is connected. A set A ⊆ N in an undirected graph H over N is complete if b for every a, b ∈ A, a = b; a clique of H is a maximal complete subset a of N . An undirected graph H over N is complete if N is a complete set in H. An acyclic directed graph over N is a directed graph over N without directed cycles. It can be equivalently introduced as a directed graph G whose nodes can be ordered in a sequence a1 , . . . , ak , k ≥ 1 such that if [ai , aj ] is an edge in G for i < j, then ai → aj in G. Then we say that the total ordering a1 , . . . , ak is consonant with the direction of arrows. In particular, every acyclic directed
A.4 Topological concepts
221
graph G has at least one terminal node, that is, a node which has no child in G. A chain for a hybrid graph G over N is a partition of N into an ordered sequence of non-empty disjoint subsets B1 , . . . , Bn , n ≥ 1 called blocks such that, • if [a, b] is an edge in G with a, b ∈ Bi then a b, and • if [a, b] is an edge in G with a ∈ Bi , b ∈ Bj , i < j then a → b. A chain graph is a hybrid graph which admits a chain. It can be equivalently introduced as a hybrid graph without directed cycles (see [139] Lemma 2.1). Clearly, every undirected graph and every acyclic directed graph is a chain graph. An example of a chain graph which is neither undirected nor directed is given on the right-hand side of Figure A.1. The graph has a chain with 3 blocks: B1 = {a, b, c}, B2 = {e, f } and B3 = {k, g}. Note that various other types of edges are used in advanced graphical approaches (see Section 3.5), e.g., bidirected edges, dashed lines, dashed arrows or even loops. From a purely mathematical point of view these edges can also be introduced as either ordered or unordered pairs of nodes, but their meaning is different. Thus, due to their different interpretation they have to be carefully distinguished from the above-mentioned “classic” edges. However, most of the concepts introduced in Section A.3 can be naturally extended to the graphs allowing edges of additional types.
A.4 Topological concepts A metric space (X, ρ) is a non-empty set X endowed with a distance ρ which is a non-negative real function ρ : X × X → [0, ∞), such that ∀ x, y, z ∈ X (i) ρ(x, y) = 0 iff x = y, (ii) ρ(x, y) = ρ(y, x), (iii) ρ(x, z) ≤ ρ(x, y) + ρ(y, z). A set G ⊆ X is called open in (X, ρ) if, for every x ∈ G, there exists ε > 0 such that the open ball U (x, ε) ≡ {y ∈ X ; ρ(x, y) < ε} with the center x and the radius ε belongs to G. We write Uρ (x, ε) if we want to make the distance ρ explicit. A set F ⊆ X is closed if its complement X\G is open. A metric space is separable if it has a countable dense set, that is, a set S ⊆ X such that ∀ x ∈ X ∀ ε > 0 there exists y ∈ S ∩ U (x, ε). A metric space is complete if every Cauchy sequence x1 , x2 , . . . of elements of X, that is, a sequence satisfying ∀ ε > 0 ∃ n ∈ N such that ∀ k, l ≥ n ρ(xk , xl ) < ε, converges to an element x ∈ X, that is, ∀ ε > 0 ∃ n ∈ N such that for every k ≥ n one has ρ(xk , x) < ε. A classic example of a separable complete metric space is an arbitrary nonempty finite set X endowed with the discrete distance δ defined as follows: 0 if x = y , δ(x, y) = 1 otherwise .
222
A Appendix
Another common example is the set of n-dimensional real vectors Rn , n ≥ 1 endowed with the Euclidean distance ! n ! (x, y) = " (xi − yi )2
for x = [x1 , . . . , xn ], y = [y1 , . . . yn ] .
i=1
The set of real numbers R with (x, y) = |x − y| is a special case of that. A topological space (X, τ ) is a non-empty set X endowed with a topology τ , which is a class of subsets of X closed under finite intersection, arbitrary union, and involving both the empty set ∅ and X itself. Every metric space (X, ρ) is an example of a topological space because the class of open sets in (X, ρ) is a topology. A topological space of this kind is called metrizable and its topology is induced by the distance ρ. For instance, the set of real numbers R is often automatically understood as a topological space endowed with the Euclidean topology induced by the Euclidean distance. A product of topological spaces (X1 , τ1 ) and (X2 , τ2 ) is the Cartesian product X1 ×X2 endowed with the product topology, that is, the class of sets G ⊆ X1 × X2 such that ∀ (x1 , x 2 ) ∈ G there exist G1 ∈ τ1 , G2 ∈ τ2 with (x1 , x2 ) ∈ G1 × G2 ⊆ G. A product i∈N (Xi , τi ) of any finite collection (Xi , τi ), i ∈ N , |N | ≥ 2 of topological spaces is defined analogously. For example, Rn (n ≥ 2) endowed with the topology induced by the Euclidean distance can be viewed as a product of topological spaces Xi = R, i ∈ {1, . . . , n}. A real function f : X → R on a topological space (X, τ ) is continuous if {x ∈ X; f (x) < r} belongs to τ for every r ∈ R.
A.5 Finite-dimensional subspaces and convex cones Throughout this section the set of n-dimensional real vectors Rn , n ≥ 1 is fixed. It is a topological space endowed with the Euclidean topology. Given x, y ∈ Rn and α ∈ R, the sum of vectors x + y ∈ Rn and the scalar multiple α · x ∈ Rn are defined componentwisely. The symbol 0 denotes the zero vector which has 0 as all its components. Given A ⊆ Rn , the symbol −A denotes the set {−x ; x ∈ A} where −x denotes the scalar multiple (−1) · x. n n The scalar nproduct of two vectors x = [xi ]i=1 and y = [yi ]i=1 is the number
x, y = i=1 xi · yi . A.5.1 Linear subspaces A set L ⊆ Rn is a linear subspace if 0 ∈ L and L is closed under linear combinations, that is, ∀ x, y ∈ L ∀ α, β ∈ R
α · x + β · y ∈ L.
A.5 Finite-dimensional subspaces and convex cones
223
Every linear subspace of Rn is a closed set with respect to the Euclidean topology. A set A ⊆ Rn linearly generates a subspace L ⊆ Rn if every element of L is a linear combination of elements of A, that is, αy · y for some αy ∈ R, y ∈ B. ∀ x ∈ L ∃ B ⊆ A finite such that x = y∈B
By a convention 0 is an empty linear combination, which means that ∅ linearly generates the subspace {0}. A finite set A ⊆ Rn is linearly independent if αy · y = 0 ⇒ [αy = 0 for every y ∈ A] . ∀ αy ∈ R, y ∈ A y∈A
In particular, a set containing 0 is never linearly independent. A linear basis of a subspace L ⊆ Rn is any finite linearly independent set A ⊆ L which (linearly) generates L. Every linear subspace L ⊆ Rn has a basis which is possibly empty in the case L = {0}. Different bases of L have the same number of elements called the dimension of L (see Theorem 1 in § 8 of Halmos [50]). The dimension is the number between 0 (for L = {0}) and n (for L = Rn ). One says that a subspace L ⊆ Rn is a direct sum of subspaces L1 , L2 ⊆ Rn and writes L = L1 ⊕ L2 if L1 ∩ L2 = {0}, L1 ⊆ L, L2 ⊆ L and L1 ∪ L2 generates L. Then every x ∈ L can be written in the form x = y + z where y ∈ L1 , z ∈ L2 and this decomposition of x is unique. Moreover, the dimension of L is the sum of the dimensions of L1 and L2 . An orthogonal complement of a set A ⊆ Rn is the set A⊥ = {x ∈ Rn ; x, y = 0 for every y ∈ A} . It is always a linear subspace. Moreover, for every linear subspace L ⊆ Rn one has Rn = L ⊕ L⊥ and L = (L⊥ )⊥ . By an affine subspace in Rn we understand a set A ⊆ Rn of the form {x} + L, that is the set { z ; z = x + y, y ∈ L }, where x ∈ Rn and L is a linear subspace of Rn . The linear subspace L is then determined uniquely and the dimension of A is defined as the dimension of L. An affine subspace generated by a set G is the least affine subspace of Rn containing G. A.5.2 Convex sets and cones A set S ⊆ Rn , n ≥ 1 is called convex if ∀ x, y ∈ S and α ∈ [0, 1] one has α · x + (1 − α) · y ∈ S. An example of a convex set in R is the interval [0, ∞). Another example is a convex cone. A set K ⊆ Rn is a convex cone if 0 ∈ K and K is closed under conical combinations, that is, ∀ x, y ∈ K ∀ α, β ≥ 0 α · x + β · y ∈ K . By a closed convex cone we understand a convex cone which is a closed set with respect to the Euclidean topology on Rn . A linear subspace is an example
224
A Appendix
of a closed convex cone. Another example is the dual cone A∗ to a set A ⊆ Rn defined by A∗ = {y ∈ Rn ; x, y ≥ 0 for every x ∈ A} . This is a general example, as K ⊆ Rn is a closed convex cone iff K = A∗ for y [134]). Another example of a some A ⊆ Rn (see Consequence 1 in Studen´ convex cone is the conical closure con(B) of a non-empty set ∅ = B ⊆ Rn (con(∅) = {0} by a convention): con(B) = {x ∈ Rn ; x = αz · z for some αz ≥ 0 and finite ∅ = C ⊆ B}. z∈C
Note that if B ⊆ R is finite then con(B) = B∗∗ , which implies that it is a closed convex cone (see Fact 6 and Proposition 1 in [134]). Any cone con(B) with finite B is called a polyhedral cone; by a rational polyhedral cone we understand the conical closure of a finite set of rational vectors B ⊆ Qn . A basic fact is that a set K ⊆ Rn is a polyhedral cone iff K = A∗ for a finite A ⊆ Rn (cf. Corollary 7.1a in Schrijver [113]). Analogously, K is a rational polyhedral cone iff K = A∗ for a finite set of rational vectors A ⊆ Qn (see Proposition 5 in Studen´ y [134]). Note that these facts can be viewed as consequences (or analogs) of a well-known result from convex analysis saying that polytopes coincide with bounded polyhedrons. n
x2
6
r [1, 1] r - x1 [−2, 0]
x2
6
@ @
@
@
[1, 1]
@ r @ r @ [−2, 0] @
- x1
Fig. A.2. Two rational cones in R2 .
Example A.3. To illustrate the concept of a dual cone and the concept of conical closure consider the set A = { [−2, 0], [1, 1] } ⊆ R2 . The conical closure of A has the form con(A) = { [x1 , x2 ] ; x2 ≥ max {x1 , 0} } and is shown on the left-hand side of Figure A.2. It is a rational polyhedral cone owing to its definition. The dual cone to A has the form A∗ = { [x1 , x2 ] ; −2x1 ≥ 0 x1 + x2 ≥ 0 } = { [x1 , x2 ] ; −x2 ≤ x1 ≤ 0 } and is shown on the right-hand side of Figure A.2. It is also a rational polyhedral cone. ♦
A.5 Finite-dimensional subspaces and convex cones
225
A closed cone K ⊆ Rn is pointed if K ∩ (−K) = {0}. An (apparently stronger) equivalent definition says that a closed cone K is pointed iff there exists y ∈ Rn such that x, y > 0 for every x ∈ K \ {0} (see Proposition 2 in Studen´ y [134]). Note that both cones shown in Figure A.2 are pointed. By a ray generated by a non-zero vector 0 = x ∈ Rn we understand the set Rx = {α · x ; α ≥ 0}. Clearly, every cone contains a whole ray R together with any of its non-zero vectors 0 = x ∈ R, which then necessarily generates R. Given a closed convex cone K ⊆ Rn , a ray R ⊆ K is called extreme (in K) if ∀ x, y ∈ K
x + y ∈ R implies x, y ∈ R.
A closed cone K has extreme rays iff it is pointed and contains a non-zero vector 0 = x ∈ K (see Proposition 4 in [134]). Moreover, every pointed closed convex cone K ⊆ Rn is a conical combination of its extreme rays, more exactly, K = con(B) for every B ⊆ K such that B ∩ (R \ {0}) = ∅ for each extreme ray R ⊆ K. Note that this fact can be viewed as a consequence of the well-known Krein-Milman theorem for bounded closed convex sets (see Proposition 4 in [134]). A pointed closed cone is a polyhedral cone iff it has finitely many extreme rays. Moreover, it is a rational polyhedral cone iff it has finitely many extreme rays and each of them is generated by a rational vector (see Consequence 5 in [134]). Basic results of Section 5.2 are based on the following special property of pointed rational polyhedral cones (see Consequence 8 in [134]). Lemma A.1. Let K ⊆ Rn be a pointed rational polyhedral cone and R be an extreme ray of K. Then there exists q ∈ Qn such that q, x = 0 for any x ∈ R and q, y > 0 whenever 0 = y belongs to another (extreme) ray of K. Another useful fact is that every conical combination of integral vectors which has integral components is necessarily a rational conical combination (see Lemma 10 in [134]). Lemma A.2. Supposing B ⊆ Zn , every x ∈ con(B) ∩ Zn has the form x = y∈C αy · y where C ⊆ B is finite and αy ∈ Q, αy ≥ 0 for every y ∈ C. Let us call a face of a polyhedral cone K a convex cone F ⊆ K such that ∀ x, y ∈ K x + y ∈ F implies x, y ∈ F. This is a modification of a usual definition of a face of a closed convex set from Brøndsted [16]; these definitions coincide for non-empty subsets F of a polyhedral cone K. One can show that a face of a polyhedral cone is a polyhedral cone (cf. Consequence 8.4 in [16]). Examples of faces of a pointed polyhedral cone K are its extreme rays, the set {0} and K itself. Note that a different definition of a face was given by Schrijver [113]. However, one can show using Theorem 7.5 in Brøndsted [16] that F is a face of a pointed polyhedral cone K iff it has the form F = {x ∈ K ; x, z = 0} where z ∈ Rn satisfies K ∩ {x ∈ Rn ; x, z < 0} = ∅. Using this observation one can show that for a pointed polyhedral cone K the given definition of a face is equivalent to the one from § 8.3 in Schrijver [113].
226
A Appendix
A.6 Measure-theoretical concepts A measurable space (X, X ) is a non-empty set X endowed with a σ-algebra X (= sigma-algebra) over X, which is a class of subsets of X involving X itself and closed under complement and countable union. Given a class A of subsets of X, the least σ-algebra over X containing A, i.e., the intersection of all σ-algebras containing A, is called the σ-algebra generated by A and denoted by σ(A). In particular, if (X, τ ) is a topological space, then the σ-algebra generated by its topology τ is the Borel σ-algebra or the σ-algebra of Borel sets. Given a measurable space (X, X ), the class of all σ-algebras S ⊆ X , ordered by inclusion, is a lattice. Indeed, for σ-algebras S, T ⊆ X , their infimum S ∧T is simply the intersection S ∩ T while their supremum S ∨ T is the σ-algebra generated by S ∪ T . Actually, it is a complete lattice since every collection of σ-algebras has an infimum, namely their intersection. The null element of the lattice is the trivial σ-algebra over X, that is, the class {∅, X}. In other words, it is the σ-algebra generated by the empty class. The unit element of the lattice is the σ-algebra X . The product of measurable spaces (X1 , X1 ) and (X2 , X2 ) is the Cartesian product X1 × X2 endowed with the product σ-algebra X1 × X2 , which is generated by measurable rectangles, that is, by sets of the form A × B where A ∈ X1 and B ∈ X2 . The product ( i∈N Xi , i∈N Xi ) of an arbitrary finite collection of measurable spaces (Xi , Xi ), i ∈ N where |N | ≥ 2 is defined analogously. If, moreover, A ⊆ N then the respective coordinate σ-algebra for A is a product of σ-algebras Yi , i ∈ N where Yi = Xi for i ∈ A and Yi = {∅, Xi } for i ∈ N \A. A real function f : X → R on a measurable space (X, X ) is measurable (sometimes one writes X -measurable) if {x ∈ X ; f (x) < r} belongs to X for every r ∈ R. A typical example is the indicator χA of a set A ∈ X defined as follows: 1 if x ∈ A , χA (x) = 0 if x ∈ X \ A . Given a real measurable function f : X → R, its positive part f + and negative part f − are non-negative measurable functions defined by f + (x) = max {f (x), 0} , −
f − (x) = max {−f (x), 0}
for x ∈ X ,
−
and one has f = f − f and |f | = f + f . A well-known auxiliary result in probability theory (see Theorem 2 in § II.2 ˇ ep´an [126]) is as follows. of Shiryayev [118] or § I.1.5 of Stˇ +
+
Lemma A.3. Let X be a non-empty set, G and H classes of subsets of X such that G ⊆ H and G is closed under finite set intersection, that is, A, B ∈ G implies A ∩ B ∈ G. Moreover, assume that X ∈ H and H is closed under proper set difference and monotone countable union: A, A ∈ H, A ⊆ A ⇒ A \ A ∈ H, An ∈ H, An ⊆ An+1 for n = 1, 2, . . . Then σ(G) ⊆ H.
⇒
∞ n=1
An ∈ H.
(A.1)
A.6 Measure-theoretical concepts
227
The above fact is often tacitly used in probability theory to prove that a certain property is valid for a wider class of sets. Some authors call results of this type monotone class theorems – see § 0.2.4 in Florens et al. [37]. A.6.1 Measure and integral A non-negative measure on a measurable space (X, X ) is a function µ defined on X , taking values in the interval [0, ∞] (infinite values are allowed), which satisfies µ(∅) = 0 and is countably additive, that is, the equality µ(
∞
i=1
Ai ) =
∞
µ(Ai )
i=1
holds for every countable collection of pairwise disjoint sets A1 , A2 , . . . in X . It is a finite measure if µ(X) < ∞ and a σ-finite measure if there exists a ∞ sequence B1 , B2 , . . . of sets in X such that X = i=1 Bi and µ(Bi ) < ∞ for every i ∈ N. A trivial example of a finite measure is a non-empty finite set X endowed with the counting measure υ on (X, P(X)) defined by υ(A) = |A| for every A ⊆ X. A classic example of a σ-finite measure is the Lebesgue measure on Rn , n ≥ 1, endowed with the Borel σ-algebra B n . This measure can be introduced as the only non-negative measure λ on (Rn , B n ) ascribing to every n-dimensional interval its volume, that is, λ(
n i=1
[ri , si ) ) =
n
(si − ri )
whenever ri , si ∈ R, ri < si , i = 1, . . . , n .
i=1
A probability measure is a non-negative measure µ satisfying µ(X) = 1. It is concentrated on a set B ∈ X if µ(B) = 1 or equivalently µ(X \ B) = 0. Two real measurable functions f and g on (X, X ) are equal µ-almost everywhere if µ({x ∈ X, f (x) = g(x)}) = 0. Then one writes f = g µ-a.e. Clearly, it is an equivalence relation. By a measurable mapping of a measurable space (X, X ) into a measurable space (Y, Y) a mapping t : X → Y is understood such that, for every B ∈ Y, the set t−1 (B) ≡ {x ∈ X; t(x) ∈ B} belongs to X . Note that a measurable function is a special case of this: if Y is R endowed with the Borel σ-algebra. Every probability measure P on (X, X ) then induces through t a probability measure Q on (Y, Y) defined by Q(B) = P (t−1 (B)) for every B ∈ Y. Two measurable spaces (X, X ) and (Y, Y) are isomorphic if there exists a one-to-one mapping ς : X → Y which is onto Y and preserves countable union and complement. Then A ⊆ B implies ς(A) ⊆ ς(B) for A, B ∈ X , ς(X) = Y, ς(∅) = ∅ and countable intersection is also preserved by ς. The inverse mapping preserves these operations as well, which implies that ς is an order-isomorphism of the poset (X , ⊆) and the poset (Y, ⊆). It is easy to see that each measure µ on (X, X ) corresponds to a measure ν on (Y, Y) given by
228
A Appendix
ν(B) = µ(ς−1 (B))
for B ∈ Y ,
and conversely. For example, given measurable spaces (X1 , X1 ) and (X2 , X2 ), the space (X1 , X1 ) is isomorphic to (X1 × X2 , X¯1 ) endowed with the σ-algebra X¯1 ≡ {A × X2 ; A ∈ X1 } ⊆ X1 × X2 . The concept of integral is understood in the sense of Lebesgue. Given a non-negative measure µ on (X, X ) this construction (described, for example, in Rudin [111], Chapter 1 or in § II.6 of Shiryaev [118]) assigns a value f (x) dµ(x) from [0, ∞], called the integral of f over A with respect to µ to A every non-negative measurable function f and arbitrary A ∈ X (f can only be defined on A). A real measurable function f on (X, X ) is called µ-integrable |f (x)| dµ(x) is finite. A finite integral if the integral of its absolute value X f (x) dµ(x), i.e., a real number, is then defined for every µ-integrable funcA tion f and A ∈ X . Note that supposing f is µ-integrable and g is X -measurable function on X one has f = g µ-a.e. iff A f (x) dµ(x) = A g(x) dµ(x) for every A ∈ X (and g is µ-integrable in both cases)– see § 1.35 and § 1.39 in [111]. Remark that in the case (X, X ) = ( i∈N Xi , i∈N Xi ) this is equivalent to an apparently weaker requirement that the equality of integrals only holds for every measurable rectangle A. This follows from the fact that every two finite non-negative measures on ( i∈N Xi , i∈N Xi ) must coincide if they equal each other on all measurable rectangles. This fact is a consequence of Lemma A.3 – the class of measurable rectangles is closed under finite set intersection and the class of measurable sets on which the measures coincide satisfies (A.1). Sometimes, one needs to introduce (the concept of possibly infinite) integral even for a non-integrable real measurable function f : X → R by the formula f (x) dµ(x) = f + (x) dµ(x) − f − (x) dµ(x) , X
X
X
provided that at least one of the integrals on the right-hand side is finite. Then one says that f is µ-quasi-integrable and the integral X f (x) dµ(x) is defined as a value in the interval [−∞, +∞]. The reader is referred to Rudin [111], Chapter 1 for elementary properties of the Lebesgue integral. A.6.2 Basic measure-theoretical results Supposing µ and ν are measures on (X, X ) one says that ν is absolutely continuous with respect to µ and writes ν µ if µ(A) = 0 implies ν(A) = 0 for every A ∈ X . Basic measure-theoretical result is the Radon-Nikodym theorem (see Rudin [111], § 6.9 and 6.10). Theorem A.1. Supposing ν is a finite measure and µ a σ-finite measure on (X, X ) such that ν µ, there exists a non-negative µ-integrable function f called the Radon-Nikodym derivative of ν with respect to µ such that
A.6 Measure-theoretical concepts
229
ν(A) =
f (x) dµ(x)
for every A ∈ X .
A
Moreover, one can show (using Theorem 1.29 in [111]) that, for every X measurable function g on X, g is ν-integrable iff g · f is µ-integrable and g(x) dν(x) = g(x) · f (x) dµ(x) for every A ∈ X . A
A
According to a note in Section A.6.1, the Radon-Nikodym derivative is determined uniquely only within equivalence µ-a.e. One writes f = dν/dµ to denote that a non-negative X -measurable function f is (a version of) the RadonNikodym derivative of ν with respect to µ. The product of σ-finite measures µ1 on (X1 , X1 ) and µ2 on (X2 , X2 ) is the unique measure µ1 ×µ2 on (X1 ×X2 , X1 ×X2 ) defined on measurable rectangles as follows: (µ1 × µ2 ) (A × B) = µ1 (A) · µ2 (B)
whenever A ∈ X1 , B ∈ X2 .
The reader is referred to [111] (§ 7.6, 7.7 and Exercise 7 in Chapter 5) for the proof of the existence and the uniqueness of (necessarily σ-finite)product measure µ1 × µ2 . The product of finitely many σ-finite measures i∈N µi , |N | ≥ 2 can be introduced analogously. Another basic measure-theoretical result is the Fubini theorem (see Rudin [111], § 7.8). Theorem A.2. Let µ1 be a σ-finite measure on (X1 , X1 ) and µ2 a σ-finite X1 × X2 -measurable measure on (X2 , X2 ). Suppose that f is a non-negative function on X1 × X2 . Then the function x1 → X2 f (x1 , x2 ) dµ2 (x2 ) is X1 measurable, the function x2 → X1 f (x1 , x2 ) dµ1 (x1 ) is X2 -measurable and one has f (x1 , x2 ) d(µ1 × µ2 )([x1 , x2 ]) = X1 ×X2
=
f (x1 , x2 ) dµ2 (x2 ) dµ1 (x1 ) =
X1 X2
f (x1 , x2 ) dµ1 (x1 ) dµ2 (x2 ) . X2 X1
Whenever f is a µ1 × µ2 -integrable real function on X1 × X2 , the same conclusion holds with the proviso that the respective functions on Xi are defined µi -almost everywhere (i = 1, 2). Supposing µ is a non-negative measure on the product of measurable spaces (X1 × X2 , X1 × X2 ), the marginal measure ν on (X1 , X1 ) is defined as follows: ν(A) = µ(A × X2 ) for every A ∈ X1 .
230
A Appendix
Every X1 -measurable function h on X1 can be viewed as an X1 ×X2 -measurable function on X1 × X2 . Then h is ν-integrable iff it is µ-integrable and h(x1 ) dµ([x1 , x2 ]) = h(x1 ) dν(x1 ) . X1 ×X2
X1
If P is a probability measure on the product ( i∈N Xi , i∈N Xi ) ofmeasurable spaces and ∅ = A ⊆ N then its marginal measure on ( i∈A Xi , i∈A Xi ) will be denoted by P A . The operation P → P A is then the operation of “marginalizing”. A real function ϕ : S → R, where S ⊆ Rn , n ≥ 1 is a convex set, is called convex if for all x, y ∈ S and α ∈ [0, 1] ϕ (α · x + (1 − α) · y) ≤ α · ϕ(x) + (1 − α) · ϕ(y) , and it is called concave if the converse inequality holds instead. Moreover, ϕ is called strictly convex , respectively strictly concave, if the inequality is strict whenever x = y and α ∈ (0, 1). Another basic result is the Jensen inequality (one can modify the proof from Rudin [111], § 3.3). Theorem A.3. Let µ be a probability measure on (X, X ), f : X → [0, ∞) a µ-integrable function and ϕ : [0, ∞) → R a convex function. Then ϕ (f (x)) dµ(x) . ϕ( f (x) dµ(x) ) ≤ X
X
In the case ϕ is strictly convex, the equality occurs if and only if f is constant µ-a.e., more exactly, if f (x) = k for µ-a.e. x ∈ X where k = X f (x) dµ(x). A.6.3 Information-theoretical concepts Suppose that P is a finite measure and µ a σ-finite measure on a measurable space (X, X ). If P µ, choose a version of the Radon-Nikodym derivative dP/dµ, accept the convention 0 · ln 0 ≡ 0 and introduce dP dP H(P |µ) = (x) · ln (x) dµ(x). (A.2) dµ dµ X
Provided that the function (dP/dµ) · ln (dP/dµ) is µ-quasi-integrable, let us call the integral the relative entropy of P with respect to µ. Of course, the quasi-integrability and the value of H(P |µ) does not depend on the choice of a version of dP/dµ. It follows from the definition of the Radon-Nikodym derivative that H(P |µ) can be equivalently introduced as the integral dP H(P |µ) = ln (x) dP (x) , (A.3) dµ X
A.6 Measure-theoretical concepts
231
provided that ln(dP/dµ) is P -quasi-integrable. Hence, the relative entropy of P with respect to µ is finite iff ln(dP/dµ) is P -integrable. Let us note that, in general, P µ does not imply the existence of the integral in (A.2) and if H(P |µ) is defined, then it can take any value in the interval [−∞, +∞]. However, when both P and µ are probability measures (and P µ), the existence of H(P |µ) is guaranteed and it can serve as a measure of similarity of P to µ. Lemma A.4. Supposing P and µ are probability measures on (X, X ) such that P µ, the relative entropy of P with respect to µ is defined and H(P |µ) ≥ 0. Moreover H(P |µ) = 0 iff P = µ. Proof. Apply the Jensen inequality to the case f = (dP/dµ) and ϕ(r) = r·ln r for r > 0, ϕ(0) = 0. Since P is a probability measure, X f (x) dµ(x) = 1 and ϕ(1) = 0 gives the lower estimate. Moreover, since ϕ is strictly convex, H(P |µ) = 0 iff f (x) = 1 for µ-a.e. x ∈ X, which is equivalent to the requirement P = µ. Note that, as concerns probability measures P and µ, the definition of the relative entropy is sometimes extended by a convention that provided P is not absolutely continuous with respect to µ one puts H(P |µ) = +∞. In particular, the assumption that H(P |µ) is finite includes the requirement P µ. Relative entropy is a theoretical basis for various numerical characteristics used in information theory, which is a mathematical theory of communication [25, 158]. For example, if X is finite, P a probability measure on it and υ the counting measure on X then the number −H(P |υ) is called the entropy (of P ). This (non-negative) finite number is used as a measure of uncertainty in the respective information source – see § 2.2 in Yeung [158]. If X1 , X2 are non-empty finite sets, P is a probability measure on X1 × X2 and µ = P1 × P2 is the product of its marginal measures (on Xi ), then the relative entropy H(P |P1 × P2 ) is called mutual information (between P1 and P2 ). It can be interpreted as a measure of dependence between the respective information sources. Its natural generalization to any finite number of sources is named multiinformation – see Section 2.3.3. The inequality reported in Lemma A.4 is often named information inequality – see Theorem 2.6.3 in Cover and Thomas [25]. One of its useful consequences is as follows. finite collection Corollary A.1. Let ak ≥ 0, bk ≥ 0, k ∈ K be a non-empty of non-negative real numbers such that k∈K ak = k∈K bk = 1. Then k∈K
bk · ln ak ≤
bk · ln bk
(A.4)
k∈K
Proof. Without loss of generality assume bk > 0 for every k ∈ K: otherwise replace K by {k ∈ K : bk > 0}, which is possible owing to the convention
232
A Appendix
0 · ln 0 ≡ 0. This implies −∞ < ln bk ≤ 0 and ln ak ≤ 0 for k ∈ K. If al = 0 for some l ∈ K then bl · ln al = −∞ means that the left-hand side of (A.4) is −∞. Thus, the inequality is strict in this case. Therefore, one can assume ak > 0 for every k ∈ K. Then put X = K, P ({k}) = bk and µ({k}) = ak . One has P µ which allows one to use Lemma A.4 to show (A.4). A.6.4 Conditional probability Supposing P is a probability measure on a measurable space (X, X ) and A ⊆ X is a σ-algebra over X, the restriction of P to A will be denoted by P A . Given B ∈ X , the conditional probability of B given A with respect to P is an Ameasurable function h : X → [0, 1] such that P (A ∩ B) = h(x) dP (x) for every A ∈ A . (A.5) A
One can use the Radon-Nikodym theorem with (X, A), µ = P A , and ν(A) = P (A ∩ B) for A ∈ A, to show that a function h satisfying (A.5) exists and is determined uniquely within the equivalence P A -a.e. Let us write h = P (B|A) to denote that an A-measurable function h : X → [0, 1] is (a version of) conditional probability of B given A. The following lemma recalls two equivalent definitions of conditional probability. One of them, the condition (W), is apparently weaker, the other one, the condition (S) is apparently stronger. Lemma A.5. Let P be a probability measure on a measurable space (X, X ), A ⊆ X be a σ-algebra, B ∈ X and h : X → [0, 1] an A-measurable function. Then h = P (B|A) iff one of the following two conditions holds: (W) the equality in (A.5) holds for every A ∈ G, where G ⊆ A is a class of sets closed under finite intersection of sets such that X ∈ G and σ(G) = A, (S) for every non-negative A-measurable real function g : X → [0, ∞) and A ∈ A one has g(x) dP (x) = g(x) · h(x) dP (x) ≡ g(x) · h(x) dP A (x) . A∩B
A
A
Proof. I. The necessity of the condition (W) is evident. To show its sufficiency introduce the class H of sets A ∈ X such that P (A ∩ B) = A h(x) dP (x). Observe that H is closed under proper set difference and monotone countable union. The condition (W) implies X ∈ G ⊆ H. Thus, Lemma A.3 implies σ(G) ⊆ H. Since σ(G) = A by (W) this implies A ⊆ H, which gives (A.5). II. The sufficiency of the condition (S) is evident: let g ≡ 1. To show its necessity let us introduce the class F of functions f : X → [0, ∞) such that f (x) dP (x) = f (x) · h(x) dP (x) . (A.6) B
X
A.6 Measure-theoretical concepts
233
Basic properties of the Lebesgue integral (see Chapter 1 in Rudin [111]) imply that F is a convex cone closed under increasing pointwise convergence: f, f ∈ F, α, β ≥ 0 ⇒ α · f + β · f ∈ F , fn ∈ F, fn ≤ fn+1 for n = 1, 2, . . . ∀ x ∈ X limn→∞ fn (x) ≡ f (x) < ∞ ⇒ f ∈ F . Observe that (A.5) implies that { χE ; E ∈ A } ⊆ F. A well-known elementary fact used in the construction of the Lebesgue integral (see Theorem 1.17 in [111]) is that every non-negative A-measurable function f can be obtained as an increasing pointwise limit of functions which are non-negative finite combinations of indicators of sets in A. Thus, the observations above imply that F includes all non-negative A-measurable functions on X, which concludes the proof of (S) – it suffices to put f = χA · g. It follows from the definition of conditional probability that whenever S ⊆ T ⊆ X are σ-algebras and B ∈ X then every S-measurable version of P (B|T ) is a version of P (B|S). Sometimes it happens that a certain fact or a value of an expression does not depend on the choice of a version of conditional probability. In this case the symbol P (B|A) is used in the corresponding formula to substitute an arbitrary version of conditional probability of B given A (w.r.t. P ). Until now, the set B ∈ X has been fixed (within this section). Nevertheless, wider understanding of the concept of conditional probability is often accepted. One can understand conditional probability as a mapping which ascribes to every set B of a certain σ-algebra B ⊆ X a version of P (B|A). Thus, it can be viewed as a real function of two variables then: of B ∈ B and of x ∈ X (while A plays the role of a parameter). Actually, this is how the concept of conditional probability is understood in this book. The main speciality is that we consider a probability measure P on the product of a finite collection of measurable spaces ( i∈N Xi , i∈N Xi ) and coordinate σ-algebras on it – for details see Section 2.1, Remark 2.1. Remark A.1. Having fixed P on (X, X ) and a σ-algebra A ⊆ X , by a regular version of conditional probability given A we understand a function which ascribes to every B ∈ X a version of P (B|A) such that, for every x ∈ X, the mapping B → P (B|A)(x) is a probability measure on (X, X ). Note that this concept is taken from Lo´eve [74], § 26.1 and that a regular version of conditional probability need not exist in general (e.g. Example VI.1.35 in ˇ ep´an [126]). However, under certain topological assumptions, namely that Stˇ X is a separable complete metric space and X is the class of Borel sets in X, its existence is guaranteed (see either [126], Theorem VI.1.21 or Neveu [96], Consequence of Theorem V.4.4).
234
A Appendix
A.7 Conditional independence in terms of σ-algebras Let A, B, C ⊆ X be σ-algebras in a measurable space (X, X ) and P be a probability measure on it. One can say that A is conditionally independent of B given C with respect to P and write A ⊥⊥ B | C [P ] if, for every A ∈ A and B ∈ B, one has for P C -a.e. x ∈ X .
P (A ∩ B|C)(x) = P (A|C)(x) · P (B|C)(x)
(A.7)
Note that an apparently weaker equivalent formulation is as follows. It suffices to verify (A.7) only for A ∈ A˜ and B ∈ B˜ where A˜ ⊆ A respectively B˜ ⊆ B ˜ = A respectively are classes closed under finite intersection such that σ(A) ˜ = B. The proof of equivalence of these two definitions can again be done σ(B) by standard extension arguments based on Lemma A.3. A typical special case considered in this book is as follows: (X, X ) is the product ( i∈N Xi , i∈N Xi ) of a finite collection of measurable spaces (Xi , Xi ), i ∈ N and A, B, C are coordinate σ-algebras for pairwise disjoint sets A, B, C ⊆ N . For details see Section 2.1 – note that the condition (2.1) there is nothing else than the condition (A.7) specialized to that case. Lemma A.6. Under the above assumptions A ⊥⊥ B | C [P ] occurs iff for every A ∈ A there exists a C-measurable version of P (A|B ∨ C). Proof. To show the necessity of the condition, fix A ∈ A and choose a version f of P (A|C). Write for every B ∈ B, C ∈ C by the definition of P (A ∩ B|C) and the condition (A.7) P (A ∩ B|C)(x) dP (x) = P (A|C)(x) · P (B|C)(x) dP (x), P (A ∩ B ∩ C) = C
C
and continue by Lemma A.5, the condition (S) applied to P (B|C), and by the fact f = P (A|C) P (A|C)(x) · P (B|C)(x) dP (x) = P (A|C)(x) dP (x) = f (x) dP (x). C
B∩C
B∩C
Since the class G = {B ∩ C; B ∈ B, C ∈ C} is closed under finite intersection, X ∈ G and B ∨ C = σ(G), by Lemma A.5, the condition (W) applied to P (A|B ∨ C), conclude that f = P (A|B ∨ C). To show the sufficiency of the condition in Lemma A.6, fix A ∈ A and B ∈ B. Take a C-measurable version f of P (A|B ∨ C) and observe f = P (A|C). Then write by the definition of P (A ∩ B|C) and the fact f = P (A|B ∨ C) for every C ∈ C P (A ∩ B|C)(x) dP C (x) = P (A ∩ B ∩ C) = f (x) dP (x) , C
B∩C
A.7 Conditional independence in terms of σ-algebras
235
and continue by Lemma A.5, the condition (S) applied to P (B|C), and by the fact f = P (A|C) f (x) dP (x) = f (x) · P (B|C)(x) dP C (x) B∩C
C
=
P (A|C)(x) · P (B|C)(x) dP C (x).
C
Thus, the equality C P (A ∩ B|C)(x) dP (x) = P (A|C)(x) · P (B|C)(x) dP C (x) C
C
has been verified for every C ∈ C, which implies (A.7).
The next lemma describes basic properties of conditional independence in terms of σ-algebras. Lemma A.7. Supposing P is a probability measure on (X, X ) and A, B, C, D, E, F, G ⊆ X are σ-algebras, it holds (i) B ⊆ C ⇒ A ⊥ ⊥ B | C [P ] , (ii) A ⊥⊥ B | C [P ] ⇒ B ⊥⊥ A | C [P ] , (iii) A ⊥⊥ E | C [P ], F ⊆ E, C ⊆ G ⊆ E ∨ C ⇒ A ⊥ ⊥ F | G [P ] , (iv) A ⊥⊥ B | D ∨ C [P ], A ⊥⊥ D | C [P ] ⇒ A ⊥⊥ B ∨ D | C [P ] . Proof. The condition (ii) follows immediately from symmetry in (A.7). For other properties use the equivalent definition from Lemma A.6. For (i) realize that B ∨ C = C and therefore every version of P (A|B ∨ C) is C-measurable. In the case (iii) observe S ≡ F ∨ G ⊆ E ∨ C ≡ T . The assumption A ⊥⊥ E | C [P ] implies, for every A ∈ A, the existence of a C-measurable version of P (A|T ). Since C ⊆ G ⊆ S it is both G-measurable and S-measurable. Hence, it is a version of P (A|S). The existence of a G-measurable version of P (A|S) implies A ⊥⊥ F | G [P ]. To show (iv), fix A ∈ A and by A ⊥⊥ B | D ∨ C [P ] derive the existence of (D ∨ C)-measurable version f of P (A|B ∨ D ∨ C). Similarly, by A ⊥⊥ D | C [P ] derive the existence of C-measurable version g of P (A|D ∨ C). Observe that f is a version of P (A|D∨C) and by the “uniqueness” of P (A|D∨C) derive that f = g P D∨C -a.e. Hence, f and g equal each other P B∨D∨C -a.e. and by the “uniqueness” of P (A|B ∨ D ∨ C) conclude that g is its version. This implies A ⊥⊥ B ∨ D | C [P ]. Corollary A.2. Supposing P is a probability measure on (X, X ), semigraphoid properties for σ-algebras hold, that is, one has for σ-algebras A, B, C, D ⊆ X (the symbol [P ] is omitted): 1. triviality:
A ⊥⊥ B | C
whenever
B ∨ C = C,
236
2. 3. 4. 5.
A Appendix
symmetry: A ⊥⊥ B | C ⇒ B ⊥ ⊥ A | C, decomposition: A ⊥⊥ B ∨ D | C ⇒ A ⊥ ⊥ D | C, weak union: A ⊥⊥ B ∨ D | C ⇒ A ⊥ ⊥ B | D ∨ C, contraction: A ⊥⊥ B | D ∨ C & A ⊥⊥ D | C ⇒ A ⊥⊥ B ∨ D | C.
Proof. Use Lemma A.7; for the decomposition use (iii) with E = B∨D, F = D, G = C; for the weak union put E = B ∨ D, F = B, G = D ∨ C instead.
A.8 Concepts from multivariate analysis The concepts and facts mentioned in this section are commonly used in mathematical statistics, in particular, in its special area known as multivariate analysis. The proofs of the facts from Section A.8.1 can be found in textbooks on matrix calculus, for example Fiedler [36], Chapters 1 and 2. The proofs of basic facts in Section A.8.3 can be found in any reasonable textbook of statistics, see e.g. Andˇel [4], § V.1. A.8.1 Matrices Given non-empty finite sets N, M , by a real N × M -matrix a real function on N × M will be understood, that is, an element of RN ×M . The corresponding values are indicated by subscripts so that one writes Σ = (σij )i∈N,j∈M to explicate the components of a matrix Σ of this type. Note that this approach slightly differs from classic understanding of the concept of a matrix where the index sets have settled pre-orderings, e.g., N = {1, 2, . . . , n} and M = {1, . . . , m}. This enables one to write certain formulas involving matrices in a much more elegant way. The result of matrix multiplication of an N × M -matrix Σ and an M × Kmatrix Γ (where N, M, K are non-empty finite sets) is an N × K-matrix denoted by Σ · Γ . A real vector v over N , that is, an element of RN , will be understood here as a column vector so that it should appear in matrix multiplication with an N × N -matrix Σ from the left: Σ · v. The null matrix or a vector having zeros as all its components is denoted by 0; the unit matrix by I. An N × N -matrix Σ = (σij )i,j∈N is symmetric if σij = σji for every i, j ∈ N ; it is regular if there exists an inverse N × N -matrix Σ −1 such that Σ ·Σ −1 = I = Σ −1 ·Σ (if an inverse matrix exists, it is uniquely determined). The determinant of Σ will be denoted by det(Σ), the transpose of Σ by Σ : σij )i,j∈N where σ ¯ij = σji for i, j ∈ N . Σ = (¯ Given an N × N -matrix Σ = (σij )i,j∈N and non-empty A, B ⊆ N , the symbol Σ A·B will be used to denote A × B-submatrix , that is, Σ A·B = (σij )i∈A,j∈B . Note that one can also find notation Σ AB in the literature. However, in this book a dot is used to separate symbols A and B in order to avoid confusion because of a special meaning of the juxtaposition AB ≡ A∪B accepted throughout the book (Convention 1).
A.8 Concepts from multivariate analysis
237
By a generalized inverse of a real N × N -matrix Σ we will understand any N × N matrix Σ − such that Σ · Σ − · Σ = Σ. A matrix of this sort always exists, but it is not determined uniquely unless Σ is regular, in which case it coincides with Σ −1 (see Rao [104], § 1b.5). However, the expressions in which generalized inverses are commonly used do not usually depend on their choice. A real symmetric N × N -matrix Σ will be called positive semi-definite if v · Σ · v ≥ 0 for every v ∈ RN , and positive definite if v · Σ · v > 0 for every v ∈ RN , v = 0. An equivalent definition is the requirement det(Σ A·A ) ≥ 0 for every ∅ = A ⊆ N in the case of a positive semi-definite matrix and the condition det(Σ A·A ) > 0 for ∅ = A ⊆ N in the case of a positive definite matrix. Note that Σ is positive definite iff it is regular and positive semidefinite. In that case Σ −1 is positive definite as well. Supposing Σ is positive definite (semi-definite) and ∅ = A ⊆ N , its main submatrix Σ A·A is positive definite (semi-definite) as well. Note that the operation Σ → Σ A·A sometimes plays the role of “marginalizing” (but only for positive semi-definite matrices). On the other hand, supposing Σ is only regular, Σ A·A need not be regular. Suppose that Σ is a real N × N -matrix, non-empty sets A, C ⊆ N are disjoint and Σ C·C is regular. Then one can introduce Schur complement Σ A|C as the following A × A matrix: Σ A|C = Σ A·A − Σ A·C · (Σ C·C )
−1
· Σ C·A .
If C = ∅ then we accept a convention Σ A|∅ ≡ Σ A·A . Note that Σ AC·AC is regular iff Σ A|C (and Σ C·C ) is regular and (Σ A|C )−1 = ((Σ AC·AC )−1 )A·A then (see Theorem 1.23 in Fiedler [36]). Moreover, the following “transitivity principle” holds: supposing A, B, C ⊆ N are pairwise disjoint and Σ is an N ×N -matrix such that both Σ C·C and Σ BC·BC is regular, one has Σ A|BC = (Σ AB|C )A|B (see Theorem 1.25 in [36]). An important fact is that whenever Σ is positive definite then Σ A|C is positive definite as well. Thus, the operation Σ AC·AC → Σ A|C often plays the role of “conditioning” (for positive definite matrices only). However, one sometimes needs to define the “conditional” matrix Σ A|C even if Σ C·C is not regular. Thus, supposing Σ is a positive semi-definite matrix, one can introduce Σ A|C by means of a generalized inverse (Σ C·C )− as follows: − Σ A|C = Σ A·A − Σ A·C · (Σ C·C ) · Σ C·A . Note that this matrix does not depend on the choice of a generalized inverse Σ− C·C and it is again a positive semi-definite matrix (one can show these facts using what is said in § 8a.2(V) of Rao [104]). Of course, in the case of a positive definite matrix Σ it coincides with the above-mentioned Schur complement. The concept of “conditioning” is thus extended to positive semidefinite matrices.
238
A Appendix
A.8.2 Statistical characteristics of probability measures Remark A.2. An elementary concept of mathematical statistics is a random variable which is a real measurable function ξ on a certain (intentionally unspecified) measurable space (Ω, A) where Ω is interpreted as the “universum” of elementary events and A as the collection of “observable” random events. Moreover, it is assumed that (Ω, A) admits a probability measure P . Then every random vector , that is, a finite collection of random variables ξ = [ξi ]i∈N where |N | ≥ 2, can be viewed as a measurable mapping of (Ω, A) into (RN , B N ), where RN ≡ i∈N Xi with Xi = R is endowed with the Borel σ-algebra B N (≡ the product of Borel σ-algebras on R in this case). Then P induces through ξ a probability measure P , called the distribution of ξ: P (A) = P ({w ∈ Ω; ξ(w) ∈ A})
for every Borel set A ⊆ RN .
The measurable space (RN , B N ) is then called the (joint) sample space. Note that generalized random variables taking values in alternative sample measurable spaces, e.g., in finite sets Xi , i ∈ N instead of R, are also often considered. The area of interest of mathematical statistics is not the “underlying” theoretical probability P but the induced probability measure P on the sample space. Indeed, despite the fact that textbooks of statistics introduce various numerical characteristics of random vectors, these numbers actually do not characterize random vectors themselves but their distributions, that is, induced Borel probability measures on RN . The purpose of many statistical methods is simply to estimate these numerical characteristics from data. Definitions of basic ones are recalled in the rest of this section. Let P be a probability measure on ( i∈N Xi , i∈N Xi ) = (RN , B N ) where |N | ≥ 2. Let xi denote the i-th component (i ∈ N ) of a vector x ∈ RN . If the function x → xi , x ∈ RN (which is B N -measurable) is P -integrable for every i ∈ N , one can define the expectation as a real vector e = [ei ]i∈N ∈ RN with the components ei = xi dP (x) = y dP {i} (y) for i ∈ N. RN
Xi
If, moreover, the function x → (xi − ei ) · (xj − ej ) is P -integrable for every i, j ∈ N , then one defines the covariance matrix of P as an N × N -matrix Σ = (σij )i,j∈N with elements (y − ei ) · (z − ej ) dP {i,j} (y, z) , σij = (xi − ei ) · (xj − ej ) dP (x) = Xi ×Xj
RN
where the latter formula holds for distinct i, j ∈ N . If i = j then σii = (xi − ei )2 dP (x) = (y − ei )2 dP {i} (y) RN
Xi
A.8 Concepts from multivariate analysis
239
is called the variance of the i-th component. Alternative names of the matrix Σ are variance matrix, dispersion matrix [104], or even variance-covariance matrix [157]. An elementary fact is that the covariance matrix is always positive semi-definite; the converse is also valid (see Section A.8.3). Supposing P has a covariance matrix Σ = (σij )i,j∈N such that σii > 0 for every i ∈ N , one can introduce a correlation matrix Γ = (ρij )i,j∈N by the formula σij for i, j ∈ N. ρij = √ σii · σjj Note that the above situation occurs whenever Σ is regular (= positive definite) and Γ is then a positive definite matrix with ρii = 1 for every i ∈ N . A.8.3 Multivariate Gaussian distributions The definition of a general Gaussian measure on RN is not straightforward. First, one has to introduce a one-dimensional Gaussian measure N (r, s) on R with parameters r, s ∈ R, s ≥ 0. In the case s > 0, one can do so by defining the Radon-Nikodym derivative with respect to the Lebesgue measure on R (x−r)2 1 · exp− 2s f (x) = √ 2πs
for x ∈ R ,
where π is the Ludolph constant. In the case s = 0, N (r, 0) is defined as a Borel probability measure on R concentrated on {r}. Then, supposing e ∈ RN and Σ is a positive semi-definite N × N -matrix (|N | ≥ 1), one can introduce the Gaussian measure N (e, Σ) as a Borel measure P on RN such that, for every v ∈ RN , P induces through the measurable mapping x → x · v, x ∈ RN a one-dimensional Gaussian measure N (v · e, v · Σ · v) on R. Let us note that a probability measure of this kind always exists and it is determined uniquely by the above requirement. Moreover, P then has the expectation e and the covariance matrix Σ (see § V.1 in Andˇel [4]). This explains why these parameters were designed in this way and shows that every positive semi-definite matrix is the covariance matrix of a Gaussian measure. A linear transformation of a Gaussian measure N (e, Σ) by a mapping x → y + Λ · x, x ∈ RN where y ∈ RM , Λ ∈ RM ×N , |M | ≥ 1 is again a Gaussian measure N (y + Λ · e, Λ · Σ · Λ ) – see Theorem 4 in § V.1 of [4]. In particular, a marginal of a Gaussian measure is again a Gaussian measure P = N (e, Σ), ∅ = A ⊆ N ⇒ P A = N (eA , Σ A·A ) .
(A.8)
Note that this explains the interpretation of Σ A·A as a “marginal” matrix of Σ (see Section A.8.1). A very important fact is that independence is characterized by means of the covariance matrix: provided that P = N (e, Σ), A, B ⊆ N and A ∩ B = ∅ one has (cf. Theorem 8 in § V.1 of [4])
240
A Appendix
P AB = P A × P B
iff
Σ A·B = 0 .
(A.9)
In general, a Gaussian measure N (e, Σ) is concentrated on a certain affine subspace, namely {e + Σ · t ; t ∈ RN }, In other words, it is the set {e} + L where L ⊆ RN is the linear subspace generated by columns Σ N ·j , j ∈ N of the matrix Σ; or, equivalently by its rows (since Σ is a symmetric matrix). It can also be described as follows: v · Σ = 0 ⇒ v · (x − e) = 0} .
{x ∈ RN ; ∀ v ∈ RN
(A.10)
In the case of a regular Σ, the subspace is the whole RN and P = N (e, Σ) can be introduced directly by defining the Radon-Nikodym derivative with respect to the Lebesgue measure on RN : 1
fe,Σ (x) = √
(2π)|N | ·det(Σ)
· exp
−
(x−e) ·Σ −1 ·(x−e) 2
for x ∈ RN .
The respective Gaussian measure on RN is then called regular. This version of the Radon-Nikodym derivative is strictly positive and continuous with respect to the Euclidean topology on RN . Moreover, it is the unique continuous version within the class of all possible versions of the Radon-Nikodym derivative of P with respect to the Lebesgue measure λ. This simple fact motivates an implicit convention used commonly in the statistical literature: continuous versions, called (marginal) densities are exclusively taken into consideration. The convention is in concordance with the usual way of “marginalizing” since, for ∅ = A ⊂ N , by integrating a continuous density f , that is, f (x, y) dλ(y) for x ∈ RA , fA (x) = XN \A
one again gets a continuous strictly positive function, i.e., a marginal density. This also motivates a natural way of defining a (continuous) conditional density for disjoint A, C ⊆ N by the formula fA|C (x|z) =
fAC (xz) fC (z)
for x ∈ RA , z ∈ RC .
The definition of the conditional measure for every z ∈ RC is then PA|C (A|z) = fA|C (x|z) dλA (x) for every Borel set A ⊆ RA , A
which appears to be a regular version of conditional probability (on RA ) given B C (see Remark A.1 for this concept). Let us emphasize that just the acceptance of the above convention leads to its “uniqueness” for every z ∈ RC . It is again a Gaussian measure, sometimes called the conditioned Gaussian measure.
A.9 Elementary statistical concepts
241
P = N (e, Σ), A, C ⊆ N, A ∩ C = ∅ = A ⇒ PA|C (|z) = N (eA + Σ A·C · (Σ C·C )
−1
· (z − eC ) , Σ A|C );
(A.11)
for a proof see Rao [104], § 8a.2(V). An important feature is that its covariance matrix Σ A|C does not depend on z. This may explain the interpretation of the Schur complement Σ A|C in terms of “conditioning” – that is why it is sometimes called the conditional covariance matrix. However, the operation of “conditioning” can be introduced even in the case of a singular Gaussian measure, that is, a Gaussian measure N (e, Σ) with a covariance matrix which is not regular. Nevertheless, it is defined “uniquely” only for those z ∈ RC which belongs to the respective affine subspace on which the marginal N (eC , Σ C·C ) is concentrated – cf. (A.10). It is again a Gaussian measure, given by (A.11), but (Σ C·C )−1 is replaced by a generalized inverse (Σ C·C )− . Since the matrix Σ A|C does not depend on the choice of (Σ C·C )− (see Section A.8.1), its covariance matrix is uniquely determined. Even more, it can be shown using what it says in § 8.a.2(V) in [104] that the expectation vector eA + Σ A·C · Σ − C·C · (z − eC ) also does not depend on the choice of (Σ C·C )− for those z ∈ RN which belong to the subspace. Thus, for these z ∈ RN the conditioned Gaussian measure PA|C (|z) is really uniquely determined. However, this may not be true for z ∈ RN outside the considered subspace. The last important fact is that if Σ is positive definite then the measure P = N (e, Σ) has finite relative entropy with respect to the Lebesgue measure λ on RN , namely H(P | λ) =
|N | 1 −|N | · ln(2π) − − ln(det(Σ)), 2 2 2
(A.12)
see Rao [104], § 8a.6 (note that Rao’s entropy is nothing but minus relative entropy).
A.9 Elementary statistical concepts The task of statistics is to examine data. The aim is often to extract structural information about the relations among variables of our interest. For this purpose, a quite complicated collection of mathematical ideas, concepts and assumptions was established in order to substitute a desired relationship between reality and theory. What follows is an overview of some of these elementary statistical concepts which is, however, adapted to the goals of this monograph. Intentionally, a distinction is made between empirical concepts, that is, concepts which can solely be introduced in terms of observed data, and the concepts which are based on statistical assumptions.
242
A Appendix
A.9.1 Empirical concepts In general, a (joint) sample space over finite set (of variables) a non-empty N can be any Cartesian product ( i∈N Xi , i∈N Xi ) of some measurable spaces (Xi , Xi ), i ∈ N , that is, of the sample spaces for individual variables. Nevertheless, two typical instances most often occur in practice: • Xi is a finite non-empty set and Xi = P(Xi ) for every i ∈ N , which is the case of a discrete sample space over N , • Xi = R and Xi is the σ-algebra of Borel sets for every i ∈ N , which is the case of a continuous sample space over N . Data over N are expected in the form of a finite sequence of elements of a fixed joint sample space over N . If the sample space is fixed then the symbol DATA(N, d) where d ∈ N will denote the collection of all ordered sequences where x ∈ Xi for every = 1, . . . , d , x1 , . . . , xd i∈N
that is, the collection of all possible databases of the length d. The vector x = [x i ]i∈N represents the -th observation, respectively the result of the -th measurement. In this monograph, the case of a complete database is only considered unlike the case of data with missing values, sometimes called the case of missing data in the literature, in which case x ∈ i∈N Xi is replaced ˜ ∈ i∈A( ) Xi where ∅ = A() ⊆ N for every = 1, . . . , d. by x d Any measurable mapping from DATA(N, d) ≡ ( i∈N Xi ) to a measurable space (T, T ) is called a statistic. Statistics are mostly real functions – they form a basis of various statistical tests and estimates. Simple examples of statistics used in the case of a continuous sample space over N are the sample ˆ = d−1 · d =1 x (with x ∈ RN ) and the sample covariance expectation e matrix defined by ˆ = (ˆ Σ σij )i,j∈N
where σ ˆij = d−1 ·
d
ˆi ) · (x j − e ˆj ) . (x i − e
=1
To give two specific examples of statistics which form a basis of statistical conditional independence tests, some auxiliary concepts are needed in the case of a discrete sample space. One of them is the concept of a contingency table over N , which is a function induced by a discrete database over N which ascribes, to every element x of the sample space, the number of its counts, that is, the number of occurrences of x in the database. More formally, if a discrete joint sample space is fixed, the symbol N will denote CONT(N,+d) where d ∈ X → Z such that {d(y) ; y ∈ the collection of all functions d : i i∈N i∈N Xi } = d. Given D ∈ DATA(N, d) and ∅ = A ⊆ N , by the marginal contingency table for A we will understand the function
A.9 Elementary statistical concepts
ctA [D] :
y = [yi ]i∈A ∈
243
Xi → |{ ; ∀ i ∈ A x i = yi }| .
i∈A
The convention that ct∅ [D] ≡ d for D ∈ DATA(N, d) is sometimes advantageous. Note that the order of the items in a database is often unimportant, in which case we assume the data are directly taken in the form of an element of CONT(N, d). A derived concept is that of empirical measure on i∈N Xi , which is a probability measure given by the formula Pˆ (A) = d−1 · |{ ; x ∈ A}| = d−1 · ctN [D](y) for A ⊆ Xi . y∈A
i∈N
Observe that the density of the marginal measure of Pˆ on i∈A Xi with respect to the corresponding counting measure is given by Xi , A ⊆ N . pˆA (y) = d−1 · ctA [D](y) for y ∈ i∈A
Given a triplet a, b|C where a, b ∈ N are distinct and C ⊆ N \ {a, b}, the respective fitted empirical measure on i∈abC Xi is a probability measure given by its density with respect to the counting measure: pˆaC ([yi ]i∈aC )·pˆbC ([yi ]i∈bC ) if pˆC ([yi ]i∈C ) > 0, pˆC ([yi ]i∈C ) pˆ a,b|C (y) = 0 if pˆC ([yi ]i∈C ) = 0, for y = [yi ]i∈abC ∈ i∈abC Xi . Note that the marginal measure of the empir ical measure Pˆ on i∈abC Xi is absolutely continuous with respect to the fitted empirical measure. The G2 -statistic (see Spirtes et al. [122] p. 129) is defined as a multiple by 2d of the respective relative entropy: pˆabC (y) ; G2 = 2d · { pˆabC (y) · ln pˆ a,b|C (y) y∈ Xi , pˆabC (y) > 0 } . i∈abC
Finally, Pearson’s X 2 -statistic (see Whittaker [157] p. 216) is defined by X2 = d ·
( pˆabC (y) − pˆ a,b|C (y) )2 ; { pˆ a,b|C (y) y∈ Xi , pˆ a,b|C (y) > 0 } . i∈abC
Clearly, these two statistics take values in [0, ∞). The respective statistical tests have the following form: one chooses a critical value t ∈ (0, ∞) and rejects the corresponding conditional independence hypothesis if the value of the statistics exceeds t. However, the exact specification of the critical value t depends on mysterious statistical assumptions and considerations (see Section A.9.4).
244
A Appendix
A.9.2 Statistical conception A basic statistical idea is that data are supposed to be values of random variables with a shared unknown distribution. The idea that a probability measure P “generates” data, sometimes also expressed by an alternative phrase “data are sampled from P ”, is usually formalized by thefollowingconcept. By a random sample from a probability measure P on ( i∈N Xi , i∈N Xi ) of the length d ∈ N we understand a series ξ 1 , . . . , ξ d of generalized random variables which take values in i∈N Xi , have a shared distribution P and are mutually independent, that is, P ({ω ∈ Ω ; ∀ = 1, . . . , d ξ (ω) ∈ A }) =
d =1
P ({ω ∈ Ω ; ξ (ω) ∈ A }) ≡
d
P (A )
=1
holds for every sequence A1 , . . . , Ad ∈ i∈N Xi . The idea that the probability measure P is only partially unknown is formalized assumption that P by the belongs to a set M of probability measures on ( i∈N Xi , i∈N Xi ) which could then be called a statistical model . Typically, a statistical model of this kind consists of a parameterized class of probability measures which are absolutely continuous with respect to a given σ-finite product measure µ on the joint sample space. The parameters belong to a convex subset Θ of the Euclidean space Rn for some n ∈ N, mostly to an n-dimensional interval (but it can be a polytope too). Thus, a statistical model M = {Pθ ; θ ∈ Θ} is typically determined by a collection of densities dPθ (x) = f (x, θ) for µ-a.e. x ∈ Xi , θ ∈ Θ} where Xi . {f (x, θ); x ∈ dµ i∈N
i∈N
Remark A.3. The above description is perhaps a too crude simplification of what is meant by a model in the statistical literature. In elementary statistical textbooks, a general phrase “model” is often used to name the collection of assumptions about functional relationships among considered random variables in which some unknown parameters occur; for example, the linear regression model in Chapter VI of Andˇel [4]. However, the respective complex collection of assumptions about a model of this kind can usually be reformulated as the requirement that the joint distribution of the “observed” random variables belongs to a parameterized class of probability measures. Therefore, a lot of models from the statistical literature can be conceived in the way described above. The understanding of a statistical model as a parameterized class of probability measures also allows one to introduce further classic statistical concepts in Section A.9.3. Moreover, traditional graphical models, such as graphical Gaussian (covariance selection) models and graphical log-linear models (see Chapters 6 and 7 of Whittaker [157]), certainly fall within this scope.
A.9 Elementary statistical concepts
245
One meets an even more complex situation in the area of graphical models. A whole family of statistical models of the above-mentioned kind with a fixed joint sample space is at our disposal and the task is to select one of the possible candidate models as the best explanation of data. That task is sometimes called the problem of model choice – see Cowell et al. [26]. There are two basic methods to tackle this problem. The first one is to design an information criterion, which is a real function of candidate models and data expressing the extent to which the data support particular models. The goal is to find a model maximizing the criterion or (all) adequate models, that is models which are good enough from the point of view of information criterion. Basic information criteria are mentioned in Section A.9.3. The second method is to use statistical tests for pairwise comparison of candidate models. Each of the tests is designed to compare two competing models on the basis of data; it is constructed as a classic test to verify a certain statistical hypothesis [71] – for details see Section A.9.4. The pair of competing models is typically a pair of nested models, which means that one class of probability measures is contained in the other one. Thus, the whole collection of considered candidate models is a poset: the natural ordering is dictated by the inclusion (of classes of probability measures). Typically there exists the greatest element in this poset of candidate models which is called the saturated model. A.9.3 Likelihood function Let M = {Pθ ; θ ∈ Θ} be astatistical model determined by the collection of densities {f (x, θ) ; x ∈ i∈N Xi , θ ∈ Θ}. The likelihood function is a function which ascribes, to a parameter θ and data D, the likelihood of the data provided that they are “generated” from Pθ : L(θ, D) =
d
f (x , θ)
where θ ∈ Θ , d ∈ N, D ∈ DATA(N, d) : x1 , . . . , xd .
=1
Indeed, in the case of a discrete sample space over N the value L(θ, D) is nothing but the probability of the occurrence of the database D in a random sample from Pθ . The likelihood function is a theoretical basis for classic statistical estimates and tests. Remark A.4. Some statistical textbooksintroduce the likelihood function as follows: given a single observation x ∈ i∈N Xi , it is a function on Θ which ascribes f (x, θ) to θ ∈ Θ. However, this usually later elementary definition is extended to the case of d observations: i∈N Xi is replaced by ( i∈N Xi )d and d Pθ by =1 Pθ . In this monograph, like in some advanced statistical books, the extended definition of the likelihood function is accepted.
246
A Appendix
By a maximum likelihood estimate in M on the basis of data D ∈ DATA(N, d) we understand any probability measure Pθˆ, θˆ ∈ Θ such that ˆ D) for every θ ∈ Θ. Note that the existence and the uniqueL(θ, D) ≤ L(θ, ness of the value θˆ which achieves the maximum of L is not ensured in general. However, the logarithm of the likelihood function for fixed data is very often a strictly concave function on a convex set Θ and its maximum is typically attained at a unique point of Θ. The likelihood function also allows one to derive various criteria to tackle the problem of model choice. If one has a fixed distribution framework (see Section A.9.5 for an explanation of what is meant by this general phrase) and considers a finite family of statistical models within this framework such that the maximum likelihood estimate exists in every model from the family, then the maximized log-likelihood criterion given by MLL (M, D) = max ln L(θ, D) θ∈Θ
where M = {Pθ ; θ ∈ Θ}, D ∈ DATA(N, d) ,
can be regarded as a basic information criterion to be maximized. However, if the family of competing models is a poset with the greatest element then the maximum of the MLL criterion is always achieved in the saturated model, although it may not be the only such model. Since it is not desirable to choose the most complex saturated model as the explanation of data, this basic criterion is modified by subtracting a penalization term which reflects the complexity of the model. A classic criterion of complexity is the effective dimension DIM (M) of the model M = {Pθ ; θ ∈ Θ}, which is typically the dimension of the affine subspace generated by Θ. The most popular criteria include Akaike’s information criterion [2] given by AIC (M, D) = MLL (M, D) − DIM (M)
(A.13)
for M = {Pθ ; θ ∈ Θ}, D ∈ DATA(N, d), and Jeffreys-Schwarz criterion [114] also named Bayesian information criterion given by 1 BIC (M, D) = MLL (M, D) − DIM (M) · ln d 2
(A.14)
for M = {Pθ ; θ ∈ Θ}, D ∈ DATA(N, d). A.9.4 Testing statistical hypotheses Given a statistical model M = {Pθ ; θ ∈ Θ}, a statistical hypothesis H = {Pθ ; θ ∈ ΘH } is specified by a non-empty proper subset ΘH ⊂ Θ of the set of parameters. The respective alternative A = {Pθ ; θ ∈ ΘA } is given by the complementary set of parameters ΘA = Θ \ ΘH . If one deals with the problem of model choice, the hypothesis H often represents a submodel of a temporarily considered model M, so that the alternative is A = M \ H. Then testing of H on the basis of data is to answer the question whether a simpler model
A.9 Elementary statistical concepts
247
H can possibly replace M. Nevertheless, it may also be the case that H is a temporarily considered model and M is a wider model containing it. Then testing of H is to answer the question of whether M gives a better explanation of the data and should, therefore, replace H. As mentioned in Section A.9.1, a usual statistical test is based on a suitable statistic S. Given a critical value t ∈ R, the respective critical region {D ∈ DATA(N, d) ; S(D) ≥ t} is the set of databases on the basis of which H is rejected. The statistic S is designed in such a way that, on condition the data are “generated” from Pθ , its value reflects the case θ ∈ ΘH . If the assumption that D ∈ DATA(N, d) is a random sample from Pθ , θ ∈ Θ is accepted then the statistic S becomes a random variable with a distribution and the probability P {ω ∈ Ω; S(ξ 1 (ω), . . . , ξ d (ω)) ≥ t} of the rejection QS,d θ of the hypothesis H can be expressed: (
d
Pθ ){ [x1 , . . . , xd ] ; S(x1 , . . . , xd ) ≥ t} = QS,d θ ([t, ∞)) .
=1
The error of the first kind occurs when the hypothesis is rejected even if it holds. Its least upper bound sup {QS,d θ ([t, ∞)) ; θ ∈ ΘH } is called the size of the test. The error of the second kind occurs when the hypothesis is not rejected while it should be. It is characterized by the power function of the test which ascribes to every θ ∈ ΘA the probability of rejection QS,d θ ([t, ∞)), that is, the probability that the error of the second kind is not made. Thus, the higher the values of the power function are, the better the test is from the point of view of the error of the second kind. A classic method of derivation of a critical value t is to choose a significance level α ∈ (0, 1), typically α = 0.05, and to look for the most powerful test among tests based on S whose size does not exceed α. This typically leads to the critical value t∗d (α) = inf {t ∈ R ; QS,d θ ([t, ∞)) ≤ α for every θ ∈ ΘH } .
Remark A.5. Note that the role of a hypothesis H and an alternative A is not interchangeable in statistical testing. This is because of a special way of designing statistical tests described above, where the error of the first kind is implicitly supposed to be more harmful than the error of the second kind. Thus, the task of testing H against A is not equivalent to the task of testing A against H. What was described above is more likely an ideal theoretical way to design a statistical test on the basis of a statistics S – this is how it is described in textbooks of statistics – see Chapter 3 in Lehman [71]. However, in practice it is often infeasible to get an analytic expression for the probability QS,d θ ([t, ∞)) for all values of d and θ. Therefore, the “exact” critical values t∗d (α) can
248
A Appendix
hardly be obtained. The usual trick used to avoid this complication is to use the idea of an asymptotic distribution of S. To this end one usually needs a theoretical result saying that there exists a probability measure QS on R which is absolutely continuous with respect to the Lebesgue measure on R such that S lim QS,d θ ([t, ∞)) = Q ([t, ∞))
d→∞
for every θ ∈ ΘH , t ∈ R .
An interesting fact is that the asymptotic distribution QS usually does not depend on the choice of θ ∈ ΘH and its analytic expression is known. Then an approximate critical value t∗∞ (α) = inf {t ∈ R ; QS ([t, ∞)) ≤ α} can be taken instead of the “exact” value t∗d (α). Quite often, the asymptotic distribution of the statistic S in consideration is the χ2 -distribution with r degrees of freedom. Its density with respect to the Lebesgue measure on R is 1 · x(r/2)−1 · e−x/2 if x > 0, r/2 for r ∈ N, x ∈ R , fr (x) = 2 ·Γ(r/2) 0 if x ≤ 0, ∞ where Γ(a) = 0 z a−1 · e−z dz, a > 0 is the value of Gamma function Γ in a. Note that this measure can equivalently be introduced as the distribution of a random variable ξ which is a sum of independent N (0, 1)r of r squares 2 (ξ ) – see Theorem 10 in § V.2 of distributed random variables: ξ = i i=1 Andˇel [4]. To illustrate the overall procedure let us describe how a statistical conditional independence test can be obtained. Consider a discrete sample spaces Xi , i ∈ N , the G2 -statistic, the saturated statistical model M and the hypothesis H specified by conditional independence restriction given by a triplet
a, b|C. The respective theoretical result says that the asymptotic distribution of the G2 -statistics is the χ2 -distribution with r = DIM (M) − DIM (H) degrees of freedom – see § 9.3(iii) of Cox and Hinkley [27] for respective arguments. This result perhaps gives interpretation to the value of r ∈ N: it is the number of parameters of the saturated model M which has to be fixed (= be set to zero) to get a submodel H. An analogous claim for X 2 -statistics is proved in § 6b.2(I) of Rao [104]. Thus, one can obtain r = DIM (M)−DIM (H), which appears to be r = (|Xa | − 1) · (|Xb | − 1) · i∈C |Xi | (cf. Proposition 7.6.2 in Whittaker [157]), and then calculate the respective approximate critical value on the basis of the χ2 -distribution with r degrees of freedom. A.9.5 Distribution framework To compare different classes of probability measures, one needs to have a common distribution framework, that is, a sufficiently comprehensive set of probability measures such that competing statistical models in the procedure of model choice are defined as subsets of that set of measures. Of course, certain implicit consistency is desirable: the considered set of probability measures
A.9 Elementary statistical concepts
249
for a set of variables N should be “of the same type” as the respective set of measures for another set of variables N . For example, one should certainly avoid strange combinations, such as all involved probability measures have discrete sample spaces if three variables are considered but they have a continuous sample space if four variables are considered. In this monograph, the concept of a distribution framework , that is, a rule which ascribes to every finite non-empty set of variables N the respective comprehensive set of probability measures over N (together with their sample spaces), is understood in a slightly vague way. The reason is that it has not been completely clarified under which operations the ascribed set of measures should be closed (e.g. the operation of permutation of variables) and what should be the relation of ascribed sets of measures for different sets of variables – see an open problem, Direction 7 in Chapter 9. Thus, instead of giving either a descriptive or axiomatic definition of this concept, six specific examples of a distribution framework are given. • The discrete distribution framework includes, for each particular nonempty set of variables N , all probability measures on an arbitrary discrete sample space over N . • The positive discrete distribution framework consists, for each N , of the set of all probability measures P onany discrete sample space i∈N Xi such that P ({x}) > 0 for every x ∈ i∈N Xi . • The (general) Gaussian distribution framework has, for every N , a fixed continuous sample space over N : it consists of the class of Gaussian measures on RN . • The positive Gaussian distribution framework also has the sample space RN for every N , but it only involves regular Gaussian measures (whose covariance matrices are positive definite). • The binary distribution framework consists, for each N , of the set of all probability measures P on a discrete sample space i∈N Xi with |Xi | ≤ 2 for every i ∈ N . • The positive binary distribution framework is defined analogously; the additional requirement is P ({x}) > 0 for every x ∈ i∈N Xi . However, there are other options. Some authors seem to consider the situation when both the set of variables N is fixed and the collection of sample spaces (Xi , Xi ), i ∈ N is prescribed. One can then introduce several particular distribution frameworks. • A discrete distribution framework with prescribed sample spaces is determined by a given non-empty finite set of variables N and by a collection of non-empty finite sets Xi , i ∈ N which are viewed as individual discrete sample spaces. The framework of probability measures on consists the respective joint sample space ( i∈N Xi , i∈N Xi ). • A positive discrete distribution framework with prescribed sample spaces is defined analogously. The only modification is that those probability
250
A Appendix
measures P on( i∈N Xi , i∈N Xi ) are considered for which P ({x}) > 0 whenever x ∈ i∈N Xi . • A discrete distribution framework with prescribed one-dimensional marginals is determined by a given finite set of variables N , by a collection of discrete sample spaces (Xi , Xi ), i ∈ N and by a collection of probability measures P i on (Xi , Xi ). It consists of those probability measures P on ( i∈N Xi , i∈N Xi ) such that, for every i ∈ N , the marginal measure of P on (Xi , Xi ) is Pi . These particular distribution frameworks can perhaps also be interpreted as rules which ascribe to every non-empty subset N of N the respective set of probability measures over N .
B List of Notation
Simple conventional symbols ⊥ ⊥
⊥ ⊥ ⊗ ⊕ ∅ ∞ ∼m ≈ ∧ ∨ · + ¬ ≺ × \ ⊆ ⊇ ⊂ ⊃ ∪ ∩ ; 0 0 ♦
symbol for absolute continuity of measures 228 symbol for conditional independence 12 (in LATEX \perp\!\!\!\perp) symbol for conditional dependence (negation of ⊥ ⊥) 12 symbol for weak composition of structural models 203 symbol for direct sum of linear subspaces 223 symbol for empty set 215 symbol for independence equivalence 113 symbol for independence implication 114 symbol for infinity symbol for the Lebesgue integral A f (x) dµ(x) 228 symbol for level equivalence induced by a skeletal imset m 197 symbol for equivalence of conditional probabilities 25 symbols for meet (infimum) and join (supremum) 216–217 (226) symbol for multiplication of matrices 236, (scalar) multiple 222 symbol for summing numbers and vectors 222 symbol for negation symbols for partial ordering 216 symbols for product (in general) symbol for set difference, e.g. A \ B symbols for set inclusion 215 [Chapter 8] symbols for inclusion neighbors 177 symbols for set union and intersection 215 symbol for decomposition implication 142 zero, zero imset 39 zero vector 222, null matrix 236 end of a proof end of a remark end of an example or a convention
252
B List of Notation
Composite conventional symbols || #$
→ (, ) [, ] , {, . . . , }
absolute value, cardinality 9 lower integer part: a = max {z ∈ Z ; z ≤ a } for a ∈ R upper integer part: #a$ = Z ; a ≤ z } for a ∈ R min {z ∈ 1·...·n combination number: nk = 1·...·k·1·...·(n−k) for n ∈ N, k ∈ Z+ , k ≤ n line (undirected edge): a b is a line between a and b 219 arrow (directed edge): a → b is an arrow from a to b 219 open interval, ordered pair 218 closed interval, edge in a graph 219 scalar product 41 222 set containing elements of the list , . . . ,
Symbols from other languages Æ Γ Γ Γ = (ρij )i,j∈N δ δA ∆ θ θijk ˆ θˆijk θ, Θ ΘG ι % λ µ µA µ-a.e. ξ, ξ Œ π π π, π G ρ σσ(A) Σ = (σij )i,j∈N ˆ = (ˆ Σ σij )i,j∈N Σ −1 Σ−
set of attributes in a formal context 102 generic symbol for a set of continuous variables 66 symbol of Gamma function 248 generic symbol for a correlation matrix and its elements 239 discrete distance 221 identifier of a set A 39 generic symbol for a set of discrete variables 66 vector parameter in a statistical model MG , G ∈ DAGS(N ) 165–165 single parameter in a statistical model MG , G ∈ DAGS(N ) 165 parameters of a maximal likelihood estimate in MG 167 generic symbol for a set of parameters of a statistical model 244 the set of parameters of MG , G ∈ DAGS(N ) 165–165 reflection operation 197 incidence relation in a formal context 102 the Lebesgue measure 227 generic symbol for a non-negative measure 227 dominating measure for a marginal (see Convention 2) 20 µ-almost everywhere 227 generic symbol for a random variable and vector 238 set of objects in a formal context 102 the Ludolph constant generic symbol for a permutation on a finite set 216 generic symbol for a prior probability measure 168 generic symbol for a distance in a metric space 221 Euclidean distance 222 (= sigma) generic symbol for countable infinite operation σ-algebra generated by a class of sets A 226 generic symbol for a covariance matrix and its elements 238 generic symbol for a sample covariance matrix 242 inverse of a matrix Σ 236 generalized inverse of a matrix Σ 237
B List of Notation Σ A·B Σ A|B Σ , v ς υ φ χA χ2 Ψ Ψ (u) Ω ℘
253
submatrix of a matrix Σ 236 Schur complement 237 transpose of a matrix Σ and of a vector v 236 generic symbol for isomorphism of measurable spaces 227 counting measure 227 generic symbol for an order-isomorphism of posets 217 indicator of a set A 226 (= chi-square) traditional symbol for a certain distribution 248 distribution framework (a class of measures over N ) 111 class of Markovian measures with respect to u in Ψ 113 “universum” of elementary events 238 generic symbol for a set of independence equivalence classes 127 generic symbol for an independence equivalence class 127
Symbols in alphabetic order A A A −A A⊥ A∗ A, B|C A⊥ ⊥ B | C [G] A⊥ ⊥ B | C [o] A ⊥ ⊥ B | C [o] A⊥ ⊥ B | C [P ] A⊥ ⊥ B | C [P ] A⊥ ⊥ B | C [u] a, b|K a⊥ ⊥ b|K a.e. AIC (M, D) anG (A) as (M, N )
symbol for a class of set (σ-algebra) 226 generic symbol for a set of arrows 219 generic symbol for an alternative in statistical testing 246 set {−x ; x ∈ A} 222 orthogonal complement of a set A 223 dual cone to a set A 224 disjoint triplet over N 12 CI statement represented in a graph G 43 (48 53) (conditional) independence statement 12 (conditional) dependence statement 12 CI statement with respect to a probability measure P 10 conditional independence for σ-algebras 234 CI statement represented in a structural imset u 78 elementary triplet over N 15 elementary independence statement over N 15 almost everywhere 227 Akaike’s information criterion 246 set of ancestors of a set of nodes A in a graph G 220 ascetic extension of a structural model M to N 202
BIC (M, D) Bn
Bayesian information criterion 246 Borel σ-algebra on Rn , n ≥ 1 227
C C(N ) cl cl U (N ) con(B) CONT(N, d)
[occasionally] set of cliques of an undirected graph 55 class of combinatorial imsets over N 72 generic symbol for a closure operation 218 structural closure 143 conical closure of a set B 224 collection of contingency tables over N induced by databases of the length d 242
254
B List of Notation
ctA [D]
marginal contingency table for A induced by a database D 243
d D↑ , D↓ DA DAGS(N ) DATA(N, d) deg(u, l), deg(u) det(Σ) dij , dijk , d[x] DIM (M) Dmax , Dmin Du+ , Du− Du∗ dν dν/dµ, dµ
conventional symbol for database length 242 induced ascending and descending class of sets 215–216 projection of a database D onto a set of variables A 170 collection of acyclic directed graphs over N 163 collection of databases over N of the length d 242 (level-)degree of a combinatorial imset u 72 determinant of a matrix Σ 236 [Chapter 8] numbers of configuration occurrences 164 effective dimension of a statistical model M 246 classes of maximal and minimal sets of a class D ⊆ P(N ) 216 positive and negative domain of an imset u 39 effective domain of a structural imset u 124 Radon-Nikodym derivative (density) of ν with respect to µ 228
e ˆ e E(N ), El (N ) exp
generic symbol for expectation vector 238 generic symbol for sample expectation vector 242 class of elementary imsets over N (of level l) 70 symbol for exponential function
f +, f − fA , f∅ f ↓A fA|C fe,Σ fr f (x, θ) fθ
positive and negative part of a function f 226 generic symbol for marginal density 20 projection of a density f (for a set A) 20 generic symbol for conditional density 164 density of a regular Gaussian measure N (e, Σ) 240 density of χ2 -distribution with r degrees of freedom 248 generic notation of density in a statistical model 244 [Chapter 8] density ascribed to a vector parameter θ 165
gra(N ) gra∗ (N ) GT GT
grade of a set of variables N 122 modified grade of a set of variables N 123 induced subgraph of G for T 219 marginal undirected graph 46
H H(N ) H(P |µ) H(P |µ : Q) hP,µ Hu
generic symbol for a hypothesis in statistical testing 246 minimal integral Hilbert basis of the cone con(E(N )) 121 relative entropy of P with respect to µ 230 Q-perturbated relative entropy of P with respect to µ 67 entropy function of P relative to µ 83 coportrait of a structural imset 150
i I inf
[Chapter 8] conventional index indicating variables (nodes) 164 unit matrix 236 infimum, the greatest lower bound 217
j j(i, x)
[Chapter 8] conventional index for parent configurations 164 [Chapter 8] code of a parent configuration 164
k k(i, x) K (N )
[Chapter 8] conventional index for node configurations 164 [Chapter 8] code of a node configuration 164 class of -standardized supermodular functions over N 92
B List of Notation
255
K (N ) K(N ) K (N ) Ko (N ), Ku (N )
the -skeleton 93 class of supermodular functions over N 87 conical closure of discrete multiinformation functions over N 190 the o-skeleton, the u-skeleton 97
L LML (G, D) ln L(N ) L∗ (N ) L (S) Lu L(θ, D)
generic symbol for a set of lines 219 generic symbol for “lower” standardization 40 logarithm of the marginal likelihood 169 symbol for (natural) logarithm the class of modular functions over N 90 [Chapter 8] a class of special modular functions over N 173 auxiliary notation in Remark 7.1 133 the lower class of a structural imset u 73 generic notation of a likelihood function 245
M M mA↓ , mA↑ max, Dmax MG MG min, Dmin ml , m∗ m , mu , mo MLL (M, D) Mm Mo mP MP MT Mu mπ m† m◦
generic symbol for a formal independence model 12 generic symbol for a statistical model 244 identifiers of classes of subsets and supersets of a set A 39 maximum, the class of maximal sets in D 216 independence model induced by a graph G 43 (48 53) statistical model given by G ∈ DAGS(N ) 163 minimum, the class of minimal sets in D 216 (level-)degree detectors 70 elements of the -, u- and o-skeleton corresponding to m 98–99 maximized log-likelihood criterion 246 independence model produced by a supermodular function m 88 independence model induced by an object of discrete mathematics o 12 multiinformation function induced by P 27 independence model induced by a probability measure P 14 restriction of a model M to T 12 independence model induced by a structural imset u 79 composition of a function m and a permutation π 216 special multiset utilized in Example 4.1 (Figure 4.3) 82 special multiset utilized in Example 6.3 (Figure 6.3) 117
N N nei (G) N (N ) N (e, Σ) N (r, s)
generic symbol for a non-empty finite set of variables (factors) 9 set of natural numbers 9 [Chapter 8] class of neighboring graphs for a graph G 162 set from Theorem 5.1 95 Gaussian measure with expectation e, covariance matrix Σ 239 one-dimensional Gaussian measure 239
o-
generic symbol for “orthogonal” standardization 40
P P Pˆ , pˆA PA PA P[A] , P[L]
generic symbol for a probability measure over N 9 underlying theoretical probability 238 empirical measure and its marginal density for A 243 marginal of a measure P for a set A 9 restriction of probability measure P to a σ-algebra A 232 discrete measures constructed in Section 2.3.7 38
256
B List of Notation
pˆa,b|C PA|C P¯A|C P -a.e. (µ-a.e.) paG (b) P (B|A) max Ppri P(X), P(N ) Pθ
fitted empirical measure for a triplet a, b|C 243 conditional probability on XA given C 10 [in Gaussian case] regular version of c. p. on XA given C 31 almost everywhere with respect to P (µ) 227 set of parents of a node b in a graph G 220 conditional probability of a set B given σ-algebra A w.r.t. P 232 class of prime components of a graph G 204 power set of the set X, respectively N 215 probability measure from MG given by a vector parameter θ 165
Q Q, Qn qi|B , q¯i|B q(i, G) QS,d θ
generic symbol for a quality criterion 163 set of rational numbers 9, rational vectors 224 [Chapter 8] components of a (strongly) decomposable criterion 170 [Chapter 8] number of parent configurations for i ∈ N in G 164 distribution of a statistic S given a parameter θ 247
R, Rn (R, B), (Rn , Bn ) r(i) RP(N ) Ru Ru Rx
set of real numbers 9, real vectors 222 the space of real numbers (vectors) with Borel σ-algebra 227 [Chapter 8] number of node configurations for i ∈ N 164 collection of real functions on the power set P(N ) range of a structural imset u 74 region of a structural imset u 124 ray generated by a vector x 225
S S (N ) S(N ) So (N ) sQ Su (N ) sup SΨ (N )
[occasionally] set of separators of a triangulated graph 55 class of -standardized set functions 91 class of structural imsets over N 73 class of o-standardized set functions 91 [Chapter 8] saturating function of a quality criterion Q 185 class of u-standardized set functions 91 supremum, the least upper bound 216 class of Ψ -representable structural imsets over N 127
T (N ) tA , ¯tA tQ [tQ D (A)]A⊆N Tø (N ), T (N )
class of disjoint triplets over N 12 [Chapter 8] components of a (strongly) regular criterion 171–172 [Chapter 8] -standardized transformation of data relative to a criterion Q 185 [Chapter 8] data vector relative to Q 185 classes of trivial and elementary disjoint triplets over N 15
uu+ , u − ua,b|K uA,B|C uG , u H U(N ) Uu U (x, ε) Uρ (x, ε)
generic symbol for “upper” standardization 40 positive and negative part of an imset u 39 elementary and semi-elementary imset 69 71 standard imsets for graphs G and H 135 137 class of structural independence models over N 104 upper class of a structural imset u 73 open ball with center x and radius ε 221
w(S)
[occasionally] multiplicity of a separator S 55
xA
projection of a configuration x onto a set A 20
B List of Notation XA XA , X¯A (XA , XA ) (X, X ) X∅ X , Y
generic symbol for a sample space for A 9 product σ-algebra for A 9, coordinate σ-algebra 10 conventional shortened notation 9 generic symbol for a measurable space 226 trivial σ-algebra 226 Galois connection 102
yik
[Chapter 8] k-th node configuration for i ∈ N 164
Z, Z zij ZP(N ) +
the set of integers, the set of non-negative integers 9 [Chapter 8] j-th parent configuration for i ∈ N 164 the class of imsets over N 39
257
C List of Lemmas, Propositions etc.
Conventions Convention Convention Convention Convention
1 2 3 4
p. p. p. p.
9 20 77 164
Corollaries Corollary Corollary Corollary Corollary Corollary Corollary Corollary
2.1 2.2 2.3 2.4 2.5 2.6 2.7
p. p. p. p. p. p. p.
24 27 32 33 34 35 38
Corollary Corollary Corollary Corollary
4.1 4.2 4.3 4.4
p. p. p. p.
69 73 81 82
Corollary Corollary Corollary Corollary
6.2 6.3 6.4 6.5
p. p. p. p.
120 122 122 128
Corollary Corollary Corollary Corollary Corollary
7.1 7.2 7.3 7.4 7.5
p. p. p. p. p.
136 140 142 145 149
Corollary Corollary Corollary Corollary Corollary Corollary
8.1 8.2 8.3 8.4 8.5 8.6
p. p. p. p. p. p.
167 168 174 182 185 187
Corollary A.1 Corollary A.2
p. 231 p. 235
Directions Corollary 5.1 Corollary 5.2 Corollary 5.3
p. 93 p. 95 p. 101
Corollary 6.1
p. 118
Direction Direction Direction Direction
1 2 3 4
p. p. p. p.
192 206 208 209
260
C List of Lemmas, Propositions etc.
Direction Direction Direction Direction Direction
5 6 7 8 9
p. p. p. p. p.
209 210 210 211 213
Examples Example 2.1 Example 2.2 Example 2.3
p. 34 p. 35 p. 35
Example 3.1
p. 50
Example 4.1
p. 81
Example 5.1
p. 107
Example Example Example Example Example
6.1 6.2 6.3 6.4 6.5
p. p. p. p. p.
112 113 116 119 125
Example Example Example Example Example
7.1 7.2 7.3 7.4 7.5
p. p. p. p. p.
132 136 141 144 146
Example Example Example Example Example
9.1 9.2 9.3 9.4 9.5
p. p. p. p. p.
192 192 198 204 212
Example A.1 Example A.2 Example A.3
p. 220 p. 220 p. 224
Lemmas Lemma Lemma Lemma Lemma
2.1 2.2 2.3 2.4
p. p. p. p.
14 15 19 20
Lemma Lemma Lemma Lemma Lemma Lemma
2.5 2.6 2.7 2.8 2.9 2.10
p. 23 p. 25 p. 28 p. 31 p. 36 p. 38
Lemma Lemma Lemma Lemma Lemma Lemma Lemma
4.1 4.2 4.3 4.4 4.5 4.6 4.7
p. p. p. p. p. p. p.
67 68 74 77 79 79 80
Lemma Lemma Lemma Lemma Lemma Lemma Lemma
5.1 5.2 5.3 5.4 5.5 5.6 5.7
p. p. p. p. p. p. p.
89 90 92 93 94 94 100
Lemma Lemma Lemma Lemma Lemma Lemma
6.1 6.2 6.3 6.4 6.5 6.6
p. p. p. p. p. p.
115 118 121 123 124 127
Lemma Lemma Lemma Lemma Lemma Lemma
7.1 7.2 7.3 7.4 7.5 7.6
p. p. p. p. p. p.
136 139 140 141 143 148
Lemma Lemma Lemma Lemma Lemma Lemma Lemma
8.1 8.2 8.3 8.4 8.5 8.6 8.7
p. p. p. p. p. p. p.
165 167 172 173 180 181 184
Lemma Lemma Lemma Lemma
A.1 A.2 A.3 A.4
p. p. p. p.
225 225 226 231
C List of Lemmas, Propositions etc. Lemma A.5 Lemma A.6 Lemma A.7
Questions
p. 232 p. 234 p. 235
Propositions Proposition 2.1 Proposition 2.2 Proposition 2.3
p. 29 p. 37 p. 37
Proposition Proposition Proposition Proposition Proposition Proposition Proposition Proposition Proposition
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9
p. p. p. p. p. p. p. p. p.
70 71 72 73 74 77 79 79 84
Proposition Proposition Proposition Proposition Proposition Proposition Proposition Proposition
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8
p. p. p. p. p. p. p. p.
87 88 89 90 92 99 107 109
Proposition 6.1 Proposition 6.2
p. 113 p. 121
Proposition Proposition Proposition Proposition
7.1 7.2 7.3 7.4
p. p. p. p.
132 142 143 146
Proposition Proposition Proposition Proposition
8.1 8.2 8.3 8.4
p. p. p. p.
176 176 181 186
Proposition Proposition Proposition Proposition
9.1 9.2 9.3 9.4
p. p. p. p.
190 197 199 202
Question Question Question Question Question Question Question Question Question Question
1 2 3 4 5 6 7 8 9 10
p. 189 p. 190 p. 191 p. 194 p. 195 p. 197 p. 207 p. 208 p. 209 p. 210
Remarks Remark Remark Remark Remark Remark Remark Remark Remark Remark Remark Remark Remark Remark
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13
p. 10 p. 11 p. 12 p. 13 p. 14 p. 16 p. 17 p. 19 p. 20 p. 24 p. 26 p. 33 p. 39
Remark Remark Remark Remark Remark Remark Remark Remark
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8
p. p. p. p. p. p. p. p.
44 45 46 49 50 53 54 59
Remark Remark Remark Remark
4.1 4.2 4.3 4.4
p. p. p. p.
66 69 76 83
Remark Remark Remark Remark Remark
5.1 5.2 5.3 5.4 5.5
p. p. p. p. p.
88 88 91 96 96
261
262
C List of Lemmas, Propositions etc.
Remark Remark Remark Remark Remark Remark
5.6 5.7 5.8 5.9 5.10 5.11
p. 97 p. 101 p. 103 p. 106 p. 107 p. 109
Remark Remark Remark Remark Remark Remark Remark Remark Remark Remark
6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10
p. 112 p. 114 p. 115 p. 117 p. 118 p. 121 p. 123 p. 125 p. 128 p. 128
Remark Remark Remark Remark Remark Remark Remark Remark Remark Remark
7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10
p. 132 p. 136 p. 137 p. 139 p. 140 p. 143 p. 146 p. 149 p. 150 p. 154
Remark Remark Remark Remark Remark Remark Remark Remark Remark Remark Remark Remark Remark Remark Remark
8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12 8.13 8.14 8.15
p. 157 p. 160 p. 162 p. 166 p. 168 p. 170 p. 171 p. 175 p. 177 p. 178 p. 182 p. 185 p. 186 p. 187 p. 188
Remark Remark Remark Remark
9.1 9.2 9.3 9.4
p. p. p. p.
194 196 197 199
Remark 9.5
p. 205
Remark Remark Remark Remark Remark
p. p. p. p. p.
A.1 A.2 A.3 A.4 A.5
233 238 244 245 247
Themes Theme Theme Theme Theme Theme Theme Theme Theme Theme Theme Theme Theme Theme Theme Theme Theme Theme
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
p. 191 p. 193 p. 193 p. 195 p. 195 p. 199 p. 199 p. 203 p. 206 p. 207 p. 207 p. 208 p. 208 p. 210 p. 210 p. 211 p. 211
Theorems Theorem 4.1
p. 84
Theorem 5.1 Theorem 5.2 Theorem 5.3
p. 95 p. 101 p. 104
Theorem A.1 Theorem A.2 Theorem A.3
p. 228 p. 229 p. 230
References
1. M. Aigner: Combinatorial Theory, Springer-Verlag 1979. 2. H. Akaike: A new look at the statistical model identification, IEEE Transactions on Automatic Control 19 (1974), pp. 716–722. 3. Z. An, D. A. Bell, J. G. Hughes: On the axiomatization of conditional independence, Kybernetes 21 (1992), n. 7, pp. 48–58. 4. J. Andˇel: Mathematical Statistics (in Czech), SNTL (Prague) 1985. 5. S. A. Andersson, M. D. Perlman: Lattice models for conditional independence in multivariate normal distributions, Annals of Statistics 21 (1993), pp. 1318–1358. 6. S. A. Andersson, D. Madigan, M. D. Perlman: A characterization of Markov equivalence classes for acyclic digraphs, Annals of Statistics 25 (1997), n. 2, pp. 505–541. 7. S. A. Andersson, D. Madigan, M. D. Perlman: On the Markov equivalence classes for chain graphs, undirected graphs and acyclic digraphs, Scandinavian Journal of Statistics 24 (1997), n. 1, pp. 81–102. 8. S. A. Andersson, D. Madigan, M. D. Perlman, C. M. Triggs: A graphical characterization of lattice conditional independence models, Annals of Mathematics and Artificial Intelligence 21 (1997), pp. 27–50. 9. S. A. Andersson, D. Madigan, M. D. Perlman: Alternative Markov properties for chain graphs, Scandinavian Journal of Statistics 28 (2001), n. 1, pp. 33–85. 10. G. Birkhoff: Lattice Theory – third edition, AMS Colloquium Publications 25, 1995. 11. P. Boˇcek: SGPOKUS, a computer program, Institute of Information Theory and Automation, June 1994. 12. P. Boˇcek: GENERATOR, a computer program, Institute of Information Theory and Automation, March 2001. 13. R. R. Bouckaert: IDAGs: a perfect map for any distribution, in Symbolic and Quantitative Approaches to Reasoning and Uncertainty (M. Clarke, R. Kruse, S. Moral eds.), Lecture Notes in Computer Science 747, Springer-Verlag 1993, pp. 49–56. 14. R. R. Bouckaert: Bayesian belief networks – from construction to inference, PhD thesis, University of Utrecht 1995.
264
References 15. R. R. Bouckaert, M. Studen´ y: Chain graphs: semantics and expressiveness, in Symbolic and Quantitative Approaches to Reasoning and Uncertainty (C. Froidevaux, J. Kohlas eds.), Lecture Notes in Artificial Intelligence 946, Springer-Verlag 1995, pp. 67–76. 16. A. Brøndsted: An Introduction to Convex Polytopes, Springer-Verlag 1983 (Russian translation, Mir 1988). 17. L. M. de Campos, J. F. Huete, S. Moral: Probability intervals, a tool for uncertain reasoning, a technical report DECSAI-93206, July 1993, University of Granada. 18. L. M. de Campos: Independency relationships in possibility theory and their application to learning belief networks, in Mathematical and Statistical Methods in Artificial Intelligence (G. Della Riccia, R. Kruse, R. Viertl eds.), Springer-Verlag 1995, pp. 119–130. 19. L. M. de Campos: Characterization of decomposable dependency models, Journal of Artificial Intelligence Research 5 (1996), pp. 289–300. 20. R. Castelo: The discrete acyclic digraph Markov model in data mining, PhD thesis, University of Utrecht 2002. 21. J. Cheng, R. Greiner: Comparing Bayesian network classifiers, in Uncertainty in Artificial Intelligence 15 (K. B. Laskey, H. Prade eds.), Morgan Kaufmann 1999, pp. 101–107. 22. D. M. Chickering: A transformational characterization of equivalent Bayesian network structures, in Uncertainty in Artificial Intelligence 11 (P. Besnard, S. Hanks eds.), Morgan Kaufmann, 1995, pp. 87–98. 23. D. M. Chickering: Optimal structure identification with greedy search, Journal of Machine Learning Research 3 (2002), pp. 507–554. 24. G. F. Cooper, E. Herskovits: A Bayesian method for induction of probabilistic networks from data, Machine Learning 9 (1992), pp. 309–341. 25. T. M. Cover, J. A. Thomas: Elements of Information Theory, John Wiley 1991. 26. R. G. Cowell, A. P. Dawid, S. L. Lauritzen, D. J. Spiegelhalter: Probabilistic Networks and Expert Systems, Springer-Verlag 1999. 27. D. R. Cox, D. V. Hinkley: Theoretical Statistics, Chapman and Hall 1982. 28. D. R. Cox, N. Wermuth: Multivariate Dependencies – Models, Analysis and Interpretation, Chapman and Hall 1996. 29. A. P. Dawid: Conditional independence in statistical theory, Journal of the Royal Statistical Society B 41 (1979), n. 1, pp. 1–31. 30. A. P. Dawid, S. L. Lauritzen: Hyper Markov laws in the statistical analysis of decomposable graphical models, Annals of Statistics 21 (1993), pp. 1272–1317. 31. A. P. Dawid: Conditional independence, in Encyclopedia of Statistical Science Update, Vol. 2 (S. Kotz, C. B. Read, D. L. Banks eds.), John Wiley 1999, pp. 146–155. 32. A. P. Dawid, M. Studen´ y: Conditional products: an alternative approach to conditional independence, in Artificial Intelligence and Statistics 99, Proceedings of the 7th Workshop (D. Heckerman, J. Whittaker eds.), Morgan Kaufmann 1999, pp. 32–40. 33. A. P. Dawid: Separoids: a mathematical framework for conditional independence and irrelevance, Annals of Mathematics and Artificial Intelligence 32 (2001), n. 1–4, pp. 335–372.
References
265
34. A. P. Dempster: Covariance selection, Biometrics 28 (1972), pp. 157–175. 35. R. Faure, E. Heurgon: Structures Ordonn´ees et Alg`ebres de Boole (in French), Gauthier-Villars 1971 (Czech translation, Akademia 1984). 36. M. Fiedler: Special Matrices and their Use in Numerical Mathematics (in Czech), SNTL (Prague) 1981. 37. J.-P. Florens, M. Mouchart, J.-M. Rolin: Elements of Bayesian Statistics, Marcel Dekker 1990. 38. N. Friedman, D. Geiger, M. Goldszmidt: Bayesian network classifier, Machine Learning 29 (1997), pp. 131–163. 39. M. Frydenberg: The chain graph Markov property, Scandinavian Journal of Statistics 17 (1990), n. 4, pp. 333–353. 40. M. Frydenberg: Marginalization and collapsibility in graphical interaction models, Annals of Statistics 18 (1990), n. 2, pp. 790–805. 41. L. C. van der Gaag, J.-J. Ch. Meyer: Informational independence, models and normal forms, International Journal of Intelligent Systems 13 (1998), n. 1, pp. 83–109. 42. B. Ganter, R. Wille: Formal Concept Analysis – Mathematical Foundations, Springer-Verlag 1999. 43. D. Geiger, T. Verma, J. Pearl: Identifying independence in Bayesian networks, Networks 20 (1990), n. 5, pp. 507–534. 44. D. Geiger, J. Pearl: On the logic of causal models, in Uncertainty in Artificial Intelligence 4 (R. D. Shachter, T. S. Lewitt, L. N. Kanal, J. F. Lemmer eds.), North-Holland 1990, pp. 3–14. 45. D. Geiger, A. Paz, J. Pearl: Axioms and algorithms for inferences involving probabilistic independence, Information and Computation 91 (1991), n. 1, pp. 128–141. 46. D. Geiger, J. Pearl: Logical and algorithmic properties of conditional independence and graphical models, Annals of Statistics 21 (1993), n. 4, pp. 2001–2021. 47. D. Geiger, A. Paz, J. Pearl: On testing whether an embedded Bayesian network represents a probability model, in Uncertainty in Artificial Intelligence 10 (R. L. de Mantaras, D. Poole eds.), Morgan Kaufmann 1994, pp. 244–252. 48. P. Giudici, P. J. Green: Decomposable graphical Gaussian determination, Biometrika 86 (1999), pp. 785–801. 49. P. H´ ajek, T. Havr´ anek, R. Jirouˇsek: Uncertain Information Processing in Expert Systems, CRC Press 1992. 50. P. R. Halmos: Finite-Dimensional Vector Spaces, Springer-Verlag 1973. 51. D. Heckerman: A tutorial on learning Bayesian networks, technical report MSR-TR-95-06, Microsoft Research, Redmont, March 1995. 52. R. Jirouˇsek: Solution of the marginal problem and decomposable distributions, Kybernetika 27 (1991), pp. 403–412. 53. K. G. J¨ oreskog, D. S¨ orbom: LISREL 7 – A Guide to the Program and Application, SPSS Inc. 1989. 54. G. Kauermann: On a dualization of graphical Gaussian models, Scandinavian Journal of Statistics 23 (1996), n. 1, pp. 105–116. 55. H. G. Kellerer: Verteilungsfunktionen mit gegebenem Marginalverteilungen (in German), Z. Wahrscheinlichkeitstheorie 3 (1964), pp. 247–270. 56. H. Kiiveri, T. P. Speed, J. B. Carlin: Recursive causal models, Journal of Australian Mathematical Society A 36 (1984), pp. 30–52.
266
References 57. T. Koˇcka: Graphical models – learning and applications, PhD thesis, University of Economics Prague (Czech Republic) 2001. 58. T. Koˇcka, R. R. Bouckaert, M. Studen´ y: On the inclusion problem, research report n. 2010, Institute of Information Theory and Automation, Prague, February 2001. 59. J. T. A. Koster: Gibbs factorization and the Markov property, unpublished manuscript. 60. J. T. A. Koster: Markov properties of nonrecursive causal models, Annals of Statistics 24 (1996), n. 5, pp. 2148–2177. 61. J. T. A. Koster: Marginalizing and conditioning in graphical models, Bernoulli 8 (2002), n. 6, pp. 814–840. 62. I. Kramosil: A note on non-axiomatizability of independence relations generated by certain probabilistic structures, Kybernetika 24 (1988), n. 2, pp. 439–446. 63. W. Lam, F. Bacchus: Learning Bayesian belief networks, an approach based on the MDL principle, Computational Intelligence 10 (1994), pp. 269–293. 64. S. L. Lauritzen, N. Wermuth: Mixed interaction models, research report R-84-8, Inst. Elec. Sys., University of Aalborg 1984. Note that this report was later modified and became a basis of the paper [67].
65. S. L. Lauritzen, T. P. Speed, K. Vijayan: Decomposable graphs and hypergraphs, Journal of Australian Mathematical Society A 36 (1984), n. 1, pp. 12–29. 66. S. L. Lauritzen, D. J. Spiegelhalter: Local computation with probabilities on graphical structures and their application to expert systems, Journal of the Royal Statistical Society B 50 (1988), n. 2, pp. 157–224. 67. S. L. Lauritzen, N. Wermuth: Graphical models for associations between variables, some of which are qualitative and some quantitative, Annals of Statistics 17 (1989), n. 1, pp. 31–57. 68. S. L. Lauritzen: Mixed graphical association models, Scandinavian Journal of Statistics 16 (1989), n. 4, pp. 273–306. 69. S. L. Lauritzen, A. P. Dawid, B. N. Larsen, H.-G. Leimer: Independence properties of directed Markov fields, Networks 20 (1990), n. 5, pp. 491– 505. 70. S. L. Lauritzen: Graphical Models, Clarendon Press 1996. 71. E. L. Lehman: Testing Statistical Hypothesis, John Wiley 1957. 72. H.-G. Leimer: Optimal decomposition by clique separators, Discrete Mathematics 113 (1993), pp. 99–123. 73. M. Levitz, M. D. Perlman, D. Madigan: Separation and completeness properties for AMP chain graph Markov models, Annals of Statistics 29 (2001), n. 6, pp. 1751–1784. 74. M. Lo´eve: Probability Theory, Foundations, Random Processes, D. van Nostrand 1955. 75. D. Madigan, J. York: Bayesian graphical models for discrete data, International Statistical Review 63 (1995), pp. 215–232. 76. F. M. Malvestuto: Theory of random observables in relational data bases, Information Systems 8 (1983), n. 4, pp. 281–289. 77. F. M. Malvestuto: A unique formal system for binary decomposition of database relations, probability distributions and graphs, Information Sciences 59 (1992), pp. 21–52. + F. M. Malvestuto, M. Studen´ y: Comment on “A unique formal ... graphs”, Information Sciences 63 (1992), pp. 1–2.
References
267
78. J. L. Massey: Causal interpretation of random variables (in Russian), Problemy Peredachi Informatsii 32 (1996), n. 1, pp. 112–116. 79. F. Mat´ uˇs: Ascending and descending conditional independence relations, in Information Theory, Statistical Decision Functions and Random Pro´ cesses, Transactions of 11th Prague Conference, Vol. B (S. Kub´ık, J. A. V´ıˇsek eds.), Kluwer 1992, pp. 189–200. 80. F. Mat´ uˇs: On equivalence of Markov properties over undirected graphs, Journal of Applied Probability 29 (1992), n. 3, pp. 745–749. 81. F. Mat´ uˇs: Probabilistic conditional independence structures and matroid theory, backgrounds, International Journal of General Systems 22 (1994), n. 2, pp. 185–196. 82. F. Mat´ uˇs: Stochastic independence, algebraic independence and abstract connectedness, Theoretical Computer Science A 134 (1994), n. 2, pp. 445– 471. 83. F. Mat´ uˇs: On the maximum-entropy extensions of probability measures over undirected graphs, in Proceedings of WUPES94, September 11–15, 1994, Tˇreˇst’, Czech Republic, pp. 181–198. 84. F. Mat´ uˇs, M. Studen´ y: Conditional independences among four random variables I., Combinatorics, Probability and Computing 4 (1995), n. 4, pp. 269–278. 85. F. Mat´ uˇs: Conditional independences among four random variables II., Combinatorics, Probability and Computing 4 (1995), n. 4, pp. 407–417. 86. F. Mat´ uˇs: Conditional independence structures examined via minors, Annals of Mathematics and Artificial Intelligence 21 (1997), pp. 99–128. 87. F. Mat´ uˇs: Conditional independences among four random variables III., final conclusion, Combinatorics, Probability and Computing 8 (1999), n. 3, pp. 269–276. 88. F. Mat´ uˇs: Lengths of semigraphoid inferences, Annals of Mathematics and Artificial Intelligence 35 (2002), pp. 287–294. 89. C. Meek: Causal inference and causal explanation with background knowlewdge, in Uncertainty in Artificial Intelligence 11 (P. Besnard, S. Hanks eds.), Morgan Kaufmann 1995, pp. 403–410. 90. C. Meek: Strong completeness nad faithfulness in Bayesian networks, in Uncertainty in Artificial Intelligence 11 (P. Besnard, S. Hanks eds.), Morgan Kaufmann 1995, pp. 411–418. 91. C. Meek: Graphical models, selecting causal and statistical models, PhD thesis, Carnegie Melon University 1997. 92. E. Mendelson: Introduction to Mathematical Logic – second edition, D. van Nostrand 1979. 93. M. Mouchart, J.-M. Rolin: A note on conditional independence with statistical applications, Statistica 44 (1984), n. 4, pp. 557–584. 94. M. Mouchart, J.-M. Rolin: Letter to the editor, Statistica 45 (1985), n. 3, pp. 427–430. 95. J. Moussouris: Gibbs and Markov properties over undirected graphs, Journal of Statistical Physics 10 (1974), n. 1, pp. 11–31. 96. J. Neveu: Bases Math´ematiques du Calcul des Probabilit´es (in French), Masson et Cie 1964. 97. A. Paz, R. Y. Geva, M. Studen´ y: Representation of irrelevance relations by annotated graphs, Fundamenta Informaticae 42 (2000), pp. 149–199.
268
References 98. A. Paz: An alternative version of Lauritzen et al’s algorithm for checking representation of independencies, Journal of Soft Computing 7 (2003), n. 5, pp. 344–349. 99. J. Pearl, A. Paz: Graphoids, graph-based logic for reasoning about relevance relations, in Advances in Artificial Intelligence II (B. Du Boulay, D. Hogg, L. Steels eds.), North-Holland 1987, pp. 357–363. 100. J. Pearl: Probabilistic Reasoning in Intelligent Systems – Networks of Plausible Inference, Morgan Kaufmann 1988. 101. A. Perez: ε-admissible simplifications of the dependence structure of a set of random variables, Kybernetika 13 (1979), pp. 439–449. 102. M. D. Perlman, L. Wu: Lattice conditional independence models for contingency tables with non-monotone missing data pattern, Journal of Statistical Planning 79 (1999), pp. 259–287. 103. C. van Putten, J. H. van Shuppen: Invariance properties of conditional independence relation, Annals of Probability 13 (1985), n. 3, pp. 934–945. Note that Mouchart and Rolin claim in [94] that most of the results of [103] are almost identical with the results of their paper [93]. The aim of the note [94] is to emphasize the priority of their authors in achieving those results.
104. C. R. Rao: Linear Statistical Inference and Its Application, John Wiley 1965. 105. T. S. Richardson: A polynomial-time algorithm for deciding Markov equivalence of directed cyclic graphical models, in Uncertainty in Artificial Intelligence 12 (E. Horvitz, F. Jensen eds.), Morgan Kaufmann 1996, pp. 462–469. 106. T. S. Richardson: A discovery algorithm for directed cyclic graphs, in Uncertainty in Artificial Intelligence 12 (E. Horvitz, F. Jensen eds.), Morgan Kaufmann 1996, pp. 454–461. 107. T. Richardson, P. Spirtes: Ancestral graph Markov models, Annals of Statistics 30 (2002), n. 4, pp. 962–1030. 108. T. Richardson: Markov properties for acyclic directed mixed graphs, Scandinavian Journal of Statistics 30 (2003), n. 1, pp. 145–157. 109. J. Rosenm¨ uller, H. G. Weidner: Extreme convex set functions with finite carrier – general theory, Discrete Mathematics 10 (1974), n. 3–4, pp. 343– 382. 110. A. Roverato: A unified approach to the characterisation of equivalence classes of DAGs, chain graphs with no flags and chain graphs, to appear in Scandinavian Journal of Statistics (2005). 111. W. Rudin: Real and Complex Analysis, McGraw-Hill 1974. 112. Y. Sagiv, S. F. Walecka: Subset dependencies and completeness result for a subclass of embedded multivalued dependencies, Journal of Association for Computing Machinery 29 (1982), n. 1, pp. 103–117. 113. A. Schrijver: Theory of Linear and Integer Programming, John Wiley 1986. 114. G. Schwarz: Estimating the dimension of a model, Annals of Statistics 6 (1978), pp. 461–464. 115. G. Shafer: Probabilistic Expert Systems, CBMS-NSF Regional Conference Series in Applies Mathematics 67, SIAM 1996. 116. L. S. Shapley: Cores of convex games, International Journal of Game Theory 1 (1971/1972), pp. 11–26.
References
269
117. P. P. Shenoy: Conditional independence in valuation-based systems, International Journal of Approximate Reasoning 10 (1994), n. 3, pp. 203–234. 118. A. N. Shiryayev: Probability (a translation from Russian), Springer-Verlag 1984. ˇ 119. P. Simeˇ cek, R. Lnˇeniˇcka, private communication, February 2004. 120. T. P. Speed: A note on nearest-neighbour Gibbs and Markov probabilities, Sankhy¯ a A 41 (1979), pp. 184–197. 121. D. J. Spiegelhalter, S. L. Lauritzen: Sequential updating of conditional probabilities on directed graphical structures, Networks 20 (1990), n. 5, pp. 579–605. 122. P. Spirtes, C. Glymour, R. Scheines: Causation, Prediction and Search, Lecture Notes in Statistics 81, Springer-Verlag 1993. 123. P. Spirtes: Directed cyclic graphical representations of feedback models, in Uncertainty in Artificial Intelligence 11 (P. Besnard, S. Hanks eds.), Morgan Kaufmann 1995, pp. 491–498. 124. W. Spohn: Stochastic independence, causal independence and shieldability, Journal of Philosophical Logic 9 (1980), n. 1, pp. 73–99. 125. W. Spohn: On the properties of conditional independence, in Patrick Suppes, Scientific Philosopher, Vol. 1, Probability and Probabilistic Causality (P. Humphreys ed.), Kluwer 1994, pp. 173–196. ˇ ep´ 126. J. Stˇ an: Probability Theory – Mathematical Foundations (in Czech), Academia (Prague) 1987. ˇ Stˇ ˇ ep´ 127. S. anov´ a: Equivalence of chain graphs (in Czech), diploma thesis, Faculty of Mathematics and Physiscs, Charles University Prague 2003. 128. R. Strausz: On separoids, PhD thesis, Universidad Nacional Autonoma de Mexico 2004. 129. M. Studen´ y: Asymptotic behaviour of empirical multiinformation, Kybernetika 23 (1987), n. 2, pp. 124–135. 130. M. Studen´ y: Multiinformation and the problem of characterization of conditional independence relations, Problems of Control and Information Theory 18 (1989), n. 1, pp. 3–16. 131. M. Studen´ y: Convex set functions I. and II., research reports n. 1733 and n. 1734, Institute of Information Theory and Automation, Prague, November 1991. 132. M. Studen´ y: Conditional independence relations have no finite complete characterization, in Information Theory, Statistical Decision Functions and Random Processes, Transactions of 11th Prague Conference, Vol. B ´ V´ıˇsek eds.), Kluwer 1992, pp. 377–396. (S. Kub´ık, J. A. 133. M. Studen´ y: Formal properties of conditional independence in different calculi of AI, in Symbolic and Quantitative Approaches to Reasoning and Uncertainty (M. Clarke, R. Kruse, S. Moral eds.), Lecture Notes in Computer Science 747, Springer-Verlag 1993, pp. 341–348. 134. M. Studen´ y: Convex cones in finite-dimensional real vector spaces, Kybernetika 29 (1993), n. 2, pp. 180–200. 135. M. Studen´ y: Structural semigraphoids, International Journal of General Systems 22 (1994), n. 2, pp. 207–217. 136. M. Studen´ y, P. Boˇcek: CI-models arising among 4 random variables, in Proceedings of WUPES94, September 11–15, 1994, Tˇreˇst’, Czech Republic, pp. 268–282.
270
References 137. M. Studen´ y: Description of structures of conditional stochastic independence by means of faces and imsets (a series of 3 papers), International Journal of General Systems 23 (1994/1995), n. 2–4, pp. 123–137, 201–219, 323–341. 138. M. Studen´ y: Semigraphoids and structures of probabilistic conditional independence, Annals of Mathematics and Artificial Intelligence 21 (1997), n. 1, pp. 71–98. 139. M. Studen´ y: A recovery algorithm for chain graphs, International Journal of Approximate Reasoning 17 (1997), n. 2–3, pp. 265–293. 140. M. Studen´ y: On marginalization, collapsibility and precollapsibility, in Distributions with Given Marginals and Moment Problems (V. Beneˇs, J. ˇ ep´ Stˇ an eds.), Kluwer 1977, pp. 191–198. 141. M. Studen´ y, R. R. Bouckaert: On chain graph models for description of conditional independence structures, Annals of Statistics 26 (1998), n. 4, pp. 1434–1495. 142. M. Studen´ y: Bayesian networks from the point of view of chain graphs, in Uncertainty in Artificial Intelligence 14 (G. F. Cooper, S. Moral eds.), Morgan Kaufmann 1998, pp. 496–503. 143. M. Studen´ y: Complexity of structural models, in Prague Stochastics 98, August 23–28, Prague 1998, pp. 521–528. 144. M. Studen´ y, J. Vejnarov´ a: The multiinformation function as a tool for measuring stochastic dependence, in Learning in Graphical Models (M. I. Jordan ed.), Kluwer 1998, pp. 261–298. 145. M. Studen´ y, R. R. Bouckaert, T. Koˇcka: Extreme supermodular set functions over five variables, research report n. 1977, Institute of Information and Automation, Prague, January 2000. 146. M. Studen´ y: On mathematical description of probabilistic conditional independence structures, DrSc thesis, Institute of Information and Automation, Prague, May 2001. 147. M. Studen´ y: Structural imsets, an algebraic method for describing conditional independence structures, in Proceedings of IPMU 2004 (10th conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems) (B. Bouchon-Meunier, G. Coletti, R. R. Yager eds.), pp. 1323–1330. 148. I. Vajda: Theory of Statistical Inference and Information, Kluwer 1989. 149. J. Vejnarov´ a: Conditional independence in possibility theory, in Proceedings of ISIPTA 99 (1st International Symposium on Imprecise Probabilities and their Applications) (G. de Cooman, F. G. Cozman, S. Moral, P. Walley eds.), pp. 343–351. 150. T. Verma, J. Pearl: Causal networks, semantics and expressiveness, in Uncertainty in Artificial Intelligence 4 (R. D. Shachter, T. S. Lewitt, L. N. Kanal and J. F. Lemmer eds.), North-Holland 1990, pp. 69–76. 151. T. Verma, J. Pearl: Equivalence and synthesis of causal models, in Uncertainty in Artificial Intelligence 6 (P. P. Bonissone, M. Henrion, L. N. Kanal, J. F. Lemmer eds.), Elsevier 1991, pp. 220–227. 152. T. Verma, J. Pearl: An algorithm for deciding if a set of observed independencies has a causal explanation, in Uncertainty in Artificial Intelligence 8 (D. Dubois, M. P. Wellman, B. D’Ambrosio, P. Smets eds.), Morgan Kaufmann 1992, pp. 323–330.
References
271
153. M. Volf, M. Studen´ y: A graphical characterization of the largest chain graphs, International Journal of Approximate Reasoning 20 (1999), n. 3, pp. 209–236. 154. S. Watanabe: Information theoretical analysis of multivariate correlation, IBM Journal of Research and Development 4 (1960), pp. 66–81. 155. D. J. A. Welsh: Matroid Theory, Academic Press 1976. 156. N. Wermuth: Analogies between multiplicative models for contingency tables and covariance selection, Biometrics 32 (1976), pp. 95–108. 157. J. Whittaker: Graphical Models in Applied Multivariate Statistics, John Wiley 1990. 158. R. W. Yeung: A First Course in Information Theory, Kluwer 2002. 159. Y. Xiang, S. K. M. Wong, N. Cercone: Critical remarks on single link search in learning belief networks, in Uncertainty in Artificial Intelligence 12 (E. Horvitz, F. Jensen eds.), Morgan Kaufmann 1996, pp. 564–571.
Index
The list of notions in alphabetic order is given here. The reference usually indicates the page containing the definition. Several entries are in italics. These entries indicate either the concepts from the literature which are not studied in detail in this book (they may only be mentioned without a definition) or vaguely defined concepts.
A absolute value |x| 9 absolutely continuous measures ν µ 228 abstract semi-graphoid 14 active route in a directed graph 48 acyclic directed graph 220 (46) directed mixed graphs 62 hypergraph 55 undirected graph (= forest) 220 adequate statistical model 160 (245) a.e. (= almost everywhere) 227 affine subspace 223 AIC criterion = Akaike’s information criterion in general 246 for G ∈ DAGS(N ) 168 algebra: σ-algebra 226 (µ-)almost everywhere (a.e.) 227 alternative chain graph 60
in statistical testing 246 ancestor of a node anG (b) 220 ancestral graph 62 annotated graph 60 annotation algorithm 61 membership algorithm 61 antecedent 50 antisymmetric binary relation 216 approximate critical value 248 arrow 218 - a notation a → b 219 removal 180–180 reversal 49 ascending class of sets 215 ascetic extension as (M, N ) 202 asymptotic distribution of a statistic 248 atom of a complete lattice 217 atomic set in an inference rule 50 atomistic lattice 217 attribute (in a formal context) 102 augmentation criterion 60 axiomatic characterization 16 (50)
B backward phase of a simple two step procedure 159 ball U (x, ε) Uρ (x, ε) 221 basis linear 223 Hilbert H(N ) 121 baricentral imset 131 dual baricentral imset 152
274
Index
Bayesian criteria 162 for G ∈ DAGS(N ) (LML) 168–169 information criterion (BIC) in general 246 for G ∈ DAGS(N ) 168 network 46 BIC criterion = Bayesian information criterion 246 binary distribution framework 249 probability measures 191–192 block of a chain 221 blocked route (path) in an acyclic directed graph 48 Borel σ-algebra, sets 226 bubble graph 57
C canonical characteristics of a CG measure 66 decomposition, triangulation of an undirected graph 205 cardinality |A| 9 intersection criteria 198 causal input list 49 interpretation 5 CG (= conditional Gaussian) measure general, positive 66 CG (= chain graph) model 53 chain for a hybrid graph 221 graph 221 child of a node 220 chi-square (χ2 -)distribution 248 chordal undirected graph 55 CI = conditional independence interpretation of moves in a local search method 211–213 186 model 12 structure 2 clique (in an undirected graph) 220 closed convex cone 223 set in metric space 221 w.r.t. a closure operation 218
closure conical closure con(B) 224 operation, system 218 structural closure cl U (N ) 143 coatom of a complete lattice 217 coatomistic lattice 217 codes of variables and configurations in Chap. 8 i, k, j 164 collapsed measure 112 collider node 47 section 53 combination linear 222 conical 223 combinatorial imsets C(N ) 72 complete database 242 graph 220 lattice 217 metric space 221 set of nodes 220 completed pattern 157 completeness of a graphical criterion 45 question (motivation) 3 (161) complex (in a chain graph) 54 complexity of a structural model 143 complies with a probability measure complies with a structural imset 83 component connectivity components 220 prime component of an undirected graph 204 composition of a function and a permutation 216 property (= “axiom” of independence models) 33 weak composition of structural models M1 ⊗ M2 203 concave function 230 concentrated measure (on a set) 227 concentration graph 59 matrix (for a regular Gaussian measure) 30 concept lattice 103 conditional
Index covariance matrix Σ A|C 31 (241) density fA|C in discrete case 164 in Gaussian case 240 dependence statement A ⊥ ⊥ B | C [o] 12 Gaussian (= CG) measure 66 independence in terms of σ-algebras A⊥ ⊥ B | C [P ] 234 for probability measures over a set of variables A ⊥ ⊥ B | C [P ] 10 model Mo 12 statement A ⊥ ⊥ B | C [o] 11–12 mutual information 27 probability - a general definition 232–233 on XA given C PA|C 10 regular version 233 product 26 (203) conditioning for Gaussian measures 240–241 for graphs 62 for matrices 237 configurations 164 conical closure con(B) 224 combination 223 connected nodes in a graph 220 undirected graph 220 connectivity components 220 consequent 50 consistency question (motivation) 3 consonant ordering of nodes 220 probability measures 27 structural models 203 contingency table 242 (1) continuous function 222 marginal density 77 reference system of measures 77 sample space 242 variables Γ 66 contraction property (= “axiom” of independence models) 13 convex cone 223
275
function 230 game, set function 88 set 223 coordinate σ-algebra 226 coportrait Hu 150 correlation matrix Γ 239 countably additive measure 227 counting measure υ 227 covariance graph 59 matrix Σ 238 for a Gaussian measure 239 30 covered arc 49 criterion AIC, BIC, MLL 246 LML 169 critical region (of a statistical test) 247 value (of a statistical test) 243 approximate critical value 248 c-separation (for chain graphs) 53 cumulative operations with structural models 203 cycle in a graph 220–220
D DAG - an abbreviation 46 model 48 with hidden variables 61 dashed arrow, line 58 data faithfulness assumption 156 over N 242 vector (relative to a criterion Q) [tQ D (A)]A⊆N 185 for AIC, BIC, DIM criteria 187 for MLL criterion 186 database of the length d 242 decomposable model 55 5 quality criterion 170 undirected graph 55 decomposition canonical decomposition of an undirected graph 205 implication of imsets ; 142
276
Index
proper decomposition of an undirected graph 204 property (= “axiom” of independence models) 13 weak decomposition of a structural model 203 degree detectors m∗ , ml 73 70 of a combinatorial imset deg(u) 72 of freedom (of χ2 -distribution) 248 smallest degree imset 141 dense set infimum-dense set in a lattice 217 in topological sense 221 supremum-dense set in a lattice 217 density of a probability measure - a terminological remark 18 continuous marginal density 77 general definition (density of P with , dP/dµ) 19 respect to µ dP dµ in discrete case 164 in Gaussian case 240 dependence statement A ⊥ ⊥ B | C [o] 12 descendant of a node 220 descending class of sets 215–216 path, route 219 detector of (level-)degree m∗ , ml 73 70 determinant of a matrix det(Σ) 236 determining class of sets (for a structural model) 145 deviance difference 159 of a statistical model 158 differential imset 182 dimension of a linear, respectively affine, subspace 223 direct characterization of independence implication 115 sum of linear subspaces L1 ⊕ L2 223 directed cycle 220 edge (= arrow) 218 - a notation a → b 219 graph 220 discrete distance δ 221 distribution framework 249
with prescribed sample spaces 249 with prescribed one-dimensional marginals 250 measure over N 11 65 positive discrete measure over N 29 sample space 242 variables ∆ 66 disjoint semi-graphoid 13 triplet over N A, B|C 12 distance 221 discrete 221 Euclidean 222 distribution - a terminological note 17 equivalence 111 framework 249 2 44 126-128 of a random variable 238 distributive lattice 217 domain of an imset effective Du∗ 124 positive, negative Du+ , Du− 39 dominated experiment 19 dominating measure 19 d-separation criterion 47–48 57 61 dual baricentral imset 152 cone A∗ 224 description of models 149
E edge 218 directed (= arrow) 218 in a mixed graph 219 undirected (= line) 218 effective dimension (DIM) 246 of MG , G ∈ DAGS(N ) 168 domain of a structural imset Du∗ 124 elementary disjoint triplet, statement 15 imset E(N ) 69–70 generator of a structural model 143 statement mode of representing a semi-graphoid 16 embedded Bayesian network 61 empirical measure Pˆ 243
Index fitted empirical measure 243 multiinformation 25 function mPˆ 186 empty set ∅ 215 entropy function hP,µ 83 of a probability measure 231 perturbated relative entropy H(P |µ : Q) 67 relative entropy H(P |µ) 230 equality µ-almost everywhere 227 equivalence distribution equivalence 111 factorization equivalence 46 54 independence equivalence 46 48 113 level equivalence 197 Markov equivalence 46 48 54 113 parameterization equivalence 112 permutation equivalence 196 qualitative equivalence of supermodular functions 88 quantitative equivalence of supermodular functions 90 question (motivation) 3 error of the first and second type 247 essential arrow 157 graph for acyclic directed graphs 157 (49) for alternative chain graphs 60 Euclidean distance, topology 222 exclusivity of standard imsets 148 expansive operations with structural models 202 expectation vector e 238 for a Gaussian measure 239 30 extensive mapping 218 extent (of a formal concept) 103 extreme ray of a cone 225
F face of a polyhedral cone 225 lattice 107 factorizable measure after a class of sets 22 w.r.t. a chain graph 54 w.r.t. an acyclic directed graph (= recursive factorization) 164 49
277
w.r.t. an undirected graph 46 factorization equivalence in general 111 of chain graphs 54 of undirected graphs 46 faithfulness question (motivation) 3 finite measure 227 fitted empirical measure pˆa,b|C 243 fixed-context CI statement 17 forest 220 formal concept 103 context (Œ, Æ, %) 102 independence model 12 forward phase of a simple two step procedure 159 Fubini theorem 229 functional dependence statement 12
G Galois connection 102 Gamma function Γ 248 Gaussian distribution framework 249 measure - a definition N (e, Σ) 239 conditional (= CG measure) 66 conditioned 240 over N 30 general directed graphs 57 generalized inverse matrix Σ − 237 random variable 238 generated σ-algebra σ(A) 226 generator of a structural model elementary, minimal, structural 143 GES algorithm 179 global Markov property 44 grade of a set of variables gra(N ) 122 modified gra∗ (N ) 123 graph 218 acyclic directed 220 chain 221 directed 220 hybrid 219 summary 62 underlying, undirected 220 with mixed edges 219
278
Index
graphoid 29 greatest element in a complete lattice 217 lower bound inf M 217 greedy equivalence search (GES) algorithm 179 G2 -statistic 243
H Hasse diagram 216 (39) hidden variable 61 Hilbert basis 121 (207) history of a separator Hi 55 hybrid graph 219 hypothesis in statistical testing 246
I idempotent mapping 218 identifier of a class of sets mA↓ , mA↑ 39 identifier of a set δA 39 i-implication u i-implies v: u v 114 immorality in a graph 48 implementation question (motivation) 5 imset 39 combinatorial 71–72 differential 182 elementary 69–70 of the smallest degree 141 represented in Ψ , Ψ -representable 127 standard for an acyclic directed graph 135 for a triangulated graph 137 structural 73 with the least, respectively a minimal, lower class 146 incidence relation in a formal context 102 inclusion boundary condition 178 neighborhood, neighbors 177 quasi-ordering 177 incremental search procedure 162 independence equivalence in general 111 of acyclic directed graphs 48
of structural imsets u v 113 of undirected graphs 46 implication u v 114 linearly independent set 223 model 12 indicator of a set χA 226 induced ascending class D↑ 215 descending class D↓ 216 measure (through a mapping) 227 model 12 - a terminological note 88 subgraph GT 219 topology (by a distance) 222 inference rules 50 infimum of a set in a poset inf M 217 of σ-algebras S ∧ T 226 infimum-dense set in a finite lattice 217 infimum-irreducible element in a complete lattice 217 information criteria 245 246 162 theory 231 information-theoretical tools 28 input list (causal) 49 integers Z (non-negative Z+ ) 9 (µ-)integrable function 228 integral (adjective) = related to integers integral (noun) Lebesgue integral 228 intent (of a formal concept) 103 intercepted route (path) in a chain graph 53 interpretability question (motivation) 4 interpretation: causal 5 intersection of a class of sets D 215 property (= “axiom” of independence models) 29 inverse matrix Σ −1 236 generalized inverse matrix Σ − 237 irreducible infimum-irreducible, supremumirreducible element 217 isomorphism lattice-isomorphism 218 of measurable spaces 227 order-isomorphism 218
Index isotonic mapping 218
J Jensen inequality 230 join in a poset x ∨ y 216 semi-lattice 216 joint sample space 238 joint-response chain graph 58 junction tree 55 juxtaposition - a convention to denote union 9 215 - a notation for composition 216
K Kerridge’s inaccuracy 67 K2 metric 170
L larger, largest chain graph 54 latent variables 61 lattice 217 atomistic, coatomistic, complete 217 concept lattice 103 conditional independence (= LCI) models 56 face lattice 107 isomorphism 218 of structural models 104 structure requirement 158–161 LCI models 56 learning question (motivation) 4 least determining, respectively unimarginal, class for a structural model 145–146 element in a complete lattice 217 lower class imset 146 upper bound sup M 216 Lebesgue measure λ 227 legal arrow addition 180 reversal 49 length of a database d 242 level equivalence 197 of an elementary imset El (N ) 70
279
level-degree detector ml 73 70 of a combinatorial imset deg (u, l) 72 likelihood function 245 line (= undirected edge) 218 b 219 - a notation a linear basis 223 combination 222 generating 223 subspace 222 linearly independent set 223 LISREL models 58 LML criterion 169–169 (211) local computation method 205 (5 55) Markov property 44 search methods 162 (186) logarithm of the marginal likelihood 169 loop in a graph 219 lower class of a structural imset Lu 73 greatest lower bound 217 inclusion neighbor 177 integer part a 252 neighbor in a poset 216 standardization of a supermodular set function 91 -skeleton K (N ) 93 -standardization of an imset 40 of a supermodular function 91 -standardized supermodular functions K (N ) 92
M MAG (= maximal ancestral graph) 62 main submatrix Σ A·A 237 marginal contingency table ctA [D] 242–243 density of P for A 20 in continuous case 77 in discrete case 163 in Gaussian case 240 likelihood 169 measure 229–230 (9) probability measure P A 9 undirected graph GT 46
280
Index
marginalizing for Gaussian measures 239 for graphs 62 (46) for matrices 237 for probability measures 230 marginally continuous probability measure 19 Markov chain Monte Carlo (MCMC) method 163 equivalence in general 111 of acyclic directed graphs 48 of chain graphs 54–54 of structural imsets 113 of undirected graphs 46 network 43 properties w.r.t. a graph 44 Markovian measure w.r.t. a (classic) chain graph 53 w.r.t. an acyclic directed graph 48 w.r.t. an undirected graph 45 w.r.t. a structural imset 81 matrix (N × M -matrix) 236 concentration matrix 30 correlation matrix 239 covariance matrix 238 (30) multiplication Σ · Γ 236 maximal ancestral graph (= MAG) 62 element in a poset 216 sets of a class of sets Dmax 216 maximized log-likelihood (= MLL) criterion in general 246 for G ∈ DAGS(N ) 167 maximum likelihood estimate Pθˆ 246 in MG , G ∈ DAGS(N ) 167 MC graph 62 MCMC method 163 measurable function 226 mapping 227 rectangles 226 space (X, X ) 226 measure complying with a structural imset 83 concentrated on a set 227 countably additive 227 discrete measure 11
finite measure 227 Gaussian measure 30 induced through a measurable mapping 227 non-negative measure 227 positive CG measure 66 discrete measure 29 measure 29 regular Gaussian measure 240 30 σ-finite measure 227 singular Gaussian measure 241 31 with finite multiinformation 28 65 meet in a poset x ∧ y 217 membership algorithm (for annotated graphs) 61 method of local computation 205 (5 55) search 162 metric space 221 metrizable topological space 222 MLL criterion = maximized loglikelihood criterion in general 246 for G ∈ DAGS(N ) 167 minimal determining class of sets for a structural model 145 element in a poset 216 generator of a structural model 143 lower class imset 146 sets of a class of sets Dmin 216 structural generator 143 minimum description length (MDL) principle 162 minor of a semi-graphoid 199 missing data 242 mixed graphs (acyclic directed) 62 model - a terminological explanation 13 choice 245 formal independence model 12 induced by a structural imset Mu 79 produced by a supermodular function Mm 88 statistical model 244 of CI structure 13 structural independence model 104
Index with hidden variables (= DAG model with h. v.) 61 modular functions L(N ) 90 moment characteristics of a positive CG measure 66 monotone class theorems 227 moral graph of a chain graph 52 an acyclic directed graph 47 moralization criterion for acyclic directed graphs 47 for chain graphs 52 moves (in a local search method) 162 m-separation criterion 62 multiinformation 24 function mP 27 of a regular Gaussian measure 35 multiple edges in a graph 219 scalar multiple of a vector α · x 222 multiplicity of a separator 55 (139) multiset 39 multivariate analysis 236 1 mutual information 231 (24) conditional 27
N natural numbers N 9 negative domain of an imset Du− 39 part of a function f − 226 of an imset u− 39 neighbor in a poset (lower, upper) 216 inclusion neighbors (= relative to inclusion quasi-ordering) 177 neighborhood structure (in a local search method) 162 node configuration 164 in a graph 218 non-decreasing function 91 non-negative integers Z+ 9 normal distributions 30 normalized imset 41 null element of a complete lattice 217
281
matrix, vector 0 236
O object in a formal context 102 of discrete mathematics 3 open ball in a metric space Uρ (x, ε) 221 set in a metric space 221 ordering (partial, total) 216 order-isomorphism of posets 218 orthogonal complement of a set A⊥ 223 standardization of a supermodular set function 91 o-skeleton Ko (N ) 97 o-standardization of imsets 40 of supermodular functions 91
P PAG (= partial ancestral graph) 62 pairwise Markov property 44 parameterization equivalence 112 of a statistical model 244 of MG , G ∈ DAGS(N ) 165–165 parent configuration 164 of a node paG (b) 220 partial ancestral graph (PAG) 62 ordering 216 partially ordered set (= poset) 216 path in a graph 219 pattern of an equivalence class of acyclic directed graphs 156 49 completed pattern 157 PC algorithm 156 (157) perfect class of measures 45 perfectly Markovian measure w.r.t. a (classic) chain graph 53 w.r.t. an acyclic directed graph 48 w.r.t. an undirected graph 45 w.r.t. a structural imset 81 permutation equivalence 196 on a finite set N 216
282
Index
perturbated relative entropy H(P |µ : Q) 67 pointed cone 225 polyhedral cone 224 polymatroid 109 portrait of a structural imset 150 poset (L, ) 216 positive binary distribution framework 249 CG measure 66 definite matrix 237 discrete distribution framework 249 measure 29 domain of an imset Du+ 39 Gaussian distribution framework 249 measure of N 29 part of a function f + 226 of an imset u+ 39 semi-definite matrix 237 posterior probability measure in MCMC method 163 potentials 22 power function of a statistical test 247 set P(X) P(N ) 215 prime max (G) 204 components of G, Ppri graph 204 set of nodes relative to an undirected graph 204 prior probability measure π G 168 in MCMC 162 probabilistic reasoning 1 probability distribution - a terminological note 17–18 measure 227 over N 9 problem of axiomatic characterization of CI models 16–17 produced model 88 product formula induced by a structural imset 75 of measurable spaces (X × Y, X × Y) ( i∈N Xi , i∈N Xi ) 226 of (σ-finite) measures µ1 × µ2 229
of topological spaces 222 scalar x, y 222 (41) σ-algebra 226 topology 222 projection of a database DA 170 of x onto A, xA 20 proper decomposition of an undirected graph 204 weak decomposition of a structural model 203 p-separation criterion 60
Q qualitative equivalence of supermodular functions 88 quality criterion 161 (155) for learning DAG models 163 decomposable 170 regular 171 score equivalent 169 strongly decomposable 170 strongly regular 172–172 quantitative equivalence of supermodular functions 90 (µ-)quasi-integrable function 228 quasi-ordering 216
R Radon-Nikodym derivative, theorem 228 random sample 244 variable, vector 238 generalized random variable 238 range of a structural imset Ru 74 rational numbers Q 9 polyhedral cone 224 ray (in Euclidean space) Rx 225 real numbers R 9 reciprocal graph 58 recovery algorithms 157 recursive causal graph 56 factorization w.r.t. an acyclic directed graph 164 (49)
Index reductive operations with structural models 199 reference system of measures 75 continuous 77 standard 76 universal 75 reflection ι 197 reflexive binary operation 216 region of a structural imset Ru 124 regular annotated graph 61 Gaussian measure 240 (30 66) matrix 236 quality criterion 171 version of conditional probability given a σ-algebra 233 197 (91) relative entropy H(P |µ) 230 perturbated H(P |µ : Q) 67 (Ψ -)representable structural imset 127 represented triplet in a (classic) chain graph 52–53 in an acyclic directed graph 47–48 in an undirected graph 43 in a structural imset 78 in a supermodular function 88 residual for a separator Ri 55 restriction of an independence model MT 12 a probability measure P A 232 ring of subsets 217 route in a graph 219 active route in a directed graph 48–48 superactive route in a chain graph 53 running intersection property 55
S sample ˆ 242 covariance matrix Σ ˆ 242 expectation e space 238 continuous, discrete 242 saturated independence statements 17 model 245 (159) saturating function (of a quality criterion Q) sQ 185 for DIM criterion sDIM 187 for MLL criterion sMLL 186
283
scalar multiple of a vector α · x 222 product of vectors x, y 222 with an imset m, u 41 Schur complement Σ A|C 237 score criterion, metric 161 equivalent quality criterion 169 search space (in a local search method) 162 section of a route 52 semi-definite matrix (positive) 237 semi-elementary imset 71 semi-graphoid abstract semi-graphoid 14 - a “historical” note 11 axioms 14 disjoint semi-graphoid 13–13 properties for σ-algebras 235–236 semi-lattice (join) 216 separable metric space 221 separation criterion c-separation 53 d-separation 47–48 57 61 for undirected graphs 43 m-separation 62 p-separation 60 separator for a triangulated graph 55 (139) separoid 14–15 set of variables N 9 SGS algorithm 156 sigma-algebra (= σ-algebra) 226 generated by a class σ(A) 226 sigma-finite (= σ-finite) measure 227 significance level (for a statistical test) 247 test 155 (159) simple two step procedure 158 simultaneous equation systems 58 singleton 215 singular Gaussian measure 241 (31) size of a statistical test 247 skeletal characterization of independence implication 118 (supermodular) functions 92 skeleton K (N ), Ku (N ), Ko (N ) 93 97
284
Index
smallest degree imset 141 solid arrow, line 58 span of a structural imset 145 standard imset for an acyclic directed graph 135 a triangulated undirected graph 137 reference system of measures (for a CG measure) 76 standardization of imsets 40 of supermodular functions (lower, orthogonal, upper) 91 statement (conditional) dependence 12 (conditional) independence 11–12 elementary 15 fixed-context 17 functional dependence 12 saturated 17 trivial 15 unconditional 17 states (in a local search method) 162 statistic 242 statistical alternative, hypothesis 246 model M 244 MG , G ∈ DAGS(N ) 163 of CI structure 13 test 243 strict ancestor, descendant 220 inclusion (⊂, ⊃) 215 of DAG models 177 strictly concave, convex function 230 descending route (path) 219 strong completeness of a graphical criterion 46 union property (= “axiom” of independence models) 43 strongly decomposable quality criterion 170 regular quality criterion 172–172 structural closure clU (N ) 143 generator (of a structural model) 143 imsets S(N ) 73
independence models U(N ) 104 subconcept (of a formal concept) 103 subgraph: induced 219 submatrix Σ A·B 236 main submatrix Σ A·A 237 submodular function 109 subset ⊆ 215 subspace affine 223 linear 222 sum of vectors x + y 222 summary graphs 62 superactive route in a chain graph 53 supermodular function 87 (27) - a notation of the class K(N ) 87 superset ⊇ 215 supertype (of a skeletal imset) 198 supremum of a set in a poset sup M 216 of σ-algebras S ∨ T 226 supremum-dense set in a finite lattice 217 supremum-irreducible element in a complete lattice 217 symmetric matrix 236 symmetry property (= “axiom” of independence models) 13
T TAN models 157–158 terminal node 221 topological space 222 topology 222 induced by a distance 222 product topology 222 total ordering 216 trail 53 transformation of data (relative to a quality criterion Q) tQ 185 for AIC, BIC, DIM criteria 187 for MLL criterion tMLL 186 transformational characterization of inclusion quasi-ordering 179–180 of independence equivalence 49 transitive acyclic directed graph 56 binary relation 216 transitivity
Index principle for Schur complement 237 property (“axiom” of independence models) 44 transpose of a matrix Σ 236 tree 220 triangulated undirected graph 55 (205) triangulation canonical triangulation of an undirected graph 205 triplet over N (disjoint) 12 represented in a (classic) chain graph 52–53 in an acyclic directed graph 47–48 in an undirected graph 43 in a structural imset 78 in a supermodular function 88 trivial disjoint triplets Tø (N ), independence statements 15 σ-algebra 226 triviality property (= “axiom” of independence models) 13 type (of a skeletal imset) 196
U UG model 43 unconditioned independence statement A⊥ ⊥ B | ∅ 17 underlying graph 220 undirected cycle 220 edge (= line) 218 b 219 - a notation a graph 220 path, route 219 unimarginal class of sets (for a structural model) 145 union of a class of sets D 215 weak union property (= “axiom” of independence models) 13 uniqueness principle for Markovian probability measures 82 unit element of a complete lattice 217 matrix I 236 universal reference system of measures 75
285
universum of objects of discrete mathematics 3 of structural imsets 161 upper class of a structural imset Uu 73 (124) inclusion neighbor 177 integer part #a$ 252 least upper bound 216 neighbor in a poset 216 standardization of a supermodular set function 91 u-skeleton Ku (N ) 97 u-standardization of imsets 40 of supermodular functions 91
V value levels 197 variables N 9 continuous Γ 66 discrete ∆ 66 random variables 238 variance σii 238–239 vector 222 over N 236 random 238
W weak composition of structural models 203 decomposition of a structural model 203 transitivity property (= “axiom” of independence models)34 union property (= “axiom” of independence models) 13
X X 2 -statistic 243
Z zero imset 0 39 vector 0 222