Probabilistic Logic in a Coherent Set,tin·g by
Giulianella Coletti and Romano Scozzafava
PROBABILISTIC LOGIC IN A COHERENT SETTING
TRENDS IN LOGIC Studia Logica Library VOLUME 15 Managing Editor Ryszard W6jcicki, Institute of Philosophy and Sociology, Polish Academy of Sciences, Warsaw, Poland Editors Daniele Mundici, Department of Computer Sciences, University of Milan, Italy Ewa Orlowska, National Institute of Telecommunications, Warsaw, Poland Graham Priest, Department of Philosophy, University of Queensland, Brisbane, Australia Krister Segerberg, Department of Philosophy, Uppsala University, Sweden Alasdair Urquhart, Department of Philosophy, University of Toronto, Canada Heinrich Wansing, Institute of Philosophy, Dresden University of Technology, Germany
SCOPE OF THE SERIES Trends in Logic is a bookseries covering essentially the same area as the journal Studia Logica - that is, contemporary formal logic and its applications and
relations to other disciplines. These include artificial intelligence, informatics, cognitive science, philosophy of science, and the philosophy of language. However, this list is not exhaustive, moreover, the range of applications, comparisons and sources of inspiration is open and evolves over time.
Volume Editor Heinrich Wansing
The titles published in this series are listed at the end of this volume.
GIULIANELLA COLETTI University of Perugia, Italy
ROMANO SCOZZAFAVA University of Roma "La Sapienza", Italy
PROBABILISTIC LOGIC IN A COHERENT SETTING
~
''
KLUWER ACADEMIC PUBLISHERS DORDRECHT/BOSTON/LONDON
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN 1-4020-0917-8
Published by Kluwer Academic Publishers, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. Sold and distributed in North, Central and South America by Kluwer Academic Publishers, 101 Philip Drive, Norwell, MA 02061, U.S.A. In all other countries, sold and distributed by Kluwer Academic Publishers, P.O. Box 322, 3300 AH Dordrecht, The Netherlands.
Printed on acid-free paper
All Rights Reserved © 2002 Kluwer Academic Publishers No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Printed in the Netherlands.
Preface The theory of probability is usually based on very peculiar and restrictive assumptions: for example, it is maintained that the assessment of probabilities requires an overall design on the whole set of all possible envisaged situations. A "natural" consequence is that the use of probability in the management of uncertainty is often challenged, due to its (putative) lack of "flexibility". Actually, many traditional aspects of probability theory are not so essential as they are usually considered; for example, the requirement that the set of all possible "outcomes" should be endowed with a beforehand given algebraic structure (such as a Boolean algebra), or the aim at getting, for these outcomes, uniqueness of their probability values, with the ensuing introduction of suitable relevant assumptions (such as a-additivity, conditional independence, maximum entropy, ... ) , or interpretations (such as a strict frequentist one, which unnecessarily restricts the domain of applicability). The approach adopted in this book is based on the concept of coherence, that can be framed in the most general view of conditional probability (as proposed by Bruno de Finetti), and it is apt to avoid the usual criticisms, making also a clear-cut distinction between the meaning of probability and the various multifacet methods for its assessment. In other words, referring to de Finetti's approach is not a "semantic" attitude in favour of the subjectivist position, rather it is mainly a way of exploiting the "syntactic" advantages of this view (which differs radically from the usual one, based on a measure-theoretic framework). For example, in a coherent setting a natural handling of partial probability assessments is possible, and the process of updating is ruled by coherence through an algorithm involving linear systems and linear programming, that does not necessarily lead to unique values of the relevant assessments. Contrary to what could appear at first glance, dealing with eo-
2
herence gives rise to a number of delicate and subtle problems, and has little to do with a conventional Bayesian approach. To say the less, in the latter the main emphasis is on the so-called priors and posteriors, which after all are just two particular probability assessments referring to two different "states of information". In our general coherent setting, we refer to an arbitrary family of conditional events and to the corresponding conditional probability assessments, including all their possible revisions. In this way we are able to show how the theory of coherent conditional probability can act as a unifying tool: through a direct assignment of conditional probabilities, we get a general theory of probabilistic reasoning able to encompass also other approaches to uncertain reasoning, such as fuzziness, possibility functions and default reasoning. Furthermore, we put forward a meaningful concept of conditional independence, which avoids many of the usual inconsistencies related to logical dependence. In the last Chapter we give a short account on how to extend our methodology and rules to more general (decomposable) uncertainty measures. Let us emphasize that we will not attempt here to enter into any controversy concerning as to whether probability may or may not be the only appropriate tool for reasoning under uncertainty, even if we underline the unifying role of coherent conditional probability. The book is kept self-contained, provided the reader is familiar with the elementary aspects of propositional calculus, linear algebra and analysis. Much of the material presented appears already, possibly in different form, in many published papers, so that the main contribution of the book is the assembling of it for a presentation within a unified framework. Finally, we want to express our thanks to an anonymous referee for many valuable comments, and to Barbara Vantaggi for a careful reading of the manuscript ensuing useful suggestions.
Contents 1 Introduction 1.1 Aims and motivation . . . . . 1.2 A brief historical perspective .
7 12
2 Events as Propositions 2.1 Basic concepts . . . . 2.2 From "belief' to logic? 2.3 Operations . . . . . . . 2.4 Atoms (or "possible worlds") . 2.5 Toward probability . . . . .
17 17 18 20 21 24
3 Finitely Additive Probability 3.1 Axioms . . . . . . . . . . . . 3.2 Sets (of events) without structure 3.3 Null probabilities . . . . . . . . .
25 25 26 27
4 Coherent probability 4.1 Coherence . . . . . 4.2 Null probabilities (again) .
31 31 34
5 Betting Interpretation of Coherence
37
6 Coherent Extensions of Probability Assessments 6.1 de Finetti's fundamental theorem 6.2 Probabilistic logic and inference
43 43 45
3
7
4
CONTENTS
7 Random Quantities
49
8 Probability Meaning and Assessment: a Reconciliation 8.1 The "subjective" view 8.2 Methods of evaluation
53 53 55
9 To Be or not To Be Compositional?
57
10 Conditional Events 10.1 Truth values .. . 10.2 Operations . . . . 10.3 Toward conditional probability.
61 63 65 70
11 Coherent Conditional Probability 11.1 Axioms . . . . . . . . . . . . . . . 11.2 Assumed or acquired conditioning? 11.3 Coherence . . . . . . . . . . . 11.4 Characterization of a coherent conditional probability . . . . 11.5 Related results . . . . . . . . 11.6 The role of probabilities 0 and 1
73 73
12 Zero-Layers 12.1 Zero-layers induced by a coherent conditional probability . . 12.2 Spohn's ranking function . 12.3 Discussion . . . . . . . . .
99 99 101 102
13 Coherent Extensions of Conditional Probability
109
14 Exploiting Zero Probabilities 14.1 The algorithm . . . . . . 14.2 Locally strong coherence . .
117 117 122
74 76 80 90 94
CONTENTS
5
15 Lower and Upper Conditional Probabilities 15.1 Coherence intervals . . . . . . 15.2 Lower conditional probability 15.3 Dempster's theory .
127 127 128 134
16 Inference 16.1 The general problem 16.2 The procedure at work 16.3 Discussion . . . . . . . 16.4 Updating probabilities 0 and 1 .
137 137 139 151 155
17 Stochastic Independence in a Coherent Setting 17.1 "Precise" probabilities . 17.2 "Imprecise" probabilities 17.3 Discussion . . . . . . 17.4 Concluding remarks . . .
163 164 179 186 190
18 A Random Walk in the Midst of Paradigmatic Examples 18.1 Finite additivity . . . . . . . . . . . 18.2 Stochastic independence . . . . . . 18.3 A not coherent "Radon-Nikodym" conditional probability . . 18.4 A changing "world" . . . . . . 18.5 Frequency vs. probability . . 18.6 Acquired or assumed (again) . 18.7 Choosing the conditioning event 18.8 Simpson's paradox 18.9 Belief functions . . . . . . . . .
191 191 193 194 197 198 202 202 204 206
19 Fuzzy Sets and Possibility as Coherent Conditional Probabilities 215 19.1 Fuzzy sets: main definitions 216 19.2 Fuzziness and uncertainty . 219
CONTENTS
6 19.3 Fuzzy subsets and coherent conditional probability . . . 19.4 Possibility functions and coherent conditional probability 19.5 Concluding remarks . . . . . . . .
20 Coherent Conditional Probability and Default Reasoning 20.1 Default logic through conditional probability equal to 1 . 20.2 Inferential rules 20.3 Discussion . . . . . . .
225 232 240
241 243 247 251
21 A Short Account of Decomposable Measures of Uncertainty 21.1 Operations with conditional events 21.2 Decomposable measures . . . . 21.3 Weakly decomposable measures 21.4 Concluding remarks . . . . . .
257 258 262 266 270
Bibliography
271
Index
285
.....
Chapter 1 Introduction 1.1
Aims and motivation
The role of probability theory is neither that of creating opinions nor that of formalizing any relevant information in the framework of classical logic; rather its role (seemingly less ambitious) is to manage "coherently" opinions using all information that has been anyhow acquired or assumed. The running of this process requires, first of all, to overcome the barriers created by prevailing approaches, based on trivially schematic situations, such as those relying just on combinatorial assessments or on frequencies observed in the past. The starting point is a synthesis of the available information (and possibly also of the modalities of its acquisition), expressing it by one or more events: to this purpose, the concept of event must be given its more general meaning, not just looked on as a possible outcome (a subset of the so-called "sample space"), but expressed by a proposition. Moreover, events play a two-fold role, since we must consider both those events which are the direct object of study and those which represent the relevant "state of information": so conditional events and conditional probability are the tools that allow to manage specific (conditional) statements and to update 7 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
8
CHAPTER 1
degrees of belief on the basis of the evidence. We refer to the state of information (at a given moment) of a real (or fictitious) person, that will be denoted (following de Finetti [53]) by "You" . A typical situation is the following: You are not able to give categorical answers about all the events constituting the relevant environment, and You must therefore act under uncertainty. In fact You have- about the problem- some knowledge that should help in assessing degrees of belief in relevant events, singledout by suitable sentences. Even if beliefs may come from various sources, they can be treated as being of the same quality and nature, since the relevant events (including possibly statistical data) can always be considered as being assumed (and not asserted) propositions. We maintain that these beliefs can be measured (also in the management of partial and revisable information in automated reasoning) by probability (conditional or not). In the aforementioned typical situation, propositions may include equalities or inequalities involving values taken on by random variables: often the latter are discrete, so that each one has a finite (sometimes, countable) range of possible values. The usual probabilistic models refer to a set of random variables, and the relevant joint probability distribution should completely specify the probability values that You assign to all involved propositions. Even if the joint distribution can in principle answer any question about the whole range, its management becomes intractable as the number of variables grows: therefore conditional independence is often assumed to make probabilistic systems simpler. So a belief network (represented by a suitable graph, a DAG- directed acyclic graph - having no directed cycles) can be used to represent dependencies among variables and to give a concise specification of the joint probability distribution: the set of random variables makes up the nodes of the graph, while some pairs of nodes are connected
INTRODUCTION
9
by arrows, whose intuitive meaning is that the parent of a node X (i.e., any node having an arrow pointing to it) has some direct "influence" on X itself; moreover, this influence is quantified by a conditional probability table for each relevant node. Essentially, given all possible envisaged situations - which are usually expressed by uncertain conditional statements (which model information in a weaker way than that given in the form of "if-then" rules) - the problem consists in suitably choosing only some of them, concerning "locally" a few variables (where "locally" means that they are regarded as not being "influenced" by too many other variables). In this book we discuss (in the framework of conditional events and conditional probability, and giving up any "ad hoc" assumption) how to deal with the following problem, which clearly encompasses that sketched above. Given an arbitmry family of conditional events (possibly just "a few", at least at the initial stage) and a suitable assessment of a real function P defined on e, You must tackle, first of all, the following question: is this assessment coherent? This essentially means that it can be framed in the most general view of conditional probability as proposed by Bruno de Finetti, which differs radically from the usual one (based on a measure-theoretic approach). For example, its direct assessment allows to deal with conditioning events whose probability can be set equal to zero, a situation which in many respects represents a very crucial feature (even in the case of a finite family of events). In fact (as we will show), if any positivity condition is dropped, the class of admissible conditional probability assessments is larger, that of possible extensions is never empty, the ensuing algorithms are more flexible, the management of stochastic independence (conditional or not) avoids many of the usual inconsistencies related to logical dependence. The concept of coherence privileges probability as a linear oper-
e
10
CHAPTER 1
ator rather than as a measure, and regards the minimal Boolean algebra (or product of Boolean algebras) spanned by the given events only as a provisional tool to handle coherence, so that this tool can possibly change when new events and new information come to the fore. So, taking de Finetti's approach as starting point is not just a "semantic" attitude in favour of the subjectivist position, rather it is mainly a way of exploiting the "syntactic" advantages of this view by resorting to an operational procedure which allows to consider, for example, partial probability assessments. Moreover, it is possible to suitably "propagate" the above probability assessments to further conditional events preserving coherence (in the relevant literature this result is known, for unconditional events, as de Finetti's fundamental theorem of probabilities). This process of updating is ruled by coherence through an algorithm involving linear systems and linear programming, and does not necessarily lead to unique values. These aspects, both from the syntactic and the semantic point of view, are discussed at length in the expository papers [23] and [115]. Many real examples are given throughout the text: in particular, some referring to medical diagnosis are discussed as Example 4 in Chapter 2, Example 8 in Chapter 4, and Examples 23, 24, 25 in Chapter 16. The concept of conditional event plays a central role for the probabilistic logic as dealt with in this book: we give up (or better, in a sense, we generalize) the idea of de Finetti of looking at a conditional event EIH as a three-valued logical entity (true when both E and H are true, false when H is true and E is false, "undetermined" when H is false) by letting the third value suitably depend on the given ordered pair (E, H) and not being just an undetermined common value for all pairs. We introduce suitable (partial) operations of sum and product between conditional events (looked on as random quantities), and this procedure gives
INTRODUCTION
11
rise to the rules of a coherent conditional probability. Contrary to a conventional Bayesian approach, we will not refer to the rigid schematic view privileging just priors and posteriors (not to mention, by the way, that also the role of the so-called likelihood is crucial in the global check of coherence), and we will also get rid from the simplifying assumption of mutually exclusive and exhaustive events. Making inference requires a space "larger" than the initial one (i.e., to consider "new" conditional events), and in our general context it is possible to take into account as relevant information also a new probability assessment (once the "global" - with respect to the previous assessments - coherence has been checked) without resorting to the so-called "second order probabilities" (referring in fact to a new probability assessment as a conditioning event is an awkward procedure, since an event is a logical entity that can be either true or false, and a probability assessment has undoubtedly a quite different "status"). Notice that the very concept of conditional probability is deeper than the usual restrictive view emphasizing P(EIH) only as a probability for each given H (looked on as a given fact). Regarding instead also the conditioning event H as a "variable" , we get something which is not just a probability (notice that H alsolike E - plays the role of an uncertain event whose truth value is not necessarily given and known). So it is possible to represent (through conditional events) and manage (through coherent conditional probability) "vague" statements as those of fuzzy theory and to look on possibility functions as particular conditional probabilities; moreover, a suitable interpretation of the extreme values 0 and 1 of P(EIH) for situations which are different, respectively, from the trivial ones E 1\ H = 0 and H ~ E, leads to a "natural" treatment of the default logic. Finally, in Chapter 21 we extend methodology and rules on which our approach is based to more general uncertainty measures,
12
CHAPTER 1
starting again from our concept of conditional event, but introducing (in place of the ordinary sum and product) two operations EB and 0 for which some of the fundamental properties of sum and product (commutativity, associativity, monotonicity, distributivity of EB over 0 ) are required.
1.2
A brief historical perspective
Bruno de Finetti (1906-1985) lived in the twentieth century, writing extensively and almost regularly from 1926 (at the age of 20) through 1982 (and not only in probability, but also in genetics, economics, demography, educational psychology, and mathematical analysis). He has put forward a view not identifying probability as a measure on a u-algebra of sets, but rather looking at it (and at its generalization, i.e. the concept of prevision) as a linear operator defined on a family of random quantities (e.g., events, looked on as propositions). He was also challenging (since the mid-1920s) the unnecessary limitations imposed on probability theory by the assumption of countable additivity (or u-additivity): his ideas came to the international attention in a series of articles in which he argued with Maurice Frechet regarding also the status of events assessed with probability zero. Then Frechet invited de Finetti for a series of lectures at the Institute Henri Poincare in Paris in 1935, whose content was later published in the famous paper "La prevision: ses lois logiques, ses sources subjectives" [51], where, through the concept of exchangeability, he assessed also the important connection between the subjective view of probability and its possible evaluation by means of a past frequency. In the article [52] published in 1949 (and appearing in English only in 1972), de Finetti critically analyzed the formalistic axioma-
INTRODUCTION
13
tization of Kolmogorov: he was the first who introduced the axioms for a direct definition of conditional probability (for the connections with Popper measure, see Section 10.3), linking it to the concept of coherence, that allows to manage also "partial" assessments. All his work exhibits an intuitionist and constructivist view, with a natural bent for submitting the mathematical formulation of probability theory only to the needs required by any practical application. In the preface to his book [53], de Finetti emphasizes how probabilistic reasoning merely stems from our being uncertain about something: it makes no difference whether the uncertainty relates, for instance, to an unforeseable future, or to an unnoticed past, or to a past doubtfully reported or forgotten. Moreover, probabilistic reasoning is completely unrelated to general philosophical controversies, such as determinism versus indeterminism: for example, in the context of heat diffusion or transmission, it makes no difference on the probabilistic model whether one interprets the underlying process as being random or strictly deterministic; the only relevant thing is uncertainty, since a similar situation would in fact arise if one were faced with the problem of forecasting the digits in a table of numbers, where it makes no difference whether the numbers are random, or are some segment (for example, the digits between the 2001st and the 3000th) of the decimal expansion of 1r (possibly available somewhere or, in principle, computable, but unknown to You). The actual fact of whether or not the events under consideration are in some sense determined, or known by other people, is for You of no consequence on the assessment of the relevant probabilities. Probability is the degree of belief assigned by You (the "subject" making the assessment: this is the essential reason why it is called subjective probability) to the "occurrence" (i.e., in being possibly true) of an event. The most "popular" and well known methods of assessment are
14
CHAPTER 1
based on a combinatorial approach or on an observed frequency: de Finetti notes that they essentially suggest to take into account only the most schematic data and information, and in the most schematic manner, which is not necessarily bad, but not necessarily good either. Nevertheless these two approaches can be recovered if looked on as useful (even if very particular) methods of coherent evaluation. They are subjective as well, since it is up to You to judge, for example, the "symmetry" in the combinatorial approach or the existence of "similar" conditions in the different trials of the frequentist approach. Not to mention that they unnecessarily restrict the domain of applicability of probability theory. On the other hand, the natural condition of coherence leads to the conclusion that subjective probability satisfies the usual and classic properties, i.e.: it is a function whose range is between zero and one (these two extreme values being assumed by ~ but not kept only for ~ the impossible and the certain event, respectively), and which is additive for mutually exclusive events. Since these properties constitute the starting point in the axiomatic approach, de Finetti rightly claims that the subjective view can only enlarge and never restrict the practical purport of probability theory. An important remark (that has a strong connection with our discussion of Section 2.2) is now in order: de Finetti makes absolutely clear the distinction between the subjective character of the notion of probability and the objective character of the elements (events, or any random entities whatsoever) to which it refers. In other words, in the logic of certainty there exist only TRUE and FALSE as final (not asserted!) answers, while with respect to the present knowledge of You there exist, as alternatives, certain or impossible, and possible. Other scholars (he claims) in speaking of a random quantity assume a probability distribution as already attached to it: so adopting a different view is a consequence of the unavoidable fact that a "belief" can vary (not only from person to
INTRODUCTION
15
person, but also) with the "information", yet preserving coherence. Then, besides the above "semantic" argument in favour of keeping distinct "logic" and "belief" , there is also a "syntactic" one: coherence does not single-out "a unique probability measure that describes the individual's degrees of belief in the different propositions" (as erroneously stated by Gardenfors in [67], p.36). In this respect, see also Example 8, Chapter 4, and let us quote again from de Finetti's book [53]: "Whether one solution is more useful than another depends on further analysis, which should be done case by case, motivated by issues of substance, and not- as I confess to having the impression - by a preconceived preference for that which yields a unique and elegant answer even when the exact answer should instead be any value lying between specifiable limits". We are going to deepen these (and others) aspects in this book; other comments on de Finetti's contributions are scattered here and there in the text, while a much more extensive exposition of the development of de Finetti's ideas (but with a special attention to statistical inference) is in the long introduction of the book [89] by Frank Lad.
Chapter 2 Events as Propositions 2.1
Basic concepts
An event can be singled-out by a (nonambiguous) statement E, that is a (Boolean) proposition that can be either true or false (corresponding to the two "values" 1 or 0 of the indicator I E of E). Obviously, different propositions may single-out the same event, but it is well-known how an equivalence relation can be introduced between propositions through a double implication: recall that the assertion A ~ B (A implies B) means that if A is true, then also B is true.
Example 1 - You are guessing on the outcome of "heads" in the next toss of a coin: given the events
A
= You guess right ,
B = the outcome of the next toss is heads ,
clearly the two propositions A and B single-out the same event. On the other hand, if You are making many guesses and, among them, You guess also on the outcome of "heads" in the next toss of a coin, then B ~ A, but not conversely. 17 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER 2
18
Closely connected with each event E is its contrary Ec: if the event E is true, then the event Ec is false, and vice versa. Two particular cases are the certain event n (that is always true) and the impossible event 0 (that is always false): notice that n is the contrary of 0, and viceversa. Notice that only in these two particular cases the relevant propositions correspond to an assertion. Otherwise the relevant events (including possibly statistical data) need always to be considered (going back to a terminology due to Koopman [87]) as being contemplated (or, similarly, assumed) and not asserted propositions. To make an assertion, we need to say something extralogical or concerning the existence of some logical relation, such as "You know that E is false" (so that E = 0). Other examples of assertions, given two events A and B, are "A implies B", or "A and B are incompatible" (we mean to assert that it is impossible for them both to occur: the corresponding formal assertion needs the concept of conjunction, see below, Section 2.3). Remark 1 - In the relevant literature, the word event is often used in a generic sense, for example in statements like "repetitions (or trials) of the same event" . We prefer to say (again, following de Finetti) "repetitions of a phenomenon", because in our context "event" is a single event. It is not simply a question of terminology, since in two different trials (for example, tosses of a coin) we may have that "heads" is TRUE (so becoming 0) in one toss, and FALSE {so becoming 0) in the other: anyway, two distinct events are always different, even if it may happen that they take the same truth value.
2.2
From "belief" to logic?
It should be clear, from the previous introduction of the main preliminary concepts (see also the discussion in the final part of Section
EVENTS AS PROPOSITIONS
19
1.2), that our approach does not follow the lines of those theories (such as that expounded in the book [67] by Gardenfors) that try to explain a "belief' (in particular, probability) for arbitrary objects through the concept of epistemic state, and then to recover the logical and algebraic structure of these objects from the rules imposed to these beliefs. We maintain that the "logic of certainty" deals with TRUE and FALSE as final, and not asserted, possible answers, while with respect to a given state of information there exist, as alternatives concerning an event (and measured, for example, by probabilities), those of being certain or impossible, and possible. To us the concept of "epistemic state" appears too faint and clumsy to be taken as starting point, and certainly incompatible with our aim to deal with partial assessments. In other words, we do not see the advantages of resorting to it in order to (possibly) avoid to presuppose a "minimal" logic of propositions. Not to mention the special role (that we will often mention and discuss in the sequel) that in our setting have events of probability 0 or 1; in fact, in Gardenfors' approach (as in similar theories) the so-called "accepted propositions" are identified with those having maximal probability (p.23 of [67]: in the same page it is claimed that "to accept a proposition is to treat it as true in one way or another"). Moreover, on p.39 of the same book Gardenfors claims: "Some authors, for example de Finetti .. . allow that some sentences that have probability 1 are not accepted .. . Even if a distinction between acceptability and full belief is motivated in some cases, it does not play any role in this book" (our bold). On the other hand, in our approach the concept of "accepted" is ... a stranger: a proposition may be "true" (or "false") only if looked on as a contemplated statement, otherwise (if asserted) it reduces to the certain (or impossible) event. Anyway, we do not see how to handle (from a "syntactic" point of view) subtleties
CHAPTER 2
20
such as "one must distinguish the acceptance of a sentence from the awareness of this acceptance" (cf. again [67], p. 23). We are ready now to recall the classic operations among events, even if we do not presuppose a beforehand given algebraic structure (such as a Boolean algebra, a a-field, etc.) of the given relevant family.
2.3
Operations
We will refer in the sequel to the usual operations among events (such as conjunction, denoted by 1\, and disjunction, denoted by V ) and we shall call two events A and B incompatible if A 1\ B = 0 (notice that the implication A ~ B can be expressed also by the assertion Ac VB= 0 ). The two operations are (as the corresponding ones - intersection and union - between sets) associative, commutative and distributive, and they satisfy the well-known De Morgan's laws. Considering a family E of events, it may or may not have a specific algebraic structure: for example, a Boolean algebra is a family A of events such that, given E E A, also its contrary Ec belongs to A, and, given any two events A and B of the family, A contains also their conjunction A 1\ B; it follows easily that A contains also the disjunction of any two of its events. But it is clearly very significant if You do not assume
that the chosen family of events had such a structure (especially from the point of view of any real application, where You need consider only those events that concern that application: see the following Example 4). On the other hand, E can always be extended (by adding "new" or "artificial" events) in such a way that the enlarged family forms a (Boolean) algebra.
EVENTS AS PROPOSITIONS
2.4
21
Atoms (or "possible worlds")
Each event can be clearly represented as a set of points, and in the usual approaches it is customary to refer to the so-called "sample space", or "space of alternatives": nevertheless its systematic and indiscriminate use may lead to a too rigid framework. In fact, even if any Boolean algebra can be represented (by Stone's theorem: a relevant reference is [119]) by an algebra of subsets of a given set n, the corresponding "analogy" between events and sets is nothing more than an analogy: a set is actually composed of elements (or points), and so its subdivision into subsets necessarily stops when the subdivision reaches its "constituent" points; on the contrary, with an event it is always possible to go on in the subdivision .. These aspects are discussed at length and thoroughly by de Finetti in [53], p.33 of the English translation. The following example aims at clarifying this issue.
Example 2 - Let X be the percentage of time during which there are (for instance tomorrow, between 9 a. m. and 1 p. m.) more than 10 people in the line at a counter of the nearest post office, and consider the event E = {X= X 0 } (for example, X 0 = 37%}. Then E can be regarded either as an "atomic" event (since a precise value such as 37.0 does not admit a further refinement) or as belonging to an infinite set (i.e., the set of the events {X= x: 0 ~ x ~ 100}}. However it also belongs to the family consisting of just the two events E = {X = X 0 } and Ee = {X =/= x 0 } and can be decomposed into E = (E 1\ A) V (E 1\ Ae), where A is the event "at least one woman is in the line", or else into E = (E 1\ B) V (E 1\ Be), where B is the event "outside is raining", or else with respect to the partition {A 1\ B, Ae 1\ B, A 1\ Be, Ae 1\ Be}, and so on. Another important aspect pointed out in the previous example is that no intrinsic meaning can be given to a distinction between events belonging or not to a finite or infinite family. In the same
CHAPTER2
22
way, all possible topological properties of the sets representing events are irrelevant, since these properties do not pertain to the logic of probabilistic reasoning. Concerning the aforementioned problem of the choice of atomic events, in any application it is convenient to stop, of course, as soon as the subdivision is sufficient for the problem at hand, but ignoring the arbitrary and provisional nature of this subdivision can be misleading. Not to mention that "new" events may come to the fore not only as a "finer" subdivision, but also by involving what had previously been considered as certain.
Example 3 - Given an election with only three candidates A, B, C, denote by the sa~e symbols also the single events expressing that one of them is elected. We have A VB VC= n, the certain event. Now suppose that C withdraws and that we know that then all his votes will go to B: so we need to go outside the initial "space" {A, B, C}, introducing a suitable proposition (representing a new information) which is given byE C AV B, with E = "C withdraws and all his votes go to B ". We will see in Chapter 10 how to manage a new information through the concept of conditional event. This example has been discussed by Schay in [108], in the context of conditional probability: we will deal with it again in Chapter 18, Example 33, challenging Schay's argument. Let us now write down the formal definition that is needed to refer to the "right" partition in each given problem.
Definition 1 - Given an arbitrary finite family
E=
{ E~,
... , En} ,
of events, the atoms A1 , ••• ,Am generated by these events are all conjunctions Ei 1\ E2 ... 1\ E~, different from the impossible event 0, obtained by putting (in all possible ways) in place of each EJ, for i = 1, 2, ... , n, the event Ei or its contrary Ef.
EVENTS AS PROPOSITIONS
23
Atoms are also called (mainly in the logicians' terminology) "possible worlds". Notice that m ::; 2n, where the strict inequality holds if there exist logical relations among the Ei's (such as: an event implies another one; two or more events are incompatible, ... ) . When m = 2n (i.e., we have the maximum number of atoms), the n events are called logically independent. This means that the truth value of each of these events remains unknown, even if we assume to know the truth value of all the remaining others.
Definition 2 - Given an arbitrary finite family
e=
{ E1, ... , En} ,
of events, let A be the set of relevant atoms. We call indicator vector of each Ei {with respect to A) the following m-dimensional vector with
Ar ~ Ei Ar 1\ Ei = 0, ' The usual indicator of an event E corresponds to the trivial partition A= {E, Ec}. JAr Ei
if if
= { 1I Q
Example 4 - A patient feels serious generalised abdominal pains, fever and retches. The doctor puts forth the following hypotheses concerning the possible relevant disease: H3
=
H 1 = ileum , H 2 = peritonitis acute appendicitis, with an ensuing local peritonitis.
Moreover the doctor assumes a natural logical condition such as H3 C
Hf 1\ H2,
so that the given events are not logically independent. Correspondingly there are then five atoms Al = Hl 1\ H2 1\ HL
A2 = Hl 1\ H~ 1\ HL
A3 =
Hf 1\ H2 1\ HL
CHAPTER 2
24
A4 = Hf I\ H2
I\
H3 ,
As = Hf I\ H~ I\ Hg .
Clearly, the events Hb H 2 , H 3 have been chosen as the most natural according to the doctor's experience: they do not have any specific algebraic structure and do not constitute a partition of the certain event S1. Moreover, a doctor often assigns degrees of belief directly to sets of hypotheses (for example, he could suspect that the disease the patient suffers from is an infectious one, but he is not able to commit any belief to particular infectious diseases).
2.5
Toward probability
Since in general it is not known whether an event E is true or not, we are uncertain on E. In our framework, probability is looked upon as an "ersatz" for the lack of information on the actual "value" of the event E, and it is interpreted as a measure of the degree of belief in E held by the subject that is making the assessment. As we shall see in the next chapters, we can only judge, concerning a probability assessment over any set of events whatsoever, whether or not it is among those evaluations which are coherent. Notice also that a careful distinction between the meaning of probability and all its possible methods of evaluation is essential: ignoring this distinction would be analogous to identifying the concept of temperature with the number shown by a thermometer, so being not entitled to speak of temperature in a room without a thermometer (these aspects will be further discussed in Chapter
8).
Chapter 3 Finitely Additive Probability 3.1
Axioms
An usual way of introducing probability is through the following framework: given a non-empty set n (representing the certain event) and an algebra A of subsets (representing events) of n, a probability on (0, A) is a real-valued set function P satisfying the following axioms
(Al) (A2) (A3)
P(O) = 1; P(A V B) = P(A) + P(B) for incompatible A, B P(E) is non-negative for any E EA.
E
A;
Remark 2 -A simple consequence of {A1)-{A3) is that P(E) = 0 if E = 0, but (obviously) the converse is not true. Even if we will deal in this book mainly with a ''finite world", nevertheless the consideration of (not impossible) events of zero probability is unavoidable (see Section 3.3). The algebraic condition put on the definition of probability (i.e. the requirement that A be an algebra) strengthens the effectiveness of
25 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER3
26
axioms (Al-A3): for instance, a trivial consequence of the additivity is the monotonicity of P, that is: A~ B implies P(A) :5 P(B). But what is more is that they imply that, given any finite partition 8 = {B 17 ••• , Bn} ~ A of n, then the probability of any event E belonging to the algebra spanned by 8 is completely specified by the probabilities P(Bi), Bi E 8, since necessarily P(E) =
L
P(Bi) .
B,c;;E
3.2
Sets {of events) without structure
In many real situations we cannot expect that the family of events we need to deal with has some algebraic structure. So, if is just any collection of subsets of n, representing events and subject only to the requirement that n E e, then (Al-A3) are insufficient to characterise pas a probability on (n, e): for example, if e contains no union of disjoint sets, (A2) is vacuously satisfied. Moreover, it may even happen that there does not exist an extension of P onto an algebra A containing e, with P satisfying (Al-A3).
e
Example 5 -Let {F, G, H} be a partition off2: consider the family
e = {E1 = F VG , E2 = F V H , E3 = G V H , H
, f2}
and the assignment P(EI)
4
= 9 , P(E2)
=
4
2
5
9 , P(E3) = 3 , P(H) = 9 , P(f2) = 1.
It can be easily verified that (A1-A3} hold one: nevertheless monotonicity of P does not hold, since P(H) > P(E2 ) while H C E 2 • Now, even if we consider the family e' obtained by deleting the event H from e (giving up also the corresponding assessment P(H) = ~'
FINITELY ADDITIVE PROBABILITY
27
so that monotonicity holds}, it does not exist an extension of P on the algebra
verifying {A1-A3}. In fact this extension should satisfy the system
+ P(G) = ~ P(F} + P(H) = ~ P(G} + P(H} = ~ P(F) + P(G) + P(H} P(F}
= 1
while we get {by summation)
2P(F) + 2P(G}
4
4
2
14
9
9
3
9
+ 2P(H} =- +- +- = -
that is
P(F)
+ P( G) + P(H)
=
7
g< 1
So the extension of P from E' to the algebra A is not a probability on A. This can also be expressed by saying that P is not coherent {see Chapter 4) onE'.
3.3
Null probabilities
If You ask a mathematician to choose at his will - in a few seconds - and tell us a natural number n, he could choose and tell any element of lN, such as "the factorial of the maximum integer less than e27 ". If You judge that choice as not privileging any natural number with respect to any other one, then a probability distribution expressing all these possible choices is necessarily "uniform", i.e. P(n} = 0 for every n.
28
CHAPTER 3
This also means that a finitely additive setting may be a better framework than the more usual a-additive one: obviously, the adoption of finite additivity as a general norm does not prevent us from considering probabilities which are a-additive (possibly with respect to some particular subfamily of events), when this turns out to be suitable. What is essential is that the latter property be seen as a specific feature of the information embodied in that particular situation and not as a characteristic of every distribution. For a deepening of these aspects, see the interesting debate between de Finetti and Frechet (the relevant papers must be read in the order [45], [63], [46], [64], [47]), and the expository papers [109],
[110]. A concrete example concerning a statistical phenomenon (the so-called first digit problem) is discussed in Chapter 18, Example 30. A common misunderstanding is one which makes finite or countable additivity correspond to the consideration, respectively, of a finite or infinite set of outcomes: we may instead have an infinite set of possibilities, but this does not imply countable additivity of the relevant probability on this set. We end this Chapter with an example that concerns a zero probability assessment in a "finite world".
Example 6 - You toss twice a coin and consider the following outcomes, fork= 1, 2:
sk =the coin stands
(e.g., leaning against a wall) at the k-th toss,
and, analogously, denote by Hk and Tk, respectively, heads and tails. The "natural" probability assessments are
P(Sk) = 0 , P(Hk) =
1
1
2 , P(Tk) = 2 ,
since the events Sk are not impossible, neither logically nor practically, but the classic probability assignments to heads and tails force
FINITELY ADDITIVE PROBABILITY
29
P(Sk) = 0. Now, You may wish to assign probabilities to the possible outcomes of the second toss conditionally to the result of the first one, for example conditionally to 81. Even if You had no idea of the formal concept of conditional probability, nevertheless for You "natural" and "intuitive" assignments are, obviously,
As we shall discuss at length in Chapter 11, it is in fact possible to assign directly {i.e., through the concept of coherence and without resorting to the classic Kolmogorov's definition} the above probabilities, even if the conditioning event has zero probability. Other "real" examples of zero probability assignments are in [28], [109], [110] and some will be discussed in detail in Chapter 18.
Chapter 4 Coherent probability The role of coherence is that of ruling probability evaluations concerning a family containing a "bunch" of events, independently of any requirement of "closure" of the given family with respect to logical operations. Even if its intuitive semantic interpretation can be expressed in terms of a betting scheme (as we shall see in Chapter 5), nevertheless this circumstance must not hide the fact that its role is essentially syntactic.
4.1
Coherence
To illustrate the concept of coherence, consider, for i = 1, 2, ... , n, an assessment Pi = P(Ei) on an arbitrary finite family
£={Et, ... ,En}, and denote by A17 .•• ,Am the atoms generated by these events.
Definition 3 - An assessment Pi = P(Ei), i = 1, 2, ... , n, on an arbitrary finite family £ is called coherent if the function P can be extended from £ to the algebra A generated by them in such a way that Pis a probability on A. 31 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
32
CHAPTER 4
In particular, Pis defined on the set of atoms generated by£, and so coherence amounts to the existence of at least one solution of the following system, where Xr = P(Ar),
L
Xr =pi,
i
=
1, 2, ... , n
Ar~Ei
(4.1)
m
L Xr = 1'
Xr 2: 0'
r
= 1, 2, ... ,m.
r=l
Remark 3 - In the above system (containing also m inequalities), the number of equations is n+ 1, where n is the number of events of the family £, and the number of unknowns is equal to the number m of atoms. When then events are logically independent (see Definition 1}, any assessment Pi (i = 1, 2, ... , n) with 0 ::; Pi ::; 1, is coherent (cfr. de Finetti {53}, Vol. 1, p.109 of the English translation). Example 1 - As we have shown in Example 5, the assessment on £' (and, all the more so, on £) is not coherent. A simple check shows, however, that the assessment obtained by substituting P(E2 ) = ~ in place of P(E2 ) = ~ is instead coherent (even on £). Since the atoms are exactly the same as before, then the solution of the corresponding system, that is
follows by an elementary computation. Notice that in the previous examples we have met instances of the following three situations: • a probability (that is, a P satisfying (A1-A3) on an algebra A) is coherent; • a coherent function P (on a family £) is the restriction to £ of a probability on a Boolean algebra A 2 £;
COHERENT PROBABILITY
33
• a function P satisfying (Al-A3) on£ may not be extendible as a probability. In conclusion, we have (roughly speaking) the following set inclusion
where P is the set of "all" probabilities, C the set of "all" coherent assessments, and :F the set of "all" functions P just satisfying (AlA3). Clearly, the latter set (as the previous discussion has pointed out) is not interesting, and we shall not deal any more with it. Remark 4 -In the previous example the system (4.1) has just one solution. This is a very particular circumstance, since in general this system has an infinite number of solutions, so that a coherent assessment is usually the restriction of many (infinite) probabilities defined on the algebra generated by the given events.
The following example (a continuation of Example 4) refers to a situation in which the relevant system (1) has infinite solutions. Example 8 - We go on with Example 4: the doctor gives {initially: we will deal with updating in Chapters 11, 13, and 16) the following probability assessments:
P(H3)
1
=B.
Clearly, this is not a complete assessment (as it has been previously discussed), and so the extension to other events of these evaluations - once coherence is checked -is not necessarily unique. The above (partial) assessment is coherent, since the function P can be extended from the three given events to the set of relevant atoms in such a way that P is a probability on the algebra generated
CHAPTER 4
34
by them, i.e. there exists a solution of the following system with unknowns Xr = P(A,.) +xa = ~ x1 +xa +x4 XI
X4 5
=k
=l
LXr=1 r=l Xr ~
0.
For example, given A, with 0 :5 A :5 XI=
A,
1
xa =--A, 2
to, then
3 xa = 40 - A,
1
X4
=
B'
3 xs = 10 +A,
is such a solution.
Since (as we shall see in the next Chapter) the compatibility of system (1) is equivalent to the requirement of avoiding the so-called Dutch-Book, this example (of non-uniqueness of a coherent assessment) can be seen also as an instance ofthe (important) issue raised at the end of Section 1.2, concerning a quotation from Gardenfors' book [67].
4.2
Null probabilities (again)
Notice that coherent assessments (even if strictly positive) may possibly assign (compulsorily!) zero probability to some atoms A,. : for example let A, B, C be three events, with C c A A B , and assess P(A) = P(B) = P( C) = 1/2. This assessment may come from a uniform distribution on the square E = [0, 1] x [0, 1] c R 2 , taking, e.g., 1 A= {(x, y) E E: 0 :5 x < 1, 0 < y :52},
COHERENT PROBABILITY
35
B = {(x, y) E E: 0
<x
~ 1, 0 ~ Y
1
< 2},
1
C = {(x, y) E E: 0 < x < 1, 0 < y < 2} \ {(x, y) E E:
1
X=
4}.
The relevant atoms are
A1 =A 1\ B 1\ C = C, A2 =A 1\ B A4
= A 1\ Bc 1\ cc ,
As
1\
cc, Aa = Ac 1\ B
1\
cc,
= A c 1\ Bc 1\ cc
and system (4.1) reads X1
= ~
x1 +x2 +x4 = 21 x1 + x2 +xa
= 21
5
LXr
= 1
r=1
Xr ~
0.
Its only solution is 1
X1
=2
7
X2
= X3 = X4 = 0
1
7
Xs
=2
7
i.e. it assigns 0 probability to the atoms A2 , A 3 , ~In conclusion, this is another instance of the fact that dealing with zero probability is unavoidable, even in a "finite world"!
Chapter 5 Betting Interpretation of Coherence In the relevant literature, the term "coherence" refers to the betting paradigm introduced by de Finetti (see, e.g., [53]). Our aim is now to show that the two concepts of coherence are syntactically equivalent : for this, we need resorting to a classic theorem of convex analysis (also known as "alternative theorem": for the proof, see, for instance, [66]). Theorem 1 - Let M and N be, respectively, real (k x m) and (h- k) x m matrices, x an unknown (m x 1) column vector, and J.t and v, respectively, (1 x k) and 1 x (h- k) unknown row vectors. Then exactly one of the following two systems of linear inequalities has solution:
Mx>O {
Nx~O
(5.1)
x~O,
p,M + vN::::; 0 { J.t, V~ 0 p,#O.
37 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
(5.2)
38
CHAPTER 5
where 0 is the null row vector.
Remark 5 - The above theorem holds not only for real matrices and vectors, but also if their elements are rational numbers, so that no computational problems, due to the relevant truncations, can arise. By relying on the previous theorem, we can establish the following result:
Theorem 2 - Let B be an (n x m) matrix, y an unknown (m x 1) column vector, and~ an unknown (1 x n) row (real} vector. Then exactly one of the following systems has solution: {
By=O
y;::: 0,
IIYII =1
~B>O,
(5.3) '
(5.4)
m
with
IIYII= L
Yi ·
i=l
Proof- First of all, we note that if (5.1) admits a solution x, then it admits also a solution y, with Yr = Xr/llzll· Take, in the above Theorem, k = 1 and h = 2n+ 1, and take M as the unitary (1 x m) row vector, and N as a (2n x m) matrix, whose first n lines are those of the matrix B, and the remaining those of its opposite -B. m
Then My > 0 gives
L Yr = 1,
that is the second line of (5.3),
r=l
while the (2n
X
1) column vector
Ny equals c~:y)' so that the
first line of (5.3) follows. On the other hand, JLM is a (1 x m) row vector with all components equal to a nonnegative real number J.L, and so the first line of (5.2) gives vN < 0; denoting by v 1 the vector whose columns
BETTING INTERPRETATION OF COHERENCE
39
are the first n columns of v and by v 2 the vector whose columns are the remaining one, this can be written (v1 - v 2 )B < 0. In conclusion, considering a vector .X = v 1 - v 2 with real components, we get .XB < 0, or else (by changing the sign of .X) the form (5.4). • We go back now to the system (4.1), which expresses coherence of the assessent Pi= P(Ei), with i = 1, ... , n, on the finite family £: it can be written in the matrix form (5.3), where B denotes the m x n matrix whose i-th row (i = 1, 2, ... , n) is I~ -Pi, and J~ is the indicator vector of Ei (see Definition 2), that is ] Al E; -
Pi ' . . . '
JAm E; -
Pi .
By Theorem 2, it has a solution if and only if the "dual" system
.XB > 0, has no solutions. Now, putting .XB
(5.5)
= G,
the columns of G are
n
9r
= ~ Ai(I:; -Pi)
r=1,2, ... ,m,
i=l
z. e. all the possible values, corresponding to all the "outcomes" singled-out by the relevant atoms, of the function n
G = ~ >.i(IE; -Pi).
(5.6)
i=l
So the system ( 4.1) has a solution if and only if, for any choice of the real numbers >.i , n
inf G {Ar}
=
inf ~ >.i(IE; -Pi) ~ 0.
(5.7)
{Ar} i=l
And what is the meaning of G? First of all, a possible interpretation of Pi = P(Ei) is to regard it as the amount paid to bet on the event Ei, with the proviso of
CHAPTER 5
40
receiving an amount 1 if Ei is true (the bet is won) or 0 if Ei is false (the bet is lost), so that, for any event E "the indicator I E is just the amount got back by paying P(E) in a bet onE".
It is possible (and useful) to consider, in a bet, also a "scale factor" (called stake) Ai, that is to refer to a payment PiAi to receive - when the bet is won - an amount Ai (we were previously referring to the case Ai = 1: "wealthy people" would choose a bigger .Ai !). Since each Ai is a real number, its consideration is useful also to exploit its sign to make bets in both directions (exchanging the role between bettor and bank, that is the role between the two verbs "to pay" and "to receive"). Following this interpretation, notice that {5.6) represents the random gain for any combination of bets on some (possibly all) events of the given family E : the events Ei on which a bet is actually made are those corresponding to Ai =I= 0 (by the way, this is not equivalent to paying 0 for the events on which we do not bet, since we might pay 0 - and bet - also for some of the former Ei, if Pi= 0; for example, betting on the events Sk, Hk, Tk of Example 6, the expression G = .AI(/s~c - 0)
+ .A2(IH~c
1
- 2)
+ A3(lr~c
1
- 2)
represents the relevant random gain). Then the coherence condition (5.7) -equivalent to the compatibility of system {1) - corresponds to the requirement that the choice of the Pi's must avoid the so-called Dutch-Book: "possible gains all positive" (or all negative, by changing the sign of the ..Xi's). Notice that coherence does not mean that there is at least an outcome in which the gain is negative: it is enough that at least an outcome corresponds to a gain equal to 0. In other words: no sure losers or winners!
BETTING INTERPRETATION OF COHERENCE
41
For example, given A and B, with A 1\ B = 0, You may bet (as bettor) on A and on B by paying (respectively) p' and p", and bet (as bank) on A VB by "paying" -p'- p" (i.e. by receiving p' + p"): this is a coherent combination of bets, since the relevant possible gains are obviously all equal to zero. Remark 6 - Since coherence requires that (5.5) has no solution, it follows that, for any choice of the unknowns Ai 's, the coherent values of the Pi's must render (5.5) not valid. In other words: coherence is independent of the way you bet (that is - according to the sign of Ai - it is irrelevant whether you are paying money being the bettor, or whether you are receiving money being the bank) and it is also independent of your ... "wealth" (that drives the choice of the size of .Ai}. Recall that, given n events, the gain (5.6) refers to any combination of bets on some (possibly all) of these events: they are singled-out by choosing Ai =/:- 0, and so there is no need to mention their number k :::; n. Conversely, we could undertake a number of bets greater than n, i.e. consider some events more than once, say h times, since this is the same as just summing the corresponding .Ai to get h.Ai. Therefore we can express the definition of coherence (for a finite family of events) taking as number of bets any k E lN (choosing from the set { 1, 2, ... , n} some indices with possible repetitions). These (obvious) remarks suggest to retain the definition of coherence (in terms of betting) also for an infinite (arbitrary) family £ of events. Therefore (recalling Definition 3) a real function P defined on £ is called coherent if, for every finite subfamily :F c £, the restriction of P to :F is a coherent probability (i.e., it is possible to extend it as a probability on the algebra g spanned by :F). We proved elsewhere (for details, see [25]) that this is equivalent (similarly to the finite case) to the existence of an extension f of P from £ to the minimal algebra g generated by £ : we need resorting
CHAPTER 5
42
to the system
{
/(/~)=Pi,
f(Ir:) = 1, where f is an unknown linear functional on g, and 1:_. are the indicator functions (see Chapter 2) of the events Ei , defined on the set A of atoms generated by e (their definition is similar to that of the finite case, but allowing infinite conjunctions). If this system has a solution, the function f is a finitely additive probability on g' agreeing with p on e. Moreover, by using an alternative theorem for infinite systems (see, for instance, [61], p.123), it is possible to show that the above system has a solution if and only if the coherence condition (in terms of betting) holds for every finite subfamily of e. Summing up: coherence of a probabilistic assessment on an arbitrary set (that is, the existence of a finitely additive probability 1 on the algebra g spanned bye and agreeing with P on e ) is equivalent to coherence on any finite subset of e ; it is therefore of paramount importance to draw particular attention also to finite families of events.
e
Chapter 6 Coherent Extensions of Probability Assessments 6.1
de Finetti's fundamental theorem
Given a coherent assessment Pi = P(Ei), i = 1, 2, ... , n, on an arbitrary finite family£ = {E1 , ... ,En}, consider a further event En+ 1 and the corresponding extended family IC = £ U { En+l} . If En+l is logically dependent on the events of £, i.e. En+l is a union of some of the atoms Ar generated by £, then, putting Xr = P(Ar), we have Pn+l
=
Lr
Xr,
Ar<:::En+I
with Pn+l = P(En+l). Letting the vector (x 1 , x 2 , ..• , xm) assume each value in the set X of solutions of system (4.1), corresponding to all possible extensions of the initial assessment P to the atoms, the probability Pn+l describes an interval [p1 , p"] ~ [0, 1], with PI
= mXf Pn+ 1 , •
P11
43
= sup Pn+ 1 ; X
G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
44
CHAPTER6
in fact, since any point of the interval [p', p"] can be obviously written (for 0::::; a::::; 1) as a convex combination ap' + (1- a)p", it follows that it corresponds to the convex combination (with the same a) of two (vector) solutions (xL x~, ... , x~) and (x~, x~, ... , x~) (and the latter convex combination is also a solution, since the system is linear). On the other hand, if En+l is not logically dependent on the events of e, we are in the situation discussed at the beginning of Section 2.4: we could go on in the subdivision into atoms, by considering now those generated by the n + 1 events of the family IC, so making En+l logically dependent on the "new" atoms. But we could avoid this procedure by noting instead that, if En+l is not logically dependent on the events of e, there will exist two events E* and E* (possibly E* = 0 and E* = n) that are, respectively, the "maximum" and the "minimum" union of atoms (generated by the initial family e) such that
Then, given the probabilities Xr of the atoms, coherent assessments of P(EnH) are all real numbers of the closed interval [P(E*), P(E*)], i.e. Lr Xr ::::; Pn+ 1 Ar~E.
::::;
Lr Xr · Ar~E·
Letting again the vector (x 1 , x 2 , ... , Xm) assume each value in the set X, the probability Pn+l describes an interval [p',p"] ~ [0, 1], with p' = inf P(E*) , p" = sup P(E*) . X
X
In conclusion, a coherent assessment of Pn+l is any value p E [p', p"]. This result is dubbed as the fundamental theorem of probabilities by de Finetti (see [53]). An extensive discussion of the fundamental theorem, with several computational and geometrical examples, is in [90], [91].
COHERENT EXTENSIONS OF PROBABILITY
45
Example 9 - Given two boxes A and B, let r A and r B be the ( unknown) numbers of red balls, respectively, in A and in B. Consider the events E1 = {rA > rB}, E2 = {rB = 0}, and the assignment P1
= 0.5
(6.1)
, P2 = 0.2 .
The relevant atoms are
A1
A3
= E1 A E2 = {rA >TB= 0},
A2
= E1 A E~ = {rA >TB> 0},
= EU\E2 = {rA =TB= 0}, A4 = EfAE~ = {rA
$TB, TB> 0}.
The assessment {6.1) is coherent {see Remark 3}: the system (4.1) has, for any A with 0.3 $ A $ 0.5, the solutions x1
= A-
0.3, x2
= 0.8 -
A,
X3
= 0.5 -
A, x 4
= A.
Consider now the new event
which is not logically dependent on E 1 , E 2
:
in fact we have
so that 0.2 $ p $ 0.2 + A, with p = P(E) . In conclusion, any value of the interval [p',p"] = [0.2, 0.7] is a coherent assessment for P(E).
6.2
Probabilistic logic and inference
Many interesting and unexpected features come to the fore when one tries to extend the above theory to conditional events and to the ensuing relevant concept of conditional probability, once a suitable
46
CHAPTER 6
extension of the concept of coherence is introduced (this will be the subject of Chapters 10 - 13): to make inference is in fact a problem of extension (of a given probability assessment to "new" events), where a relevant and crucial role is played by the concept of conditioning. In the literature on Artificial Intelligence, a relevant theory by N.J. Nilsson [98] (referring only to unconditional events) is called "probabilistic logic", but it is just a re-phrasing (with different terminology) of de Finetti's theory, as Nilsson himself acknowledges in [99]. Usually, the problem of checking coherence and determining lower and upper bounds p' and p" for the probability of an additional event is formulated as a linear programming problem with a system of constraints depending on an exponential number of real variables. Similar ideas appeared already in the classical work of Boole [12], which attracted little attention until it was revived by Hailperin [79]. Methods of solution based on the so-called "column generation" have been implemented in [82]. A specific problem occurring in signal theory has been studied by the simplex method in [15]. Other relevant approaches are given in [70] and (for solving the problem of checking coherence of a partial probability assessment through a set of simplification rules) in [2], [3]. The circumstance that the coherent values p of a (new) event En+l are such that p E [p', p"] has nothing to do with the so-called "imprecise probabilities" (a terminology introduced by Walley [125]). In particular, we agree with de Finetti ([53], p. 368 of the English translation), whose answer to the question "Do imprecise probabilities exist?" is essentially (as we see it) YES and NO. To clarify this issue, let us take some excerpts from the quoted reference: "The question as it stands is rather ill-defined, and we must first of all make precise what we mean. In actual fact, there
COHERENT EXTENSIONS OF PROBABILITY
47
is no doubt that quantities can neither be measured, nor thought of as really defined with the absolute precision demanded by mathematical abstraction ... A subjective evaluation, like that involved in expressing a probability, attracts this criticism to an even greater degree ... It should be sufficient to say that all probabilities, like all quantities, are in practice imprecise, and that in every problem involving probability one should provide, just as one does for other measurements, evaluations whose precision is adequate in relation to the importance ofthe consequences that may follow ... The question posed originally, however, really concerns a different issue, one which has been raised by several authors: it concerns the possibility of cases in which one is not able to speak of a single value p for a given probability, but rather of two values, rJ and p", which bound an area of indeterminacy, p' ~ p ~ p", possessing some essential significance .. . The idea of translating the imprecision into bounds, p' ~ p ~ p" .. . is inadequate if one wishes to give an idea of the imprecision with which every quantity is known or can be considered" . In other words, the so-called "imprecision" does not concern individual events or isolated features, but one should think of the possible links or "freedom" in the choice of the function P deriving from logical or probabilistic relations involving many events. Similar remarks concerning the function P are shared - through some subtle considerations concerning indeterminacy - by Williams (see [127]), who claims also, at the beginning of the quoted paper: "It has been objected against the subjective interpretation of probability that it assumes that a subject's degree of belief P(E) in any event or proposition E is an exact numerical magnitude which might be evaluated to any desired number of decimal places ... The same argument, however, would appear to show that no empirical magnitude can satisfy laws expressed in the classical logico-mathematical framework, so long as it is granted that inde-
48
CHAPTER 6
terminacy, to a greater or lesser extent, is present in all empirical concepts" . Nevertheless it could be interesting to study coherence of a probability assessment possibly involving both "precise" and "imprecise" evaluations: the most genuine situation in an updating process is that in which we get - as (coherent) extension of an initial coherent assessment - an upper and a lower probability (see Chapter 15); so, if we want to go on in the updating by taking into account new "information" (for example, some further probability values), we need checking the "global" coherence - as lower and upper probability - of the new values and the previous upper and lower probability. The relevant theory is dealt with, in the more general framework of conditional assessments, in Chapter 16, where also some actual cases are discussed.
Chapter 7 Random Quantities Given v events E 1 , ... , Ev and v real numbers Yb ... , Yv, a discrete random variable is defined as V
y = LYklEk.
(7.1)
k::::l
When the coefficients y 1 , ... , Yv belong to an arbitrary set, Y is called a random quantity. When the events Ek 's are a partition of n , we say that Y is in "canonical" form. Making this assumption is not restrictive, since any random quantity can be suitably reduced to its canonical form through the atoms generated by the events E 1 , ... , Ev (see Definition 1, Chapter 2). Notice that (7.1) can be regarded as the amount got back in a bet on Y- that is in a combination of bets on the v events Ell ... , Ev -made by paying amounts P1Y1, ... ,PvYv (i.e. with stakes Y1, ... , Yv)· The total amount paid, i.e. V
1P (Y)
= L PkYk
(7.2)
k::::l
is the so-called prevision (or expectation) of Y when the set {p1 , ... , Pv} is a coherent probability assessment on the family E 1 , ... , Ev. So, in the particular case that the random variable Y is just an event E .(its indicator lE), we have 1P(Y) = P(E), i.e. prevision 49 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER 7
50
reduces to probability. It follows that eqs. (7.1) and (7.2) can be read as the natural generalizations of the statement concerning the indicator of an event E, given in Chapter 5: "the value of Y is just the amount got back by paying JP(Y) in a bet on Y ". (In Chapter 10 we shall see that a conditional event can be defined as a particular random quantity, where one of the coefficients is (not a given real number, but) a real function). On the other hand, consider now a set Y of n random variables (in canonical form) v;
Yi = LYVE~'
i = 1,2, ... ,n,
k=l
and define
1/}
1/2
Yi + Y2 = LYklE~ + LY~IE~ k=l
k=l
and, for a ER, 1/i
aYi
=L
ay~IE~.
k=l
Given a real function J> defined on Y , the function J> is said a (coherent) prevision on Y if it is possible to extend it as a linear and homogeneous operator on a linear space £ containing Y and a non-zero constant element, so that, for any Y1, Y2 , YE£, a ER
1P(Y1 + Y2) = J>(YI) J>(aY)
+ J>(Y2)
= aJ>(Y).
It can be shown that J> is coherent on Y if and only if there exists a coherent probability on the set of the relevant events with k = 1, 2, ... , v; i = 1, 2, ... , n, such that, for any Yi E Y, one has 1/j
EL
J>(Yi) =
L
k=l
P(E~)Y1·
51
RANDOM QUANTITIES
The requirement of coherence for JP is equivalent to the solvability of the system
L L
XrY1
=
JP(fi) ,
i = 1,2, ... ,n
k Ar~E~ m
L
Xr
= 1'
Xr ::::::
0'
r = 1, 2, ... ,m'
r=l
where we use again m to denote the number of atoms (generated by the v1 + v2 + ... + Vn events Ek) and each Xr denotes the corresponding probability. By an argument similar to that relative to events (again involving the alternative theorem), we can conclude that JP is coherent if and only if, for any choice of the real numbers Ai , the values of the random variable (gain) n
G=
L Ai(Yi -JP(fi))
(7.3)
i=l
are not all positive or all negative (where "all" means for every possible outcome, each one being singled-out by one of the m atoms).
Chapter 8 Probability Meaning and Assessment: a Reconciliation 8.1
The "subjective" view
The point of interpreting events -and then probability- in terms of bets is to get an unmistakable, concrete and operational meaning, valid for any kind of event: in fact betting is conceivable in any circumstance that can be expressed by a sensible proposition, and not only in those corresponding to the classic combinatorial or frequentist evaluations. Moreover, notice that the so-called "subjective view" is based on hypothetical bets: the force of the argument does not depend on whether or not one actually has the possibility or intends to bet. So, even if the above discussion aims at giving a possible "semantic" interpretation of the concept of coherence, nevertheless we may just refer {thanks to the alternative theorem) to the mathematical side of the problem, which rests essentially on the compatibility of system (4.1). In order to fully grasp the richness of this approach, an over-
53 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
54
CHAPTER 8
coming of barriers created by prevailing opinions is needed: for example, many approaches rely essentially upon a "combinatorial" assessment of probability (assuming equal probability of all possible cases) and upon the possibility of introducing the probability of an event through the frequency observed in the past for other events that are considered, in a sense, "equal" or "similar" to that of interest. Yet it is not generally underlined that the choice of these events (as the choice, in the combinatorial approach, of the outcomes for which equal probability is assumed) is necessarily subjective ([48], [54]).
Example 10 - An insurance company that needs to evaluate the probability of dying within the year of a given person can base its assessment on data referring to individuals of the same town (or region, or district) as the given person, or of the same age, or sex, or civil status, or of the same profession, or income, or having an analogous bodily constitution {height, weight, etc.), and so on, grouping in many different ways some or all of the preceding characteristics, or possibly others, and to each of these (subjective) choices there corresponds in general a different frequency. In other words, it is essential to give up any artful limitation to particular events (not even clearly definable) and try to ascribe to probability a more general meaning, which after all should be a sensible way to cope with real situations: the degree of belief in the occurrence of an event. It is important to point out that our approach puts in the right perspective all the subjective aspects hidden in the so-called "objectivistic theories" (so our view concerning the meaning of terms such as "random quantity" is in complete agreement with the following position expressed by de Finetti [53]: "the meaning is simply that of 'not known' (for You), and consequently 'uncertain' (for You), but well-determined in itself ... Its truth value is unique, but if You call it random, this means that You do not know this truth value").
PROBABILITY MEANING AND ASSESSMENT
8.2
55
Methods of evaluation
The combinatorial and frequentist methods of evaluation of probabilities can be easily embedded into this general concept of probability. Referring to the former, given n possible outcomes represented by the events E 1 , E 2 , ••• , En of a partition of n, and an event E which is a union of r among the Ei's, the evaluation P(E) = r/n follows easily from the addition law (A2) of Chapter 3 through the subjective opinion that a symmetry exists and that it implies equality of probabilities, namely P(Ei) = lfn. As far as the frequentist method is concerned, let us just mention that it is possible - through the concept of exchangeability, introduced in [51] (where "exchangeable" events were called "equivalent") by B. de Finetti - to evaluate a probability taking into account suitable observed frequencies (relative to just a finite number of trials). Recently, in [92] a weak form of exchangeability has been introduced that allows frequentist evaluations also under more general conditions. We give here only a simplified procedure (see [111]) that avoids to explicitly resort to the deeper concept of exchangeability. Given a sequence A 1 ,A2, ... ,A2n of events ("trials" of a given phenomenon), assume that the outcome of the first n is known, i.e. the corresponding "past frequency" X
= h1 + JA2 + •• •+ ]An
n is, say, k/n, and consider the "future frequency"
y =
JAn+l
+ JAn+2 + ••• + JA2n
.
n
The quantities X and Y are discrete random variables (see Chapter 7); the prevision of Y is
IP(Y) = P(An+d
+ P(An+2) + ... + P(A2n). n
CHAPTER 8
56
If the above events are judged equally probable (subjective opinion!) we get, denoting by p = P(Ai) this common probability, 1P(Y)
= p.
(8.1)
Assuming (subjective opinion!) that the probability distribution of the ''future" frequency Y is equal to that of the "past" frequency X, whose value is known and equal to k/n (so that, trivially, 1P(X) = k/n ), from (8.1) it follows that p = k/n, i.e. the "frequentist" evaluation of P(Ai) for i ~ n + 1 . In conclusion, if we maintain the necessary distinction between the concept of probability and its multifacet methods of evaluation, a "reconciliation" is clearly possible, so avoiding many misunderstandings and idle discussions about the socalled different approaches (discussions that are often dubbed improperly- as "frequentist probability versus subjective probability"). Not to mention that, thanks to conditional probability (Chapters 10 - 16), it is possible to give many different probability evaluations P(EIH) for each different "state of information" expressed by H (that may correspond, for example, to statistical data). For a discussion of the connections between frequency and probability in statistical physics, see Example 34 in Chapter 18.
Chapter 9 To Be or not To Be Compositional? Some remarks concerning the concept of "compositional" (as referred to probability, and also called, in the relevant literature e.g. in [100] - "truth-functional" belief) are now in order. Consider, for instance, the requirement that P(AAB) should be determined by the values of P(A) and P(B) only: is that possible? The usual answer is NO. But in a coherent framework there are many possible values of P(AAB) that are consistent with P(A) and P(B): in fact, putting P(A) = p 1 and P(B) = p2 , and introducing the relevant atoms A AB, Ac AB, A A Bc, Ac A Bc, with respective probabilities x1 , x2 , X3 , x 4 , coherence gives Xl
+ X3 = Pl
XI+ X2
= P2
4
L
Xr
= 1'
Xr
~ 0'
r
= 1, ... , 4'
r=l
so that nonnegativity of the
Xr
's implies easily the following con57
G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER 9
58
dition for x1:
that is
max{O,P(A)+P(B)-1}
~
P(A!\B)
~
min{P(A),P(B)}. (9.1)
So, in a sense, we could say that, among all possible extensions of P from A and B to the "new" event A !\ B, any chosen coherent P(A !\ B) is weakly compositional, since it is restricted to values belonging to an interval whose end points depend only on P(A) and P(B). (Obviously, a similar computation can be easily done to find the coherent interval for P(A V B)). On the other hand, the motivation brought forward in the relevant literature to maintain that probability is not a truth-functional belief is based on the fact that in general (except, for instance, the case ofstochastic independence) the unique value of P(A!\B) cannot be expressed as a function of P(A) and P(B), since it satisfies the well-known formula P(A !\B) = P(A)P(B!A) . But in our framework even the myth of the uniqueness of the conditional probability P(B!A) can be challenged (see Chapter 11, devoted to coherent conditional probability), not to mention that the argument appears as circular: in fact, in the classical approach conditional probability does not come from a direct definition through coherence, but it is a derived concept that requires the knowledge of both P(A !\B) and P(A)! Let us now underline our agreement with the following statement by Jeff Paris (see [100], p.53), expressed here in our own notation and referring to an arbitrary uncertainty measure
TO BE OR NOT TO BE COMPOSITIONAL?
59
on the grounds that, for example, it forces
0, then
so
= P(A)
and
Chapter 10 Conditional Events Does probability extend classical logic? In a recent paper by Dubois and Prade [60], they seem to give a negative answer to this question. If we refer to a formal and strict meaning of the term "classical" logic, we cannot disagree with their conclusion. On the other hand, at the very beginning of a 1935's paper [49) by de Finetti (from which some excerpts are reported also in Dubois and Prade's paper) there is the following statement: • "it is beyond doubt that probability theory can be considered as a multi-valued logic (precisely: with a continuous range of values), and that this point of view is the most suitable to clarify the foundational aspects of the notion and the logic of probability". Let us now report other two quotations from de Finetti's paper: • "Propositions are assigned two values, true or false, and no other, not because there 'exists' an a priori truth called 'excluded middle law', but because we call 'propositions' logical entities built in such a way that only a yes/no answer is possible ... A logic, similar to the usual one, but leaving room for three or more values, could not aim but at compressing sev-
61 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
62
CHAPTER 10
eral ordinary propositions into a single many-valued logical entity, which may well turn out to be very useful ... ". • "Even if, in itself, a proposition cannot be but true or false, it may occur that a given person does not know the answer, at least at a given moment. Hence for this person there is a third attitude in front of a proposition. This third attitude does not correspond to a third truth-value distinct from yes or no, but to the doubt between the yes and the no (as people who, due to incomplete or indecipherable information, appear as of 'unknown sex' in a given statistics. They do not constitute a third sex. They only form the group of people whose sex is unknown)". A careful reading of these words leads us to a different conclusion from what is claimed in the quoted paper [60): "the distinction between degree of truth and degree of uncertainty that, at least, goes back to de Finetti... " . On the contrary, de Finetti seems to deny the concept of "degree of truth". Here is a further excerpt from de Finetti's quoted paper (49), a few lines below the two aforementioned ones: • "the three-valued logic with 'doubtful' need not to be considered as the modification that could substitute the two-valued logic; the former should simply be superimposed to the latter by looking at a proposition as able in itself to take two values 'true' and 'false', while the distinction 'doubtful' is just provisional and relative to a given person ... The logic taking an infinite number of values - to which we are led in this way is again, as the two valued-logic plus 'doubtful', a logic superimposed to the two-valued logic. There are not propositions which are neither true nor false but with a certain degree of probability (our italic): there are only propositions true or false, but a given person may not know whether a proposition is true or false, and, being in doubt, he will give on it
CONDITIONAL EVENTS
63
a provisional judgement that is numerically represented by a degree of probability". These considerations constitute the starting point leading de Finetti (in the same paper) to the introduction of conditional events (and to their interpretation through a betting scheme), letting the conditioning event represent the "state of information" of the given person.
10.1
Truth values
In the next Chapter we will deal with a direct introduction of conditional probability: this requires a definition of conditional event, which is not simply an ordered pair (E, H), with H =fi 0, where 0 is the impossible event (we will adopt the usual notation EIH ). An interpretation of EIH in terms of a betting scheme (extending that given, for unconditional events, in Chapter 5) may help in clarifying its meaning: we exploit (and generalize) an idea suggested by de Finetti [49]. If an amount p- which should suitably depend on EIH- is paid to bet on EIH, we get, when H is true, either an amount 1 if also E is true (the bet is won) or an amount 0 if E is false (the bet is lost), and we get back the amount p if H turns out to be false (the bet is called off). In short, we may introduce the truth-value T(EIH) of a conditional event EIH- assuming that, for an (unconditional) event E, this is just its indicator lE= T(EIO), equal either to 1 or to 0 according to whether E is, respectively, true (the bet is won) or false (the bet is lost) - in the following way (analogous to that shown for the indicator of E in Chapter 5 and for the value of a random variable Y in Chapter 7):
''the value of T(EIH) is just the amount got back by paying p in a bet on EIH ",
64
CHAPTER 10
so that we may write
Then, introducing, for the "third" value of T(ElH), the symbol t(EIH) in place of p (to point out that this function depends on the pair E, H; moreover we assume, for any given H, that t(·lH) is not identically equal to zero), T(EjH) = 1 · IEAH
+ 0 · !Eci\H + t(EjH) · lnc .
(10.1)
In conclusion, a conditional event ElH (or, better, its truthvalue) can be seen as a discrete random variable in canonical form (Chapter 7) V
y
= LYklElc' k=l
taking v = 3, E1 = E 1\ H, E2 = Ee 1\ H, Ea= He, and Y1 = 1, Y2 = 0, Ya = t(EjH). We call Boolean support of the conditional event ElH the (ordered) pair (E, H). The conditional event ElH induces a (unique) partition of the certain event 0, that is (E 1\ H, Ee 1\ H, He): we put in fact t(EjH) = t(E 1\ H, Ee 1\ H, He) .
(10.2)
Remark 7 - Formula (10.1} needs a careful reading in the trivial cases: essentially, the representation ( 10.1) can be seen as a matrix EH ( 1
EeH 0
ne t(EjH)) '
(10.3)
in the sense that the multiplication which connects in {10.1} elements of the same column of the above matrix has only an "instrumental" role for the management of the operations we are going to
65
CONDITIONAL EVENTS introduce. For example, in the particular case H = {10.3} reduces to
n,
the matrix
where t(EIO) is the value that had been associated to the partition (E, Ec, 0). In the betting interpretation t(EIO) represents the amount paid to bet {unconditionally, i.e. no calling off of the bet!) on the event E; in particulart(OIO) = 1. Now, since !He = I0 = 0, the function t in fact plays no role when the conditional event EIO reduces to the (unconditional) one E.
Notice that our concept of conditional event is clearly different from those adopted, for example, by Adams [1], Dubois and Prade [58), Goodman and Nguyen ([77), [78]).
10.2
Operations
Suppose that we have an arbitrary family C of conditional events EIH and consider the relevant setT of random variables T(EIH); first of all notice that, by the very definition of conditional event and taking into account {10.2), one has t(EIH) = t( (E 1\ H)IH),
(10.4)
since these two conditional events give rise to the same partition of
n. Now, if we sum (as random variables) two elements of T, in general we do not obtain an element ofT. We have in fact, by
(10.1): T(EIH)
+ T(AIK) = 1· (IEAH + IAAK) + 0 · (lEcAH + lAcAK )+
CHAPTER 10
66 But notice that in the case H
T(EIH)
=K
and E 1\ A 1\ H
+ T(AIH) = 1. (I(EVA)I\H) + 0.
= 0 we
get
(I(EVA)ei\H )+
+(t(EjH) + t(AjH) )!He. Therefore, since t(EjH) is a function of EjH, if in the family C containing EjH and AjH there is also the conditional event (E V A)IH, we take necessarily as "third" value of T((E V A)jH)
t( (E v A)IH) In particular, for A =
= t(EjH) + t(AjH).
(10.5)
0, we obtain from (10.5) t(0jH) = 0.
(10.6)
So, taking into account (10.4), if E 1\ H = 0 then t(EjH) = 0. Let us now multiply (as random variables) two elements of T: we have, still referring to (10.1):
T(EjH)T(AjK) = 1 · (IEI\HIAI\K)
+ 0 · (!Eei\H!Aei\K+
+JEei\HJAI\K + JEI\HJAei\K + JHeJAei\K + JKeJEei\H )+
which is not an element ofT. But if we consider only the particular case K = EI\H, we obtain, taking into account that (E/\Ac)v Ec =
(E
1\
A)c:
T(EIH)T (AI (E 1\ H)) = 1 . (I(EI\A)I\H)
+ 0 . (I(EI\A)CI\H) +
+t(EjH)t(Aj(E 1\ H))I(EAH)e!He. Therefore (since I(EAH)e I w = I He), if the family T contains also the conditional event (E 1\ A)jH, we need to take as its "third" value t( (E 1\ A)IH) = t(EjH) · t( Aj(E 1\ H)). (10.7)
CONDITIONAL EVENTS If A
67
= E = H, we obtain in particular = t(HIH) · t(HIH) ,
t(HIH)
and so t(HIH) E {0, 1}. Moreover, from the above results it follows easily t(01H) ~ t(EIH) ~ t(HIH) = t(OIH). (10.8) On the other hand, for A= E and H
=0
eq.(10.7) gives
= t(EIO) · t(EIE) ;
t(EIO)
by (10.5) it follows (putting E
= H or E
1 = t(OIO) = t(HIO)+t(HciO)
= t(HIO)t(HIH)+t(HciO)t(HciHc).
If t(HIO) >
o, then t(HIH)
= He) that
= 1 and so, by (10.4),
t(OIH)
= 1.
(10.9)
If t(HIO) = 0, since t(·IH) is not identically equal to zero, then
there exists an event E such that t(EIH) > 0. From (10.8) we have that also t(HIH) = t(OIH) > 0, and so (since t(HIH) E {0, 1}) it follows again eq. (10.9). Summing up eqs.
(10.4), (10.5), (10.7), (10.9), we see that,
when the set of Boolean supports of the conditional events EIH of C is C8 = g x 8°, where g is a Boolean algebra and B ~ g is closed with respect to (finite) disjunctions, with 8° = B \ {0}, the third value t( ·I·) satisfies "familiar" rules, that is: (i) t(EIH) = t((E 1\ H)IH), for every E E g and HE so (ii) t(·IH) is a (finitely additive) probability on g for any given HE S 0 (iii) t((E 1\ A)IH) = t(EIH) · t(AI(E 1\ H)), for every A E g and E, HE 8°, E 1\ H =/= 0. Conditions (i), (ii), (iii} can be replaced by (i) ', (ii), (iii), with (i}' t(HIH) = 1, for every H E 8°.
CHAPTER 10
68
Suppose in fact that {ii), (iii) hold: to prove that {i) implies {i)' it is sufficient to observe that 1 = t(OIH) = t(HIH) . To prove that {i)' implies {i), consider first the monotonicity of t(·IH)- so that t(EIH);::::: t((E 1\ H)IH)- and the inequality
t(EIH) :s; t((E 1\ H)IH)
+ t(HciH) =
= t((E 1\ H) IH)+ t(OIH)- t(HIH) = t( (E 1\ H) IH). Notice that condition (i) is essential, in the sense that it is not entailed by {iii), as the following example shows:
Example 11 -Let Q = B ={A, Ac, n, 0}, and for any E E 8° put
t(AIE) = t(01E) = 0,
t(AciE) = t(OIE) = 1.
It is easily seen that this function t(·l·) satisfies {10. 7), for example
t(AIO)t(AI(O 1\ A))= t(AIO), t(Acln)t(Aci(O 1\ Ac)) = t(Acln), t(OIA)t( A cl (n 1\ A)) = t(AciA) , t(OIAc)t( AI (n 1\ A c)) = t(AIAc), but not {10.4), for example 1 = t(AciA)
# t( (A 1\ A c) lA) = t(01A) = 0.
Given two events A and B, the inclusion relation A~ B is clearly equivalent to the inequality lA :s; IB relative to the corresponding indicators (which represent the truth-values of the two events). This remark suggest the following "natural" definition of inclusion ~o for conditional events:
Definition 4 - Given AIH, BIKE C, we define
AIH
~0
BIK
~
T(AIH) :s; T(BIK)'
(10.10)
where T is the truth-value defined by (10.1) and the inequality refers to the numerical values corresponding to every atom generated by {A, B, H, K}, that is to every element of the partition obtained as intersection of the two partitions {A 1\ H, Ac 1\ H, He} and {B 1\ K, Be 1\ K, Kc}.
69
CONDITIONAL EVENTS
We show now that (10.10) is equivalent to the definition given in
[77]. Theorem 3 - Let AJH, BJK E C and A the algebra generated by {A, B, H, K}. If C contains also conditional events of the kind FJH, FJK, FJH V K (for every F E A}, then
AJH
~0
BJK
{:::::=:}
A 1\ H
~
B
1\
K, Be 1\ K
~
Ac 1\ H.
Proof- Assuming T(AJH) :::; T(BJK), that is, by (10.1), 1 • JE/\H
+ t(EJH) • /He
:::;
1 • JA/\K
+ t(AJK) • /Kc,
it is easy to check that this inequality fails if either of the two inclusions on the right-hand side does not hold. Conversely, we need to prove only that t(AJH) :::; t(BJK), since it is easy to check- by a complete scrutiny of the truth-values of the relevant (unconditional) events- the validity of the right-hand side of (10.10) when the truth-values of the given conditional events are 0 or 1. Now, taking into account the inclusion relation, we have the following eleven (instead of sixteen) atoms generated by
{A,B,H,K}:
A4
= A c 1\ He 1\ B 1\ K
A6 = Ac 1\ H
1\
B
1\
, A5
= A c 1\ H 1\ Be 1\ K
Kc , A7 = A
1\
He 1\ B
1\
,
Kc ,
~=ffi\WI\BI\~,~=ffi\HI\ffl\~,
A10 Putting tr
= Ac 1\ H 1\ Be 1\ Kc, = t(Ar JO),
An
= Ac 1\ He 1\ Be 1\ Kc .
from (10.5) we get
CHAPTER 10
70
By considering eq.(10.7) relative to the triples AI\Hl!l, Hl!l, AIH and B 1\ Kl!l, Kjn, BIK we obtain the conditions
t1 = (t1 + t2 + ts + ta + tg + t10)t(AjH),
(o)
t1 +t2 +ta +t4 = (t1 + t2 +ta + t4 +ts)t(BIK).
(*)
Then, subtracting (*) and (o), we get
t2 + ta + t4
= [t(BIK)- t(AjH)](t1 + t2 + ts)+
+(ta + t4)t(BIK)- (ta + tg + t10)t(AIH); so, if it were t(BIK)- t(AIH) < 0, it would follow, for t(Hl!l) > 0, that t 2 + t 3 + t 4 < (t 3 + t4)t(BIK) (which is, taking into account (10.8) and (10.9), a contradiction). On the other hand, if t(Hl!l) = 0, then t 1 = t 2 = ts = ta = t 9 = t 10 = 0, so that t 3 + t 4 = (t 3 + t 4)t(BIK), which implies either t(BIK) = 1 ~ t(AIH) or t 3 = t 4 = o, that is t(Kl!l) = 0. In the latter case, consider again eq.(10.7), but substitute, in the above two triples, HV Kin place of!l; putting Zr = t(ArlHV K), by (10.5) we get again (*) and (o), with Zr in place of tr. Then, arguing as above (with Zr in place of tr) and assuming t(BIK)- t(AIH) < 0, we would get again, if t(HIH V K) > 0, a contradiction; and when t(HIH V K) = 0, arguing again as above we would get either t(BIK) = 1 ~ t(AIH) or t(KIH V K) = 0, but the latter is impossible, since
t(KIH V K) ~
t(KIH V K)
10.3
+ 0 = t(KIH V K) + t(HIH V K)
+ t(H 1\ KCIH V K)
=
~
t(H V KIH V K) =
1. •
Toward conditional probability
To conclude this Chapter on conditional events, and to pave the way for the extension of the concept of coherence to conditional probability, some important remarks are now in order.
CONDITIONAL EVENTS
71
• The above conditions (i) ', (ii), (iii) coincide exactly with the axioms given by de Finetti in 1949 (see (52]) to define a conditional probability (taking go as set of conditioning events). They will be reported again- introducing the usual symbol P(·l·) in place oft(·l·)- at the beginning of the next Chapter. Properties (i)' and (iii) are also in the definition of "generalized conditional probability" given by Renyi in (104], where condition (ii) is replaced by the stronger one of o--additivity (obviously, the two conditions are equivalent if the algebra E is finite). Actually, in (104] Renyi takes, as set of conditioning events, an arbitrary family B (not requiring to be an additive one), and this choice may entail some "unpleasant" consequences (see Section 11.5). Popper also dealt with conditional probability from a general point of view (which includes its "direct" assignment, and the possibility of zero probability for the conditioning event) in a series of papers, starting from 1938 (in Mind, vol. 47), but in this paper - as Popper himself acknowledges in the new appendix *II of the book [101] - he did not succeed in finding the "right" set of axioms. These can be found (together with a discussion of many interesting foundational aspects) in the long new appendix *IV of the same book, where he claims: "I published the first system of this kind only in 1955" (in British Journal for the Philosophy of Science, vol. 6). Popper's definition (based essentially on the same axiom system as that given by de Finetti) is known in the relevant literature as "Popper measure" . Other relevant references are Csaszar (43], Krauss (88] and Dubins (56]: see the discussion in Section 11.5. • Conditions (i)-(iii) hold even for a family C which is not the cartesian product of an algebra and an additive set: this re-
72
CHAPTER 10
mark will be the starting point for the introduction of the concept of coherent conditional probability. On the other hand, since the value p = t(EiH) is the amount paid to bet on EjH, it is obviously sensible to regard this function as the natural "candidate" to be called (on a suitable family of conditional events) conditional probability: recall in fact the particular case of an event E - corresponding to H = n and the more general case of a random variable Y, in which the analogous amounts correspond, respectively, to probability and to prevision. • From the point of view of any real application, it is important not assuming that the family C of conditional events had some specific algebraic structure. We stress that only partial operations have been introduced for conditional events: in fact we did neither refer to Boolean-like structures, nor try to define logical operations for every pair of conditional events. In the relevant literature (going back to the pioneering paper by de Finetti [49] and, more recently, to Schay [108], Bruno and Gilio [16], Calabrese [17], Dubois and Prade [59], Goodman and Nguyen [78]) there are different proposals on how to define, for example, conjunction and disjunction between any two conditional events. Many pros and contras concerning the "right" choice among these different possible definitions of operations for conditional events are discussed in [75].
Chapter 11 Coherent Conditional Probability The "third" value t(EIH) of a conditional event has been interpreted (following the analogy with the probability of an event and the prevision of a random variable) as the amount paid to bet on El H. As we have seen in the discussion in the last Section of the previous Chapter, this entails - in a sense, "automatically" - the axiomatic definition of conditional probability (in the sequel we will identify the set C of conditional events and that C8 of their Boolean supports).
11.1
Axioms
Definition 5 -If the set C = g x so of conditional events EIH is such that g is a Boolean algebra and S ~ g is closed with respect to (finite) disjunctions (additive set), then a conditional probability on g x so is a function P -+ [0, 1] satisfying the following axioms (i) P(HIH) = 1, for every HE so (ii) P(·IH) is a (finitely additive) probability on g for any given 73
G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
74
CHAPTER 11
HE8° {iii} P((EI\A)IH) = P(EIH) ·P(AI(EI\H)), for every A E and E, HE 8°, E 1\ H f= 0.
g
Axiom {iii) can be replaced by {iii}' P(AIC) = P(AIB)P(BIC) if A ~ B ~ C, with A E g and B, C E 8°; in fact, by {iii} with A~E=B~H=C,
we get {iii} '. Conversely, from {iii} ', taking in particular the three events E 1\ A 1\ H ~ E 1\ H ~ H , we have P( (E 1\ A 1\ H)IH)
= P( (E 1\ A 1\ H)I(H 1\ E) )P( (E 1\ H)IH),
that is, taking into account that P((F 1\K)IK) = P(FIK), axiom {iii}. Putting P(·IH) =PH(·), property {iii} can be written (11.1)
This means that a conditional probability PH (·) is not singledout by its conditioning event H, since its values are bound to suitable values of another conditional probability, i.e. PE/\H(-). Then Pn(·) cannot be assigned (so to say) "autonomously". On the contrary, what is usually emphasized in the literature - when a conditional probability P(EIH) is taken into account is only the fact that P(·IH) is a probability for any given H: this is a very restrictive (and misleading) view of conditional probability, corresponding trivially to just a modification of the "world" (or sample space) n.
11.2
Assumed or acquired conditioning?
It is essential to regard the conditioning event H as a "variable", i.e. the "status" of H in EIH is not just that of something repre-
COHERENT CONDITIONAL PROBABILITY
75
senting a given fact, but that of an (uncertain) event (like E) for which the knowledge of its truth value is not required (this means, using Koopman's terminology, that H must be looked on as being contemplated, even if asserted: similar terms are, respectively, assumed versus acquired). So, even if beliefs may come from various sources, they can be treated in the same way and can be measured by (conditional) probability, since the relevant events ( including statistical data!) can always be considered as being assumed propositions (for example, the "statistical" concept of likelihood is nothing else that a conditional probability seen as a function of the conditioning event). An interesting aspect can be pointed out by referring to a situation concerning Bayesian inferential statistics: given any event H (seen as hypothesis), with prior probability P(H), and a set of events E 1 , ... , En representing the possible statistical observations, with likelihoods P(E1 IH), ... , P(EniH), all posterior probabilities P(HIE1), ... , P(HIEn) can be pre-assessed through Bayes' theorem (which, by the way, is a trivial consequence of conditional probability rules). In doing so, each Ek (k = 1, ... , n) is clearly regarded as "assumed". If an Ek occurs, P(HIEk) is chosen - among the prearranged posteriors - as the updated probability of H: this is the only role played by the "acquired" information Ek (the sample space is not changed!). In other words, the above procedure corresponds (denoting the conditional probability P(HIE) by p) to regard a conditional event HIE as a whole and to interpret p as (look at the position of the brackets!) "the probability of [H given E]" and not as "[the probability of H], given E". On the other hand, the latter interpretation is unsustainable, since it would literally mean "if E occurs, then the probability of H is p", which is actually a form of a logical deduction leading to absurd conclusions (for a very simple situation, see Example
CHAPTER 11
76
35 in Chapter 18). So we are able to challenge a claim of Schay (at the very beginning of the paper [108]) concerning ... the position of the brackets in the definition of the probability of H given E .
11.3
Coherence
Let us now discuss the main problem of this Chapter: how to assess P on an arbitrary set C of conditional events. Similarly to the case of unconditional probabilities, we give the following Definition 6 - The assessment P(·l·) on an arbitrary family c =cl Xc2 of conditional events is coherent if there exists C' 2 c, with C' = Q x 8° (Q a Boolean algebra, B an additive set, with B ~ Q), such that P (·I·) can be extended from C to C' as a conditional probability. Notice that a conditional probability on C is also (obviously) a conditional probability on a subfamily C" ~ C' with the same algebraic features: therefore Definition 6 can be formulated with reference to the minimal algebra g generated by cl u c2 and to the minimal additive set B generated by C2 . Among the peculiarities (which entail a large flexibility in the management of any kind of uncertainty) of this concept of coherent conditional probability versus the usual one, we mention the following two: • due to its direct assignment as a whole, the knowledge (or the assessment) of the "joint" and "marginal" unconditional probabilities P(E 1\ H) and P(H) is not required; • moreover, the conditioning event H (which must be a possible one) may have zero probability, but in the assignment of P(EIH) we are driven by coherence, contrary to what is done in those treatments (see, e.g. (65]) where the relevant
COHERENT CONDITIONAL PROBABILITY
77
conditional probability is given an arbitrary value in the case of a conditioning event of zero probability. For a short discussion of the approach to conditioning with respect to null events in the framework of the so-called Radon-Nykodim derivative, see Chapter 18, Example 32. On the other hand, if Pn (·) = P( ·) is strictly positive on 8°, we can write, putting H = n in (11.1),
P(E A A) = P(E) · P(AIE) . Then- in this case- all conditional probabilities P(·IE), for any E, are uniquely determined by a single "unconditional" P (as in the usual Kolmogorov's definition), while in general- see the next Theorem of this Chapter - we need a class of probabilities P0 's to represent the "whole" conditional probability. Now, a question arises: is it possible, through a suitable alternative theorem, to look again (as in the case of a probability of events) on the above definition of coherence from the point of view of a "semantic" interpretation in terms of betting? Since a conditional event is a (particular) random quantity, it seems reasonable to go back to the interpretation of coherence given in terms of betting for random quantities. Given an arbitrary family C of conditional events, we can refer, as in the case of (unconditional) probabilities (see the discussion in the final part of Chapter 5), to each finite subset
So, denote by Pi>..i the amount paid to undertake a bet on each EiiHi to receive T(EiiHi)>..i, where T(EIH) is given by (10.1), with Pi = t(EiiHi)· It follows, by (7.3) and (10.1), the following expression for the relevant gain n
G=
L >..i(T(EiiHi) -Pi) = i=l
CHAPTER 11
78 n
n
i=l
i=l
= E >..i(IEi/\Hi + Pi(1 -]Hi) -Pi) = E AilHi (IEi
-Pi) .
(11.2)
Recall that the condition of coherence requires that, for any choice of the real numbers >..i, the values of the random gain G are not all positive (or all negative), where "all" means for every possible outcome corresponding to the m atoms generated by the events E1, ... ,En, H1, ... , Hn. Now, by a slight modification of the argument in Chapter 5, this requirement is equivalent to the compatibility - in place of system (4.1) or, equivalently, (5.3)- of the following system, whose unknowns Xr are the probabilities of the atoms An
L
Xr
-Pi
Arc;;,Ei/\Hi
L
Xr
= 0,
i = 1, 2, ... , n
Arc;;,Hi
(11.3)
m
E
Xr
= 1'
Xr
~ 0'
r = 1, 2, ... ,m.
r=l
But, since the first n equations are all homogeneous, many of them can be satisfied giving zero to all the relevant unknowns, independently of the choice of the assessments Pi· In particular, if the disjunction H~ of the Hi's is properly contained in n (that is, there is some atom contained in Hf A... A H~), then we can put equal to zero the probabilities of the remaining atoms, so that all the first n equations are trivially satisfied. Therefore we may have a solution of system (11.3)- and then a gain with values not all of the same sign- not because of a "good" (coherent) choice of the Pi's, but just because of a combination of n bets that have been all called off: in fact, each atom Ar singling-out the outcome corresponding to this combination of bets is contained in all Hf 's, so that Ar A Hi= 0 for any i = 1, 2, ... , n; then, by (11.2), this outcome gives G = 0 for any choice of the p/s. This arbitrariness is not allowed for those indices i (possibly all) such that Xr = P(Hi) > 0, (11.4) Arc;;, Hi
L
COHERENT CONDITIONAL PROBABILITY
79
since in this case system (11.3) gives
P(Ei 1\ Hi) P(Hi)
(11.5)
and each amount Pi plays the role of a "natural candidate" to extend the results of Chapters 4 and 5 from a probability P(Ei) to a conditional probability P(EiiHi)· Moreover, with respect to the subfamily singled-out by (11.4), the choice (11.5) is a coherent one, since, denoting by
the value of G corresponding to the outcome Ar, where the sum is over the Hi's satisfying (11.4), we get
~r XrGr = ~r Xr ~i AiiHJAr (JE; =
~i Ai ( ~r
Xr - Pi
Arr;,_E;I\H;
=
~- Ai ( z
~
r Arr;,_E;I\H;
~r
-Pi)
Xr)
=
=
Arr;;_H; Lr
Xr
Xr- Arr;,_E;I\H;
L
A CrH· r_
X
r
~
r Arr;,_H;
Xr)
= 0. '
•
then, since the real numbers Xr ~ 0 are not all equal to zero, the possible values Gr of G are neither all positive nor all negative. In conclusion, in the general case (i.e., without any restrictive assumption ofpositivity), system (11.3) is not apt -contrary to the case of the analogous system (4.1) for unconditional events - to characterize coherence. It is only a necessary condition, entailing the relations
which may hold possibly even as 0 = 0, and so in this case the choice of Pi is arbitrary (even negative or greater than 1 ... ) .
80
CHAPTER 11
In order to cope with this situation, de Finetti [53] introduced the so-called "strengthening" of coherence, which amounts to require that, for any choice of the real numbers .\i, the values {11.2} of the random gain G are not all positive (or all negative), where "all" means for every possible outcome corresponding to the m atoms generated by Et, ... , En, H 1 , ... , Hn and contained in n
Ho= V{Hi: Ai
i- 0}.
1
For details, see the discussion in [31], where it is shown that this strengthened form of coherence (we call it dF-coherence) is equivalent to that given by Definition 6 in terms of extension of the function P( ·I·) as conditional probability. Slightly different formulations of dF-coherence have been given, e.g., in [94], [84], [102]; in [69] coherence is characterized by resorting (not to betting, but) to the so-called "penalty" criterion. Remark 8 - In terms of betting, dF-coherence of an assessment P( ·I·) on an arbitrary family C of conditional events is equivalent - as in the case of (unconditional) probability - to dF-coherence of P(·l·) on every finite subset :F ~ C. Then, since in a finite set dF-coherence is equivalent to the formulation given in terms of extension of P( ·I·) as conditional probability, it follows easily that we can refer as well {for the formulation based on Definition 6) to every finite subset of C.
11.4
Characterization of a coherent conditional probability
We characterize coherence by the following fundamental theorem ([21], [22]), adopting an updated formulation, which is the result of successive deepenings and simplifications brought forward in a series of papers, starting with [25].
COHERENT CONDITIONAL PROBABILITY
81
Theorem 4 - Let C be an arbitrary family of conditional events, and consider, for every n E IN, a finite subfamily
:F = {E1IHb ... , EniHn} ~ C; we denote by Ao the set of atoms Ar generated by the (unconditional) events E1, H1, ... , En, Hn and by g the algebra spanned by them. For an assessment on C given by a real function P, the following three statements are equivalent: (a) P is a coherent conditional probability on C; {b) for every n E IN and for every finite subset :F ~ C there exists a sequence of compatible systems, with unknowns x~ ~ 0, Lr
X~= P(EiiHi) Lr X~,
~~~A~
[if
Er Ar~H;
~~~
x~- 1 = 0, o: ~ 1]
(i = 1, 2, ... , n)
with o: = 0, 1, 2, ... , k S n, where H~ = H 0 = H 1 V... VHn and Hg denotes, foro:~ 1, the union of the Hi's such that Er x~- 1 = 0; Ar~H;
(c) for every n E IN and for every finite subset :F ~ C there exists (at least) a class of (coherent) probabilities {P[, P{, ... P[}, each probability P[ being defined on a suitable subset Aa ~ Ao {with Aa' C Aa" for o:' > o:" and P:, ( Ar) = 0 if Ar E Aa' ) such that for every G E g , G =/=- 0 , there is a unique P[, with Lr Pt(Ar) > 0 j
(11.6)
Ar~G
moreover, for every EiiHi E :F there is a unique PJ satisfying {11.6) with G = Hi and o: = /3, and P(EiiHi) is represented in the form
(11.7)
CHAPTER 11
82
Proof- We prove that (a)
=}
(b).
Suppose that P (defined on C) is coherent, so that it is coherent in any finite subset F ~ C ; put F = F 1 x F 2 , and denote by the same symbol P the extension (not necessarily unique) of P, which is (according to Definition 6) a conditional probability on g x B, where B is the additive class spanned by the events {H1 , ... , Hn} = F 2 and g the algebra spanned by the events { E 1, ... , En, H1, ... , Hn} = F1 U F2; soP satisfies axioms (i), (ii), (iii) of Definition 5. Put
Po(·) = with
P(·!H~), n
H~
=V Hi.
(11.8)
1
The probability P0 is defined on g and so, in particular, for all Ar ~ H~; notice that for at least an Hi we have P0 (Hi) > 0, and we have P0 (Ar) 0 for Ar CZ. H~. Then define recursively, for a 21,
Pa(·) = P(·!H~), with (11.9) Each probability Pa is defined on g and so, in particular, for all Ar ~ H:!; notice that for at least an Hi ~ H:! we have Pa(Hi) > 0, and we have Pa(Ar) = 0 for Ar CZ. H:!. Obviously, by definition of H:! and Pa there exists k ~ n such that a ~ k for any a ; moreover, for every Hi there exists f3 such that Lr Pfi(Ar) > 0 Arc;;_H;
holds. On the other hand, for every K E B, the function P(·!K) is a probability, and if Hi ~ K we have, by (11.1) in which H, A, E are replaced, respectively, by K, Ei, Hi, that
COHERENT CONDITIONAL PROBABILITY
83
Since Hr; E B, and the probabilities PK(Ei/\Hi) and PK(Hi) can be expressed as sums of the probabilities of the relevant atoms, then condition (b) easily follows by putting X~= Pa(Ar)
(notice that in each system (Sa) the last equation does not refer to all atoms as in system (11.3) -which coincides with (So) -but only to the atoms contained in Hr; ) . To prove that (b) implies (a), i.e. that on C the assessment Pis coherent, we show that P is coherent on each finite family :F ~ C (see Remark 8). Consider, as family C' 2 :F (recall Definition 6), the cartesian product Q x B, and take any event FIK E C' = Q x B. Since B is an additive set, then K is a disjunction of some (possibly all) of the Hi's: let f3 be the maximum of the indexes a's such that K ~ Hr; (i.e., the corresponding system (S 13 ) contains all the equations relative to the Hi ~ K ). Therefore the solution x~ = P13 ( Ar) of this system is nontrivial for at least one of the aforementioned equations and K C£. H~+l ; it follows P13(K)
=
Lr X~ Ar<;K
and Pa(K) = 0 for every a< f3. Then the conditional probability P(FIK) can be defined, for every FIK E Q x B, as P(FIK) = Pf3(F 1\ K) P13 (K) .
Now, recalling Definition 5 of conditional probability, it is easy to check the validity of conditions (i) and (ii). To prove (iii), consider a triplet of conditional events AI(E/\H), EIH, and (E/\A)IH of C': with an argument similar to that used above for the event K, there exists an index a such that H ~ Hr;, H C£. Hr;+l, and Pa(H) > 0. Then P(EIH) = Pa(E 1\ H) Pa(H)
CHAPTER 11
84
and
P(E A AIH)
= Pa(~~(~; H) .
So, if Pa(E A H) > 0, by writing also P(AIE A H) in terms of Pa, we easily obtain property {iii); if Pa(E A H) = 0 (but notice that there exists {3 >a such that Pp(EAH) > 0 ), we have P(EIH) = 0, which implies P(E A AIH) = 0, and so {iii) is trivially satisfied for any value (between 0 and 1) of P(AIE A H). We prove now that {b)::::::> (c) (to simplify notation, in the sequel we shall omit the apex :F in the probabilities Pt). Consider any finite subset :F of C and the relevant set of atoms A 0 • Let (Sa), a= 0, 1, ... , k be the sequence of compatible systems, and denote by X 0 = (xf, ... , x~J (where ma is the number of atoms contained on H~ ) the relevant solution. Then we can define on Ao a function P0 by putting P0 (Ar) = x~ for Ar ~ H~, and P0 (Ar) = 0 for Ar ~ H~ . The function P0 is a probability distribution on the atoms, assuming positive value on some event Hi (all those relating to the equations with nontrivial solution). Let A 1 C Ao be the set of atoms Ar such that Po(Ar) = 0 and define in A 1 the function P 1 by putting Pl(Ar) = x~ for Ar ~ H~, and g(Ar) = 0 for Ar f/:. H~ . Going on by the same procedure we can define Aa and Pa (for a = 0, 1, ... , k ). Notice that, for a ~ 1 , Aa ~ H~. This family of probabilities is such that for every Ei!Hi E C there exists a unique Pa with
Lr Pa(Ar) > 0 Ar~Hi
and also
Let now Ak+l be the set of atoms Ar such that Pk(Ar) = 0. For these atoms the assessment P(EiiHi) (i = 1, ... , n) gives no constraints, and then we can arbitrarily choose the distribution on the
COHERENT CONDITIONAL PROBABILITY
85
atoms (in particular, we can give these atoms positive numbers summing to 1), and so the class {P0 , ••• , Pk+ 1 } satisfies condition (c). Finally, to prove that (c) implies {b), consider a class {Pa:}, with each element defined on the relevant sets of atoms Aa:. Let {3 be the maximum of the indexes a 's such that H~ ~ Aa·, and let
Pp(H:) = Lr Pp(Ar) = mo. Ar<;Hg
Then x~ = Pp(Ar)/m0 , for Ar ~ H~, is a solution of system (S0 ). By the same procedure, let {3' > {3 be the maximum of the a 's such that H~ ~ Aa , and let Pp•(H~) = Lr Pp(Ar) = m1; Ar<;HJ
then x~ = Pp•(Ar)/m 17 for Ar ~ H~, is a solution of system (SI), and so on. • In the sequel, to simplify notation we shall always omit (as we did in the proof of the theorem) the apex :F in the probabilities
Pt.
Any class {Pa} singled-out by condition (c) of Theorem 4 is said to agree with the conditional probability P. In general there are infinite classes of probabilities {Pa:} (in particular we have only one agreeing class in the case that C is a product of Boolean algebras, since no extension - according to Definition 6 -is needed). Remark 9 - The condition with is essential. In fact, consider the family
c=
{EjH, FIH}
a' >a" '
86
CHAPTER 11
with P C E C H and let P on C be defined as follows:
P(EIH)
=~,
P(PIH)
= 1.
The set A of all atoms is
with
A1 = P
A3
1\
E
1\
H = P , A2 = pc 1\ E
= pc 1\ Ec 1\ H
, A4
1\
H,
= pc 1\ Ec 1\ He .
Clearly, P is not coherent (since it is nonmonotone), nevertheless there exists a subset of atoms
and two probabilities P0 and P 1 on H, with
Po(AI) = Po(A2) = and P1(Ar)
= 1, P 1 (A 3 ) = 0, P(EIH) =
P(FIH)
=
1
1
4 , Po(A3) = 2
such that
Po(AI) + Po(A2) _ ~, Po(AI) + Po(A2) + Po(A3) 2 P 1 (Ar) P1(Ar) + Pr(A3) = 1 '
Remark 10 - As we have seen in the final part of the proof of (b)::::} (c) of Theorem 4, for those conditional events PIKE AxA0 , where A is the algebra generated by the atoms of Ak+ 1 , we can give to F 1\ K and K an arbitrary (coherent) positive probability Pk+ 1 , so that
COHERENT CONDITIONAL PROBABILITY
87
In fact there are no constraints for the choice of the values of this "new" probability Pk+ 1 on the atoms. On the contrary, the assignment of conditional probability to each further conditional event GIH E A x A 0 must {obviously) take into account possible logical relations with the previous ones. In this way, the number of probabilities constituting the class {Pa} is equal to k + 1 , where k is the number of systems of the chosen sequence satisfying {b). On the other hand, the lack of constraints given by the assessments P(EiiHi) allows to assign also zero probability to some atoms Ar E Ak+l, for example to all except one. In this case it is (obviously) necessary to consider a new probability Pk+ 2 defined in Ak+ 2 ~ Ak+l, and so on, so that the class {Pa} may have now more than k + 1 elements. Finally, concerning this {partial) "freedom" in the choice of P(FIK) when FIK E A x A 0 , notice that this is no more true for FIK E g x go: this will be thoroughly discussed in Chapter 13, where also the extension to conditional events FIK f/. g x go will be considered. Example 12 -Given three conditional events E1IH1, E2IH2, E3IH3 such that Ao ={AI, ... , As}, with H1 = A1 V A2 V A3, E1 1\ H1 = A3 ,
H2 = A3 V A4, E2 1\ H2 = A4 ,
H3 = A1 V As, E3 1\ H3 = As ,
consider the assessment PI= P(EIIHI) = 1'
P3 = P(E3IH3) = 1.
The system (Sa) with a= 0 has (for every,\ such that 0 ::::; A ::::; ~) the solutions
where Xr = P0 ( Ar). Therefore the assessment is coherent, since the solution P0 corresponding to A=/=- 0 and A=/=- ~ satisfies {11.6}, with
CHAPTER 11
88
G = Hi , for all Hi 's (and so H~ = 0 for this class, whose only element is P0 ) . For A = 0, the solution P~ is such that P~(H1 ) = P~(H2 ) = 0, so that now H; = A 1 V A 2 V As V A 4. Solving the system (Sa) for a= 1 gives Y1
= Y2 = 0 ,
Ys
1
= Y4 = 2 '
with Yr = P{(Ar)· Notice that the unique element of this class satisfying {11.6}, with G = Hi, is P{, with P{(H1) = ~ > 0, P{(H2 ) = 1 > 0. For A = ~ we have P~'(Hs) = 0, so that H~ = A1 V As. Solving (Sa) for a = 1 gives u1
= 0,
us= 1,
with Ur = P{' (Ar). Notice that for the unique element of this class satisfying (11.6}, with G = Hs, we have P{'(Hs) = 1 > 0. In conclusion, we have found three classes - those defined under (c)- i.e.: {P0 } , {P~, P{}, {P~', P{'}; the corresponding representations (11. 7) for p 1 = P(E1 IHI) = 1 are
A 0 + 0 +A l2
- 0+0+ ~
=
-
P{(As) P{(A1 V A2 V As)
P"(A 0 s) P~'(A1 V A2 V As)'
and similar expressions can be easily obtained to represent P2 and P3·
Remark 11 - As we have seen in the previous Chapter (Theorem 3), a conditional probability can be, in a sense, regarded as a sort of monotonic function, that is
AIH ~0 BIK ~ T(AIH) ~ T(BIK) '
(11.10)
COHERENT CONDITIONAL PROBABILITY
89
where T is the truth-value defined by ( 10.1) and the inequality (obviously) refers to the numerical values corresponding to every element of the partition obtained as intersection of the two partitions {A 1\ H, Ac 1\ H, ne} and {B 1\ K, BC 1\ K, Kc}. Recalling that the present notation for t( ·I·) is P( ·I·) and that it is easy to check by a complete scrutiny of the truth-values of the relevant (unconditional) events - the validity of {11.10) when the truth-values of the given conditional events are 0 or 1, we can easily show that {11.10) characterizes coherence. The relevant system (So) is
(So)
I
+ x2 + Xs + x6 + Xg + x10) Xt + X2 + Xa +~4 = P(BIK)(xl + X2 + x 3 + X4 + x 5 ) X1 + ... + Xu -1 x1
= P(AIH)(xl
Xr
2:: 0
where the unknowns Xr 's are the probabilities of the eleven atoms introduced in the proof of Theorem 3. Notice that, to take into account of the possibility that P(H) = 0 or P(K) = 0, we need to go on by considering also system (SI). The computations are {"mutatis mutandis") essentially those already done in the just mentioned proof. The following theorem shows that a coherent assignment of P(·l·) to a family of conditional events whose conditioning ones are a partition of n is essentially unbound.
Theorem 5 - Let C be a family of conditional events {EiiHihEh where card(!) is arbitrary and the events Hi's are a partition of n. Then any function p: C -t [0, 1] such that
is a coherent conditional probability. Proof- Coherence follows easily from Theorem 4 (the characterization theorem of a coherent conditional probability); in fact, for any finite subset :F ~ C we must consider the relevant systems
CHAPTER 11
90
(Sa): each equation is "independent" from the others, since the events Hi's have no atoms in common, and so for any choice of P(EiiHi) each equation (and then the corresponding system) has trivially a solution (actually, many solutions). •
11.5
Related results
As already (briefly) discussed in Section 10.3, in (104] Renyi considers axioms (i}-(iii} for a (countably additive) function P(·l·) defined on g x 8°, where g is an algebra of subsets of nand Ban arbitrary subset of g (let us call such a P(·l·) a weak conditional probability). While a conditional probability - as defined in Section 11.1, Definition 5- is (trivially) coherent, a weak conditional probability may not be extendible as a conditional probability, i.e. it is not necessarily coherent (in spite of the fact that g is an algebra, and even if we take g and B finite), as shown by the following
Example 13 -Let A, B, C, D events such that A= B 1\C 1\D and B ~ C V D , B ~ C , B ~ D . Denote by g the algebra generated by the four given events, and take B = {B, C, D}. Among the assessments constituting a weak conditional probability P(·l·) on g x 8° , we may consider the one which takes, in particular, for the restrictions (unconditional probabilities) P(·IB), P(·IC), P(·ID), the following values : P(AIB) = 0 , P(AIC) = P(AID) =
~;
it satisfies (trivially} axioms (i}-(iii), but P is not coherent: in fact, extending it to the additive class generated by B, we must necessarily have
P(AIC V D)= P(AiC)P(CIC V D)= P(AID)P(DIC V D), (*)
COHERENT CONDITIONAL PROBABILITY
91
which implies P(CICV D)= P(DICV D). So at least one of these two conditional probabilities is positive, since
P(CICV D) +P(DICV D)~ 1, and then, by (*}, P(AIC V D) > 0. But
P(AIC V D)
= P(AIB)P(BIC V D) = 0
(contradiction). Renyi proves that a weak conditional probability can be obtained by means of a measure m defined in g (possibly assuming the value +oo) by putting, for every B E 8° such that 0 < m(B) < +oo and for A E g, (11.12) Conversely, he finds also a sufficient condition for a weak conditional probability P( ·I·) to be represented by a measure m in the sense of (11.12). Renyi poses also the problem of finding conditions for the existence of a class of measures {ma} (possibly assuming the value +oo) that allows- for every BE 8° such that 0 < ma(B) < +oo for some o: -a representation such as (11.12), with m= ma. Moreover (in the same year - 1955 - and in the same issue of the journal containing Renyi's paper), Csaszar (43] searches for a weak conditional probability P on g x 8° such that there exists a dimensionally ordered class of measures Ma defined in g , apt to represent, for any AIB E g x 8°, the function P. This means that, if A E g and P,-y(A) < +oo for an index-y, then p,p (A) = 0 for f3 < 1' ; moreover, if for every B E 8° there exists an o: such that 0 < Ma(B) < +oo, then (11.12) holds with m= /1-a. He proves that a necessary and sufficient condition for P to admit such a representation is the validity of the following condition
(C):
CHAPTER 11
92
{C) If Ai ~ Bi 1\ Bi+ 1 (with Ai Bn+l = B1 ), then
E
n
n
i=l
i=l
g, Bi
E
B 0 , i = 1, ... , n, and
II P(AiiBi) = II P(AiiBi+l) . Notice also that this condition was obtained by Renyi as a consequence of axioms (i)-(iii) in the case in which the family B is an additive set (that is, when the weak conditional probability is a conditional probability according to Definition 5); and Csaszar proves that (C) implies that P can be extended in such a way that the family B is an additive set. On the other hand, in 1968 Krauss [88] goes on by considering (in a finitely additive setting) a function P( ·I·) satisfying axioms (i)-(iii) on g x A 0 , with g and A Boolean algebras and A~ g (let us call this P(·l·) a strong conditional probability, which is, obviously, a conditional probability). In particular, P is called a full conditional probability when A= g. (We recall also that Dubins [56] proves that a strong conditional probability can always be extended as a full conditional probability, while Rigo [105] proves that a weak conditional probability can be extended as a full conditional probability if and only if condition (C) of Renyi-Csaszar holds). Krauss characterizes strong conditional probabilities in terms of a class of (nontrivial) finitely additive measures ma (not necessarily bounded), each defined on an ideal Ia of A, with If3 ~ Ia for (3 > a: for every B E A 0 there exist an ideal Ia such that
B E Ia \ U{.T,. : I,. ~ Ia} r
and for every A E Ia one has ma (A)
A
E
Uf4 : I,.
= 0 if and only if
~ Ia} U {0} ;
r
then, for any Ia and A, B E Ia ,
ma(A 1\ B)
= P(AIB) ma(B) .
COHERENT CONDITIONAL PROBABILITY
93
Notice that, if in our Theorem 4 (characterizing coherence) we take the set C = g x .A0 , with g and .A finite Boolean algebras and .A ~ g (in this case coherence of P is obviously equivalent to satisfiability of axioms {i}-{iii}), Krauss' theorem corresponds to the equivalence between conditions (a) and (c), with ma(·) = ma(Hg) Pa(-), and the family {Pa} is unique (as already observed after the proof of characterization theorem). We stress that none of the existing "similar" results on conditional probability (including those concerning weak and strong conditional probabilities) covers our framework based on partial assessments. In fact, for both Csaszar and Krauss (and Renyi), given a P(·l·) on g x 8° , the circumstance that g (and, for Krauss, also B ) are algebras plays a crucial role, as well as the requirement for P to satisfy condition {C): notice that both the subsets Ia and the measures ma need (to be defined) values already given for P, and the same is true for checking the validity of (C). In particular, to build the family {ma} Krauss starts by introducing, for any given event BE 8°, :F(B)
= {Bi
E ~: P(BIB V Bi)
> 0}
(so to say, B has not zero probability with respect each event Bi ), showing that :F(C) ~ :F(B) {::} P(CIC v B) = 0;
then for any B E 8° a relevant measure is defined in :F(B), by putting, for A E :F(B), P(AIAV B) mB(A) = P(BIA V B) '
and he proves that the set of :F(B)'s (and so that of the corresponding measures) is linearly ordered.
CHAPTER 11
94
In conclusion, all these results constitute just a way - so to say - to "contemplate" and ri-organize existing "data", while in our approach we must search for the values which are necessary to define the classes {Pa} ruling coherence. Then condition (b) of Theorem 4 becomes essential to build such classes (in the next Chapter we will come back to them, showing their important role also for the concept of zerolayer).
11.6
The role of probabilities 0 and 1
The following example shows that ignoring the possible existence of null events restricts the class of admissible conditional probability assessments. Example 14 -Given three conditional events E1IH1, E2IH2, EaiHa such that Ao = {A 1 , ... , A 5 }, with
H1 = A1 V A2 V Aa V A4 , H2 = A1 V A2, E1 A H1 = A1 , E2 A H2 = A2 , Ea
A
Ha = Aa V A4 , Ha = Aa ,
consider the assessment
P1
3
= P(EdH1) = 4,
If we require positivity of the probability of conditioning events, we must adjoin to the system (Sa) with a = 0 also the conditions
and this enlarged system (as it is easily seen) has no solutions. Instead the given assessment is coherent, since the system (Sa) has the solution 3
xl
= 4'
COHERENT CONDITIONAL PROBABILITY where Xr = P0 (Ar)· Then, solving now the system (Sa) foro: (notice that H~ = A 3 V A4) gives
95
=
1
1
Y3
with Yr
= Y4 = 2,
= PI(Ar)· In conclusion
are the representations {11. 7} of the given assessment. As far as conditioning events of zero probability are concerned, let us go back to Example 6 (Chapter 3) to show that what has been called a "natural" and "intuitive" assessment is a coherent one.
Example 15 (Example 6 revisited) - Given the assessment
consider the atoms generated by the events SI, H2, T2, s2:
A4
= H2 1\ sr, As= T2 1\ sr, A6 = 821\ sr
so that, putting Xr = P0 (Ar), to check coherence of the above assessment we should start by studying the compatibility of the following system XI+ X2 + X3 = O(xi + X2 + X3 + X4 + X5 + X6) xi= Hxi + x2 + x3) x2 = Hxi + x2 + x3) X3 = O(xi + x2 + x3) XI + X2 + X3 + X4 + X5 + X6 = 1 Xr;::: 0
96
CHAPTER 11
which has the solution X1 = x2 = Xg = Xs = 0, X4 = x 5 = ~ . So we can represent P(S1 If2) as ¥ = 0. Going on with the second system (SI), we get
(SI)
Y1 = !(y1 + Y2 + yg) Y2 = !(y1 + Y2 + yg) Y3 = O(y1 + Y2 + yg) Y1 + Y2 + Y3 = 1 Yr ~ 0
whose solution Y1 = Y2 = ~ , y3 = 0 allows to represent, by the probabilities P 1 (Ar) = Yr defined on A 11 also the three remaining given conditional probabilities.
Remark 12 - A sensible use of events whose probability is 0 (or 1) can be a more general tool in revising beliefs when new information comes to the fore. So we can challenge a claim contained in [118} that probability is inadequate for revising plain belief, expressed as follows: "'I believe A is true' cannot be represented by P(A) = 1 because a probability equal to 1 is incorrigible, that is, P(AIB) = 1 for all B such that P(AIB) is well defined. However, plain belief is clearly corrigible. I may believe it is snowing outside but when I look out the window and observe that it has stopped snowing, I now believe that it is not snowing outside". In the usual framework, the above reasoning is correct, since P(A) = 1 and P(B) > 0 imply that there are no logical relations between B and A (in particular, it is A AB I 0) and P(AIB) = 1. Taking instead P(B) = 0, we may have A AB = 0 and so also P(AIB) = 0. On the other hand, taking B= "looking out the window, one observes that it is not snowing" (again assuming P(B) = 0), and putting A="it is snowing outside", we can put P(A) = 1 to express
COHERENT CONDITIONAL PROBABILITY
97
a strong belief in A, and it is clearly possible (as it can be seen by a simple application of Theorem 4) to assess coherently P(AIB) = p for every value p E [0, 1]. So, contrary to the aforementioned claim, a probability equal to 1 can be, in our framework, updated.
Chapter 12 Zero-Layers We introduce now the important concept of zero-layer [29], which naturally arises from the nontrivial structure of coherent conditional probability brought out by Theorem 4.
12.1
Zero-layers induced by a coherent conditional probability
Definition 7 - Let
c = cl X c2
be a finite family of conditional events and P a coherent conditional probability on C. If P = { Pa} a=O,l,2, ... ,k is a relevant agreeing class, for any event E -=/=- 0 belonging to the algebra generated by C1 U C2 we call zerolayer of E, with respect to the class P, the (nonnegative) number /3 such that PfJ(E) > 0: in symbols, o(E) = j3.
Zero-layers single-out a partition of the algebra generated by the events of the family C1 u C2 . Obviously, for the certain event n and for any event E with positive probability, the zero-layers are o(O) = o(E) = 0, so that, if the class P contains only an everywhere positive probability Po, there is only one (trivial) zero-layer with a= 0. 99
G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
100
CHAPTER 12
As far as the impossible event 0 is concerned, since Pa(0) = 0 for any o , we adopt the convention of resorting to the symbol +oo to denote its zero layer, i.e. o(0) = +oo. Moreover, it is easy to check that zero-layers satisfy the relations o(A V B) = min{ o(A), o(B)},
and o(A 1\ B) ~ max{ o(A), o(B)}.
Notice that zero-layers (a concept which is obviously significant mainly for events of zero probability) are a tool to detect "how much" a null event is ... null. In fact, if o(A) > o(B) (that is, roughly speaking, the probability of A is a "stronger" zero than the probability of B), then P(AI(A v B))= 0 (and so P(BI(A V B))= 1), since, by Theorem 4,
P(AI(A V B))
Pa(A)
= Pa(A V B) ,
where o is the zero-layer of the disjunction A VB (and so of B); it follows Pa(A) = 0. On the other hand, we have o(A) = o(B) if and only if P(AI(A v B))· P(BI(A V B))
> 0.
Two events A, B satisfying the above formula were called commensurable in a pioneering paper by B. de Finetti [50]. Definition 8 - Under the same conditions of Definition 7, consider a conditional event EIH E C: we call zero-layer of EIH, with respect to a class P = {Pa} of probabilities agreeing with P, the (nonnegative} number
o(EIH)
= o(E 1\ H) -
o(H) .
101
ZERO-LAYERS
Notice that P(EIH) > 0 if and only if o(E A H) = o(H), i.e. o(EjH) = 0. So, also for conditional events, positive conditional probability corresponds to the zero-layer equal to 0. Moreover, by the convention adopted for the zero-layer of 0, we have EA H = 0 =? o(EjH) = +oo.
Example 16 - Revisiting Example 15, it is easy to check that the zero-layers of the null events 8 1 and 8 1 A 8 2 are, respectively, 1 and 2; so the zero-layer of the conditional event 82 181 is 2- 1 = 1. Other examples of zero-layers can be easily obtained by revisiting the other examples given in the final part of the previous Chapter and resorting to the corresponding agreeing classes.
12.2
Spohn's ranking function
Spohn (see, for example, [121], [122]) considers degrees of plausibility defined via a ranking function, that is a map "" that assigns to each possible proposition of a finite "world" W a natural number (its rank) such that (a) either K(A) = 0 or K(Ac) = 0, or both; (b) K(A V B)= min{K(A), K(B)}; (c) for all A A B =j:.
K(BIA)
0, the conditional rank of B given A is
= K(A A B) -
K(A).
Ranks represent (according to Spohn terminology) degrees of "disbelief". For example, A is not disbelieved iff K(A) = 0, and it is disbelieved iff K(A) > 0. They have the same formal properties of zero-layers; the set of not disbelieved events is called the core E of K, that is
E
= {w
E
W : K( { w})
= 0}.
102
CHAPTER 12
It corresponds (in our setting) to the set of events whose zero-layer is a = 0 , i.e. events of positive probability P( ·IHg) (possibly,
Hg =
n).
Ranking functions are seen by Spohn as a tool to manage plain belief and belief revision, since he maintains that probability is inadequate for this purpose. But in our framework this claim can be challenged, as it has been discussed in Remark 12 of the previous Chapter (a simple computation shows that the zero-layer of the null event B considered in that Remark is equal to 1). See also the paper [39].
12.3
Discussion
Even if ranking functions have the same formal properties of zerolayers, notice that - contrary to Spohn - we do not need an "autonomous" definition, since zero-layers are - so to say "incorporated" into the structure of a coherent conditional probability : so our tool for belief revision is in fact coherent conditional probabilities and the ensuing concept of zero-layer. Moreover, ranking functions need to be defined on all subsets of a given "world" W, since otherwise their (axiomatic) properties could be, in some cases, trivially satisfied without capturing their intended meaning (compare this remark with the discussion of the axioms for probability, at the beginning of Section 3.2). The starting point of our theory is instead an arbitrary family cl u c2 of events (see Definition 7), from which zero-layers come out.
Example 17 - Let E, F, C be events such that E V F VC= n, E 1\ F 1\ C = 0 , Ec 1\ Fe = Fe 1\ cc = Ec 1\ cc = 0 . The following rank assignment
x:(E) = 1 , x:(F) = 2 , x:(C) = 0 satisfies the axioms, nevertheless it is not extendible to the algebra generated by the three given events.
103
ZERO-LAYERS
There are in fact three atoms
and we have
now, since
then ii;(A2 ) = 0 or ii;(A 3 ) = 0 (or both}. But the values of the rank assigned to E, F, G clearly imply ii;(A 2 ) ;::::: 2 and ii;(A3 ) ;::::: 1. Now, a brief discussion concerning further differences between zerolayers and ranking functions follows. In our framework, the assignment (and updating) of a zero-layer of an event through conditioning is ruled by coherence, and can give rise both to events remaining "inside" the same layer or changing the layer (this aspect will be deepened also in the last section of Chapter 16 on Inference, concerning the problem of updating probabilities 0 and 1); on the other hand, the definition of condizionalization given by Spohn [122] is, in a sense, free from any syntactic rule. In fact, to make inference a ranking function ii; is updated by a function ii;A,n (where A is an event of Wand n a natural number) given by
I
/i;(BIA) = /i;(B A A) - /i;(A) , if B ~ A
ii;A,n(B) =
ii;(BIAc)
+ n, if B
~
Ac
min{ii;A,n(B A A), ii;A,n(B A Ac)}, for all other B.
The "parameter" n is a measure of the "shifting" of ii; restricted to A with respect to ii; restricted to Ac, and Spohn himself ascribes
104
CHAPTER 12
to the value n a wholly subjective meaning (he claims: "there is no objective measure of how large the shift should be") ; but the value of n plays a crucial role in the new assessment of r;,, which is influenced by n also in the third case (B g A and B g A c ) • Anyway, what comes out is a new "scenario" relative only to the situation A . So it is not possible, with a ranking function, to consider at the same time many different conditioning events Hi in the same context, as we do in our setting; moreover, there is no needin the approach based on coherence- of the (arbitrary) number n, since coherent conditional probabilities allow "automatic" assignment of both probability values and zero-layers. The following example may help in making clearer this issue : Example 18 - Consider five conditional events Ei!Hi, obtained from the square E = [0, 1] x [0, 1] c JR? in this way: take the (unconditional) events
Ea with x1
= {(x, y) E E
:x
= y} ,
= ~ , Y1 = Y2 = ~ , x2 = ~ , and
Then (assuming a uniform distribution on E) consider the assessment: P(E1IH1)
= P(E2IH2) = P(Ea!Ha) = 0,
P(E4IH4) The relevant atoms are
1
= 2, P{E5IH5) = 0.
105
ZERO-LAYERS
and system (80 ) is X1 = 0 ·(XI+ X2 + X3 + X2 = 0 ·(xi+ X2 + X3 +
X4)
X3 = 0 ·(xi+ X2 + X3 +
X4)
X4)
X1 = ~ · (x1 + x2) X1 = 0 • (xi+ X3) X1 + X2 + X3 + Xr ~
X4
= 1
0.
Its only solution is
and then o(A4 ) = 0. Going on with system (SI), we get Y1 = ~ · (YI + Y2) Y1 = 0 · (YI + Y3) Y1 + Y2 + Y3 = 1 Yr ~ 0,
whose only solution is Y1 = Y2
=0,
Y3 = 1 ,
so that o(E3 ) = 1 . Finally, the system (S2 ) gives z1 = ~ · (z1 + z2) { z1 + z2 = 1 Zr ~
0,
that is z1 = z2 = ~, so that o(EI) = o(E2 ) = 2 (and since we have E 4 = E 5 = E 1 , then also E 4 and E 5 are on the same layer). Then
CHAPTER 12
106
o(Hs) = o(EI V E3) = min{ o(EI), o(E3)} = 1, so that, in conclusion,
o(E4IH4) = o(E4)- o(H4) = 2- 2 = 0 (in fact P(E4IH4) > 0 ), while
o(EsiHs)
= o(Es)- o(Hs) =
2- 1 = 1,
z.e. conditioning on H 5 makes E 5 a "weaker" zero (a picture of the unit square with the relevant events may be helpful to appreciate the intuitive meaning of these conclusions!) In this example we have also another instance of the possibility of updating (coherently!) a probability equal to 1: consider in fact, for example, P(E4) = 1, and notice that P(E4IH4) = ~. In conclusion, coherent conditional probability complies, in a sense, with Spohn's requirements; he claims in Section 7 of [120]: "... Popper measures are insufficient for a dynamic theory of epistemic states . . . the probabilistic story calls for continuation. It is quite obvious what this should look like: just define probabilistic counterparts to ranks which would be something like functions from propositions to ordered pairs consisting of an ordinal and a real number between 0 and 1 . . . the advantage of such probabilified ranks over Popper measures is quite clear". We have shown that we do not need to distinguish between the two elements of the ordered pair that Spohn associates to each proposition, since all the job is done by just one number. In this more general (partial assessment allowed!) setting, the same tool is used to update both probabilities (and zero-layers) of the events initially taken into account (or else of those belonging to the same context, i.e. logically dependent on them), and probabilities (and zero-layers) of "new" events "come to the fore" later. In fact updating is nothing else than a problem of extension (see the next
ZERO-LAYERS
107
Chapter and Chapter 16 on Inference), so that a Popper measure (which is the "nearest" counterpart to de Finetti's conditional probability: see Section 10.3) is certainly apt to do the job, since it is a particular coherent conditional probability, whose updating is always possible (see also the remarks at the end of Section 12.2). Notice also that the set of events belonging to the same zerolayer is not necessarily an algebra, so the role of coherence is crucial to assign a probability to them. On the other hand, it is unclear, starting from the assignment of ranks, how to get a "probabilified" rank without conditioning to the union of events of the same rank (regarding this conditional probability as a restriction of the whole assessment on W ), but this is a matter of conditioning- except for the rank 0 - with respect to events of zero probability; then, since a tool like coherent conditional probability (or Popper measure) is anyway inevitable, why not introducing it from the very beginning instead of letting it "come back through the back-door"? Another issue raised by Spohn in [120] is to resort to nonstandard numbers (i.e., the elements of the iperreal field R * , a totally ordered and nonarchimedean field, with R* :::) R) as values of the relevant conditional probabilities. We deem that a (ticklish) tool as the iperreal field is not at all easily manageable, for example when we need considering both reals and iperreals (as it may happen, e.g., in Bayes' theorem). Moreover, it is well known (see [88]) that an iperreal probability P* gives rise to a conditional probability P(EIH) = R [P*(E A H)] e P*(H) '
where Re denotes the function mapping any iperreal to its real part (see, e.g., [106]); conversely, given a conditional probability, it is possible to define (not uniquely) an iperreal one. Then, if the above ratio is infinitesimal, we get P(EIH) = 0. Anyway, in our coherent setting the process of defining autonomously ranks to be afterwards "probabilified", or of introducing iperreal probabilities, is not needed (not to mention- again-
108
CHAPTER 12
the further advantage of being allowed to manage those real situations in which partial assessments are crucial). The role of zero-layers for the concept of stochastic independence is discussed in Chapter 17, where also the "unpleasant" consequences coming out from resorting (only) to ranking functions to define independence are shown (see, in particular, Remark 16).
Chapter 13 Coherent Extensions of Conditional Probability A coherent assessment P, defined on a finite set C of conditional events, can be extended in a natural way (through the introduction of the relevant atoms) to all conditional events EIH logically dependent on g, i.e. such that E 1\ H is an element of the algebra g spanned by the (unconditional) events Ei, Hi (i = 1, 2, ... , n) taken from the elements of C, and H is an element of the additive class spanned by the Hi's. Obviously, this extension is not unique, since there is no uniqueness in the choice of the class {P01 } related to condition (c) of Theorem 4. In general, we have the following extension theorem (essentially due to B. de Finetti [52] and deepened in its various aspects in [94], [126], [84], [102]).
Theorem 6 - If C is a given family of conditional events and P a corresponding assessment, then there exists a {possibly not unique) coherent extension of P to an arbitrary family K of conditional events, with /C 2 C, if and only if P is coherent on C. Notice that if P is coherent on a family C, it is coherent also on 109 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER 13
110
E c_; C. In order to have a complete picture of the problems related to the extension to a new conditional event EIH of a coherent conditional assessment P on a finite family C, we will refer to the following two points: (i) finding all possible coherent extensions of the conditional probability P(EIH) when EIH E g x go; (ii) extending this result to any conditional event FIK (i.e., possibly with FIK rt g x go).
Consider (i). First of all, notice that, given two coherent assessments relative to n + 1 conditional events
IT' = {P(EiiHi) =Pi, i = 1, ... , n; P(EIH) = p'} and
with p'
~ p",
then also the assessment
IIa ={Pi ,i = 1, ... ,n; ap' + (1- a)p"} is coherent for every a E (0, 1]: this plainly follows by looking at the relevant gain n
G
= L AilH;(IE;- Pi)+ AolH(IE- (ap' + (1- a)p")) i=l
and noting that, for A0 > 0, n
G 2 L AifH; (/E, -Pi)
+ AolH(JE -
P11 )
i=l
and
n
G ~ LAifH;(/E;- Pi)+ AofH(JE- P1 ) , i=l
CONDITIONAL PROBABILITY EXTENSIONS
111
so that the conclusion follows from the coherence of IT' and IT". Therefore the values of the possible coherent extensions of P to EIH constitute a closed interval [p', p''] (possibly reducing to a single point). Now, denote by pi the set of all classes {Pa}; related to the often mentioned characterization theorem. For the conditioning event H E go there are the following two situations: • (A) there exists, for every class {Pa} E pi, an element PfJ (defined on the subset AfJ of the set Ao of atoms: cf. condition (c) of Theorem 4) such that Hi ~ AfJ for some i, with PtJ(H) > 0;
• (B) there exists a class {Pa} E pi such that for every a one has Hi~ Aa for some i, and Pa(H) = 0. In the case (A), we evaluate by means of formula (11.7) all the corresponding values P(EIH), and then we take infimum and supremum of them with respect to the set pi. By writing down the relevant programming problem, we get
L: ArCEI\H
11
p = sup 1'3
"" L.....J
y~ a
Yr
'
Ar~H
where y~ = Pa(Ar) and Ar E Ao, the set of atoms of the algebra g (we denote by the same letter a all the indices ai corresponding to the class containing the probability Pa such that Pa(H) > 0).
CHAPTER 13
112
It is easily seen that this problem is equivalent to the following linear one
p'=inf
L
z~
p" = sup
'PJ Arr:;_EAH
where
z~ = y~ / L y~
L
z~ ,
p; Arr:;_EAH
.
Arr:;_H
Clearly, infimum and supremum will be reached in correspondence to those classes such that the relevant systems have the minimum number of constraints. In the next Chapter we will expound a strategy to make easier the computation of the solution of the above programming problem by suitably "exploiting" zero probabilities : this means that we search for classes pi in which Pa(H) = 0 for the maximum number of indices a.
In the case (B) we are in the situation discussed in Remark 10 of the previous Chapter: we can assign to the conditional events EIH arbitrary values, so that p' = 0 and p" = 1. Consider now point (ii): we must take a conditional event FIK ~ Q x go, so that the events F 1\ K and K are not both in Q ; we show now that we can find suitable conditional events F.IK. and F*IK* such that the events F., K., F*, K* are union of atoms, proving then that a coherent assessment of the conditional probability P(FIK) is any value in the closed interval p. ~ P(FIK) ~ p*,
where p. = inf P(F.IK.), 'PJ
p* = supP(F*IK*). pj
(13.1)
CONDITIONAL PROBABILITY EXTENSIONS
113
Obviously, if 1l is the algebra spanned by g U {F, K}, there is (by Theorem 6) a coherent extension of P to 1l x 1l0 • Now, let p a possible value of a coherent extension of P(·l·) (initially given on C) to the conditional event FIK fj. g x go, and consider the set Bo of the atoms Br generated by Ei, Hi (i = 1, 2, ... , n), F, K (that is, Ar 1\ (F 1\ K), Ar 1\ (Fe 1\ K), Ar 1\ Kc for any Ar E Ao ). Since p is coherent, there exists (at least) a class {Pa} containing a probability Pa (to simplify notation, for the index singling-out this probability we use the same symbol which denotes the generic element of the class) such that
P.a (F 1\ K)
p-
-
Pa(K)
Er
Pa(Br)
BrCFI\K
- ----=--==-------=----:-=---:--
-
Er Pa(Br)
Br~K
>
X
Er
Br~FI\K
Pa(Br)
+
Er
Pa(Ar)
Ari\Fci\K-:j=0
X+ a
Since a ~ 0, the latter function (of x) is increasing for any x, so that, taking into account that, for the atoms (of the two classes Ao and 8 0 ) contained in F 1\ K we have Vr Ar ~ Vr Br , we get
Er p
2::
Er
Ar~FI\K
Pa(Ar)
ArCFI\K
Pa(Ar)
+
Er
Pa(Ar)
= P(F*IK*)'
Ari\Fci\K-:j=0
where each probability Pa assures - according to condition (c) of the characterization theorem - also the coherence of the initial assessment on C, and
CHAPTER 13
114 Moreover, clearly,
F.
1\
K. =
F; 1\ K. =
V Ar
Ar~FI\K
V
~
F
1\
Ar 2 pc
K, 1\
K.
ArAFCI\K-:f;0
Notice that F. IK. is the "largest" conditional event belonging to g x go and "included" in FIK, according to the definition of inclusion ~o for conditional events recalled in Theorem 3 of Chapter 10 and in Remark 11 of Chapter 11. Now, letting Pa. vary on all different classes of pJ assuring the coherence of the initial assessment on C, we get the left-hand side of (13.1). For the right-hand side the proof is similar, once two events F* and K* are suitably introduced through the obvious modifications of their "duals" F. and K • . In conclusion, we can summarize the results of this Chapter in the following Theorem 7 - Given a coherent conditional probability P on a finite set
c =cl X c2 = {EliHl, ... 'EniHn} of conditional events, let pJ = { Pa.} be the set of classes agreeing with P, and let g be the algebra generated by C = C1 UC2 • Consider a further conditional event FIK fj C, and put F.IK.
= AIBg>FIK sup { AIB} , AIBEQxgo
F* IK*
= FIKg>AIB inf {AIB} AIBEC/XC/
0
.
Then a coherent assessment of P(FIK) is any value of the interval [p.,p*], wherep. = 0 andp* = 1 if F.IK. or F*IK* satisfy condition (B), while, if both satisfy condition (A), p. = infP(F.IK.), p1
p* = supP(F*IK*). Pi
CONDITIONAL PROBABILITY EXTENSIONS
115
Remark 13 - When condition (A) holds for both F* IK* and F* IK* , we may have p* = 0 and p* = 1 as well: it is easily seen that this occurs when there exists a class {Pn} such that A,e 2 Hg and A,e R. K* (or K*) for an index f3 . This is equivalent to the existence of a solution of system (To) under (3.2) of Section 14.1 of the next Chapter.
Chapter 14 Exploiting Zero Probabilities The previous results related to the coherence principle and to coherent extensions can be set out as an algorithm for handling partial conditional probability assessments, the corner-stone of all the procedure being the characterization Theorem 4 of Chapter 11.
14.1
The algorithm
If C is an arbitrary family of conditional events EiiHi (i = 1, ... , n), suitably chosen as those referring to a "minimum" state of information relative to the given problem, supply all the known logical relations among the relevant events Ei, Hi , and give a "probabilistic" assessment P = {Pi = P(EiiHi)}. The procedure to check coherence can be implemented along the following steps:
• (1): build the family of atoms generated by the events Ei, Hi (taking into account all the existing logical relations); • (2): test the coherence of P. 117 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER 14
118
The second step is really a subprocedure implemented by the following algorithm: • (2.1): introduce the system (Sa) with na unknowns,
• (2.2): put a = 0 in (Sa) ; • (2.3): if (Sa) has solutions, go to (2.4); otherwise the assessment is not coherent and must be anyhow revised (in the latter case, go to step (2.3') to get suggestions for the revising process); • (2.3'- a): introduce subsystems (Sa,k) of (Sa) obtained by deleting, in all possible ways, any k equations; • (2.3'- b): put k = 1 in (2.3'- a) and go to (2.3'- c); • (2.3'- c): if there exist compatible subsystems (Sa,k), then for each of them choose, among the conditional events EiiHi appearing in (Sa) and not in (Sa,k), a conditional event EjiHi: to find its interval of coherence, go to step (3.2), putting there F*
= Ei , K* = Hi ;
• (2.4): if (Sa) has a solution Pa(Ar) such that
Pa(Hi) =
L
Pa(Ar) > 0
Ar<;Hi
for every Hi specified in the first lines of (Sa), the assessment is coherent; if Pa(Hi) = 0 for some Hi, go to (2.5) until the exhaustion of the Hi's; • (2.5): put a+ 1 in place of a and go to step (2.3).
EXPLOITING ZERO PROBABILITIES
119
In the above sequence of linear systems, the first one has n + 1 equations (where n is the cardinality of the set C of conditional events) and m unknowns (the number of relevant atoms). From a theoretical point of view, it could seem that the "nicest" situation should correspond to find at the first step a solution satisfying (11.4) for any Hi, i.e. to solve only one system. Nevertheless, notice that to make easier the computational procedure it is more suitable not searching for such a solution, but trying instead to solve many "smaller" systems, in the sense of having a smaller number of equations (possibly only two), but more important - a smaller number of unknowns, since the main computational problem is that of building the atoms. In fact, we should choose, at each step, solutions in which there are many suitably chosen unknowns P0 (Ar) equal to zero relative to those atoms Ar 's contained in as many as possible conditioning events Hi's (cf. [26]). The best situation would be when the Ar 's are contained in all Hi's except one: then each system would reduce to a system having only a few equations (possibly two) - that is, only those which refer to the remaining conditioning events H;'s and which express P(E;IH;) by means of the relevant probability Po - plus the last one requiring that the sum of the probabilities of the atoms must be equal to 1. So at each step we would be able to verify coherence of the assessed conditional probabilities. This is the "nicest" situation: in fact, a careful scrutiny of the possible cases shows that it could instead happen, for example, that each Ei 1\ Hi is contained in the conjunction of all Hi's (a very . .. pessimistic and may be unrealistic situation); then it would be impossible to put equal to zero (acting on the probabilities of the relevant atoms) all P(Hi)'s except one, because even putting equal to zero the probability of just one Hi would entail that all conditioning events have zero probability (as it is easily seen by looking at the relevant system). On the other
120
CHAPTER 14
hand, another extreme situation is that in which the events H/s are mutually incompatible: in this case we can arbitrarily choose n - 1 of them to put equal to zero their probability. In all other cases the algorithm must proceed by a careful choice of the conditioning events whose probability should be put equal to zero. For example, if we consider an event Hi such that the disjunction D of some of the remaining Hi's does not contain all atoms Ar which are contained in Ei A Hi , nor all atoms As which are contained in Ej A Hi , then this is sufficient to guarantee that putting equal to zero the probabilities of all the Hi ~ D does not render equal to zero the probability of the event Hi; then there is a solution of the relevant system such that
In conclusion, since solving a system by giving some unknowns the value zero (in such a way that some equations are trivially satisfied) is the same as solving a system with only the remaining equations, each ensuing nontrivial solution may be clearly related only to the "bigger" atoms generated by some of the events: so we can adopt a strategy able to reduce (drastically, in the most common cases) the number of atoms needed to check coherence. Now, let us consider the third step of our algorithm, concerning the extension problem: • (3): extend (in a coherent way) to a new conditional event FIK (possibly to many such conditional events, but the procedure acts step by step) the assessment P, finding the interval of coherence of P(FIK). Recalling that the two extreme values are reached in correspondence to those classes such that the relevant systems have the minimum number of constraints, the procedure of exploiting zero probabilities aims at singling-out these classes, and can be set out along the following steps:
EXPLOITING ZERO PROBABILITIES
121
tt
• (3.1): given the event FIK C, supply all the significant logical relations between F, K and the Ei's and Hi's; if F and K are, in particular, logically dependent on the Ei's and H/s, go to (3.2) putting K. = K; on the contrary, introduce the conditional event F.!K. (and its obvious "dual" F*IK*) defined in the final part of the previous Chapter. • (3.2): given F.!K., consider the following system (T0 ), with unknowns Yr = P0 (Ar), Ar E Ao, giving positive probability to K.,
L L
Yr >0 Yr=O
L
Yr
= P(Ei!Hi) L
~~~A~
Yr
(i
= 1, ... ,n)
~~~
(notice that the last n equations are trivially satisfied); • (3.3): if (To) has a solution (cf. also Remark 13 at the end of previous Chapter), go to step (3.8); • (3.4): if (T0 ) has no solutions, introduce a system (S~) obtained by adding in the following equation to (Sa) :
E
X~=O;
Ar~K.
• (3.5): put a= 0 in (S~) • (3.6): if (S~) has no solutions, go to (3.9); • (3.7): if (S~) has a solution, put a+ 1 in place of a and go to (3.4) until a not compatible system is found - in this case go to step (3.9)- or until the exhaustion of the Hi's- in this case go to (3.8);
122
CHAPTER 14
• (3.8): put p. = 0 and go to step (3.10); • (3.9): solve the following linear programming problem
L
min
x~,
A,.~F.AK.
with constraints
I
L
x~-Pi
A,.~E,I\Hi
L
X~
L
x~=O
A,.~Hi
=1,
X~ ~ 0,
Ar ~ H~
A,.~K.
• (3.10): consider the conditional event F*IK* (i.e., the "dual" of F.IK... as introduced at the end of the previous Chapter) and repeat the procedure from step (3.2) by replacing, in all steps, K,.. by K*, K.I\F; by K* 1\F*, p,.. = 0 by p* = 1 and, x~) by (max x~) . finally, replacing (min
L
L
A,.~F*I\K*
14.2
Locally strong coherence
In this Chapter we are showing how to exploit zero probabilities through the possibility of searching for conditioning events H such that Pa(H) = 0 for the maximum number of probabilities Pa. Furthermore, checking coherence "locally" to get "global" coherence is also strictly connected with the existence of logical relations among the given events, and it is then useful to find suitable subfamilies that may help to "decompose" the procedure: in other words, we need to build only the atoms generated by these subfamilies. This procedure has been deepened in all details (and implemented in XLISP-Stat language) by Capotorti and Vantaggi in [18] through the concept of locally strong coherence, which applies in fact to subfamilies of the given set of conditional events: checking
EXPLOITING ZERO PROBABILITIES
123
whether the assessment on a subfamily does not affect coherence of the whole assessment allows to neglect this subfamily. Hence, even if looking at subfamilies has, in a sense, a "local" character, their elimination has a global effect in the reduction of computational complexity. We start (all results that follow are contained in reference [18]) with the following
Definition 9 - Given the family
of conditional events, an assessment P in C is called strongly coherent with respect to B, where B is an event such that BI\Hi =f 0 for all i = 1, ... , n, if the assessment P' defined on C' = {Eii(Hi 1\ B), (Hi 1\ B)IO: 1 = 1, ... , n}
by putting P'(EiiHi 1\ B) coherent.
=
P(EiiHi) and P'(Hi 1\ B)
>
0 zs
Obviously, strong coherence (with respect to B) implies coherence, but the converse is not true. Moreover, strong coherence implies that it is possible to choose the coherent extension of P to the atoms (generated by the - unconditional - events of the family C) contained in ne by giving them zero probability.
Definition 10 - Let :F = :F1 x :F2 be a subfamily of C, and put v = c \ :F = V1 x v2 . If B:r- = (
V
Hit'
HiE'D2
then the assessment P is locally strong coherent in :F when the restriction of P to :F is strongly coherent with respect to B :F. It follows that B:r- 1\ Hi = 0 for every Hi E :F2. The following theorem points out the connections between coherence of the assessment on C and locally strong coherence in a suitable subset :F.
124
CHAPTER 14
Theorem 8 - Let P : C -+ [0, 1] be locally strong coherent on :F . Then P is coherent (on C) if and only if its restriction to V = C\:F is coherent. The proof of the theorem is based on the following observations : if p is locally strong coherent in :F ' then :F2 n v2 = 0 and the first system (Sa) (of the characterization theorem, Theorem 4 in Chapter 11) has a solution such that x~ = 0 for any atom Ar c; BJ=- and such that x~ > 0 for every Hi E V 2 ; therefore the second sys-
L
Art;Hi
tern (SI) contains only equations relative to the conditional events EiiHi E V, so that coherence on C depends only on coherence on V. The relevant aspect of the above theorem is that locally strong coherence on a subset :F of C makes this subset :F a sort of "separate body" that allows to ignore the relationships among conditional events in :F and those in V: as a consequence, the size of both the family of conditional events EiiHi and the set of atoms where coherence must be checked can be more and more strongly reduced by an iterative procedure, thanks also to necessary and sufficient logical conditions for locally strong coherence relative to specific subsets. For example, in [18] there is a complete characterization of locally strong coherence when :F is a singleton, and many sufficient conditions have been found when :F contains two or three conditional events. We report here only the characterization relative to a single conditional event EIH. If :F = {EIH}, then Pis locally strong coherent in :F if and only if one of the following conditions holds: (a)
P(EIH)
1\
= 1 and E /\ H
Hj
=f. 0 ;
Hj#H
(b)
P(EIH)
= 0 and Ec /\ H
1\ Hj"#H
Hj
=f. 0
EXPLOITING ZERO PROBABILITIES
125
(c)
Therefore, if a conditional event of C satisfies one of the conditions (a), (b), (c), then it is clearly enough to prove coherence only for the remaining n - 1 conditional events: but, before doing so, we can repeat the procedure, searching if there is among them another conditional event satisfying one of the three conditions, and so on, until this is possible. When none of the remaining conditional events verifies (a), (b), or (c), we can proceed by analyzing the (possible) locally strong coherence f~r subsets of C containing two conditional events, and so on. Finally, we meet with a subset of C which is not locally strong coherent with respect to any of its subsets, and here coherence must be checked in the usual way. ln [18] it is proved that the result does not depend on the "path" that has been followed to reach the subset of C where coherence must be checked. Here is a simple example (for more sophisticated ones, see the aforementioned paper).
Example 19 - Let
be such that
and consider the assessment
126
CHAPTER 14
Now, we search for locally strong coherence relative to singletons contained in C: with respect to :F1 = {E1IH1} locally strong coherence fails, since E 1 A H 1 A H~ = 0 , while P is locally strong coherent in :F2 = {E2IH2}, since E~ A H2 A Hf A H3-=/= 0. Then we need to check coherence of P on :F = {E1IH1, E3IH3}, and now P (or, better, its restriction to :F) is locally strong coherent on the set :F1 = {EdH1}, because E1 A H1 A H3 -=/= 0 and
Ef A H1 A H3 -=/= 0 . Therefore it is enough to check coherence only on the singleton :F3 = {E3IH3}, but this is assured by any value in [0, 1] of the relevant conditional probability. In conclusion, the given assessment is coherent. Notice that, by resorting to the usual procedure through the sequence of systems (Sa) , we would need to consider- in this example - eleven atoms.
Chapter 15 Lower and Upper Conditional Probabilities 15.1
Coherence intervals
The extension Theorem 6 (Chapter 13) is the starting point to face the problem of "updating" (conditional) probability evaluations. In particular, the extension to a single "new" conditional event F 1 IK1 (cf. Theorem 7) gives rise to an interval [p~, p~] of coherent values for P(FIIK1). Choosing then a value p E [p~, p~], we can go on with a further new conditional event F 2 IK2 , getting for it a coherence interval [p~, p~], which (besides depending on the choice of p) can obviously be smaller than the interval we could have obtained by extending directly (that is, by-passing F1IK1) the initial assessment to F2 IK2 • Therefore, given an initial assessment P( ·I·) on n conditional events E 1IH1, ... , En!Hn, and h "new" conditional events
if we do not proceed step-by-step by choosing a coherent value in each subsequent interval, we could make h "parallel" coherent extension [p~, p~], ... , [p~, p~], but in this way we are not warranted
127 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
128
CHAPTER15
that choosing then a value Pi in each [p~,p?], i = 1, ... , h, the ensuing global assessment (including the initial one) would be coherent (this particular - and unusual - circumstance is called "total coherence" in [73]). On the other hand, if we choose as values of the further assessment all the left extremes p~ of the above intervals we get, as global evaluation, what is called a lower probability (and we get analogously an upper probability if we choose the right extremes). In particular, we may obviously find that some values of the lower probability are equal to zero, so that the assumption of positivity which is usually done in many approaches to conditioning for "imprecise" probabilities is a very awkward and preposterous one. A thorough discussion of these aspects is in [32] and will be dealt with in the next Chapter in the framework of inferential problems. Moreover, most of the functions introduced in the literature as measures of uncertainty (aiming at extending probability) can be looked upon as particular lower and upper probabilities: so this seems a further argument which renders natural to undertake an alternative treatment of these functions.
15.2
Lower conditional probability
We will refer - here and in the sequel - only to lower conditional probabilities P; clearly, we can easily get corresponding results concerning upper probabilities: in fact an upper probability is a function P defined as in the subsequent formula (15.1) by replacing "inf' by "sup".
Definition 11 - Given an arbitrary set C of conditional events, a coherent lower conditional probability on C is a nonnegative function P such that there exists a non-empty dominating family
LOWER AND UPPER CONDITIONAL PROBABILITIES
129
P = {P( ·I·)} of coherent conditional probabilities on C whose lower envelope is P, that is, for every EIH E C, P(EIH) = i~f P(EIH) .
(15.1)
Example 20 - Given a partition {E1 , E 2 , Ea, E 4 } of n, consider the event H = Ea V E 4 and the assessment
To see that this is not a coherent conditional probability it is enough to refer to Theorem 6 (Chapter 13) : in fact there does not exist, for instance, a coherent extension to the conditional event HIH, since p(EaiH)
1
+ p(E4IH) = 2 i= 1 = p(HIH).
Nevertheless there exists a family P = {P', P"} of coherent conditional probabilities, with
P'(E1In) P"(E1In)
= ~, P'(E2In) = ~, P'(EaiH) = ~, =
l,
P"(E2In)
= ~, P"(EaiH) = ~,
P'(E4IH)
= ~,
P"(E41H)
=
l·
and p is its lower envelope.
We show now that, when C is finite, if P is a coherent lower conditional probability, then there exists a dominating family P' ;2 P such that
P(EIH)
= II.}!n P(EIH) .
Since any element of the dominating family must be a coherent conditional probability (then extendible to g x 8°, with g algebra and 8° additive class), we may argue by referring to C = g x 8°.
130
CHAPTER 15
Let EIH E C be such that P(EIH) = i~f P(EIH), but not the minimum; then for any € > 0 there exists a conditional probability PF. E P with
Define a new conditional probability P' = lim PF. ( P' is a condiF.-tO tional probability, since the limit operation keeps sum and product and also the equality PF.(HIH) = 1 ). Now P'(EIH)
= limPF.(EIH) = P(EIH) F.-tO
and for any other conditional event FIK E C we have lim0 PF.(FIK) = P'(FIK) ~ P(FIK), E-t
Definition 12 - Given a coherent lower conditional probability P on C and any conditional event FiiKi E C, the element P of the dominating family P such that P(FiiKi) = P(FiiKi) will be called i-minimal conditional probability. The following relevant theorem has been given and discussed in [27] and [32]. For simplicity, we prefer to formulate it for a finite family of conditional events, but it could obviously (see Remark 8, Chapter 11) be expressed in a form similar to that of Theorem 4 (the characterization theorem for "precise" conditional probabilities). Theorem 9 - Let C be an arbitrary finite family of conditional events FiiKi, and denote by Ao the usual relevant set of atoms. For a real function P on C the following two statements are equivalent: (a) the function P is a coherent lower conditional probability on C; {b) there exists, for any FiiKi E C (at least) a class of probabilities ITi = { P~, Pf, ... } , each probability P~ being defined on a
LOWER AND UPPER CONDITIONAL PROBABILITIES suitable subset A~ ~ unique P~ with
Ao,
131
such that for any F3 jK3 E C there is a
Lr P~(Ar) > 0 Ar~Kj
and
Er P~(Ar) P(F.·IK·) < Ar~Fji\Kj . 3 3 Er P~(Ar)
i= i
if
j
if
j=i
Ar~Kj
Er
P(PIK·) -
3
= Ar~Fji\Kj
P~(Ar)
Er P~(Ar)
3
Ar~Kj
and, moreover, A~, C A~, for a' > a" , while P~, ( Ar) = 0 if ArEA~,.
Proof- Let .FiiKi E C: there exists a coherent conditional probability pi (i-minimal) on C such that Pi(EiiHi) = P(EiiHi) and Pi(E3 jH3) ~ P(E3 jH3) for j i= i. Then this clearly amounts, by Theorem 4, to the validity of condition (b). • Actually, it is possible to build the classes {P~} as solutions of sequences of systems (one for each conditional event FiiKi E C) like the following one:
Lr
P~(Ar)
= P(FijKi)
~~~!\~
Lr P~(Ar) [if P~-l(Ki)
= 0],
~~~
Lr
P~(Ar) ~ P(F;IK;) Lr P~(Ar) [if P~-l(K;)
~~~!\~
~~~
= 0],
Lr P~(Ar) = 1 Ar~K;:,;
where the second line gives rise to many inequalities, one for each j i= i , and K~i is, for a ~ 0, the union of the Ki 's such that P~_ 1 (Ki) = 0 (and P~ 1 (Ki) = 0 for all K/s). This can give rise, as in the case of probabilities, to an actual algorithm to prove the consistency (coherence) of a lower
132
CHAPTER 15
probability assessment on a finite set of events. Clearly, for a partial lower (upper) probability assessment we have less stringent requirements, since systems with inequalities have more solutions than those with only equalities, i.e. there are better "chances" (with respect to a probability assessment) to fulfill the requirement of coherence. But the relevant check is computationally more burdensome (in fact we must repeat the same procedure n times, where n is the cardinality of the given set of conditional events). Example 21 - Given two (logically independent) events A and B, consider the following (unconditional) assessment P(A)
P(B)
=
=
1
1
4 , P(A 1\ B) =
16 , P(A V B) =
3
4.
To prove that Pis a lower probability, we resort to Theorem 9 (taking all conditioning events equal to n ): so we need to write down four systems, one for each event; the unknowns are the probabilities of the atoms
Consider the system referring to the event A :
i XI +x2 ~ i
XI+ X3 =
XI~ I~
i
XI+ X2 + X3
~
XI + X2 + X3
+ X4
Xi~
=
1
0
A solution is 1
XI= X3 =
S'
1
X2
=-, 2
X4
=
1
4.
LOWER AND UPPER CONDITIONAL PROBABILITIES
133
Solutions of the other three systems are the following: that corresponding to B is Y1 = Y2 =
1
1
B,
Y4 =
4,
that corresponding to A I\ B is 1
Zg
= 4'
Ug
= 16'
and that corresponding to A V B is 1
u -1-
16'
1
3 U4
= 4,
which easily follow from the relevant systems, that we did not (for the sake of brevity) write down explicitly. An algorithm to check coherence of a lower probability assessment (again based, as in the case of conditional probability, on the concept of locally strong coherence) has been set out in [19] (and implemented in XLISP-Stat Language), showing also how to solve some relevant inferential problems (as those dealt with in the following Chapter 16). A (seemingly) similar procedure refers to "imprecise" probabilities, for which some authors require a very weak form of coherence, that is : the existence, given a family of imprecise assessments [a~, a~'], i = 1, ... , n, of at least a set {p1 , ... ,Pn}, with PiE [a~, a~'], constituting a coherent conditional probability. This concept, called coherent generalized probabilistic assessment, has been introduced by Coletti in [22], and has been (independently) considered also by Gilio in [71] and later (under the name of g-coherence) by Biazzo and Gilio [8]. For a relevant algorithm, see [9]. Let us now face the problem of coherent eztensions of lower conditional probabilities. Taking into account Theorem 9 and
CHAPTER 15
134
the results of Chapter 13, it follows that the coherent enlargement to a "new" event FIK of a lower conditional probability P, defined on a finite family of conditional events {.FiiKi}, is given by mjnPi(F.IK.), I
where Pi is the infimum with respect to a class IJi characterizing P in the sense of the just mentioned theorem, and F.IK. is the conditional event introduced in the final part of Chapter 13. Notice that, if there exists an index i and a family IIi of probabilities P~ such that P~(K) = 0 for every a, then P(FIK) = 0; otherwise the value of P(FIK) is obtained in the general case as the minimum of the solutions of n linear programming problems. (Analogous considerations could be easily rephrased for upper probabilities).
15.3
Dempster's theory
It is well-known that any lower (unconditional) probability P is a superadditive (or 0-monotone) function, that is, given two events A andB,
A 1\ B
=0
=*
P(A V B) 2:: P(A)
+ P(B) .
Notice that this is true also if we consider a conditional lower probability P( ·IK) relative to the same conditioning event K: this can be easily seen by resorting to Theorem 9. In fact, given AIK, BIK, (A V B) IK E C, let i and a be the indices such that P~(K) > 0 and
now, since A and Bare incompatible events, the right-hand side is the sum of two similar expressions relative to AIK and BIK, and
LOWER AND UPPER CONDITIONAL PROBABILITIES
135
then we get, taking into account the inequalities corresponding to the case j =1- i of the theorem, P(A V BIK) 2: P(AIK) + P(BIK) . On the other hand, a lower probability may not be 2-monotone, i.e. may not satisfy P(A V B) 2: P(A)
+ P(B)- P(A 1\ B),
as shown by the following Example 22 - Given a partition {A, B, C, D} of probability assessments
Pr(A) = 0, Pr(B) = Pr(C) = P2 (A) =
1
4 , P2(B)
n,
consider two
1
2 , P1(D) = 0, 1
4,
= 0, P2(C) =
g(D) =
1
2,
on the algebra generated by these four events. The lower probability obtained as lower bound of the class {P1 , P2 } has, in particular, the following values
P(A)
1
1
1
= 0, P(A V B) = 4 , P(A V C) = 2 , P(A VB V C) = 2 ,
so that 1
2 = P(A VB V C) < P(A V C)+ P(A V B) -
P(A) =
1
1
2+ 4 -
0.
Obviously, all the more reason a lower probability is not necessarily an n-monotone function, that is n
P(Al V ... V An) 2:
L P(Ai)- L P(Ai 1\ Aj) +. ·. i=l
i<j
Lower and upper probabilities induced by multivalued mappings were introduced by Dempster in [55]. In particular, these lower
136
CHAPTER 15
probabilities, which are n-monotone for any n E IN have been called by Shafer [116] belief functions. We will not deal with the theory of belief functions in this book, except for the discussion in Chapter 18 of a classical example (cf. Example 38) which is claimed (in [117]) as not being solvable without resorting to belief functions. Our aim will be to show instead that lower and upper conditional probability are a useful tool to find a simple probabilistic solution for this kind of examples, in a way that fits and encompasses the solution obtained via the belief function approach. For the sake of interested (and not acquainted) reader, we will recall (in that Chapter) the main definitions (for details, see [116]) concerning belief functions and Dempster's rule of combination.
Chapter 16 Inference 16.1
The general problem
We refer to an arbitrary family 1£ = {HI! H 2 , ••• , Hn} of events (hypotheses), i.e. 1£ has neither any particular algebraic structure nor is a partition of the certain event n. We detect logical relations among the given events (the latter could represent, e.g., some possible diseases), and some further information is carried by probability assessments, relative to an event E (e.g., a symptom) conditionally to some of the Hi's ("partiallikelihood"). If we assess (prior) probabilities for the events Hi's, ensuing problems are: (i) Is this assessment coherent? (ii) Is the partial likelihood coherent "per se"?
(iii) Is the global assignment (the initial one together with the likelihood) coherent? If the relevant answers are all YES, we may try to "update" (coherently) the priors P(Hi) into the posteriors P(HilE). This is an 137 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
138
CHAPTER 16
instance of a more general issue, dealt with in Chapter 13: the problem of coherent eztensions. A very particular case is Bayes' updating for exhaustive and mutually exclusive hypotheses, in which this extension is unique. In the general case the lack of uniqueness gives rise to upper and lower updated probabilities, and we could now update again the latter, given a new event F and a corresponding (possibly partial) likelihood. In this Chapter we discuss many relevant features of this problem (keeping also an eye on the distinction between semantic and syntactic aspects). To start with, the first "natural" step is that of detecting all possible logical relations among the given events: in fact, as we have seen in Chapter 4, Remark 3, if there are no logical relations among the n events (that is, if the number of the relevant atoms equals 2n), any assessment (with values between 0 and 1) is coherent. This result has been extended to conditional events in [73]. Obviously, logical relations reduce the number of relevant atoms. Further information is carried by suitable conditional probability assessments, relative to an (observable or observed) event E (that could possibly be, with reference to the medical problem, a symptom or the evidence coming from a suitable test) conditionally to some of the Hi's, or to events obtained from them through elementary logical operations. We shall call "partial likelihood" this set of conditional probabilities. Going back to the previous questions (i}, (ii}, (iii}, some remarks are now in order: • is the initial assessment coherent? The syntactic counterpart of this concept is the requirement that the function P defined on the set 1l can be extended as a probability on the minimal algebra generated by 1l (see Chapter 4: we just need to check the compatibility of system (4.1)); • is the partial likelihood coherent "per se"? Notice that usu-
INFERENCE
139
ally likelihoods come from observed frequencies: if we refer just to a single conditional event, its probability can be assessed by an observed frequency in the past (since a frequency is a number between 0 and 1, and this is a necessary and sufficient condition for coherence when only a single conditional event is considered). But things are not so easy when further events (conditional or not) are involved, since consistency problems (coherence!) must then be taken into account, due to the circumstance that the relevant conditioning events are not, in this general case, incompatible (cf. Theorem 5). Coherence for conditional assessments has been the subject of Chapter 11; • is the global assignment (the initial one together with the likelihood) coherent? As it will be shown in the following two examples, the whole assignment (prior probabilities and likelihood) can be incoherent even if the two separate assessment were not. So it is better, in these circumstances, to avoid a hasty Bayesian updating of probability assessments (since it can lead to wrong conclusions) and to resort instead to a direct check of coherence.
16.2
The procedure at work
Example 23 (continuing Examples 4 in Chapter 2, and Example 8 in Chapter 4) - We recall that a doctor had considered three possible diseases H 1 , H 2 , H 3 with the logical condition H 3 c Hf 1\ H 2 , giving the coherent assessment
(see the two aforementioned examples). The doctor considers now the event E = "pressing in particular points of the abdomen does not increase pain", and he gives the
140
CHAPTER 16
following relevant logical and probabilistic information
E
1\
Ha
= 0 , P(EIH2) = ~, P(EIH~) = ~.
Obviously, the latter assignment is coherent, since it refers to a {trivial) partition {with respect to the conditioning events). If we update the {prior} probability of H 2 by means of the above likelihood {through Bayes' theorem}, we get
This new probability of H 2 is coherent with the previous probabilities of H 1 and Ha. To prove that, consider the atoms obtained when we take into account also the new event E :
= A1 1\ Ec , B1 = A4 1\ Ec ,
= A2 1\ Ec, Bs = As 1\ E ,
B4
= Aa 1\ Ec , Bg = As 1\ Ec .
Bs
B6
To check coherence we consider the following system, with unknowns
Yi = P(Bi) Y1
+ Y2 + Y4 + Ys =
Y1 = Y1
i
+ Ya =
~ (YI
~
+ Y2 + Ya + Ys)
9
LYi= 1 i=l
Yi 2 0. It is easily seen that the system (So) has {infinite) solutions and, since there are also solutions such that
Y1
+ Y2 + Ya + Ys > 0 ,
INFERENCE
141
this is sufficient to ensure that the assessment is coherent. This is true even if we take into account the updating of the probability of H 3 , that is P(H3 IE) = 0: in fact this corresponds to ignoring the second equation of system (80 ). But to consider this assessment as an updating of the previous one can be a too hasty (and wrong) conclusion, since the value of P(H2 IE) has been obtained by considering in fact as "prior" the assessment
P(Hc) 2
= ~5'
and not that actually given by the doctor, which involves also the evaluation of P(H1 ) and P(H3 ). The updating of that assessment obviously requires that the "whole" prior and the likelihood must be jointly coherent. Instead in this case coherence does not hold: considering indeed the following system
+ Y2 + Y4 + Y5 = ~ Y1 + Y3 + Y4 + YB + Y1 =
Y1
*
Y1 = Y1 + Y3 Y2
k
= ~ (YI + Y3 + Y4 + YB + Y1)
+ Ys =
HY2
+ Y5 + Ys + Y9)
9
LYi
=1
i=l
Yi 2': 0 ,
simple computations {solving for Y1 +y3 the fourth and the second eq. and inserting this and the third eq. into the second one) show that it does not admit solutions, so that the assessment is not coherent. The following example shows that even the "local" coherence of prior and "pseudoposterior" obtained in the previous example was just accidental.
142
CHAPTER 16
Example 24 - A patient feels a severe back-ache together with lack of sensitiveness and pain in the left leg; he had two years before a lung cancer that was removed by a surgical operation. The doctor considers the following exhaustive hypotheses concerning the patient situation: H 1 =crushing of L5 and Sl vertebrae, H 2 =rupture of the disc, H 3 =inflammation of nerve- endings, H 4 =bone tumor. The doctor does not regard them as mutually exclusive; moreover, he assumes some logical relations:
H1 Hf
A A
H4 A (H1 V H2 V H3) = 0, H2 A H3 = 0 , H1 A H~ A H~ = 0 , H2 A H~ = 0 , Hf A H~ A H3 = 0 .
Correspondingly, we have only the four atoms
A3 = Hf
A
H2
A
H3
A H~
,
A4 = Hf
A H~ A H~ A
H4 .
The doctor makes the following probabilistic assessments
Its coherence is easily checked by referring to the usual system with unknowns Xr = P(Ar), which has a unique solution 1 Xl
= 12 '
X4
=
1
2.
Let now E be the event E = an X-ray test is sufficient for a reliable and decisive diagnosis so that
INFERENCE
143
The doctor assigns the likelihood P(EIHI) =
~,
P(EIHf) =
~.
If we update the (prior) probability P(H1) by the above likelihood through Bayes' theorem, we get P(H1IE) = ~- But now (contrary to the situation of Example 23) this updated probability of H 1 is not coherent with the given probabilities of H 2 and H3 • Notice in fact that the atoms obtained when we take into account the new event E are exactly those generated by the events Hi, so that to check coherence we need to study the solvability of the system, with unknowns Xr = P(Ar), XI = X4
~(x1
+ x4)
= ~
=~ X2 + X3 = I52
XI+ X3
4
LXi = 1 i=l
But the first two equations give system is inconsistent.
X1
~ , hence x 3
< 0, so this
The circumstance that the whole assignment (prior probabilities and likelihood) can be incoherent even if the two separate assessment are not, cannot occur in the usual case where Bayes' theorem is applied to a set of ezhaustive and mutually ezclusive hypotheses: this is clear by Theorem 5 (Chapter 11). In fact, looking at the systems (Sa) introduced in Chapter 11 to characterize coherence, each equation (corresponding to the "product rule" of probability) is "independent" from the others, since the events Hi's have no atoms in common, and so each equation (and then the system) has trivially a solution.
144
CHAPTER 16
When the answers to the previous questions (i), (ii), (iii) are all YES, the next aim is to face the problem of "updating" the priors P(Hi) into the posteriors P(HiiE). In general, the problem of coherent extensions can be handled by Theorem 6 of Chapter 13: if C is a given family of conditional events and P a corresponding assessment on C, then there exists a (possibly not unique) coherent extension of P to an arbitrary family Q of conditional events, with g 2 C, if and only if the assessment P is coherent on C. Since the lack of uniqueness gives rise to upper and lower updated probabilities, to go on in the updating process (i.e., to update again the "new" conditional probabilities - possibly upper and lower - given a new event F and a corresponding - possibly partial-likelihood) we must resort to the general Theorem 9 (Chapter 15) characterizing upper and lower probabilities. This will be shown in the final part of the following example (which shows also that, if coherence of the "global" - i.e. prior and likelihood together - assessment holds, it is possible to update (prior) probability by Bayes' rule - also in situations in which the given events are not mutually exclusive - by resorting to the partitions given by the relevant atoms). Example 25 - A patient arrives at the hospital showing symptoms of choking. The doctor considers the following hypotheses concerning the patient situation:
H1 =cardiac insufficiency, H 2 =asthma attack, H 3 = H 2 A H, where H = cardiac lesion. The doctor does not regard them as mutually exclusive; moreover, he assumes the following natural logical relation: Correspondingly, we have the atoms
INFERENCE
145
A4 = H 1 1\ H~ 1\ H~ , As = Hf 1\ H~ 1\ H~ . The doctor makes the probability assessments
P(H1) =
1
1
2 , P(H2) = 3 , P(H3)
=
1
5 , P(H1 V H2)
=
3
5.
(16.1)
Its coherence is easily checked by referring to the usual system with unknowns Xr = P(Ar), which has a unique solution 1
XI
= 5'
X2
1 30 '
=
X3
=
1 10 '
2
4 X4
= 15 '
Xs
= 5.
Let now E be the event E = taking medicine M against asthma does not reduce choking symptoms . Since the fact E is incompatible with having asthma attack (H2 ), unless the patient has cardiac insufficiency or lesion (recall that H 3 implies both H 1 and H 2 }, then
H2
1\
Hf
1\
E = 0.
The doctor now draws out from his database the ''partial likelihood" P(E!HI)
=
3 10 ,
(16.2)
Then the process of updating starts by building the new atoms
B4 = A4 1\ Ec , B7
= A2 1\ E
,
Bs = As 1\ Ec , Bs
= A4 1\ E
,
B6 = A1
1\ E ,
Bg = As 1\ E ,
and to check coherence we need to consider the usual system (Sa) with unknowns Yi = P(Bi) and whose first six equations come from {16.1} and {16.2). Given .X and J1, with 7
7
-<11.<60- !'-"'- 30'
7 4 --1/.<.X<--11. 30 !'-"'- - 15 !'-"''
146
CHAPTER 16
a solution is 19 Y1 = 60 - J-t , Y2
Ys =
2
5- J-t ,
= ,\ + J-t 7
7
30 , Y3
=
1
10 , Y4
4
Y6 = J-t- 60 ,
Y1 = 15 - ,\- J-t ,
=
4
15 - ,\ ,
Ys = ,\ ,
Y9 = J-t •
Then "global" coherence {i.e., referring to prior and likelihood jointly) allows Bayesian updating of P(HilE) , i = 1, 2, 3. Notice that the event E must be looked on as an assumption, not as an observation {recalling Koopman's terminology concerning assumed and acquired propositions: see Chapter 2}, since we could, of course, have considered Ec. Going back to the event E, we lack the usual representation ("disintegration" formula) for its probability 3
P(E) =
L P(Hi)P(EIHi) ' i=l
since the hypotheses do not constitute a partition. On the other hand P(E) can be obtained by summing the probabilities of the atoms contained in E, that is
3
P(E) = P(B6)+P(B7)+P(B8 )+P(Bg) = YB+Y1+Ys+Y9 = 20 +p,, and so it ranges in the interval [ 1~, ~~ ]. Therefore, by straightforward computations, we have
9
9
-23 -< P(H1JE) < - -16 , while the evaluation of P(H2IE) requires (since the likelihood P(EIH2) is not given) resorting to
P(H2 A E)= P(B6)
+ P(B1) =
3 20 -,\,
to be divided by P(H2 ), so that, finally, we obtain 9 0 :::; P(H2IE) :::; 20 .
INFERENCE
147
Clearly,
E= {
:3' o, 0}'
P= {
:6' :o' ;3}
are, respectively, a {coherent) lower and a (coherent} upper conditional probability of Hi lE (i = 1, 2, 3). Now we could go on again by updating the upper and lower conditional probabilities P(Hi!E) and P(Hi!E), assuming that is given, for example, the new event F = taking the medicine M against asthma increases tachycardia with
0 =/= E 1\ F , H2 1\ Hf 1\ F = 0 . Precisely, we are looking for a coherent assessment of P(Hi!E 1\ F) and P(HiiE 1\ F) (i = 1, 2, 3). Obviously, the doctor has in the data base some relevant likelihood: however (as it has been previously discussed} the checking of coherence cannot proceed as before, since we start now with an upper and lower probability . So we must apply now Theorem 9. The doctor draws out from his database the likelihood
The process of updating goes on by singling out the new atoms: in fact we need building only those contained in E, that is
~=~1\P,~=~I\P,~=~I\P,~=~I\P.
To check coherence of the whole assignment (as lower probabilities) we start by considering the system (S~) with unknowns x~ = PJ-(Cr) and related to the conditional events
CHAPTER 16
148
in the given order (so that, by referring to the notation of the just mentioned theorem, the first conditional event corresponds to FdK11 and so on},
8 1 > 0"" x~ x1 - ~ z
i=1 8
x~ + x~ +
xg + xA ~ 0 L xi i=1
xi1 + xi2 +xi3-- 8'!..(xiI + xi2 +xi3 +xi5 +xi6 +xi) 7 x1I +xi2-- 6.!! (xi1 + x12 + xi5 + xi) 6 8
ExJ = i=I
xf
1
~ 0.
Among all possible solutions there is, given J-t with 14 0< u<- ,.-23'
the following one 1
x3 = I
x7
1
9
= 8 · 23
I
'
x8
7
9
8 . 23
I
' x4 = J1 '
14
= 23
- J-t.
Therefore to prove coherence it is sufficient to check the compatibility of the next system, with unknowns Yi = Pf (Cr),
y} + y~ = ~ (y} + y~ + yJ + y~) { y} + y~ + yJ + y~ = 1
y}
~ 0.
149
INFERENCE
In fact, due to the above choice of the solution of (S~), there is no need to consider the systems (S~) with i > 1. In conclusion, we proved that, given the family
the corresponding assessment
{9 0 0 7 23 ' ' ' 8'
~6}
is a coherent lower probability. We are now going to prove that the assessment 7 9 7 5} {9 16 ' 23 ' 20 ' 8 ' 6 is a coherent upper conditional probability for the same family of conditional events. We need to consider a system which is ("mutatis mutandis") the analogue of (s~) for lower probabilities, that is
8
u 11
< -
.l 23 ""u~ L..J ~ i=1 8 "" 1 + u51 + u61-< 9 20 L..J ui
1+ 1 u1 u2
i=1
u~ + u~ + u~ = ~ (u~ + u~ + u~ + u~ u 11 + u 2 1 -- 6 §. (u 11 + u 21 + u 51 + u 61)
+ u~ + u~)
8
L:ui = 1 i=1
uf
~ 0,
which has a solution such that 1 u1
7
= 23
'
1 u2
2
= 69
'
1 u3
61
= 384
1
' u5
1
+ u6 =
1 15 '
CHAPTER 16
150 1
u4
1
+ Us =
7
7
1
16 ' u7 = 1920 .
This is a solution also of the system obtained from the previous one changing the second inequality into an equality. Then it remains to be proved that has a solution the system
s 3 u1
3
3 u5
3_9""'3 u6 - 20 ~ ui i=1
+ u2 + + u31 + u32 + u33_- sI (u31 + u32 + u33 + u35 + u36 + u3) 7 u~ + u~ = ~ (u~ + u~ + u~ + u~) s
:Eu~ = 1 i=1
u~
2: 0.
It is easily seen that there is a solution such that 3 u1
=
0
'
u3 _ ~
3
8 '
2 -
3
u4
u3 3
= 2
3
20 '
+ Us = 5 '
3 u5
3_0
u7 -
3
+ u6 =
3
40 '
.
Now it is possible to go on by introducing a new conditional event and checking its coherence (as briefly discussed at the end of Section 15.2): the relevant range is a suitable closed interval.
Remark 14 - In the previous example, two among the values of the updated lower probability P of the Hi's were equal to zero. To go on in the updating process, these values have been taken as new "prior" assignments: then it is of paramount importance (also from a practical point of view) to have a theory privileging the possibility of managing conditioning events of zero probability (since they may appear in the relevant likelihood}.
INFERENCE
16.3
151
Discussion
Notice that an important syntactic consequence of our choice (to deal only with those "imprecise" probabilities 'P and 'P' arising as coherent extensions, so that they are lower and upper probabilities) is the following: since the relevant enveloping probability distributions (those singling-out lower and upper probabilities) are unique, there is no ambiguity concerning the information "carried" by 'P and 'P' (see the discussion at the end of Chapter 6). On the other hand, we prefer to rely more on the syntactic aspects than on the semantic ones, so avoiding any deepening of vague statements such as "losing" or "carrying" information, which are not clearly and unambiguously interpretable, especially in the framework of the so-called "imprecise" probabilities. For example, does it carry more information a precise assessment p, with p = .5, or an imprecise one [p', p"], with p' = .8 and p" = .95? If this question had any essential significance, we would prefer - in this case - an "imprecise" conclusion (since it looks more "informative"). Summing up, the procedure applied to the previous specific examples (to handle uncertainty in the process of automatic medical diagnosis) can be put forth in general, as expressed in the next Theorem 10. First of all, we need to consider the following starting points: • consider a family of hypotheses (that is, events Hi (with i = 1, 2, ... , n) represented by suitable propositions) supplied by You: they could explain a given initial piece of information referring to the specific situation. No structure and no simplifying and unrealistic assumption (such as mutual exclusiveness and exhaustivity) is required for this family of events; • detect all logical relations between these hypotheses, either
152
CHAPTER 16
already included in the knowledge base, or given by You on the basis of the specific situation; • assess probability of the given hypotheses. Clearly, this is not a complete assessment, since these events have been chosen by You as the most natural according to your experience: they do not constitute, in general, a partition of the certain event 0, and so the extension to other events of these probability evaluations is not necessarily unique. • refer to a data base consisting of conditional events ElK and their relevant probabilities P(EIK), where each event K may represent a possible information which is in some way related to the given hypotheses Hi, while each evidence E (regarded as assumed) is an event coming as the result of a suitable evidential test. These probabilities could have been obtained by means of relevant frequencies and should be recorded in some files. Then, once this preliminary preparation has been done, the first step of our procedure consists in building the family of atoms (generated by the hypotheses H 1 , H 2 , ... , Hn): they are a partition of the certain event, but they are not the "natural" events to which You are willing to assign probabilities. Nevertheless these atoms are the main tool for checking the coherence of the relevant assessment: in fact coherence amounts to finding on the set of atoms (by solving a linear system) a probability distribution (not necessarily unique) compatible with the given assignment. If the assessment turns out not being coherent, You can be driven to a different assignment based on the relevant mathematical relations contained in the corresponding linear system. Another way-out is to look for suitable subfamilies of the set {H1, H2, ... , Hn} for which the assignment is coherent, and then proceed by resorting to the extension theorem. On the contrary, coherence of the probabilities P(Hi) allows to
INFERENCE
153
go on by checking. now the coherence of the whole assessment including also the probabilities P(EIK). This requires the introduction of new atoms, possibly taking into account all logical relations involving the evidences E and the hypotheses Hi. In particular, some of the latter may coincide with some K. As the previous examples have shown, the whole assignment (prior probabilities and likelihood) can be incoherent even if the two separate assessment were not. On the basis of the results obtained by means of the evidential tests, You can now update the probabilities of the hypotheses Hi, i.e. You assess the conditional probabilities P(HiiE). Then You need to check again coherence of the whole assessment including the latter and the former probability evaluations. When prior probabilities and likelihood are jointly coherent, You can get formulas representing each posterior probability (of an hypothesis Hi given an evidence E) by Bayes' theorem P(H·IE) = P(Hi)P(EIHi) ' P(E) ' but the denominator P(E), with P(E) > 0, cannot by computed by the usual "disintegration" formula n
P(E) =
L P(Hi)P(EIHi) ,
(16.3)
i=l
since the Hi's are not a partition. Nevertheless we can express P(E) in terms of the atoms, but this representation is not unique, since the corresponding linear system may have more than just one solution: computing upper and lower bounds of P(E) we get, respectively, lower and upper bounds for the posterior probabilities P(HiiE). In conclusion Theorem 10 - Let 1l = {H11 ••• , Hn} be an arbitrary set of events {"hypotheses") and {P(H1), ... , P(Hn)} a coherent assessment ("prior" probabilities). Given any event E ("evidence"), a
CHAPTER 16
154
set of events IC = {K 11 ••• , Km} {possibly Ki = Hi for some j and i) and the relevant coherent assessment {P(EIKI), ... , P(EIKm)} {"likelihood"), then there exists a (not necessarily unique) assessment {P(HdE), ... , P(HniE)} (''posterior" probabilities) if and only if the global assessment
is coherent as well. In particular, if Ki = Hi for some j and i , denote by Ar the atoms generated by 1lUICU{E} and by P the family of conditional probabilities extending the global assessment also to the events Hi lE (i = 1, ... , n); if inf P(Ar) > 0, then P(HiiE) E [p',p"],
L
'P Ar<;;E
where
If inf
L
P(Ar)
= 0, thenp' = 0 and p" = 1.
'P Ar<;;E
The latter assertion corresponds to condition (B) of Chapter 13 (see Theorem 7). In the next Section we discuss also the case P(E) = 0 (possibly allowing also P(H) = 0 ). Now, given a new event F and a corresponding (possibly partial) likelihood, the checking of coherence proceeds with priors (the "old" posteriors) that are possibly upper and lower probabilities. The relevant algorithm and implementation (including significant examples) are discussed with all details in (19]. Here also a significant role is played by the concept of locally strong coherence (cf. Section 14.2).
155
INFERENCE
16.4
Updating probabilities 0 and 1
A commonplace in the literature on Bayesian inference is the one stating that, if a prior discrete distribution (for example, the probability on an event H) is equal to zero, this should inevitably be true (by Bayes' theorem) also for the posterior (for example, for P(HIE), if E is an event representing the result of an experiment): so any updating is considered impossible in this case. On the other hand, the consideration (and the comparison) of null probabilities should be the usual situation in statistical inference, because most things that happen had zero probability. We believe that the role of null probabilities is one of the most subtle and most neglected among all the problems of statistical inference. Nevertheless, even in the case of absolutely continuous distributions, the use of standard mathematical notions not conveying a proper statistical meaning may be questionable : in fact the main tool is - in the usual (countably additive) setting - that of density, despite its dependence (as a Radon-Nikodym derivative: see also our discussion in Section 18.3) on the knowledge of the whole distribution, entailing also a violation of the so-called likelihood principle. On the other hand, we have seen (for example, in Remark 12 of Chapter 11 and at the end of Example 18 of Chapter 12) that probabilities equal to 1 can be updated, and so the same must (obviously) be true for probabilities equal to 0. In this Section we will deepen some of the aforementioned aspects (preliminary results have been presented in 1997 at the ISBA conference [37]), with references to the simplest form of Bayes' theorem, involving only two (unconditional) events, the "hypothesis" Hand the "evidence" E, which can be written, when P(E) > 0, in the form P(HIE)
= P(H)P(EIH) P(E)
.
CHAPTER 16
156
So we have a family C with four conditional events and the relevant assessments P(HIO) = P(H) (the prior), P(EIO) = P(E) (the probability of the evidence, looked on as "assumed", even if "acquired"), P(EIH) (the likelihood), and P(HIE) (the posterior). Correspondingly, we have the four atoms
A1
= E 1\ He,
A2 = E 1\ H,
= Ee 1\ H,
A3
A4 = Ee 1\ He.
To study all possible coherent assessments, we resort to Theorem 4: the first system is
x2 + X3 = P(H)(x1 + x2 + X3 + x4) x1 + x2 = P(E)(x1 + x2 + X3 + x4) x2 = P(EIH)(x2 + x3) x2 = P(HIE)(xl X1
+ x2)
+ X2 + X3 + X4 = 1
Xr;:::
0.
Of course, we will not deal with the trivial case P(E)P(H) > 0, so that we will consider the following three situations
= 0;
• (1)
P(H) > 0, P(E)
• (2)
P(H) = 0 , P(E) = 0;
• (3)
P(H) = 0, P(E) > 0.
(1) Evidence has zero probability (and P(H) > 0) Since P(H) > 0, then P(E) = 0 if and only if P(EIH) system (So) becomes
x2 + X3 = P(H)(x1 + x2 + X3 + x4) X1 + X2 = 0 · (xl + X2 + X3 + X4) x2 x2 X1
= 0 · (x2 + x3) =
P(HIE)(xl
+ x2)
+ X2 + X3 + X4 = 1
Xr;:::
0,
= 0; so
157
INFERENCE and we get XI = x 2 second system is
(SI)'
= 0,
X3
= P(H),
x3
Y2 = P(HIE)(yi { YI + Y2 = 1 Yr ~ 0;
+ x4 =
1, so that the
+ Y2)
it follows easily that the posterior P(HIE) can take any value Y2 E [0, 1]. A noticeable consequence of this result concerns the so-called Jeffreys-Lindley paradox, which refers to the Bayesian approach to the classical problem of testing a "sharp" null hypothesis: it goes back to the pioneering work of H. Jeffreys [85] and D. Lindley [95], and it is regarded as a controversial issue, since a sharp null hypothesis may be rejected by a sampling-theory test of significance, and yet a Bayesian analysis may yield high odds in favor of it. (A simple resolution in terms of "vague" -qualitative- distributions through the concept of pseudodensity [110] has been given in [74]). The problem is the following : suppose that the hypothesis H0 = {0 = 00 } (concerning the value of an unknown parameter) is singled-out to be tested, since it is in some way special, against the alternative HI = {0 =/:. 00 } , on the basis of a measurement x of a random variable X (usually a Gaussian density, with unknown mean 0). In the usual Bayesian approach, the prior distribution 1r for 0 assigns a "lump" of probability 7r0 > 0 to the null hypothesis Ho, while the "remainder" 7ri ( 0) of the prior distribution on HI is given a suitable absolutely continuous distribution. A straightforward use of Bayes' theorem leads to a posterior ratio
P(Holx)
P(Hiix) (for details, see [74]) which can take on, for a sufficiently large prior variance, any arbitrary large value, whatever the data and whatever
CHAPTER 16
158
small is 1f0 > 0. We have already pointed out (at the beginning of this Section) the objections that can be raised against arguments based on "improper" mathematical tools, so it is not surprising that they may lead to paradoxical conclusions. Nevertheless, the previous computations in terms of coherence show that it does not make sense to give Ho a positive probability, pretending - on the basis of the evidence { E = x} - to draw conclusions (by Bayes' theorem) on the posterior P(Ho!E), since the latter can take - coherently - any value in [0, 1] , independently of the distribution of all other hypotheses (constituting the event HI). Notice that we have anyway, for the relevant "partial" likelihood, the value P(E!Ho) = 0. A further understanding can be reached by the study of the second case
(2) Prior and evidence both have zero probability The first system, (So)'', in this case gives easily XI + x 2 = 0 , i.e. the solution XI = x 2 = x 3 = 0 , x 4 second system becomes
XI
+ x3
= 0,
= 1 , and the
= P(H!E)(YI + Y2) Y2 = P(E!H)(Y2 + Y3) YI + Y2 + Y3 = 1
Y2
(SI)''
Yr
~
0.
If P(EIH) = 0 (so that y2 = PI(E/\H) = 0 ), we may have different solutions of (SI)": in fact, recalling that P(H) = 0, and hence that for the zero-layer of H we have o(H) > 0, the different solutions may correspond to different choices of o (H) . Take o(H) = 1: this means y2 +y3 > 0 (but recall that y2 = 0 ), and we have a solution with YI + y 2 = 0 (and so o(E) = 2) and
159
INFERENCE y3 = 1 ; then the third system is
(S2)"
z2 = P(HIE)(zl + z2) { Z1 + Z2 = 1 Zr ~
0,
that is P(HIE) = z2 : the posterior can take any value in
[0, 1]. ·Again: the evidence E, with o(E) = 2, has no influence on H, with o(H) = 1 (notice that in (1) we found, analogously, that E, with o(E) = 1, has no influence on H, with o(H) = 0 ). Still assuming o(H) = 1 (i.e. Y2 + y3 > 0 ), another solution of (SI)" is clearly, for 0 < A < 1, Y1 = A, Y2 = 0, y3 = 1- A, which gives P(HIE) = 0 (notice that now o(E) = 1, and the posterior is no more arbitrary). Is this zero posterior "more believable" than the zero prior? We have
o(HIE) = o(H A E) - o(E) = 2 - 1 = o(H) = 1 , that is prior and posterior are on the same layer (but a further comparison could be done "inside" it: see below, under (3)). Consider now the case o(H) = 2: this means, by (S1 )", that y2 = Ya = 0, and so Y1 = 1; it follows P(HIE) = 0 and o(E) = 1. The third system is
(S2 )"'
z2 = 0 · (z2 + za) { Z2 + Z3 = 1 Zr ~
0,
then z2 = 0, z 3 = 1 (it follows o(H A E)= 3 ). Now, a reasonable prior assumption, to distinguish a "sharp" null hypothesis Ho to be tested against the alternative H 1 ::/=Ho, is to choose o(H0 ) = 1 and o(HI) = 2. As we have just seen, we get P(HoiE) = P(H1 IE) = 0, and to compare these two zeros consider
o(HoiE) = 2 - 1 < o(H1IE) = 3 - 1;
160
CHAPTER 16
then the zero posterior P(HoiE) is "more believable" than the zero posterior P(H1IE) . Going on in taking into account all possible combinations of the probability values, we consider now the case P(EIH) > 0: the system (SI)" gives easily, putting a = P(EIH) and b = P(HIE) , with a + b > 0 , a unique solution YI
=
a(1 -b) a(1- b)+ b ' y 2
=
ab a(1- b)+ b' Ya
=
b(1 -a) a(1- b)+ b ·
Since P(H!E)
= y2 + Ya P(E!H) , Y1
+ Y2
the latter equality can be written (if (y2 + y3 )(y1 + y2 ) > 0, that is, if o(H) = o(E) = 1) as P(H!E) = P1(H) P(E!H) P1(E) .
It follows that P(HIE) > 0 if and only if P(EIH) > 0 (even if P(H) = 0 ). In conclusion, since P1(H) P(H!HV E) g(E) - P(EIH V E) '
the values of the posterior P(HIE) fall in a range which depends on the "ratio of the two zero probabilities P(H) = P(E) = 0".
(3) Prior has zero probability (and P(E) > 0) The system (So) gives easily: x 2 + x 3 = 0, x 1 = P(E), and x 2 = P(EIH) (x2 + x 3 ) • It follows P(HIE = 0, and the second system is (S1 )'"
Y2 { Y2
= P(E!H)(Y2 + Ya) + Ya
Yr ~ 0,
= 1
INFERENCE
161
so that P(EIH) = y2 , with o(H) = 1, o(E) = 0, while
Y2
arbitrary in [0, 1]. Notice that
1 if P(EIH) > 0 o(H 1\ E)= { 2 if P(EIH) = 0. It follows 1 if P(EIH) > 0 o(HIE) = o(H 1\ E) - o(E) = { 2 if P(EIH) = 0 . This means that, if the likelihood is zero, the posterior is a "stronger" zero than the zero prior; if the likelihood is positive, prior and posterior lie in the same zero-layer, and they can be compared through their ratio, since Bayes' theorem can be given the form P(HIE) P(H)
P(EIH) P(E) .
Among the results discussed in this Section, we emphasize that priors which belong to different zero-layers produce posteriors still belonging to different layers, independently of the likelihood.
Chapter 17 Stochastic Independence a Coherent Setting
• Ill
As far as stochastic independence is concerned, in a series of papers ([28], (29], (33], (36]) we pointed out (not only for probabilities, but also for their "natural" generalizations, lower and upper probabilities) the shortcomings of classic definitions, which give rise to counterintuitive situations, in particular when the given events have probability equal to 0 or 1. We propose a definition of stochastic independence between two events (which agrees with the classic one and its variations when the probabilities of the relevant events are both different from 0 and 1), but our results can be extended to families of events and to random variables (see (123]). We stress that we have been able to avoid the situations - as those in the framework of classic definitions - where logical dependence does not (contrary to intuition) imply stochastic dependence. Notice that also conditional independence can be framed in our theory, giving rise to an axiomatic characterization in terms of graphoids ; and this can be the starting point leading to graphical models able to represent both conditional (stochastic) independence 163 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER 17
164
and logical dependence relations. This issue has been thoroughly addressed in (124], and so we will not deal here with conditional independence. Finally, we maintain that stochastic independence is a concept that must be clearly kept distinct from any (putative} formalization of the faint concept of "causality". For the sake of brevity, we shall use in this Chapter the loose terminology "precise" and "imprecise" when referring, respectively, to probabilities or to lower (upper} probabilities.
17.1
"Precise" probabilities
We start by discussing stochastic independence for precise probabilities. The classic definition of stochastic independence of two events A, B, that is P(A 1\ B)
= P(A)P(B) ,
may give rise to strange conclusions: for example, an event A with P(A) = 0 or 1 is stochastically independent of itself, while, due to the intuitive meaning of independence (a concept that should catch the idea that being A independent of B entails that assuming the occurrence of B would not make You change the assessment of the probability of A), it is natural to require for any event E to be dependent on itself. Other formulations of the classic definition are P(AIB) = P(A) and that are equivalent to the previous one for events of probability different from 0 and 1: actually, without this assumption the latter
CS-STOCHASTIC INDEPENDENCE
165
two formulations may even lack meaning, due to the usual definition of conditional probability P(EIH), which requires the knowledge (or the assessment) of the "joint" and "marginal" probabilities P(E A H) and P(H) , and the ensuing positivity of the latter. As widely discussed in previous Chapters, in our· approach conditional probability is instead directly introduced as a function whose domain is an arbitrary set of conditional events, bounded to satisfy only the requirement of coherence, so that P(EIH) can be assessed and makes sense for any pair of events E, H, with H ::/= 0; moreover, the given conditional probability can be extended (possibly not uniquely) to any larger set of conditional events preserving coherence. We recall a notation introduced in Chapter 2: given an event E, the symbol E* denotes both E and its contrary Ec; so the notation, for example, A*IB* is a short-cut to denote four conditional events: AIB' AIBC' ACIB' ACIBC.
Here is the definition of stochastic independence between two events:
Definition 13 - Given a set£ of events containing A, B, Ac, Be, with B ::/= n, B ::/= 0, and a coherent conditional probability P, defined on a family C (of conditional events) containing the set V= {A*IB*,B*IA*} and contained in£ x £ 0 , we say that A is cs-stochastically independent of B with respect to P {in symbols AJLcsB, that is: independence in a coherent setting) if both the following conditions hold:
{i) P(AIB) = P(A!Bc) ; {ii) there exists a class P = {Pa} of probabilities agreeing with the restriction of P to the family V, such that
where the symbol o(·l·) denotes the zero-layer of the relevant conditional event.
166
CHAPTER 17
Remark 15 - Notice that if 0 < P(AIB) < 1 {these inequalities imply also 0 < P(AciB) < 1} and if condition {i} holds (so that also 0 < P(AIBc) < 1 and 0 < P(AciBc) < 1}, then both equalities in condition {ii} are trivially (as 0 = 0) satisfied. Therefore in this case condition AJLcsB should coincide with the classic one : nevertheless notice that the latter would require the assumption 0 < P(B) < 1, so that our approach actually covers a wider ambit, since to give sense to the two probabilities under {i} the aforementioned assumption is not needed in our framework. If condition (i) holds with P(AIB) = 0, then the second equality under {ii} is trivially satisfied, so that stochastic independence is ruled by the first one. In other words, equality {i} is not enough to assure independence when both sides are null: it needs to be "reinforced" by the requirement that also their zero-layers (singled-out by the class {Pa}) must be equal. Analogously, if condition {i} holds with P(AIB) = 1 (so that P(AciB) = 0}, independence is ruled by the second equality under {ii). Example 26 - Going back to Example 6 (re-visited also as Examples 15 and 16}, consider A = H 2 1\ S 1 and B = H 2 • Clearly, P(AIB) = P(AIBc) = 0; we seek now for the relevant zero-layers. Since the atoms generated by A and Bare A1 = AI\B, A 2 = Aci\B, A 3 = A c A Be , it is easy to check that every agreeing class gives the zero-layers the following values o(AIB)
=1,
o(AIBc)
= +oo.
Therefore A is not cs-independent of B, a circumstance that makes clear the important role played by the equality of the two zero-layers: in fact A and B are even logically dependent/ The previous example points out the inability of probability (alone) to "detect" logical dependence, which parallels the inability of zerolayers (alone) to "detect" stochastic dependence (in fact, when 0 <
CS-STOCHASTIC INDEPENDENCE
167
P(AIB) =/= P(AIBc) < 1 these two conditional events are both on the zero-layer corresponding to a = 0 ) . Remark 16 - Spohn defines an event A stochastic independent of B {or B irrelevant to A) by means of a ranking function K, {see Section 12.2}, requiring
As a consequence, since events (conditional or not) of positive probability have rank 0 {except conditional events EIH with EAH = 0 ), all pair of events that are logically independent are also stochastically independent! Notice that this "unpleasant" circumstance may appear even if the considered events are not all of positive probability: take A and B such that P(AIB) = P(AIBc) = ~ ,
l,
with P(B) = 0. The usual procedure {based on the sequence of systems (Sa) introduced in Theorem 4) leads easily to
o(AIB) = o(A A B) - o(B) = 1- 1 = 0 =
= 0- 0 = o(A A Be) - o(Bc) = o(AIBc) . Since ranking functions have the same formal properties of zerolayers, it is possible to assign the same values obtained for the zerolayers to the ranks of the relevant events, so that A should be considered (according to Spohn's definition) stochastically independent of B. This is clearly counterintuitive: consider three tosses of a coin and take the events (cf. also Example 6) B A
= the coin stands in the first toss ,
= the coin shows heads in the tosses in which it does not stand ;
it is immediately seen that they satisfy the previous numerical assessment, while it is at all "natural" to regard them as being not independent.
CHAPTER 17
168
The following theorem shows that, given a coherent conditional probability satisfying condition (i} of Definition 13, condition (ii} either holds for all agreeing classes relative to the family V, or for none of them. This means that cs-stochastic independence is invariant with respect to the choice of the agreeing class. Theorem 11 - Given a set e of events containing A, B, Ac, Be, with B =I= n, B =I= 0, and a coherent conditional probability P, defined on a family C containing V= {A*IB*, B*IA*} and contained in ex let P(AIB) = P(AIBc) . If there exists a class P = { Pa} of probabilities agreeing with the restriction of P to the family V, such that
eo,
then this holds true for any other agreeing class. Proof - Consider first the case that A and B are not logically independent. This corresponds to one (at least) of the following three situations: A 1\ B = 0; A ~ B or B ~ A ; A V B = 0. So we split this first part of the proof in three steps, showing that, for any agreeing class, condition (ii} does not hold. - Suppose A 1\ B = 0: then P(AIB) = 0 and o(AIB) = +oo, while o(AIBc) =I +oo for any agreeing class. -If A~ B, then P(AIBc) = 0, and so also P(AIB) = 0. On the other hand, for any agreeing class, o(AIB) = o(A 1\ B) - o(B) =I o(AIBc) = o(AI\Bc) -o(Bc) = +oo. If B ~A, then P(AIB) = 1 = P(AIBc), but, for any agreeing class, o(AciB) = o(Aci\B) -o(B) = +oo =/= o(Ac 1\ Be) - o(Bc) . -Let A VB= 0: then Be~ A, so that P(AIBc) = 1 = P(AIB), but, for any agreeing class, o(Aci\B)-o(B) =/= o(Aci\Bc)-o(Bc) = +oo, since Ac 1\ Be = 0. Consider now the case that A and B are logically independent, and let
cl= A 1\ B' c2 = Ac 1\ B' c3 =A 1\ BC' c4 = Ac 1\ BC
169
CS-STOCHASTIC INDEPENDENCE
be the relevant atoms. If 0 < P(AIB) < 1, we already know (see Remark 15) that {ii) holds trivially (as 0 = 0) for any agreeing class. Putting now P(A!B) = P(A!Bc) = a (with a = 0 or 1), P(BIA) = /3, P(B!Ac) = 'Y, we have the following system, with Xr = P0 (Cr) 2:: 0 (r = 1, .. .4),
(So)
x1 = a(x1 X3 = a(x3 X1 = f3(x1
+ x2) + x4)
+ X3) X2 = 'Y(X2 + X4) X1 + X2 + X3 + X4
= 1
Let us consider the case a = 0 (the proof is similar for a = 1). We get, from the first two equations and the last one, x 1 = x 3 = 0, so that, substituting into the fourth equation:
X2
= 'Y
X4
,
=1-
'Y .
So, ifO < 'Y < 1, system (SI), with Yr = H(Cr), is
(St) {
Y1 Y1
= f3(YI
+ Y3)
+ Y3 = 1
which has the solution y 1 if 0 < f3 < 1 we have
o(A!B)
= {3,
y3
= 1- {3,
for any
= o(A 1\ B) - o(B) = 1 -
/3 E
(0, 1]. Now,
0=
= o(A!Bc) = o(A 1\ Be) - o(Bc) = 1 -
0.
This means that, if
0 < P(B) < 1,
0 < P(B!A) < 1,
then we have only one agreeing class, and this satisfies (ii). On the other hand, if f3 = 0 or f3 = 1, we have, respectively,
o(A!B)
=2-
0 =/= o(A!Bc) = 1 - 0
CHAPTER 17
170 or
o(AIB) = 1 - 0 -=/= o(AIBc) = 2 - 0, so that, if 0 < P(B) < 1,
P(BIA) = 0
0 < P(B) < 1,
P(BIA) = 1,
or then the respective (unique) agreeing classes do not satisfy {ii). Take now 'Y = 0: in this case x 2 = 0 (and x 4 = 1) and then system (81) becomes Y1 = O(yl + Y2) { (81) Y1 = f3(Yl + Y3) Y1 + Y2 + Y3 = 1 .
It has (for 0 < (3 :::;; 1) the solution y1 = 0, y2 the system (82) is, putting Zr = P2(Cr),
(82 ) {
= 1,
y3
= 0,
and so
= (3(z1 + z3) Z1 + Z3 = 1 Z1
whose solution is z1 = (3, z3 = 1 - (3. Then, if (3 < 1
o(AIB)
= 2-1-=/= o(AIBc) = 2-0.
Moreover, for (3 = 1 we have
o(AIB)
= 2-1-=/= o(AIBc) = 3-0,
so that, if
P(B) = 0,
0 < P(BIA) :::;; 1,
then the respective (unique) agreeing classes do not satisfy (ii). Now, for (3 = 0 we have
o(AIB)
= 3- 1 = o(AIBc) = 2 -
0,
and, going back to system (SI), we find also (if 0 :::;; .A < 1) the solution y1 = 0, Y2 = .A, Y3 = 1 - .A, so that, for 0 < .A < 1,
o(AIB)
= 2-1 = o(AIBc) = 1-0,
171
CS-STOCHASTIC INDEPENDENCE
.A= 0 we have the system (82 ) { z1 = O(z1 + z2) Z1 + Z2 = 1
while for
(S2 )
whose solution is z1 = 0, z 2 = 1. Then
o(AIB)
= 3-2 = o(AIBc) = 1-0.
Therefore, if
= 0,
P(B)
P(BIA)
= 0,
then all three agreeing classes satisfy {ii). Take now 1 = 1: so x 4 = 0 and the system (SI) is Y3 = O(y3
+ Y4)
(8t) { Y1 = f3(Yl + Y3) Y1 + Y3 + Y4 = 1 For 0 ~ {3 < 1 one solution is y1 need to consider the system (82 ),
(82 ) { z1 = {3(z1 Z1
+ Z3 =
whose solution is z1
=
0, y3
= 0,
Y4
=
1, and so we
+ z3) 1
= {3,
o(AIB)
z3
= 1- {3.
=2-
If 0 < {3 < 1 we have
0 =f. o(AIBc)
= 2- 1,
while for {3 = 0 we have
o(AIB) = 3 - 0 i= o(AIBc) = 2 - 1 . Then, if
P(B)
= 1,
0 ~ P(BIA) < 1 ,
then the relevant agreeing classes do not satisfy {ii}. Instead, for {3 = 1 we have
o(AIB)
= 2-
0 = o(AIBc)
=3-
1.
CHAPTER 17
172
Moreover, going back to system (81 ), for {3 = 1 we have also (for any,\ with 0 < ,\ ~ 1) the solution y 1 = ..\, Y3 = 0, Y4 = 1- ..\, so that, for 0 < ,\ < 1, o(AIB) = 1 - 0 = o(AIBc) = 2 - 1,
while for,\= 1 we have the system (82 )
(SB) { Z3 = O(z3 + z4) z3+z4=1 whose solution is z3 = 0, Z4 = 1. Then o(AIB) = 1 - 0 = o(AIBc) = 3 - 2.
In conclusion, if P(B)
= 1,
P(BIA)
= 1,
then all three agreeing classes satisfy (ii). • The following theorem puts under the right perspective how our definition avoids the shortcomings of classic definitions: stochastic cs-independence is stronger than logical independence.
Theorem 12 - Let A, B two possible events. If AJLc8 B, then A and B are logically independent. Proof- It is a simple consequence of the first part of the proof of the previous theorem. •
The next theorem characterizes stochastic cs-independence of two logically independent events A and B in terms of the probabilities P(B), P(BIA) and P(BIAc), giving up any direct reference to the zero-layers.
Theorem 13 - Let A and B be two logically independent events. If Pis a coherent conditional probability such that P(AIB) = P(AIBc), then AJLcsB if and only if one (and only one) of the following (a), (b), (c) holds:
173
CS-STOCHASTIC INDEPENDENCE (a) 0 < P(AIB) < 1;
{b) P(AIB) = 0 and the extension of P to B and BIA satisfies one of the three following conditions: 1. P(B) = 0, P(BIA) = 0, 2. P(B) = 1, P(BIA) = 1 , 9. 0 < P(B) < 1, 0 < P(BIA) < 1 ,
{c) P(AIB) = 1 and the extension of P to B and BIAc satisfies one of the three following conditions: 1. P(B) = 0,
P(B!Ac) = 0, 2. P(B) = 1, P(B!Ac) = 1, 9. 0 < P(B) < 1, 0 < P(BIAc) < 1,
Proof- Parts (a) and {b) follow easily from the second part of the proof of Theorem 11: refer in particular to (*), (**), (* * *). Part (c) can be proved in a similar way (it refers to the case a= 1, while part {b) corresponds to a = 0). A direct proof (without resorting to Theorem 11) is in [29]. •
Example 27 - To show the "operativity" of Definition 19, we discuss the following situation: consider two events A, B, with P(A) = 0 , P(B) =
1
4;
for instance, B = { 4n} nE IN , A = { 97, 44, 402} , which are disjunctions of possible answers that can be obtained by asking a mathematician to choose at his will and tell us a natural number n (cf. Remark 2 in Chapter 3). Assessing P(BIA)
= ~ , P(BIAc) = ~ ,
it turns out that BJLcsA does not hold, while, assessing P(AIB) = 0 = P(A!Bc) ,
174
CHAPTER 17
we have AJLcsB : in fact, we need to find the relevant zero-layers, and system (So) gives easily
while system (81 ) gives
so that o(AIB)
= o(AAB)-o(B) = 1-0 = o(AABc)-o(Bc) = 1 = o(AIBc).
Therefore AIB and AIBc are on the same zero-layer.
This lack of symmetry means (roughly speaking) that the occurrence of the event B of positive probability does not "influence" the probability of A; but this circumstance does not entail, conversely, that the occurrence of the "unexpected" (zero probability) event A does not "influence" the (positive) probability of B (see also the following Theorem 16). Since P(AIA) = 1 and P(AIAc) = 0 for any (possible) event A (even if P(A) = 0 or P(A) = 1), we have the following
Proposition 1 - For any coherent P and for any possible event A, one has •(AJLcsA), i.e. the relation Jlcs is irreflexive (any event is stochastically dependent on itself). Proposition 2 - For any coherent P and for any possible event B, one has nJLcsB and 0JLcsB. Proof- Trivial. •
Remark 17 - The conclusion of the previous Proposition is very natural, since the probabilities {1 and 0, respectively) of n and 0 cannot be changed by assuming the occurrence of any other possible event B.
CS-STOCHASTIC INDEPENDENCE
175
Conversely, we recall that Definition 13 of AlLcsB requires the "natural" conditions B =f:. and B =f:. 0 (since a conditioning event cannot be impossible): in fact n and 0 correspond to a situation of complete information (since the former is always true and the latter always false), and so it does not make sense asking whether they could "influence" the probability of any other event A. We point out that this is another instance (even if in a limiting case) of a lack of symmetry in the concept of independence.
n
Proposition 3 - Let P be a coherent conditional probability, and A, B two possible events. If AJLc8 B, then Ac Jlc 8 B, AJLcsBc, and Ac lLcsBc. Proof- Trivial. •
The following two theorems study the connections between our definition of stochastic independence and others known in the literature: Theorem 14 - If AJLc8 B, then P(AlB) = P(A). Conversely, assuming that P(B) < 1 and 0 < P(A) < 1, if P(AlB) = P(A), then AJLc8 B. Proof- Assume AJLcsB: clearly, the conclusion holds trivially, 1. Now, if P(B) < 1 (including also for any A, when P(B) P(B) = 0), P(AlB)
= P(AlBC) = P(A A BC) = P(A)[1- P(BlA)] . P(Bc)
1 - P(B)
it follows P(AlB) - P(A A B) = P(A) - P(A A B)
and finally P(AlB) = P(A).
'
CHAPTER 17
176
Conversely, if P(B) < 1 (possibly P(B) = 0) and 0 < P(A) < 1, assuming P(AIB) = P(A), we have
P(AIBC) = P(A A BC) = P(A)P(BCIA) = P(A)- P(BIA)P(A) = P(Bc) P(Bc) 1 - P(B)
= P(A)- P(A A B) = P(A)- P(AIB)P(B) = P(A) 1 - P(B)
1 - P(B)
so that P(AIBc) = P(AIB). Moreover, 0 < P(AIB) < 1 (and so also condition (ii) of independence is trivially satisfied: see Remark 15) .•
Remark 18 - When P(B) = 1, so that, trivially, P(AIB) = P(A), the relation AJLcsB may not hold. In fact, when P(B) = 1, the probability P(AIBc) can take any value of the interval {0,1}, as we are going to show (the proof has been given in {28}). For any assessment of this probability, putting, for the four atoms Ar,
the following system is compatible x2
= P(A)(x2 + x3)
x1
= P(AIBc)(xi
X2
+ X3 = 1
XI+ XI
+ x4)
X4 = 0
+ X2 + X3 + X4 = 1
Xi ;::: 0 since it has the solution xi = X4 = 0, x 2 = P(A), x 3 = 1- P(A); going on as usual, also the next system, with Yr = g (Ar), Y1 { Y1
= P(AIBc) (YI + Y4) + Y4 = 1
Yi;::: 0 is satisfied for any YI = P(AIBc) ~ 0.
CS-STOCHASTIC INDEPENDENCE
177
Theorem 15 - If AJLcsB, then P(AB) = P(A)P(B). Conversely, assuming that 0 < P(A) < 1 and 0 < P(B) < 1, if P(AB) = P(A)P(B), then AJlc8 B. Proof- The initial statement follows immediately from the first part of Theorem 14. Conversely, since the product rule implies P(AIB) = P(A) and P(BIA) = P(B), one has P(AIBc) = P(A)P(BciA) = P(A)(1 - P(BIA) = P(A) = P(AIB) P(Bc) 1 - P(B) .
Finally, argue as in the last two lines of the proof of Theorem 14. •
Remark 19 - When P(B) = 0, the equality P(A/\B) = P(A)P(B) holds for any P(A), but this does not imply AJLcsB. If P(B) = 1, both equalities P(A 1\ B) = P(A)P(B) and P(AIB) = P(A) hold for any A, but (as it has been already noticed in Remark 18} this does not imply AJlc8 B. If P(A) = 0, the product rule is satisfied for any B, and we may have also P(AIB) = P(AIBc) = 0, but it does not follow AJlc8 B, since condition o(AIB) = o(AIBc) may not hold. Finally, if P(A) = 1, both equalities hold, but it is not necessarily true that o(AciB) = o(AciBc). Concerning the possible symmetry of the independence relation, we have the following result:
Theorem 16 -Let AJlc8 B. We have: {i} if P(B) = 0 then BJLcsA j {ii} if P(B) = 1 then BJlcsA; {iii} if 0 < P(B) < 1 and 0 < P(AIB) < 1 , then BJLcsA. Proof- Prove {i): let AJLcsB and P(B) = 0, and suppose that 0 < P(AIB) = P(AIBc) < 1 (which implies 0 < P(A) < 1). Then we have P(BIA) = P(BIAc) = 0 and the conclusion follows by condition {b}3 of Theorem 13. Suppose now P(AIB) = P(AIBc) = 0 and so P(A) = 0 and P(BIAc) = 0. By {b}J of Theorem 13 it
CHAPTER 17
178
follows also P(BIA) = 0. Then, by using the same condition {b)1 with the role of A and B interchanged, we have BJLcsA· Finally, suppose P(A!B) = P(A!Bc) = 1 and so P(A) = 1 and P(BIA) = 0. By (c)1 of Theorem 13 it follows also P(B!Ac) = 0. Then, by using condition {b)2 (again interchanging the role of A and B), we get BJLcsA. The proof of {ii) is analogous. In the last case {iii), the condition AlLcsB coincides with the classic ones, and so the result is known. • Remark 20 - We note that if 0 < P(B) < 1 and P(AIB) = P(A!Bc) = 0 (and so P(A) = 0), then AllcsB does not assure that Bllc8 A. In fact, even if we have 0 < P(BIA) < 1 by condition {b)3 of Theorem 13, P(BIA) not necessarily equals P(B!Ac) = P(B): see the Example 27, where we have shown that this possible lack of symmetry is not counterintuitive.
For "lovers" of a symmetric concept of independence, Definition 13 can be strengthened by adjoining only (with A and B interchanged) condition {i), that is P(BIA) = P(B!Ac). In fact condition {ii) rules, in a sense, logical independence (which is, obviously, symmetric), as can be inferred from the following Theorem 17 - If A and B are logically independent, then
(17.1)
if and only if P(AIB) = P(AIBc)
and
P(BIA) = P(B!Ac).
(17.2)
Proof- Obviously (17.1) implies (17.2). Conversely, assuming (17.2) and distinguishing, for the given conditional probabilities PI = P(AIB) = P(A) and p 2 = P(BIA) = P(B), the three cases 0 < Pi < 1, Pi = 0, Pi = 1 (i = 1, 2), the conclusion follows by resorting to the different situations expressed in Theorem 13; for example, if 0 < PI < 1 and p2 = 0 , we have AllcsB from
CS-STOCHASTIC INDEPENDENCE
179
condition (a}, while B.JLcsA follows from condition {b}3 (with A and B interchanged). • In conclusion, the above theorem points out that requiring symmetry, i.e. condition (17.2), is a strong assumption absorbing all the different cases considered by Theorem 13.
17.2
"Imprecise" probabilities
Let us now turn on the problem of introducing stochastic independence for imprecise probabilities. The difficulties that arise in the classical framework are well known (see, for example, De Campos and Moral [44], Couso, Moral and Walley [41]). In fact, in the usual approaches problems are due to the introduction of marginals when one tries to extend to "imprecise" probabilities the product rule for "precise" probabilities P(A 1\ B) = P(A)P(B) . (17.3) In fact, considering for example the lower probability P, we may have P(A 1\ B) = P(A)P(B) , (17.4) while no element P of the enveloping class satisfies {17.3), and, conversely, we may have families of probabilities such that each element P of the family satisfies {17.3), but {17.4} does not hold for the corresponding lower probability. Example 28 - Given two {logically independent) events A and B, consider the following assessment P(A) = P(B) =
~,
P(A 1\ B)
= 116
,
P(AV B)=~.
We have shown in Example 21 of Chapter 15 that this is a lower probability.
180
CHAPTER 17
Now, notice that P(A 1\ B) = P(A)P(B); nevertheless, there is no dominating class P such that the product rule holds for all probabilities P E P. In fact, consider {for instance) the system (8!) {cf Theorem 9 of Chapter 15} corresponding to the event A 1\ B, that we write in the form (for a, b, c ;::: 0}
+ za =~+a z1 + z2 = ~ + b z1
z1--
1 16
+ Z2 + Z3 = ~ + C Z1 + Z2 + Za + Z4 = 1 Z1
Zi;:::
0
and assume that the corresponding "minimal" probability satisfies the product rule, that is
Since the latter equation holds only for a= b = 0, we easily obtain the contradiction 3 Z1
+ Z2 + Z3 < 4
with the fourth equation of the above system. Conversely, we exhibit now a family of probabilities such that each element P satisfies the product rule, but the latter does not hold for the dominated lower probability. Consider in fact the following
Example 29 - Take the simple family with two elements P 1 and
p2 1
P1(A)
= 4,
P2(A)
= 2,
1
1
Pt(B)
= 2,
P2(B)
= 4,
1
1
P1(A/\B)=B; 1
P2(A 1\ B) = B .
CS-STOCHASTIC INDEPENDENCE
181
They satisfy the product rule, while the corresponding lower probability 1 P(A) = P(B) = P(AAB) = B
l,
does not.
Other formulations of independence for imprecise probabilities, such as P(AIB) = P(A) , give rise to similar difficulties, even if based on conditional probability, since its "Kolmogorovian-like" definition requires anyway resorting to marginals. These difficulties lead to many different ways of introducing conditioning for imprecise probabilities, while our approach to conditioning in this context is the most natural. In fact, we recall that its starting point refers (again) to the direct definition (through coherence) of the conditional "precise" probabilities "enveloping" the imprecise one (see Chapter 15): a further advantage is to avoid for lower probabilities the very strong (even if quite usual) assumption P(B) > 0, where B is the conditioning event. We introduce now a definition of stochastic independence for lower probabilities that avoids (as can be easily checked) the situations pointed out in the latter two examples (where, given a lower probability and an enveloping class, the independence property may hold for the former but not for the latter, or conversely). Definition 14 - Given a set E of events containing A, B, Ac, ne, with B =f:. n, B =1- 0, and a coherent lower conditional probability P, defined on a family C containing V= {A*IB*, B*IA*} and contained in E x &0 , we say that A is cs-stochastically independent of B with respect to P {in symbols A Jlcs B) if there exists a dominating class P such that AJLcsB holds true for every P E P . Notice that the latter requirement of Definition 14 can be limited only to i-minimal probabilities (see Chapter 15), for any FiiKi E C.
182
CHAPTER 17
Remark 21 - According to our definition, the events A, B of Example 28 are not stochastically independent with respect to P : in fact, if they were independent, by Theorem 15 the product rule would hold for a dominating class, while we proved that such a class does not exist. The events of Example 29 are independent (with respect to P }, as it can be easily checked by expressing the relevant conditional probabilities by means of the given (unconditional and with values strictly greater than 0 and less than 1) probabilities of A, B and A 1\ B. Definition 15 - Let C be a family of conditional events and P a coherent lower conditional probability on C. For any conditional event FilKi E C, we call zero-layer of FilKi, with respect to a set IT = { P[} of minimal classes, the {nonnegative) number
~.e.
the zero-layer with respect to the i-minimal class
Pl .
The following Proposition essentially states that a lower probability inherits the independence properties of the dominating class.
Proposition 4 - If A Jlcs B, then both the following conditions hold:
(i) P(AJB) = P(AJBc) = P(A) ; (ii) there exists a class IT = {P[} as those of Definition 15, such that Proof - The proof is an easy consequence of the properties of cs-stochastic independence for "precise" probabilities. •
Remark 22 - Clearly, cs-stochastic independence for lower probabilities implies logical independence: this conclusion follows easily from Theorem 12 by referring to just one of the dominating probabilities.
CS-STOCHASTIC INDEPENDENCE
183
Theorem 18 - Let A and B be two logically independent events. If A.lLcsB, then P(AIB) = P(AIBc) and one (and only one) ofthe following conditions (a), {b), (c) holds: (a) 0 < P(AIB) < 1; {b) P(AIB) = 0 and the extension of P to B and BIA satisfies one of the three following conditions: 1. P(B) = 0, P(BIA) = 0, 2. P(B) = 1, P(BIA) = 1, 3. 0 < P(B) < 1, 0 < P(BIA) < 1, (c) P(AIB) = 1 and the extension of P to Band BIAc satisfies one of the three following conditions: 1. P(B) = 0, P(B!Ac) = 0, 2. P(B) = 1 , P(B!Ac) = 1 , 3. 0 < P(B) < 1, 0 < P(B!Ac) < 1 . Proof- It follows easily recalling the "necessary" part of Theorem 13. • Differently from the case of conditional (precise) probability, here the con11erse is not true. In fact, for (a), {b), (c) to imply independence it is necessary (and sufficient) that the restriction of the imprecise conditional probability to the relevant events be a precise conditional probability, as stated in the following Theorem 19 - Let A and B be two logically independent events, £ a set of events containing A, B, Ac, Be, with B '# n, B '# 0, and P a coherent lower conditional probability, defined on a family C containing V= {A*IB*, B*IA*} and contained in£ x £ 0 • Suppose P(AIB) = P(AIBc) , with P satisfying one of the conditions (a), {b) or (c) of Theorem 18. Then A .llcs B if and only if the restriction of P to the set V' = V U {A, B} is a coherent conditional (precise) probability.
184
CHAPTER 17 Proof- Consider the system
Lr P~(Ar) ~ P(FiiKi) Lr P~(Ar) [if i > 6], ~~~~
~~~
Lr P~(Ar) ~a Lr P~(Ar) , Ar~AB
Ar~B
Lr P~(Ar) ~a Lr P~(Ar)'
Ar~ABC
Ar~Bc
Lr P~(Ar) ~ (3 Lr P~(Ar) , Ar~AB
(S~)
Ar<;:;A
Lr P~(Ar) ~ (3 Lr P~(Ar) , Ar~BAc
Lr P~(Ar)
Ar~Ac
=a,
Ar~A
Lr P~(Ar) ~a, Ar~B
P~(Ar) ~ 0,
Lr P~(Ar) 0 ArCK 0
=
1
· 11
where we refer to the notation of Theorem 9 of Chapter 15, with F1IK1 = Aln, F2IK2 = Bin, F3IK3 = AIB, F4IK4 = AIBc, F5IK5 = BIA, F61K6 = BIAc. Moreover, we put
P(AIB)
= P(AIBc) = P(AIB) = P(AIBc) = P(A) = P(A) =a,
P(BIA)
= P(BIAc) = P(BIA) = P(BIAc) = P(B) = P(B) = (3.
Firstly, suppose that the restriction of P to V' is a coherent conditional (precise) probability P and consider the case (a), that is 0 < P(AIB) < 1. It follows that the system (S~) has a solution satisfying (as equalities) the inequalities related to the conditional events FiiKi, with i ~ 6. Then this solution gives rise to i-minimal coherent conditional probabilities Pi for every i ~ 6. Moreover Pi satisfies the condition of independence AJLcsB.
CS-STOCHASTIC INDEPENDENCE
185
To prove that it is possible to find i-minimal probabilities (for i > 6 ), notice that the equality related to a conditional event FiiKi fori > 6 does not give to FiiKi (i ~ 6) different constraints from those induced by the coherence of the lower probability. Therefore it is possible to find two values A and J-L of the corresponding probabilities, with A ~ P(A) and J-L ~ P(B) . For the cases (b) and (c) the proof is similar. Conversely, suppose now that A llcs B ; then there exists an iminimal class{~} such that AllcsB for any i (that is Pi(AIB) = Pi(AIBc) = ~(A) and one of the above conditions (a), (b), (c) of Theorem 13 holds). Therefore such a ~' for i = 1, is necessarily a solution of the system (S~) that satisfies (as equalities) the inequalities related to FiiKi, with i ~ 6. Then it follows that all the i-minimal probabilities with i ~ 6 coincide, and this remark ends the proof. •
Remark 23 - When the restriction of P on the relevant events is a (precise) probability P, the sufficiency of the conditions (a), (b), (c) of Theorem 18 for the stochastic independence of A and B -see the first part of the above proof- is not (as it could appear) a trivial consequence of Theorem 13 (that refers to precise probabilities). In fact, since stochastic independence for imprecise probabilities is a property that involves all the probabilities of a dominating class, it may happen that the values taken by the coherent lower probability P on events not in V' inhibit the existence of a dominating class of precise probabilities satisfying the aforementioned conditions. Remark 24 - For other usual definitions of stochastic independence for imprecise probabilities (based on the product rule) a result similar to that of Theorem 19 does not hold. Consider in fact Example 28: even if the restriction of P to {A,B,AI\B} is a coherent (precise) probability and {17.4) holds,
CHAPTER 17
186
nevertheless this form of independence does not imply the existence of a class of dominating probabilities satisfying the product rule.
17.3
Discussion
Although it might seem obvious that, given a measure of uncertainty (as a lower probability) involving a family of (precise) probabilities, a "natural" definition of stochastic independence can be obtained by requiring the independence condition for each element of the family (for example, the dominating class, as in Definition 14), we had better be careful in other contexts. In other words, while a concept of independence for a lower probability not referring to the dominating family might be not significant (see Examples 28 and 29) - in the sense that it might be (so to say) too "weak" or too "strong" - on the other hand the conclusion of Theorem 19 may suggest the "philosophical" remark that the intuitive aspects of stochastic independence (i.e., a new information on the event B should not change our belief in the event A) are better captured by referring to just one probability. For example, do convex combinations of probabilities inherit stochastic independence relations that hold for all the element of the relevant family? The following classic situation shows that the answer is "N 0" . It refers to an urn of unknown composition, which, by considering successive drawings with replacement, supplies a natural example of not independent (even if exchangeable) events A = the first ball drawn is white, B = the second ball drawn is white, (the intuitive content is that at each drawing we learn "something").
CS-STOCHASTIC INDEPENDENCE
187
Suppose that each ball in the urn is white or black, but we do not know the number of balls with a given colour, so that we need to consider also the events
Hr =the number of white balls is r,
(17.5)
with r = 0, 1, 2, ... , N, if N is the total number of balls in the urn. Clearly, if the composition of the urn were known (that is, r takes only a given value r 0 ) , A and B would be stochastically independent. For definiteness, take N = 3 and suppose that we know that only two situations are possible (even if we do not know which is the true one), i.e. the number of white balls (out of the 3) is 1 or 2.
If P 1 and P 2 are the corresponding probability assessments, we have
and
P2(A) = P2(B) =
~
, P2(A 1\ B) =
~
, P2(BIA) = P2(B!Ac) =
~,
so that A.lLcsB with respect to both P 1 and P 2 (here our definition coincides with the classic one, since all probabilities are positive and less than 1). Consider now the convex combination (17.6) with a 1 = P(HI) and a 2 = P(H2 ). Since
it follows that (17.6) represents the probability distribution for the drawings from the given urn of unknown composition.
CHAPTER 17
188
We show that, for any choice of a 1 and a 2 , with 0 < a 1, a 2 < 1, the events A and B are not independent: in fact
1 2 1 2 1 P(A) = P(B) = 3a1 + 3 o2 = 3 a 1 + 3(1- ai) = 3 (2- a 1), 1
4
1 4
P(A 1\ B) =
ga1 + ga2 = 3(3 - ai) ,
P(A)P(B) =
~(2- ai) 2 #- ~(~- ai),
so that
as can be easily checked. Notice that the same negative conclusion is reached also if we consider even if the latter is equal to
in fact the common value
is not a conditional probability (i.e., it is not coherent), since the last equality implies P(A 1\ B) P(Ac 1\ B) P(A) P(Ac)
i.e.
4- 3a1 2 --2- Ql 1 + Ql' which has no solution (for 0 < a 1 < 1).
Given an arbitrary natural number N, it is not difficult to prove the same result for the convex combination N
P(·)
=L r=l
P(Hr)P(·IHr),
CS-STOCHASTIC INDEPENDENCE
189
i.e. by referring to all possible compositions (17.5) of the urn.
Urns of unknown composition are useful also to discuss some further aspects concerning the so-called "dilation" (a precise probability may give rise- under conditioning- to an imprecise one), a phenomenon which is considered as counterintuitive: we challenge this view in [33] (not to mention that many instances of dilations can be found in Chapter 16 on inference!). So the fact that a "precise" probability P(A) may give rise- under conditioning with respect to B- to an interval whose extremes are a lower and an upper probability, is well known in Bayesian statistics and it is not at all strange in terms of coherence. Our opinion is that dilation looks as counterintuitive since what is usually emphasized in the literature- when a conditional probability P(AIB) is taken into account- is only the fact that P(·IB) is a probability for any given B: this is a very restrictive (and misleading) view of conditional probability, corresponding trivially to just a modification of the "world" n (see our discussion following Definition 5 in Chapter 11). It is instead essential to regard the conditioning event B as a "variable", i.e. the "status" of Bin AIB is not just that of something representing a given fact, but that of an (uncertain) event {like A) for which the knowledge of its truth value may not be required: so P(BIA), which plays an essential role- through Bayes' theorem - in the updating P(AIB) of P(A), may be naturally "imprecise". This can be easily checked by considering, for example, two urns with N balls- one of them, U, being of unknown compositionand the events A = an urn chosen at random out of the two is U , B = a ball drawn from the selected urn is white . In fact, denote by r the (unknown) number of white balls in the urn U (so r takes values r = 0,1, ... ,N), and by r 0 the (given) number of white balls in the second urn: since P(A) = ~, we get,
190
CHAPTER 17
by a straightforward application of Bayes' theorem, P(AIB)
=
_r_, with r
r +ro
= 0, 1, ... ,N,
so that P(AIB) can takeN+ 1 values (between 0 and
17.4
NZro ).
Concluding remarks
The concept of stochastic independence is usually based on the factorization of a joint probability distribution as the product of the marginal distributions. This leads to difficulties and inconsistencies, in particular when events of probability zero or one are involved. Concerning imprecise probabilities, in the literature there are several different definitions of independence that give rise as well to similar and further difficulties. We are able to cope with these problems by extending to imprecise probabilities (in the rigorous sense of lower and upper probabilities) our approach to independence based solely on conditional probability: a "direct" definition of the latter through coherence avoids all the difficulties connected with the use of marginals and the presence of conditioning events of probability zero. Moreover, stochastic independence (for either precise or imprecise probabilities) implies logical independence. Nevertheless, we believe that the intuitive aspects of stochastic independence AJLcsB (that is, a new information on the event B should not change our belief in the event A) are better captured by referring to just one (precise) probability.
Chapter 18 A Random Walk in the Midst of Paradigmatic Examples In this Chapter we consider a bunch of "randomly chosen" examples, with the aim of further clarifying or complementing many aspects that have already been dealt with (more or less extensively) in the previous Chapters. Each example is preceded by relevant comments that should put it under a right perspective.
18.1
Finite additivity
In Chapter 3, Remark 2, we discussed briefly the suitability of assuming for probability the weaker axiom of finite additivity. The subsequent theory of coherence confirms that this is a "natural" choice. The following concrete example illustrates a statistical phenomenon (the so-called first digit problem [109], [110]) which cannot be properly modeled by a countably additive probability: an attempt to do so leads to a conclusion which is empirically untenable. 191 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER 18
192
Example 30 - It has been observed that empirical results concerning the distribution of the first significant digit of a large body of statistical data (in a wide sense: physical and chemical constants, partial and general census and election results, etc) show a peculiarity that has been considered paradoxical, i.e. there are more "constants" with low order first significant digits than high. In fact, the observed frequency of the digit k {1 ~ k ~ 9} is not 1/9, but is given by (18.1)
where Ek is the event ''the first significant digit of the observed constant is k ", that can be written as
with (for short, with an abuse of notation)
since Ikn is in fact the proposition ''the first significant digit of the observed constant belongs to the interval Ikn ". Assuming countable additivity, these intervals, in spite of their increasing (with n) cardinality, might obviously have, to guarantee the summability of the relevant series 00
P(Ek) =
L
P(Ikn),
(18.2)
n=O
a probability converging to zero. On the other hand, since any kind of "regularity" in a statistical table should be apparent also in every table obtained from it by any change of units, it follows that the sought probability P should be "scale-invariant", i.e.
P(I)
=
P(>-.I)
PARADIGMATIC EXAMPLES
193
for any interval I and real>.. By choosing as>. a power of 10, it follows that, for any integer k between 1 and 9, and for any natural number n, P(Ikn) = 0, so {18.2} cannot hold. Instead, in a finitely additive setting, these equalities are compatible with the above value of P(Ek), since, by superadditivity (an elementary property of finitely additive measures on a countable partition}, we have 00
P(Ek) ~
L P(Ikn) · n=O
How to find a suitable {finitely additive) probability distribution satisfying {18.1} is shown, e.g., in {110}.
18.2
Stochastic independence
The first-digit problem is apt also to discuss situations concerning stochastic independence in our coherent setting versus the classic one. Example 31 - With the same notation of the previous Example, for any given natural number n, we have
while Ek and Ikn are clearly not independent {neither logically nor stochastically). In fact, for any given natural number n we have
which is different (referring now to Definition 13, Chapter 17} from
194
18.3
CHAPTER18
A not coherent "Radon-Nikodym" conditional probability
Consider the conditional probability P(EIH): even if allowing the conditioning event H to have zero probability gives rise to subtle problems, nevertheless this conditional probability has, in our framework, all the .. . "civil rights", since it can be directly assessed through the concept of coherence. On the other hand, in Kolmogorov's axiomatic approach, in which the formula P(EIH) = P(E 1\ H) P(H)
(assuming P(H) > 0) is taken as definition of the conditional probability, a difficulty immediately arises when absolutely continuous distributions are considered, since in this case zero probabilities are unavoidable. In order to recall in the shortest and most elementary way the procedure followed (in the usual approach) to cope with these difficulties, we will adopt an informal exposition, sketching the main ideas and avoiding any detailed and rigorous specification. Neither we shall recall each time explicitly that all the probability distributions of the classical framework must verify countable additivity (and not only finite additivity, which is the natural requirement in a framework based on coherence). Let (X, Y) be a random vector and P the relevant probability distribution. Given two Borel sets Ax and By contained respectively in the range of X and in that of Y, by the same symbols we shall denote also the events {X E Ax} and {Y E By}. For any given value x of X, the conditional probability p(Bylx) is defined (see, e.g., [10]) as a function of x such that P(Ax
n By)=
j
Ax
p(Bylx)p.(dx)
(18.3)
PARADIGMATIC EXAMPLES
195
where f.-t is the marginal distribution of X. The existence of such a function p(Bylx) is warranted (under usual regularity conditions) by Radon-Nikodym theorem: in fact
P(Ax
n By)~ J-t(Ax),
so that, putting
it follows that the probability measure {3(·, By) is absolutely continuous with respect to f.-t; therefore (as it is well-known) f3 can be represented as a Lebesgue-Stieltjes integral (with respect to f.-t) of a density, i.e. of the function (of the argument x) denoted by p(Bylx) in eq. (18.3). Is this function entitled to be called "conditional probability"? Of course, in order to interpret p(Bylx) as the conditional probability of By given {X = x} it is necessary that p( ·lx) be a (countably additive) probability measure: this is true under suitable regularity conditions (that hold in the most common situations). Notice that p(·lx), being a density with respect to x, could be arbitrarily modified on a set of zero measure. Moreover, in the particular case that Ax reduces to a singleton {x} with probability J-t( {x}) > 0, we must have
P({x} n By)= J-t({x})p(Bylx);
(18.4)
and in fact in this case eq.(18.3) becomes eq.(18.4). For the sake of simplicity, we have considered a random vector (X, Y), but it should be clear that the previous results could have expressed by referring to two suitable partitions of the certain event !1 and by relying on the relevant extensions of the concept of integral and of the related measure-theoretic tools. Now, the main question is the following: is the above function p( ·I·) a coherent conditional probability? Let us consider the following
196
CHAPTER18
Example 32 - Given a E IR, let Hn = [a, a+~] for every n E IN, with J.L(Hn) > 0, and let H = {a}, with J.L(H) = 0. Given an event E, by Kolmogorov's definition we have
p
(E)H. ) = P(Hn 1\ E).
n
J.L(Hn)
'
then take a density p(E)x) defined by means of {18.3) P(Hn 1\ E) = { p(E)x)J.L(dx). }Hn Under usual regularity and continuity conditions we can write (by "mean-value" theorem) p(E)Hn) = p(E)x 0 ) for a suitable X 0 E Hn, so that lim p(E)Hn) = lim p(E)xo) = p(E)H). (18.5) n-too n-too Now, if we consider the events E' = E V H and E" = E 1\ He (recall that the probability of H is zero), we get P(E' 1\ Hn) = P(E" 1\ Hn) = P(E 1\ Hn) and then P(E')Hn) = P(E")Hn) = P(E)Hn) for every n E IN; it follows that also the three corresponding limits ( 18. 5) are equal, so, in particular, P(E'IH) = P(E"IH). But notice that coherence requires P(E'IH) = 1 and P(E"IH) = 0. The conclusion is that the adoption of the classical RadonNikodym procedure to define conditional probability, while syntactically correct from a pure mathematical point of view, can (easily) give rise to assessments which are not coherent (since they do not satisfy all relevant axioms of a conditional probability). Not to mention that it requires to refer not just to the given elementary conditioning event, but rather it needs the knowledge of the whole conditioning distribution: this circumstance is clearly unsound, especially from an inferential point of view, since P(E)x)
PARADIGMATIC EXAMPLES
197
comes out to depend not only on x, but on the whole a-algebra to which x belongs. A rigorous measure-theoretic approach to the relevant problems concerning a comparison between de Finetti's and Kolmogorov's settings in dealing with null conditioning events is in (11]; for an elementary exposition, see (112]. A complete and exhaustive expository papers (in particular, see its section 4) is (7].
18.4
A changing "world"
The situation described in the next example concerns the problem on how to assess "new" conditional probabilities when the set of (conditional) events changes. It has been already discussed (but only from the "logical" point of view concerning the possibility of "finer" subdivision into atomic events) in Chapter 2, Example 3. Example 33 - Given an election with three candidates A, B, C, we learn (or we assume} that C withdraws and that then all his votes will go to B: according to Schay {1 08}, this situation involves probabilities for which the product rule
P(B A H) = P(H)P(BIH),
(18.6)
with H = (A V B), does not hold. Assuming that the {initial} probability of either one winning is 1/3, and denoting by the same symbols also the corresponding events, so that
P(A) = P(B) = P(C) = 1/3, Schay argues as follows: since one has P(A V B) = 2/3 and P(BIH) = 2/3 {but notice that the only coherent choice for the latter conditional probability is 1/2, since both B and H have positive probability!), then, taking into account that BA H = B gives for the left-hand side of {18.6} the value P(B) = 1/3, while the right-hand side of the product rule is (2/3)(2/3) = 4/9.
CHAPTER18
198
Actually, a careful singling-out of the "right" conditioning event (as it has been discussed in Example 3) shows that it is not the event H = A V B, but the event, outside the initial "space" {A, B, C}, E = C withdraws and all his votes go to B, with E C H; so giving P(BIE) the value 2/3 looks like a more "convincing" assignment than giving P(BIH) this {incoherent) value. It is not difficult to prove that the assignment P(BIE) = 2/3 is not only convincing, but also coherent if P(E) :5 1/2: more precisely, a cumbersome {but simple) computation shows that coherent assignments of P(BIE) are those in the interval 1 1 - 3P(E)
1
:5 P(BIE) :5 3P(E) ;
in particular, if P(E) :5 1/3 any value {between 0 and 1) is coherent. So we cannot agree with Schay's conclusion that "it may along these lines be possible to incorporate the probabilities of quantum mechanics in our theory" . On the contrary, certain paradoxes concerning probabilities that do not satisfy (putatively) the product rule and arising in the statistical description of quantum theory, may depend on the fact that observed frequencies, relative to different (and possibly incompatible) experiments, are arbitrarily identified with the values of a conditional probability on the same given space. Before discussing these aspects in the next example, some remarks are now in order, recalling also what has been discussed at the end of Chapter 2 and in Chapter 8 about the careful distinction that is needed between the meaning of probability and its methods of evaluations.
18.5
Frequency vs. probability
Even if it is true that in "many" cases the value of a probability is "very near" to a suitable frequency, in every situation in which
PARADIGMATIC EXAMPLES
199
something "very probable" is looked on as "practically certain", there are "small" probabilities that are actually ignored, so making illegitimate also any probabilistic interpretation of physical laws. For example, a probabilistic explanation of the diffusion of heat must take into account the fact that the heat could accidentally move from a cold body to a warmer one, making the former even colder and the latter even warmer. This fact is very improbable only because the "unordered" configurations (i.e., heat equally diffused) are far more numerous than the "ordered" ones (i.e., all the heat in one direction), and not because unordered configurations enjoy some special status. Analogously, when pressing "at random" 18 keys on a typewriter and forecasting the occurrence of any sequence different from ''to be or not to be", we cannot consider it impossible that that piece of "Hamlet" could come out: in fact, if we were arguing in this way, it would mean also denying the possibility of explaining why we got just that sequence which we actually got, since it had the same probability as "to be or not to be" of being typed. So, why it is so difficult to see that piece by Shakespeare coming out - or else: to see water freezing on a fire - even in a long series of repetitions of the relevant procedure? It is just because their (expected) "waiting times" (inversely proportional to the corresponding probabilities) are extremely large (it has been computed that they are much larger that the expected life of our universe!) Notice that the difference between an impossible fact and a possible one - also with a very small probability, or even zero (it is well-known that we may have "many" possible events with zero probability) - is really enormous, since it is not a matter of a numerical difference, but of a qualitative (i.e., logical) one. Going back to the connections between probability and observed frequency, the classical two-slit experiment, discussed from a probabilistic point of view by Feynman [62], is an interesting illustration
200
CHAPTER 18
of the quantum mechanical way of computing the relevant probabilities (an interpretation in term of coherent probability has been given in [113]).
Example 34 - A source emits "identically prepared" particles {in the jargon of quantum community, preparation is the physical counterpart of the notion of "conditioning") toward a screen with two narrow openings, denoted SI and S2 . Behind the screen there is a film which registers the relative frequency of particles hitting a small given region A of the film. Measurements are performed in three different physical situations: both slits open, only slit SI open, only slit S2 open. We introduce, for a given particle, the following event, denoted (by abusing notation) by the same symbol of the corresponding physical device: A = the particle reaches the region A , and, fori= 1, 2, the following two events: si =the particle goes through slit si. Moreover, since all the particles are identically prepared, we may omit the further symbol H (referring to preparation} in all conditioning events. The experimentally measured frequencies are usually identified, respectively, with the three probabilities P(A), P(AISI) and P(AIS2). Repeated experiments can be performed letting a particle start from the source, and then measuring its final position on the film, to determine whether it is in the region A or not; moreover we could "measure" P(AIS1) or P(AIS2 ) letting be put in function an experimental device allowing the particle going to hit the region A only through the slit S 1 or only through the slit S2 . The latter corresponding frequencies (of going through the relevant slit} are also identified with the probabilities P(SI) and P(S2). Now, irrespective of whether the device has been activated or not, and of what was the issue in case of activation, we may obviously
201
PARADIGMATIC EXAMPLES
write, by the disintegration formula {see {16.3}, Chapter 16}, (18.7)
since. this is an elementary property of conditional probability, easy consequence of the relevant axioms. Instead physical experiments give an inequality between left and right hand side of {18. 7}. Well, this circumstance cannot be used to "falsify" anything or to introduce a sort of "new kind of probability", since it refers in fact only to observed frequencies. Actually, observed frequencies (pertaining to different experiments) may not be necessarily identified with (and so used to compute) probabilities, and the previous discussion can be seen as an instance of the problem of finding a coherent extension of some beforehand given (conditional) probabilities {see Chapter 13}. Interpreting A as AIO and si as SilO, the value P(A) given by {18. 7} is a coherent extension of the conditional probabilities P(AISi) and P(SiiO), while in general a value of P(A) obtained by measuring a relevant frequency may not. In other words: while a convex combination (a sort of "weighted average") of conditional probabilities can be - as in eq. ( 18. 7} a probability, there is no guarantee that it could be expressed as a convex combination of conditional frequencies (corresponding to different and incompatible experiments). In the previous example, the two incompatible experiments are not (so to say) "mentally" incompatible if we argue in terms of the general meaning of probability (for example, P(AISI) is the degree of belief in A under the assumption - not necessarily an observation, but just an assumed state of information- "S1 is true"): then, for a coherent evaluation of P(A) we must necessarily rely only on the above value obtained by resorting to eq. (18.7), even if such probability does not express any sort of "physical property" of the given event.
202
18.6
CHAPTER 18
Acquired or assumed (again)
The previous remarks pave the way for another important aspect involving the concepts of event and conditioning, and the ensuing "right" interpretation of the conditional probability P(EIH): we refer to the necessity of regarding an event always as an assumed and not asserted proposition, as discussed in Chapter 2 and at the beginning of Chapter 11. The following example has been discussed by de Finetti in Chapter 9 of the book cited under [52].
Example 35 - Consider a set of five balls {1, 2, 3, 4, 5} and the probability of the event E that a number drawn from this set at random is even {which is obviously 2/5}: this probability could instead be erroneously assessed (for instance) equal to 1/3, if we interpret P(EIH) = p as '1the probability of E], given H" (that would literally mean "if H occurs, then the probability of E is p"), and not as a whole, i.e. as ''the probability of [E given H] ". In fact, putting H 1 = {1, 2, 3} and H 2 = {3, 4, 5} the probability of E conditionally on the occurrence of each one of the events H 1 and H 2 is 1/3, and one (possibly both} of them will certainly occur.
18.7
Choosing the conditioning event
Another illuminating example, concerning the "right" choice of the conditioning event, is the following.
Example 36 - Three balls are given: two of them are white and distinguishable (marked 1 and 2), the third one is black. One out of the three corresponding events W17 W2 , B is the possible outcome of the following experiment: a referee tosses a dice and put in a box the black ball or the two white ones, according to whether the result is "even" (event E) or "odd" (event 0 }. In the former case the final outcome of the experiment is B, whereas in
PARADIGMATIC EXAMPLES
203
the latter the referee chooses (as the final outcome of the experiment) one of the two white balls (and we do not know how the choice is done). Then we learn that, if W1 was not the final outcome, ''the referee shows 1 as one of the two remaining balls" {denote by A the event expressed by this statement). Actually, the referee shows indeed that one of the two remaining balls is 1: what is the probability that B was the final outcome of the experiment? This example is an "abstract" version of a classical one, expressed in various similar forms in the relevant literature {the three prisoners, the two boys in a family with two children one of which is a boy, the car and the goats, the puzzle of the two aces, etc.). Here also the problem is that to correctly express the available evidence, which is the event A = ''the referee shows 1 as one of the two remaining balls " and not "1 is one of the two remaining balls". Obviously, the conditional probability of B is affected by one or the other choice of the conditioning event. Now, since A = E V ( 0 1\ W2) {in words: either the result of the dice tossing is E, i.e. the final outcome is B, or the result is 0 and the referee has chosen, as final outcome, the ball2; in both cases, W1 is not the final outcome, and so the referee shows 1 as one of the two remaining balls), it follows that 1 1 P(A) = P(E) + P(O)P(W2 IO) = 2 [1 + P(W2IO)] = 2 (1 + x), where x is the probability that the referee chooses the ball 2 when the result of the dice tossing is 0. Notice that, even if the number x is not (or not yet) determined, it always makes sense, since {in our general framework in which probability is a degree of belief in a proposition) it refers to the statement ''the referee chooses the ball 2 ", which is a logical entity that can be either true or false.
204
CHAPTER 18
Then we get P(BIA) = P(B A A) = P(B) = _1_ P(A) P(A) 1 +X
'
and, since x can be any number between 0 and 1, it follows that a coherent choice of P(BIA) is any number such that
~
::::; P(BIA) ::::; 1.
In conclusion, for a sound interpretation of a conditional event and of conditional probability, also a careful exam of subtleties of this kind is essential. For example, if the referee is "deterministic", in the sense that he takes always the same ball (2 or 1} when the result of the dice tossing is 0, then P(BIA)
1
= (2 or 1),
while if he chooses between the two balls by tossing a coin (x = then P(BIA) = ~.
18.8
V,
Simpson's paradox
To study the effect of alternative treatments T and Tc on the recovery R from a given illness, usually a comparison is made between the two conditional probabilities P(RIT) and P(RITc) (evaluated by means of relevant frequencies): then, considering two different subpopulations M and Me (for example, males and females) and the corresponding pairs of conditional probabilities P(RIT A M) and P(RITc A M), or P(RIT A Me) and P(RITc A Me), situations may occur where one gets P(RIT A M)
< P(RITc A M)
PARADIGMATIC EXAMPLES
205
and
P(RIT 1\ MC)< P(RITC 1\ MC) for both subpopulations, while
This phenomenon is called Simpson's paradox or "confounding effect" (and M is the confounding event). If a confounding event (e.g., M) has been detected, then Simpson's paradox can be ignored taking as frame of reference either the whole population or the two separate subpopulations, but there are not guiding lines for this choice and, anyway, this process may be endless, since there may exist, besides M, many other confounding events not yet detected. A resolution has been given in [4], and it is discussed in the following example.
Example 37 - Referring to the symbols introduced above, the consideration of the conditional events RIT and RITe corresponds to conditioning to given {and incompatible} facts {see also the discussion of Example 34}; in other words, they try to answer the question "given the treatment T (or Tc), did the patient recovery?". Then it appears as more sensible to refer instead to the conditional events TIR and TciR {by the way, the first one is enough), which correspond to the question "given the recovery, has the patient been treated by r or by re ? " Moreover, with this choice Simpson's paradox is avoided. In fact, suppose we agree that the inequality (18.8)
means that the treatment r is more beneficial than re (with respect to the recovery R}. Then, starting from the analogous inequalities referring to any (even unknown) confounding event C, that is
CHAPTER 18
206
we get easily P(TIR) = P(CIR)P(TIR 1\ C)
+ P( CciR)P(TIR 1\ cc) >
> P(CIR)P(TciR 1\ C)+ P(CciR)P(TciR 1\ cc)
= P(TciR),
that is formula (18. 8).
18.9
Belief functions
Finally, we discuss a classical example that is claimed (by Shafer, see [117]) as being not solvable without resorting to belief functions. We show instead that it is possible to find a simple probabilistic solution by means of conditional lower and upper probabilities (for the sake of brevity, we will deal only with lower probability). We start by recalling only the main definitions concerning belief functions and Dempster's rule of combination, making use (as much as possible) of our terminology. A Dempster's space V is a four-tuple V= {S, 7, r, J.L}, where S and 7 are two different set of atoms (i.e., two different finite partitions of n ) and to each element of s E S there corresponds an element r(s) belonging to the algebra A generated by the elements of 7; moreover, J.L is a probability distribution on S such that J.L(So) > 0, where So is the set of regular points s E S, i.e. those such that r(s) f. 0 (while an element s E S is called singular if r(s) = 0). For the sake of simplicity, assume that So = S , otherwise the regularization of J.L is defined as
J.Lo(s) =
J.L( s) J.t(s) .
L
sES0
Starting from V, a function m : A
~
[0, 1] is a basic probability
PARADIGMATIC EXAMPLES
207
assignment if m(A) = { J.t{s =.r(s) =A}= J.t{r- 1 (s)}, 0, 1f A= 0.
if
A¥: 0
Notice that this function is not a probability: for example, it is not monotone with respect to implication between events ("inclusion", in terms of the corresponding sets); in fact, since different elements of A cannot be images (through r) of the same element of S , then A ~ B does not necessarily imply m(A) ::; m(B) . Nevertheless, if the elements of the algebra A are looked on as "points" of a new "space" , then m is a probability on A , since for any AEA m(A) ~ 0
L
and
m(A)
= 1.
AEA
Then a belief function is defined as Bel(A)
=L
m(B),
B~A
and this function turns out to be n-monotone for every n E 1N (see the last Section of Chapter 15) and satisfying Bel(0) = 0 and Bel(O) = 1. In particular, the function Bel reduces to a probability if and only if the function m is different from 0 only on the atoms of A (that is, on the atoms of T). Consider now two Dempster's spaces and relative to the same set T (and so to the same algebra A). The purpose is to find a "common framework" for V 1 and 'D2 , SO that the function r on the product space S = 81 X 82 (and range in T) is defines by putting
CHAPTER 18
208
Then the following condition concerning probability distributions on 8 1 and 8 2 is assumed
that is stochastic independence (in the classic sense) of any pair of elements s1 E 81 and 82 E 82 . In conclusion, we get the space v1
x v2 = { s = si x S2 , 7, r = r I
A
r 2 , J..t =
J..LI • J..t2}
,
but there is no guarantee that ri(si) A r2(s2) # 0 for some pair (s1,s2)ES. This requires the introduction of the regularization of the space V = VI X v2 ' called Dempster's sum
where the measure J..to is defined as
The corresponding basic probability assignment m given, for A E A, by
2:
EEl m 2 is
mi(B)m2(C)
BIIC=A
m(A) =
= mi
2:
m 1 (B) m2(C) '
with
B,C EA,
if
A# 0
BIIC/-0
0,
if
A= 0.
Finally, the function Bel relative to V can be deduced from the function m = m 1 EEl m 2 in the same way as done above for (*). Example 38 - In {117} the following example is considered: "Is Fred, who is about to speak to me, going to speak truthfully, or is
PARADIGMATIC EXAMPLES
209
he, as he sometimes does, going to speak carelessy, saying whatever comes into his mind?". Shafer denotes ''truthful" and "careless" as the possible answers to the above question: since he knows from previous experience that Fred's announcements are truthful reports on what he knows 80% of the time and are careless statements the other 20% of the time, he writes P(truthful) = 0.8 , P(careless) = 0.2.
(18.9)
If we introduce the event E =the streets outside are slippery and Fred announces that E is true (let us denote by A the latter event, i.e. Fred's announcement}, the usual belief function argument gives Bel(E) = 0.8 , Bel(Ec) = 0. (18.10) In fact T is - in this example - { E, Ec}, while
sl = {truthful ' careless} and Jll is the P given by {18.9}; moreover f1(truthful)
= E, f 1(careless) = E V Ec = 0.
It follows m 1(0) = 0, m 1(E) = J.L1(truthful) = 0.8, m 1(Ec) J.L1 (0) = 0, m1 (0) = J.l1 (careless) = 0.2 . Then in {117] the putative merits of {18.10} are discussed with respect to what is called "a Bayesian argument" and to its presumed "inability to model Fred when he is being careless" and "to fit him into a chance picture at all". Successively, another evidence about whether the streets are slippery is considered, that is the event T = a thermometer shows a temperature of 31° F. It is known that streets are not slippery at this temperature, and there is a 99% chance that the thermometer is working properly;
CHAPTER 18
210
moreover, Fred's behavior is independent of whether it is working properly or not. In this case we have S 2 = {working, not working} , while J.t 2 (working) = 0.99 and J.t2 (not working) = 0.01; moreover f 2 (working) = Ec, f 2 (not working) = E V Ec = n. It follows m 2 (E) = J.t 2 (not working) = 0.01, m 2 (Ec) = J.t 2 (working) = 0.99. Then a belief function is obtained through the procedure of "combination of independent items of evidence" (Dempster's sum), getting a result which should reflect the fact that more trust is put on the thermometer than in Fred, i.e. Bel(E)
= 0.0384::: 0.04 ,
Bel(Ec)
= 0.9519 ::: 0.95.
(18.11)
In fact
f(truthful/\ working) = f(truthful/\not working) so that J.t(E)
= 0.008,
Bel(E)
0, f(careless 1\ working) = Ec,
= E,
J.t(Ec)
f(careless/\not working)
= 0.198,
J.t(n)
= 0 _008 + ~:~~~ + 0 _002 = Bel(Ec)
= 0.002
= EV Ec,
and
~:~~~ = 0.0384,
= 0 ·198 = 0.9519.
0.208 Finally, other computations follow concerning the case of the so-called "dependent evidence". Our probabilistic solution of the above example is very simple and fits and encompasses the solution obtained via the belief function approach. First of all, we challenge the possibility of defining the probabilities in {18.g}, since "truthful" and "careless" cannot be considered events: in fact, their truth or falseness cannot be directly verified, while we can instead ascertain, recalling that A = Fred announces that streets outside are slippery, whether are true or false (assuming A) the conditional events
PARADIGMATIC EXAMPLES
211
EIA and EeiA; moreover, the equalities in (18.9) must be replaced by inequalities, since E may be true also in the case that Fred's announcement is a careless statement. So we have P(EIA)
2:: 0.8 , P(EeiA)
~
0.2.
(18.12)
The belief values ( 18.10) can be seen, referring to the conditional events considered in {18.12}, as lower and upper conditional probability assessments; on the other hand, as far as the belief values ( 18.11) are concerned, in a probabilistic context we are actually interested in coherently assessing, for example, the lower conditional probability of EIA 1\ T and EeiA 1\ T, that should be consistent with the lower conditional probability assigned to EIA, EeiA, ElT, and EeiT. Notice that, since there is a 99% chance that the thermometer is working properly, we have P(EIT) ~ 1/100. Actually, by a simple application of the general theorem concerning lower conditional probabilities {Theorem 9, Chapter 15}, we can prove that P(EIA)
= 0.8
, P(EeiA)
P(EIA 1\ T)
=0
= 0.04
, P(EIT)
=0
, P(EeiT)
, P(EeiA 1\ T)
= 0.99,
= 0.95
is a coherent lower conditional probability assessment. We need to write down six systems, one for each conditional event; the unknowns are the probabilities of the atoms (contained in A VT) A1 A4
= E 1\ T 1\ Ae ,
= Ee 1\ T 1\ Ae,
A2
A5
= E 1\ T 1\ A ,
= Ee 1\ T 1\ A,
A3
= E 1\ re 1\ A ,
, A6
= Ee 1\ re 1\ A.
Consider the system (S~) referring to the conditional event EIA:
CHAPTER 18
212
x2 + xa = I~ (x2 + xa + Xs + xs) Xs + Xs ~ O(x2 + xa + Xs + xs)
XI+ x2 ~ O(xi + x2 + X4 + xs) X4 + X5 ~
{i0(XI + X2 + X4 + Xs)
x2 ~ I~o (x2
+ xs)
Xs ~ {050 (x2
+ xs)
XI + X2
+ Xg + X4 + X5 +
Xs = 1
Xi~ 0
a solution is
2
xa
= 3'
X4 = Xs =
1
6;
so, since all the atoms contained in the conditioning event A A T have zero probability, we need to consider, for the conditional event El A, the second system (Sl)
x~ ~ lcio (x~ + x~) '>95('+ Xs100 X2 Xs') x~+x~=1 Xi~ 0
which has, e.g., the solution I
24
Xs=-.
25
Solutions of the systems (S!) , i = 2, ... , 6, relative to the other five conditional events are the following: one of (S~), corresponding to EciA is, e.g., Yt = Y2 = Ys = Ys = 0 ,
1 2
Ya = Y4 = -
PARADIGMATIC EXAMPLES
213
(and the second system (Sl) has the solution y~ = 2~ , y~ = ~: ) . A solution of the system (S!) corresponding to ElT is, e.g., Z1
=
Z2
=
Z5
= 0,
Z3
=
1
2,
3
Z4
1
=-' 8
Z6
=8
(and a solution z~, z~ of the second system (Sf) is again the same as above}; one corresponding to EciT is, e.g.,
(and the second system has the solution t~ corresponding to EIA 1\ T is, e.g.,
=
2~
,
t~ = ~: ) ; one
(and the second system has again the solution u~ = 215 , u~ = ~: ). Finally, the solution of the system (S~) corresponding to EciA 1\ T is, e.g., V3
=
1
V4
=-
2
io ,
(while the second system has the solution v~ = v; = ~~ ) . For the sake of brevity, we did not write down explicitly all the relevant system (except that corresponding to the first conditional event). In conclusion, not only the chosen values of P(EIA) , P(EciA) , P(EjT), P(EcjT) , P(EjA 1\ T) P(EciA 1\ T) constitute a coherent lower conditional probability assessment, but since the above systems have clearly many other solutions, we might find other coherent evaluations.
Chapter 19 Fuzzy Sets and Possibility as Coherent Conditional Probabilities Our aim is to expound an interpretation (introduced in [30] and [34]) of fuzzy set theory (both from a semantic and a syntactic point of view) in terms of conditional events and coherent conditional probabilities: a complete account is in [38]. During past years, a large number of papers has been devoted to support either the thesis that probability theory is all that is required for reasoning about uncertainty, or the negative view maintaining that probability is inadequate to capture what is usually treated by fuzzy theory. In this Chapter we emphasize the role of coherent conditional probabilities to get rid of many controversial aspects. Moreover, we introduce the operations between fuzzy subsets, looked on as corresponding operations between conditional events endowed with the relevant conditional probability. Finally, we show how the concept of possibility function naturally arises as a coherent conditional probability.
215
G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
216
19.1
CHAPTER 19
Fuzzy sets: main definitions
The concept of fuzzy subset goes back to the pioneering work of Zadeh [128]. On this subject there is a vast literature (for an elementary exposition, see (96]; another relevant reference is [86]); so we recall here only the main definitions. Given a set (universe) Y, any of its ("crisp") subsets A is singledout either by a "well-defined" property or by its characteristic function CA : Y -+ {0, 1} , with cA(x) = 1 for x EA and cA(x) = 0 for x (/_A. A fuzzy subset B of Y is defined through a membership function Jl.B : y -+
(0, 1] ,
that is a function that gives any element x E Y a "measure of its belonging" to B: in particular, the values J-LB(x) = 0 and J-LB(x) = 1 correspond, respectively, to x (/_ B and x E B in the sense of crisp sets. So the role of a membership function is that of interpreting (not uniquely) a property not representable by a (Boolean) proposition. Example 39 - Let Y = m, and consider the two statements A = "x is greater or equal to 3" and B = "x is about 10". Clearly, A is a crisp set, singled-out by its characteristic function, while the fuzzy subset B can be represented by many different membership functions, according to the different subjective numerical interpretation of the property B . Remark 25 - Even if it is true -from a syntactic point of view - that membership functions, in a sense, generalizes characteristic functions, allowing infinite values in [0, 1] and not only a twovalued range {0, 1}, nevertheless there is a strong qualitative jump from an "objective world" to another one in which
FUZZY SETS AND POSSIBILITY
217
a semantic (and "subjective"!) component plays a fundamental role. Now, before introducing the operations between fuzzy subsets, let us recall that for crisp sets the operations U (union), n (intersection), (f (complement) can be defined through characteristic functions by putting
so it appears at all natural to define similarly the analogous operations for fuzzy subsets using membership functions in place of characteristic functions, that is
The first and most significant difference with respect to crisp sets is that the previous definitions entail
while for characteristic functions the same operation gives (obviously) the function identically equal to 1. A further generalization for defining composition rules between J..lA and J..lB in order to get J..lAuB and J..lAnB is that of introducing suitable binary operations ("triangular" norms, in short T -norms) from [0, 1] 2 to [0, 1] endowed with similar significant properties as those of max and min. Definition 16 - A T-norm is a function T : [0, 1]2 ~ [0, 1] satisfying the following properties: (1} aTb=bTa (symmetric) (2} (aT b) T c =a T(bT c) (associative) (3) a~ x, b ~ y =} aTb ~ xTy (monotony) (4) aT 1 = a {1 is neutral element)
218
CHAPTER 19
Examples ofT-norms widely used in the relevant literature are the following:
= min{ x, y},
TM
(the minimum) : x TM y
Tp
(the product):
xTpy
TL
(Lukasiewicz):
xTL y = max{x + y- 1, 0},
(the weakest) :
= xy,
xT0 y= {
min{x, y}, ifmax{x, y} = 1 0,
otherwise .
The T-norm T0 is the minimum and the T-norm TM is the maximum in the pointwise ordering (even if the class ofT-norms is not linearly ordered by this pointwise relation, since some of them are not comparable). The notion ofT-norm plays the role of the intersection by defining /-tAnB = /-tAT /-tB .
The role of the union is played by the concept of T -conorm :
Definition 17 -A T-conorm is a function B: [0, 1]2---+ [0, 1] satisfying properties {1), {2), {3) of Definition 16, with B in place of T, and (4) aB 0 =a {0 is neutral element) Then we define /-tAUB
= /-tAB /-tB ·
Examples of T-conorms are
BM
(the maximum):
x BM y
Bp
(probabilistic sum):
BL
(Lukasiewicz):
= max{x, y},
x Bp y = x
x BL y
+ y- xy,
= min{x + y, 1}.
We recall now a generalization of the concept of complement (or negation), given by the following
FUZZY SETS AND POSSIBILITY
219
Definition 18 - A strong negation is a map 1J : [0, 1] -+ [0, 1] satisfying the following properties {1} 17(0) = 1 , 17(1) = 0, {2) 1J is decreasing, {3} TJ(TJ(x)) = x. Finally, we recall the notion of dual T-norm and T-conorm: T and S are called dual when
xTy = TJ(TJ(x) STJ(y)), or vice versa (exchanging T and S ) . Taking TJ(x) = 1- x, the pairs {Tx,Sx}, with X equal, respectively, to M, P, L, are pairs of dual T-norm and T-conorm.
19.2
Fuzziness and uncertainty
In the literature on fuzzy sets it is usually challenged the suitability of interpreting a statement such as E = "Mary is young" as an event, and the values of the membership function corresponding to the relevant fuzzy set as probabilities. In fact E is a vague statement, and vagueness is looked on as referring to the intended meaning (i.e. a sort of "linguistic" uncertainty) and not as an uncertainty about facts. The arguments usually brought forward to distinguish grades of membership from probabilities often refer to a restrictive interpretation of event and probability, while the probabilistic approach adopted in this book differs radically from the usual theory based on a measure-theoretic framework, which assumes that a unique probability measure is defined on an algebra (or a-algebra) of events constituting the so-called sample space n. It has been widely discussed that directing attention to events as subsets of the sample space (and to algebras of events) may be unsuitable for many real world situations, which make instead
220
CHAPTER 19
very significant both giving events a more general meaning and not assuming any specific structure for the set where probability is assessed. Another usual argument against any kind of probabilistic interpretation of fuzzy theory is based on the (putative) non compositional character of probability. Apart from the fact that in Chapter 9 (with a "relaxed" interpretation of the concept of truthfunctional belief) we challenged this view (at least with respect to our approach, based on coherent probability), we underline anyway that our definition of membership function in probabilistic terms will refer to a suitable conditional probability, looked on as a function of the conditioning event, and the relevant operations (which will correspond in very natural way to the basic T-norms and Tconorms, bound by coherence) come out to be truth-functional in the strict sense. In fact, in our view an essential role is played by conditioning, a concept that is not always sufficiently and properly emphasized, even in those articles (we mention here just Cheeseman [20], Giles [68], Hisdal [83], Dubois, Moral and Prade [57]) based on somehow similar ideas as those expressed here (they refer to terms such as label, context, information, state of mind, likelihood, ... ): in fact often a clear and precise mathematical frame is lacking. On the other hand, our approach cannot be compared to those that deal with fuzzy reasoning versus traditional probabilistic reasoning without referring to conditioning: in fact the very concept of conditional probability is deeper than the usual restrictive view emphasizing P(EIH) only as a probability for each given H (looked on as a given fact). Regarding instead also the conditioning event H as a ''variable", we get something which is not just a probability: see the (often mentioned) discussion in Chapter 11. We can refer to an event H corresponding to the "crisp part" of a fuzzy property; in this way a conditional event EIH can be seen also as a three-valued logical entity, which reduces to a "crisp"
FUZZY SETS AND POSSIBILITY
221
event when H is true. So the "fuzziness" is driven by suitably interpreting the situation corresponding to the case when it is not known whether H is true. The role of a conditioning event is that of setting out clearly (and in a rigorous way) the pragmatic view that "everything" depends on the relevant state of information, so overcoming loose concepts such as "label", "context" , etc .. Let us go back to the intuitive idea of fuzzy subset: where does it come from and what is its "operational" meaning? We start by recalling two examples; the first is a classical one and has already been discussed (mainly from a semantic point of view) in [115] and [30], while the next Example 41 has been the starting point (see [114]) for the interpretation of fuzzy sets as presented in this book. Example 40 -Is Mary young? From a pragmatic point of view, it is natural to think that You have some information about possible values of Mary's age, which allows You to refer to a suitable membership function of the fuzzy subset of "young people" (or, equivalently, of "young ages"). For example, for You the membership function may be put equal to 1 for values of the age less than 25, while it is put equal to 0 for values greater than 40; then it is taken as decreasing from 1 to 0 in the interval from 25 to 40. One of the merits of the fuzzy approach is that, given the range of values from 0 to 1, there is no restriction for the assignment of a membership function, in contrast to probability that obeys certain rules such as, for example, the axiom of additivity: it follows that, when You assign a subjective probability of (say) 0.2 to the statement that Mary's age is between 35 and 36, You inescapably must assign a degree of belief of 0. 8 to the contrary, and You may not have for the latter fact any justification apart from the consistency argument represented by the additivity rule. In our probabilistic framework the way-out is indeed {through conditioning) very simple. Notice that the above choice of the membership function implies that, for You, women whose age is less than
222
CHAPTER 19
25 are "young" , while those with an age greater than 40 are not. So the real problem is that You are uncertain on being or not "young" those women having an age between 25 and 40: then the interest is in fact directed toward conditional events such as EIAx, with E = You claim that Mary is young, Ax = the age of Mary is x, where x ranges over the interval from 25 to 40. It follows that You may assign a subjective probability P(EIAx) equal to 0.2 without any need to assign a degree of belief of 0.8 to the event E under the assumption A~ {i.e., the age of M ary is not x}, since an additivity rule with respect to the conditioning events does not hold. In other words, it seems sensible to identify the values of the membership function with suitable conditional probabilities: in particular, putting Ho = Mary's age is greater than 40, H 1 = Mary's age is less than 25, then we may assume that E and Ho are incompatible and that H 1 implies E, so that, by the properties of a conditional probability, P(EIHo) = 0 and P(EIHI) = 1. Notice that the conditional probability P(EIAx) has been directly introduced as a function on the set of conditional events (without assuming any given algebraic structure), bound to satisfy only the requirement of coherence, so that it can be assessed and makes sense for any pair of events. Now, given the event E, the value P(EIAx) is then a function J.L(x) of x, that could be taken as membership function. In the usual (Kolmorogovian) approach to conditional probability, the introduction of P(EIAx) would require the consideration (and the assessment) of P(E 1\ Ax) and P(Ax) (assuming positivity of the latter), that is a very difficult task!
FUZZY SETS AND POSSIBILITY
223
Remark 26 - Putting H = H 0 V H 1 , the conditional probability P(EIHc) is a measure of how much You are willing to claim or not that Mary is young if the only fact you know is that her age is between 25 and 40. And this will is "independent" of your beliefs corresponding to the single ages x: in fact, even if H false corresponds to the truth of {V x Ax : x = 25, ... , 40}, nevertheless there is no additivity requirement, since conditional probability (as already noticed a few lines above) is not additive with respect to the disjunction of conditioning events. These remarks will pave the way for the introduction in our context (in Section 4 of this Chapter) of possibility functions.
Example 41 - This example, taken from {97}, concerns the long term safety assessment of a radioactive waste repository in salt. After the disposal of waste has been finished, "almost impermeable" dams are built at strategic positions within an underground gallery system in order to prevent the transport of fluid possibly intruding at later times. The problem is to predict the future development of the permeability of these dams for time periods of hundreds or thousands of years. Available information about possible values of dams permeability is used to construct a subjective membership function of a fuzzy set (of "almost impermeable" dams}: for values of the permeability between 10- 21 and 10- 17 the membership function is put equal to 1, while it is put equal to 0 for values greater than 10- 15 ; finally, the membership function is decreasing from 1 to 0 in the interval from 10- 17 to 10- 15 . The motivation given by the authors rests on the usual argument that, given the range of values from 0 to 1, there is no restriction for the assignment of a membership function, in contrast to probability: in fact, as soon as You assign a probability of {say) 0.4 to the statement that in the future the permeability of the dam will be between 10- 17 and 10- 16 , You must assign a degree of belief of 0.6 to the contrary.
224
CHAPTER19
The way-out from this putative difficulty is (again, as in Example 40) very simple, since the above choice of the membership function implies that dams whose permeability is less than 10- 17 are "almost impermeable", while those with a permeability greater than 10- 15 are not. So the real problem is that You are uncertain on being or not "almost impermeable" those dams having a permeability between 10- 17 and 10- 15 : then the interest is in fact directed toward the conditional event EIH, with E =You claim that the dam is "almost impermeable", H =the permeability of the dam is between 10- 17 and 10- 15 . It follows that You may assign a subjective probability P(EIH) equal to (say) 0.25 without any need to assign a degree of belief of 0. 75 to the event E under the assumption He {i.e., the permeability of the dam is not between 10- 17 and 10- 15 ) . In [114] it is shown that also a second argument brought forward in [97] to contrast probabilistic methods versus the fuzzy approach can be overcome: it concerns the merits of the rules according to which the possibility of an object belonging to two fuzzy sets is obtained as the minimum of the possibilities that it belongs to either fuzzy set. The issue is the computation of the probability that the value of a safety parameter belongs to a given (dangerous) interval for all four components (grouped according to the similarity of their physicochemical conditions) of the repository section. For each component this probability is computed as equal to 1/5, and the conclusion is that "in terms of a safety assessment, the fuzzy calculus is more conservative", since in the fuzzy calculus (interpreting those values as values of a membership function) the possibility of a value of the parameter in the given interval for all components is still 1/5 (which is the minimum taken over numbers all equal to 1/5), while the same event is given (under the assumption of independence) the small probability (1/5) 4 • Anyway, we do not report here the way-out to this problem
FUZZY SETS AND POSSIBILITY
225
.suggested in [114], since its general (and rigorous) solution is a trivial consequence of the formal definitions and operations between fuzzy subsets (also in the form we are going to define in the next Section).
19.3
Fuzzy subsets and coherent conditional probability
Before undertaking the task of introducing (from the point of view of our framework) the definitions concerning fuzzy set theory, we need to deepen some further aspects of coherent conditional probabilities. First of all, among the peculiarities (which entail a large flexibility in the management of any kind of uncertainty) of the concept of coherent conditional probability versus the usual one, we recall the interpretation of the extreme values 0 and 1 of P(AIB) for situations which are different, respectively, from the trivial ones A A B = 0 and B ~ A ; moreover, we underline the "natural" looking at the conditional event AIB as "a whole", and not separately at the two events A and B. Nevertheless, notice the following corollary to Theorem 5 (Chapter 11).
Theorem 20 - Lei C be a family of conditional events { EIHihEI, where card( I) is arbitrary and the events Hi's are a partition of n, and let P(·l·) be a coherent conditional probability such that P (E IHi) E { 0, 1}. Then the following two statements are equivalent (i) P(·l·) is the only coherent assessment on C; (ii) Hi A E = 0 for every Hi E 1lo and Hi ~ E for every HiE 1l1, where 1lr ={Hi: P(EIHi) = r}, r = 0, 1. We are ready now to re-read fuzzy set theory by resorting to our framework.
226
CHAPTER 19
Let X be a (not necessarily numerical) random quantity wit4 range Cx, and, for any x E Cx, let Ax be the event {X= x}. The family {Ax}xEC., is obviously a partition of the certain event n. If r.p is any property related to the random quantity X, consider the event
Ecp = You claim r.p. and a coherent conditional probability P(EcpiAx), looked on as a real function defined on Cx. Since the events Ax are incompatible, then (by Theorem 5) every /-LE..,(x) with values in [0, 1] is a coherent conditional probability.
Remark 27 - Given XI, x 2 E Cx and the corresponding conditional probabilities /-LEcp (xi) and /-LEcp (x 2 ), a coherent extension of P to the conditional event Ecpi(Ax 1 V Ax 2 ) is not necessarily additive with respect to the conditioning events. Definition 19 - Given a random quantity X with range Cx and a related property r.p, a fuzzy subset E~ of Cx is the pair
So a coherent conditional probability P(EcpiAx) is a measure of how much You, given the event Ax = {X = x}, are willing to claim or not the property r.p, and it plays the role of the membership function of the fuzzy subset E~. Notice also that (as already remarked above) the significance of the conditional event EcpiAx is reinforced by looking on it as "a whole", avoiding a separate consideration of the two propositions Ecp and Ax.
FUZZY SETS AND POSSIBILITY
227
Remark 28 - It is important to stress that our interpretation of membership function as a conditional probability P(EcpiAx) has little to do both with the "frequentist's temptation" discussed in the Introduction of the book {80} by Hajek, and with the usual distinction which is done between uncertainty (which could be reduced by "learning") and vagueness {which is, in a sense, unrevisable: there is nothing to be learned about whether, e.g., a 35 years old person is old or not). In fact, when we put, for example, P(EcpiAx) = 0.70, this means that the degree of membership of the element x to the fuzzy subset defined by the property cp is identified with the conditional probability (relative to the conditional event EcpiAx seen as a whole) that You claim cp. Then, concerning the first ''frequentist" issue, we are not willing, e.g., to ask n people {knowing her age) "Is Mary young?" and to allow them to answer "yes" or "no", imagining that 70% of them say "yes"; in fact "Mary is young" is not an event, while "You claim that Mary is young" (notice that its negation is not "You claim that Mary is not young") is an event. Rather we could ask: "Knowing that Mary's age is x, how much are You willing to claim that Mary is young?". As far as the second issue {"learning") is concerned, notice that in probability theory "learning" is usually ruled by conditional probability, so "learning" about the age x would require to condition with respect to x itself: in our definition, instead, x {that is, Ax) is the conditioning event. Moreover, the assignment 0.70 is given- so to say- once for all, so that the "updating" (of Ecp !) has already been done at the very moment {knowing x) the conditional event EcpiAx has been taken into account together with the relevant evaluation P(EcpiAx); then it makes no sense, with respect to the event Ecp , to learn "more" (i.e., to consider a further conditioning).
CHAPTER 19
228
Example 42 - Let X = gain in a financial transaction, with range IR, and let cp = very· convenient. The corresponding fuzzy subset of Cx = 1R is the set of those gains {in the financial transaction) that You judge to be very convenient according to the membership function P(E'PIAx), with x E JR, assigned by You. Definition 20 - A fuzzy subset E; is a crisp set when the only coherent assessment J.LEcp(x) = P(E"'IAx) has range {0, 1}. By Theorem 20, it is obvious the following
Proposition 5 - A fuzzy subset E; is a crisp set when the property cp is such that, for every x E C x, either E'P 1\ Ax = 0 or Ax ~ E'P. Example 43 - Consider the same X of the previous example and the property '1/J =between 10 and 30 millions. In this case, clearly, P(E'rJ!IAx) (necessarily) assumes only values in {0, 1}, and in fact the subset singled-out by '1/J is a crisp one. Given two fuzzy subsets E;, E;, corresponding to the random quantities X and Y (possibly X = Y), assume that, for every x E Cx and yE Cy, both the following equalities hold
= P(E"'IAx),
(19.1)
P(E'rJ!IAx 1\ Ay)= P(E'rJ!IAy),
(19.2)
P(E"'IAx 1\ Ay)
with Ay= {Y = y}. These conditions (which are trivially satisfied for X = Y, which entails x = y : a conditioning event cannot be equal to 0 ) are quite natural, since they require, for X i= Y, that an event E"' related to a fuzzy subset E; of C x is stochastically independent (conditionally
FUZZY SETS AND POSSIBILITY
229
to any element of the partition {Ax}xEC.,) of every element of the partition {Ay}yEC11 relative to a fuzzy subset E~ of Cy. We introduce now a general definition of the binary operations of union and intersection and that of complementation.
Definition 21 - Given two fuzzy subsets (respectively, of Cx and Cy) E; and E~ , define
E; U E; = {E'P V E,p, JlE~pvE.p}, E; n E; = { E'P (E;)'
A
E,p , JlE~pAE.p} ,
= {E,'P,
JLE~'P},
where the functions /lE~pVE.p (x,
y) = P(E'P V E,piAx A Ay),
JlE~pAE.p (x,
y) = P(E'P
A
E,piAx
A
Ay)
have domain
Cxy =Cx x Cy. Remark 29 - Notice the following implication:
E,'P
~
(E'P)c,
where (E'P)c denotes the contrary of the event E'P (and the equality holds only for a crisp set); for example, the proposition "You claim not young" implies "You do not claim young', but not conversely. Then, while E'P V (E'P)c = Cx, we have instead E'P V E,'P ~ Cx. Therefore, if we consider the union of a fuzzy subset and its complement
E; U (E;)' = {E'P V E,'P, JlE~pvE~
CHAPTER 19
230
On the other hand, it is easy to check that the complement of a crisp set is also a crisp set: in fact, from Ecp A Ax = 0 it follows Ax ~ (EcpY = E-.cp, and from Ax ~ Ecp it follows (Ecp)c A Ax = 0, that is E-.cp A Ax = 0 .
Remark 30 - In the above definitions, the "set-theoretic" component belongs to the domain of what is objective. On the other hand, the function J.L {that is the core of a fuzzy subset}, is the "probabilistic" part (cf. Remark 25} and represents only a formal assignment, since the rules of a coherent conditional probability do not singleout a unique value for J.LE"'AE.p (x, y) and J.LE"'VE.p (x, y), leaving the choice (as we shall see) in a "large" range of possible values (even if coherence will not allow an independent choice of the two values).
E;
Consider now two fuzzy subsets and E;: the rules of conditional probability give, taking into account {19.1) and {19.2),
P(Ecp V EwlAx A Ay)= P(EcplAx)
+ P(EwlAy)+
-P(Ecp A EwlAx A Ay).
(19.3)
Therefore, to evaluate P(Ecp V EwlAx A Ay) it is necessary (and sufficient) to know also the value of the conditional probability P(Ecp A EwlAx A Ay), and vice versa. By resorting to Theorem 4 (characterizing coherent conditional probability assessments) and to the relevant linear systems, it is not difficult to prove (see also Chapter 9) that the only constraints for the value are the following
max{P(EcplAx) + P(EwlAy) -1, 0} ~ min{P(EcpiAx),
P(EwlAy)}.
~p ~
(19.4)
Let us now discuss three possible choices for the value of this conditional probability p :
FUZZY SETS AND POSSIBILITY
231
(a) give p the maximum possible value, that is
then in this case we necessarily obtain, by (19.3), that
This assignment corresponds to the choice of TM and SM as T-norm and T-conorm.
{b) give p the minimum possible value, that is
i.e. the Lukasiewicz T-norm. In this case we necessarily obtain, again by (19.3), that
i.e. the Lukasiewicz T-conorm.
(c) give p the value
that is assume that Ecp is stochastically independent of E'I/J given Ax A Ay. In this case we necessarily obtain
i.e. the probabilistic sum Sp and the product Tp. Notice that any combination of the above choices is coherent. On the other hand, if we consider the weakest T-norm T0 , we can prove, again by Theorem 4 or directly by (19.4), that the choice of p agreeing with T0 is not coherent.
232
19.4
CHAPTER 19
Possibility functions and coherent conditional probability
We recall now briefly the connections between fuzzy theory_ and possibility functions (called also possibility measures).
Definition 22 - Given a Boolean algebra A, a possibility (measure) is a function IT : A-t 1R such that IT(O) = 1, IT(0) = 0 and, for every A, B E A,
IT(A V B) = max{IT(A), IT( B)} . The restriction of IT to the atoms of A is called possibility distribution. Clearly, there are essentially no constraints for a possibility distribution, except in the case of a finite A, in which necessarily IT(Ar) = 1 for some atom Ar. Given A E A, if the cardinality of the set of atoms Ai ~ A is finite, then IT( A) = max IT(Ai). A;<;::A
So, if the algebra A is finite, the knowledge of the possibility distribution IT is enough to determine the whole possibility measure on A. On the other hand, if A is an infinite disjunction of atoms, it is not necessarily true that the possibility measure is the superior of the IT(Ai). Notice that every membership function can be regarded as a possibility distribution. If n is the relevant universe and A an algebra of subsets of n, the ensuing possibility measure can be interpreted in the following way: it is a sort of "global" membership (relative to each finite element A of A) which takes (among all the possible choices as its values on A) the maximum of the membership in A. Let us now examine some relevant aspects of possibility theory in the framework of our approach.
233
FUZZY SETS AND POSSIBILITY
Let X be a random quantity with range Cx , so that in this context the certain event can be expressed as n = {X E Cx}. The following definition introduces (autonomously) a possibility distribution as a suitable coherent conditional probability:
Definition 23 - Let E be an arbitrary event and P any coherent conditional probability on the family g = {E} x {Ax}xECx, admitting P(EIO) = 1 as {coherent) extension. A possibility distribution on Cx is the real function 1r defined by 1r(x) = P(EIAx)· Remark 31 - When Cx is finite, since every extension {see Chapter 13} of P(Ei·) must satisfy axioms {i), {ii) and {iii) of a conditional probability {Chapter 11), we necessarily have 1 = P(EIO)
=
L
P(AxiO)P(EIAx)
and
xEC,.
L
P(AxiO) = 1.
xEC,.
It follows that 1 = P(EIO) :::; maxP(EIAx)· xEC,.
On the other hand, we notice that in our framework {where null probabilities for possible conditioning events are allowed) it does not necessarily follow that P(EIAx) = 1 for every x; in fact we may well have P(EIAy) = 0 (or equal to any other number between 0 and 1} for some yE Cx. Obviously, the constraint P(EIAx) = 1 for some x is not necessary when the cardinality of Cx is infinite. Now, taking into account Theorem 5 (Chapter 11) and Remark 31, we are entitled to claim that a possibility distribution on a set n is in fact nothing else that any function 7r : 1l ----)- [0, 1], where 1l is a partition of n, such that, when 1l is finite, there exists an element x E 1l with 1r(x) = 1. The following theorem is the main tool to introduce possibility measures in our context referring to coherent conditional probabilities.
CHAPTER 19
234
Theorem 21 - Let E be an arbitrary event and C be a family of conditional events { EjHi}iEh where card( I) is arbitrary and the events Hi 's are a partition of n. Denote by 1l the algebra spanned by the Hi's and let p: C --t [0, 1] be any function such that {11.11} holds (with Ei = E for every i E I). Then any P extending p on K = {E} x 1l 0 and such that
P(EIH V K) = max{P(EjH), P(EjK)}, for H, K
E
1£0
{19.5)
is a coherent conditional probability. Proof- We start by assuming card(!) < oo. If H 1 , ... , Hn are the (exhaustive and mutually exclusive) conditioning events, then any Ar belonging to the set Ao of atoms generated by the events {E, Hi : El Hi E C} is of the kind either A~ = Hr 1\ E or A~ = Hr 1\ Ec. To prove the theorem in the finite case, it is sufficient to prove that the function P obtained by putting, for every element Kj =J. 0 of the algebra 1l generated by Ao,
P(E!K) = max p(E!Hi) H·CK
·-
is a coherent conditional probability. So, taking into account Theorem 4, we need to prove that there exists a class {Pa} (agreeing with P) satisfying the sequence of systems (Sa): the first system, with unknowns P0 (Ar) 2: 0, is the following:
where Ki E 1l0 ,
Lr A~<;EI\Kj
Po(A~)
+
Lr A';.<;Eci\Kj
Po(A~),
FUZZY SETS AND POSSIBILITY
235
and Hg = H1 V ... V Hn = 0. Put Mo = H·CH0 max P(ElHi)· Notice that among the above equa·-
0
tions we have (choosing respectively either (for r = 1, 2, ... , n) K; =Hr, or K; = Hg): Po(A~)
= P(ElHr)[Po(A~) + Po(A~)],
:E
A' cEXHo r-
Po(A~)
o
=
r
max P(ElHi) · E~
Hir;Hg
= 1, ... , n = Mo,
with
Eo= 0
Po(A~)
Lr
+ :Er
A~r;;;,EAHg
Moreover, choosing K;
=
Po(A~).
A~r;EcAHg
HMo
=
V{Hr : P(ElHr)
=
M0 } , the
r
corresponding equation is
Consider, among the first n equations of this subsystem of n + 2 equations, those with P(ElHr) = M0 : clearly, the last equation is linearly dependent on them. We give the atoms corresponding to these equations nonnegative values summing up to 1 with the only constrains p (A") = p (A' ) 1 - Mo o
r
o
r
Mo
·
Then the other equations (those with P(EIHr) < Mo) can be trivially satisfied (as 0 = 0) giving zero to all the relevant atoms Ar. Finally, it is now easy to check (by suitably grouping the values obtained in the left and right-hand side of the first n equations of the subsystem) that this solution is also a solution of the system
(So)· We need now to consider the next system (S1 ), with unknowns P 1 (Ar), relative to the atoms Ar contained on those Ki such that
CHAPTER 19
236
P0 (K;) = 0 (for example, those which are disjunction of Hi's with P(EIHi) < Mo):
(SI)
I
Lr
PI (A~) = If~. P(EIHi) · E{
A'CEAK·
'-
3
r-
Lr
PI(A~)
Lr
+
A~~EAHJ
3
PI(A~) = 1
A~~EcAHJ
where
E{ =
Lr A~~EAK;
PI(A~)
+
Lr
PI(A~)
A~~EcAK;
and H; denotes the disjunction of the Hi's such that P0 (Hi) = 0. Putting MI = max P(EIHi), we can proceed along the same H·CH 1 ·-
0
line as for system (So), so proving the solvability of system (SI); then we possibly need to consider the system (S2 ), and so on. In a finite number of steps we obtain a class of {Pa} agreeing with P, and so, by Theorem 4, P is coherent. Consider now the case card(!) = oo. Let :F = {EIKI, ... EIKn} be any finite subset of /C. If Ao is the (finite) set of atoms Fr spanned by the event {E, Ki : EIKi E :F}, denote by F; and F;', respectively, the atoms contained in E and in Ec. To prove the coherence of P, it is enough to follow a procedure similar to that used for the finite case, where now the role of the HJs is played by the events F; V F;' . •
Remark 32 - If card(!) is infinite, there exists a coherent extension of the function p as conditional probability on IC = { E} x 1l0 , satisfying {19.5}, even if we put the further constraint P(EIO) = 1. {Recall that the ensuing assessment P(E) = 1 by no means implies
E=O). On the other hand, when card( I) is finite, this extension is possible only if P(EIHi) = 1 for some i.
Remark 33 - If card (I) is finite, then for any H E 1l0
,
FUZZY SETS AND POSSIBILITY
237
P(EIH) = max P(EIHi). H;CH
So the knowledge of the function P on the given partition is enough to determine the whole conditional probability on 1l0 • On the other hand, in the general case, if the event H is an infinite disjunction of elements of1l 0 , it is not necessarily true that the conditional probability P(EIH) is the superior of the P(EIHi) 's (see the following example}.
Example 44 - A mathematician chooses a set A belonging to the family E of finite or cofinite subsets of IN, and You should guess a value n E IN belonging to A. Let A = the mathematician chooses A E E, E = You guess right n E A. For A = Ax = { x}, with x E IN, and A = IN, "natural" assignments are, respectively, P(EIAx) = 0 and P(EIJN) = 1 . Clearly, there exists an extension of P such that, for A, B E E, P(EIA v B)= max{P(EIA), P(EIB)},
for example giving P(EIA) the value 1 if A is cofinite; on the other hand, no extension exists such that P(EIA)
= sup P(EIAx) . x£;A
Now we are able to introduce (from our point of view) a "convincing" definition of possibility measure.
Definition 24 - Let 1l be an algebra of subsets of Cx (the range of a random quantity X) and E an arbitrary event. If P is any coherent conditional probability on IC = {E} x 1l0 , with P(Eif2) = 1 and such that P(EIH V K)
= max{P(EIH), P(EIK)}, for H, K
E
1£0
,
(19.5)
then a possibility measure on 1l is a real function II defined by putting II(H) = P(EIH) for H E 1£ 0 and II(0) = 0.
238
CHAPTER 19
Remark 34 - Theorem 21 assures (in our context) that any possibility measure can be obtained as coherent extension (unique, in the finite case) of a possibility distribution. Vice versa, given any possibility measure IT on an algebra 1£, there exists an event E and a coherent conditional probability P on K = { E} x 1l 0 agreeing with IT, i.e. whose extension to {E} x 1£, obtained by putting P(Ei0) = 0, coincides with IT. Notice that our approach to possibility measures makes evident their meaning in the context of fuzzy theory. Consider in fact an arbitrary fuzzy subset (Ecp, Jlcp) of Cx (with the only condition that /lcp(x) = P(EcpiAx) admits an extension to n such that /lcp(O) = P(EcpiO) = 1 ), and regard /lcp as a possibility distribution. Then the possibility measure induced by /lcp is, for every element A ~ Cx of the algebra 1£, the conditional probability that You assign to Ecp given A, when You choose among the infinite possible coherent extensions of /lcp that satisfying ( 19. 5) (see also Remark 26). To explore the subtle implications of this choice among all possible extensions, we refer now to the concept of zero-layer introduced in Chapter 12. In fact, in the next theorem we will show that the coherent extension of a conditional probability P(EIAx) that satisfy (19.5) gives rise to different zero-layers for the atoms Ax corresponding to different P(EIAx) (it is possible· to show that, if P(EIAx) < P(EIAy), such an extension satisfies P(AxiAx V Ay)= 0, and so P(AyiAx V Ay)= 1 ). Therefore a coherent conditional probability P(Ei·) satisfying (19.5) can be associated to a measure of your "disbelief" in the events H E 1l .
Theorem 22 - Let E be an arbitrary event and C be a family of conditional events {EIHi}iE 1 , where card(!) is finite and the events Hi 's are a partition of 0. Denote by 1l the algebra spanned by the Hi's and let P be a coherent conditional probability on K =
FUZZY SETS AND POSSIBILITY
239
{E} x tl 0 satisfying {19.5}.
Then, given EIA, EIB E /C, from P(EIA) < P(EIB) it follows o(A) > o(B) . Proof- To prove that P(EIA) < P(EIB) => o(A) > o(B), let us go back to the proof of Theorem 21 and note that, for each system (Sa), the only possible solutions are those which assign Pa(A~) = [(1-Ma)/Ma]Pa(A~) = 0 to all the atoms such that P(EIA~VA~) = P(EIHr) < Ma. Considering in fact the same subsystem, and summing the equations
we obtain Lr Pa(A~)
Lr [Pa(A~)
= P(EIHr)
~~~
+ Pa(A~)]
:::;
~~~
:::; Ma Lr [Pa(A~)
+ Pa(A~)].
Ar~Hg
Then, denoting by HMQ the event analogous to HMo introduced (for the first system) in the proof of Theorem 21, the latter inequality reduces to an equality if and only if Lr [Pa(A~)
+ Pa(A~)] =
Ar~Hg
Lr [Pa(A~)
+ Pa(A~)] =
1.
ArEHMQ
Therefore, the compulsory choice of null probabilities for the atoms A~, A~ such that P(EIHr) < Ma leads to the next system (Sa+I), and, possibly, to (Sa+2 ), and so on. On the other hand, since P(EIB) = MfJ for a certain fJ (so P(EIHs) = MfJ for at least an Hs ~ B) and P(EIA) < P(EIB), then for all Hr ~ A we have P(EIHr) < MfJ: it follows that all Hr ~ A have a probability Pa equal to 0 for all a relative to systems (Sa) with an equation containing an Hr such that P(EIHr) = Ma. In conclusion, taking also into account that for every event H of the algebra tl we have o(H) = min { o(Hr)}, it follows the inequality Hr CH
o(A) > o(B) . •
-
CHAPTER 19
240
Remark 35 - Under the same conditions of Theorem 22, when P(EIA) = P(EIB), we can have either o(A) f. o(B) or the equality o(A) = o(B). Theorem 22 allows to regard every possibility measure II as a decreasing function of the zero-layer set {0, 1, 2, ... , k} generated by the coherent conditional probability P agreeing with II (in the sense of Remark 34). Moreover, since any decreasing function of the zero-layer set obeys the same composition rules as II, any coherent conditional probability satisfying (19.5) gives rise to an infinite number of possibility measures (that is, all those obtained through decreasing transformations of its zero-layer set).
19.5
Concluding remarks
By resorting to what we consider the most effective concept of conditional probability, we are able not only to define fuzzy subsets, but also to introduce in a very natural way the basic continuous T-norms and the relevant dual T-conorms, bound to the former by coherence. In fact, given aT-norm (that in this framework singles-out the value P(Erp A E..piAx A Ay) of the conjunction), then the corresponding choice of the T-conorm (which determines the value of the disjunction) is uniquely driven by the coherence of the relevant conditional probability (and the dual operation is what is actually obtained). Moreover, we get also a sensible probabilistic interpretation of the choice of the usual continuous T-norms (such as min, product, Lukasiewicz). Finally, an interesting and fundamental by-product of our approach is a natural interpretation of possibility functions, both from a semantic and a syntactic point of view.
Chapter 20 Coherent Conditional Probability and Default Reasoning In this Chapter we face the problem of putting forward a theory (just sketched in [39] and [40]) for representing default rules by means of a suitable coherent conditional probability, defined on a suitable family of conditional events. We recall that the concept of conditional event (as dealt with in this book) plays a central role for the probabilistic reasoning: letting its third value suitably depend on the given ordered pair (E, H) (and not being just an undetermined common value for all pairs), it turns out (as explained in detail in Chapters 10 and 11) that this function can be seen as a measure of the degree of belief in the conditional event EIH, which under "natural" conditions reduces to the conditional probability P(EIH), in its most general sense related to the concept of coherence. A peculiarity of this concept of coherent conditional probability (versus the usual one) is, among the others, the fact that a suitable interpretation of the extreme values 0 and 1 of P(EIH) (for situations which are different, respectively, from the trivial ones
241 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER20
242 E 1\ H =
0 and H
~
E) leads to a "natural" treatment of the
default reasoning. We want to stress the simplicity of our direct approach to default logic (to show this simplicity is in fact the main aim of this Chapter) with respect to other well-known methodologies, such as those, e.g., by Adams [1], Benferhat, Dubois and Prade [5], Gilio [72], Goldszmidt and Pearl [76], Lehmann and Magidor [93]. We deem it interesting that our results are - more or less - not in contrast with those contained in the aforementioned papers. A brief comparison will be done in the final Section of the Chapter. At the end of Chapter 11 (see Remark 12) we showed that a sensible use of events whose probability is 0 (or 1) can be a more general tool in revising beliefs when new information comes to the fore, so that we have been able to challenge the claim (contained, e.g., in [118]) that probability is inadequate for revising full belief. Moreover, as recalled above, we may consider the extreme value P(EIH) = 1 also when it is not true that H ~ E. We will show how to handle, by means of a coherent conditional probability, some aspects of default reasoning (a general theory is, e.g., in [103], [107]): as it is well-known, a default rule is a sort of weak implication. Results that are usually obtained, in the relevant literature, through cumbersome calculations (sometimes even losing sight of the meaning), can instead - more or less "surprisingly" - be reduced to simple and synthetic considerations and methodology. In a sense, our inspiring principle is the so-called Ockham 's razor\ which reads (in its original Latin form): "pluralitas non est ponenda sine necessitate", that could be interpreted, for scientists: "when you have two (or more) competing theories which make exactly the same prediction, the one that is simpler is the better". 1 William
of Ockham (a village in the English county of Surrey) was a 14th century logician (and Franciscan friar).
DEFAULT REASONING
20.1
243
Default logic through conditional probability equal to 1
First of all, we discuss briefly some aspects of the classic example ofTweety. The usual logical implication (denoted by ~) can be anyway useful to express that a penguin (1r) is certainly a bird (!3), i.e. 7f~/3,
so that moreover we know that Tweety (r) is a penguin, and also this fact can be represented by a conditional probability equal to 1, that is
But we can express as well the statement "a penguin usually does not fly" (we denote by cpc the contrary of cp, the latter symbol denoting "flying") by writing
(For simplicity, we have avoided to write down explicit a proposition such as "a given animal is a penguin", using the short-cut "penguin" and the symbol 1r to denote this event; similar considerations apply to /3, T and cp). The question "can Tweety fly?" can be faced through an assessment of the conditional probability P(cp!r), which must be coherent with the already assessed ones: by Theorem 4 (Chapter 11), it can be shown that any value p E [0, 1] is a coherent value for P(cp!r), so that no conclusion can be reached - from the given premises on Tweety's ability of flying.
CHAPTER 20
244
In other words, interpreting an equality such as P(EIH) = 1like a default rule (denoted by t----+), which in particular (when H ~E) reduces to the usual implication, we have shown its nontransitivity: in fact we have r t----+ 7r and 1r t----+
Definition 25 - Given a coherent conditional probability P on a family C of conditional events, a default rule, denoted by H t----+ E, is any conditional event EIH E C such that P(EIH) = 1. Clearly, any logical implication A ~ B (and so also any equality A= B) between events can be seen as a (trivial) default rule. Given a set C of conditional events, we denote by ~ ~ C the subset of default rules. So, given a conditional event EIH, to say (H t----+ E) fj ~means that EIH belongs to the set C \ ~-
Remark 36 - In terms of zero-layers, to assess P(EIH) equivalent to require o(EciH) > o(EIH);
=1
is
(20.1)
Chapter 12, Definition 8} if P(EIH) = 1 we have o(EIH) = 0 (since P(EIH) > 0}, while, on the other hand, we have o(EcjH) > 0 (since P(Ec!H) = 0}. Conversely, if (20.1} holds, i.e. o(Eci\H) > o(EI\H), consider the first system (80 ) (cf. Theorem 4, Chapter 11}, which is, putting .X= P(EIH), xa = .X(xa + x4) in fact (cf.
{ X1 Xr
+ X2 + X3 + X4 2 0,
= 1
DEFAULT REASONING
245
where Xr = P0 (Ar), with
A1 = Ee A He , A2 = E
A
He , Aa = E
A
H , A4 = Ee A H.
Then necessarily X4 = 0, so the first equation of system (So) gives x 3 = Ax3 • If x 3 > 0, then A= 1; taking instead x 3 = 0, the second system (SI) Ya = A(Ya + Y4) { Ya + Y4 = 1
Yr
~
0,
with Yr = P1 (Ar), gives Ya =A, Y4 = 1- A, and the assumption on the zero-layers forces y4 = 0, and so A = 1. Given a set .6. ~ C of default rules Hi~------+ Ei (i = 1, ... , n ), we need to check its consistency on C, that is the coherence of the "global" assessment P on C such that P(EiiHi) = 1 on .6.. We stress that, even if our definition involves a conditional probability, the condition given in the following theorem refers only to logical (in the sense of Boolean logic) relations.
Theorem 23 - Given a coherent conditional probability P on a family C of conditional events, a set .6. ~ C of default rules
Hi
~------+
Ei,
i = 1, 2, ... , n,
represented by the assessment
P(EiiHi) = 1 ,
i = 1, 2, ... , n,
is consistent on .6. (i.e., the latter assessment is coherent) if for every subfamily of .6.
we have
8
8
k=l
k=l
V (Eik A Hik) fl V (Efk A Hik). Conversely, if .6. is consistent on C, then (20.2} holds.
(20.2)
CHAPTER20
246
Proof- We prove that, assuming the above logical relations (20.2), coherence of P implies the assessment P(EiiHi) = 1 (for i = 1, 2, ... , n) on .Ll. We resort to the characterization Theorem 4: to begin with, put, for i = 1, 2, ... , n, P(EiiHi) = 1 in the system (So); the unconditional probability Po can be obtained by putting P0 (Ar) = 0 for all atoms Ar ~ V'J= 1 (Ej A H;). Then for any atom Ak ~ Ei A Hi which is not contained in V'J= 1 ( Ej A notice that condition (20.2) ensures that there is such an atom Ak, since V'J= 1 (E; A H;) g; V'J= 1 (Ej A H;)- we may put P0 (Ak) > 0 in such a way that these numbers sum up to 1, and we put P0 (Ar) = 0 for all remaining atoms. This clearly gives a solution of the first system (80 ). If, for some i, Ei A Hi ~ V'J= 1 ( Ej A H;), then Pa(Ei A Hi) = 0. So we consider the second system (which refers to all Hi such that Pa(Hi) = 0), proceeding as above to construct the probability P 1 ; and so on. Condition (20.2) ensures that at each step we can give positive probability Pr~ to (at least) one of the remaining atoms. Conversely, consider the (coherent) assignment P(EiiHi) = 1 (fori= 1, ... , n) on .Ll. Then, for any index j E {1, 2, ... , n} there exists a probability Pa such that Pa(E; A H;) > 0 while one has Pa(Ej A H;) = 0. Notice that the restriction of P to some conditional events Ei 1 1Hi 1 , ... , Ei.IHi. of Ll is coherent as well. Let Po be the first element of an agreeing class, and ik an index such that P0 (Hi~c) > 0: then P0 (Ei~c 1\Hi~c) > 0 and Po(Efk I\Hi1c) = 0. Suppose that Ei1c /\Hi"~ VA:= 1 (Ef~c /\Hi"): then Pa(Eiki\Hi~c) = 0. This contradiction shows that condition (20.2) holds. •
H;) -
Definition 26 - A set Ll ~ C of default rules entails the default rule H ~ E if the only coherent value for P(EIH) is 1. In other words, the rule H ~ E is entailed by Ll (or by a subset of .Ll) if every possible extension (cf. Theorem 6, Chapter 13) of the probability assessment P on C such that P(EiiHi) = 1 (for i = 1, ... , n) on .Ll, assigns the value 1 also to P(EIH).
DEFAULT REASONING
247
Remark 37 - Other approaches (see also the discussion in the last Section) do not refer to a "world" C containing the set Ll of default rules {this amounts to having no information outside Ll). We deem instead important not limiting the analysis to the set Ll, since this may have consequences on what is entailed or not (if we have more information, why not using it?). Notice in fact that in the case C :::> Ll more conditional events can be entailed (out of C) than in the case C = Ll, since an extension P {which had caused the lack of entailment for a conditional event, say AIB) could not be present in the new family of probabilities, due to the constraint given by its value in C \ Ll. (In other words, when C \ Ll is not empty, there are less extensions to check, therefore we have more "chances" to entail new defaults, for example to entail B t--+ A).
20.2
Inferential rules
Several formalisms for default logic have been studied with the aim of discussing the minimal conditions that an entailment should satisfy. We show now how in our framework this "inferential" process obeys the usual rules considered in the relevant literature: we will refer essentially to those discussed in (93], with some possible slight variation to adapt them to our terminology (anyway, a widespread consensus among their "right" formulation is lacking). We lay stress on the simplicity of our proofs, due to the ease of just checking whether some probabilities are equal or not to 1.
Theorem 24 - Given a set ~ of consistent default rules, we have (Reflexivity) ~ entails A t--+ A for any A =/:- 0 (Left logical equivalence) (A = B) , (A t--+ C) E Ll entails B t--+ C
CHAPTER20
248
(Right weakening) (A~ B), (C f----+ A) E D. entails C
f----+
B
{Cut) (A AB f-----t C) , (A f-----t B) E D. entails A
f----+
C
(Cautious Monotonicity) (A f----+ B) , (A f----+ C) E D. entails A AB
f----+
C
(Equivalence) (A f----+ B), (B f----+ A), (Af-----t C) E D. entails B f-----t C
{And) (A
f----+
B), (A
f----+
C) E D. entails A
C) , (B
f----+
C) E D. entails A V B
f----+
BA C
{Or) (A
f----+
Proof- Reflexivity amounts to P(AjA) event.
f----+
C
= 1 for every possible
Left Logical Equivalence and Right weakening trivially follow from elementary properties of conditional probability. Cut: from P(CIA A B) since
= P(BIA) =
1 it follows P(CIA)
=
1,
P(CIA) = P(GjA A B)P(BjA) + P(CIA A Bc)P(BciA) =
Cautious
= P(CIA A B)P(BIA). Monotonicity: since 1 = P(BIA) = P(CIA), we have
1 = P(CjAAB)P(BjA)+P(CjAABc)P(BciA)
= P(CjAAB)P(BjA),
hence P(CIA A B)= 1. Equivalence: put P* ( ·) = P( ·lA V B V C) ; since at least one among the events A, B, C must have positive probability P*, it follows (due to the given values of the three conditional probabilities, which are obviously still equal to 1 also with P* in place of P) that A, B, C have all three positive probability P*; then
249
DEFAULT REASONING
P*(A 1\ C)
= P*(A) = P*(A 1\ B) = P*(B).
It follows P*(Ac 1\ B) = P*(A 1\ cc) = 0, so that
P*(B) = P*(A 1\ B 1\ C)+ P*(A 1\ B 1\ cc)= P*(A 1\ B 1\ C), P*(B 1\ C)= P*(A 1\ B 1\ C)+ P*(Ac 1\ B 1\ C)= P*(A 1\ B 1\ C), and so P*(CIB) = 1. Since P*(CIB) = P(CIB), the conclusion follows. And: since 1 ~ P(BVCIA)
= P(BIA)+P(CIA)-P(BACIA) = 2-P(B/\CIA),
it follows P(B 1\ CIA) = 1. Or: since
P(CIA V B) = P(CIA)P(AIA V B)+ P(CIB)P(BIA V B)+ -P(CIA 1\ B)P(A 1\ BIA V B) =
= P(AIA V B)+ P(BIA V B)- P(CIA 1\ B)P(A 1\ BIA V B)
~ 1'
we get P(CIA V B)= 1. • Let us now discuss some properties, that in the relevant literature are regarded as "unpleasant": in fact they do not necessarily hold also in our framework:
{Monotonicity) (A ~ B) , (B f-----+ C) E
~
(Transitivity) (A f-----+ B) , (B f-----+ C) E ( Contraposition) (A f-----+ B) E ~ entails
A
entails ~
entails Be
f-----+
f-----+
A
C
f-----+
C
Ac
The previous example about Tweety shows that Transitivity can fail. In the same example, if we add the evaluation P(cpi,B) = 1 (that is, a bird usually flies) to the initial ones, the assessment is still
CHAPTER 20
250
coherent (even if P(cpl1r) = 0 and 1r ~ (3), but Monotonicity can fail. Now, consider the conditional probability P defined as follows:
it is easy to check that it is coherent, and so Contraposition can fail. Many authors (cf., e.g., [93]) claim (and we agree) that the previous unpleasant properties should be replaced by others, that we express below in our own notation and interpretation: we show that these properties hold in our framework.
(Negation Rationality) If (A 1\ C ~ B), (A 1\ cc
~
B)
rt
~
then
~
does not entail
(A~B)
Proof- Since (A 1\ C ~ B) and (A 1\ cc ~ B) do not belong to~' that is P(BIA 1\ C) < 1 and P(BIA 1\ cc) < 1, then P(BIA)
= P(BIA 1\ C)P(CIA) + P(BIA 1\ cc)P(CciA) < 1.
(Disjunctive Rationality) If (A ~ C) , (B ~ C) rt ~ then ~ does not entail (A VB Proof- Starting from the equalities P(CIAVB)
~
C)
= P(CIA)P(AIAVB)+P(CIAci\B)P(Aci\BIAVB) =
= P(CIB)P(BIA V B)+ P(CIA 1\ Bc)P(A 1\ BciA V B), since P(CIA) < 1 and P(CIB) < 1, assuming P(CIA v B) = 1 would imply (by the first equality) P(AIA V B) = 0 and (by the second one) P(BIA V B) = 0 (contradiction).
(Rational Monotonicity) If (A 1\ B ~ C), (A ~ Bc) (A~C)
rt
~,
then
~
does not entail
251
DEFAULT REASONING
Proof- If it were P(GIA) = 1, i.e. 1 = P(CIA 1\ B)P(BIA)
+ P(CIA 1\ Bc)P(BciA),
we would get either P(CIA 1\ B)= P(CIA 1\ BC)= 1
or one of the following P(CIA 1\ B)= P(BIA) = 1 ' P(CIA 1\ BC)= P(BCIA) = 1
(contradiction).
20.3
Discussion
Our methodology, based on the concept of coherent conditional probability, encompasses many existing theories, and our framework is clearly and rigorously settled: conditional events EIH are not 3-valued entities whose third value is looked on as "undetermined" when H is false, but they have been defined instead in a way which entails "automatically" (so to say) the axioms of conditional probability, which are those ruling coherence. In other words ... "tout se tient". Moreover, we deem it interesting that our results are - more or less - not in contrast with those contained in the relevant literature. A brief comparison is now in order. The concept of consistency is usually based (cf., e.g., [1], [5], [72]) on that of quasi conjunction: we do not refer to this notion, since it is defined by means of a particular conditional event (and our concept of conditional event is different from those of the quoted papers). In Adams' framework [1] (where a default rule H f------t E is defined by requiring the existence, for every ~:: > 0, of a probability P such that P(EIH) > 1 - t: ), the probability is assumed to be proper (i.e., positive) on the given events, but (since the domain
252
CHAPTER20
of a probability P is an algebra) we need to extend P from the given events to other events (by the way, coherence is nothing but complying with this need). In particular, these "new" events may have zero probability: it follows, according to Adams' definition of conditional probability (which is put equal to 1 when the conditioning event has zero probability), that we can easily get incoherent assessments. Consider in fact the following
Example 45 - Given two {logically independent) events H 1 and H2, put E1 = H1 1\ H2 , E2 = Hf 1\ H2 . For any
t:
> 0, consider the assessment
so that {E1 jH1 , E 2 jH2 } is consistent according to Adams, as can be easily checked giving the atoms the probabilities
P(Hf 1\ H~)
= 0, P(Hf 1\ H2) = 1 -
t:,
(notice that the assessment is proper). But, according to his definition of conditional probability, we can extend P, for any event A c H 1 1\ H!j, as
which is not coherent. A (partly) coherence-based approach to default reasoning (but in the framework of "imprecise probabilities" propagation), is that in [72], even if we claim (besides the utmost simplicity of our definitions and results) important semantic and syntactic differences. For example, our concept of entailment is certainly different, as shown by the following simple
253
DEFAULT REASONING
Example 46 -Consider two {logically independent) events H 1 and H 2 , and put E1 = H1 1\ H2 , E2 = Hf 1\ H2 , Ea
= Hf 1\ H~ ,
E
= E2 ,
H
= Ha = n .
Given a , with 0 < a < 1 , the assessment
on C = {E1IHb E2IH2, EaiHa} is coherent; the relevant probabilities of the atoms are
P(Hf 1\ H~) = a, P(Hf 1\ H2) = 1 - a,
so that the set Ll of default rules corresponding to {E1IH1, E2IH2} is consistent. Does Ll entail EIH q A simple check shows that the only coherent assessment for this conditional event is P(EIH) = 1 - a. Then the answer is NO, since we require {in the definition of entailment) that 1 is {the only} coherent extension. On the contrary, according to the characterization of entailment given in {72} - that is: Ll (our notation} entails EIH iff P(EciH) = 1 is not coherentthe answer to the previous question is YES, since the only coherent value of this conditional probability is P(EciH) = a {see the above computation). The System Z proposed in [76] deals with the possibility of resorting to infinitesimal to manage default rules and entailment along the lines of Adam's approach. Also the concept of ranking function (introduced by Spohn) is taken into account; but, as it has been widely discussed in Chapter 12, our concept of zero-layer encompasses that of ranking function, and it is already part (so to say) of the coherent conditional probability structure (so that its "autonomous" definition is not needed). Notice also that our definition
254
CHAPTER20
of default through the assessment P(EIH) = 1 coincides formally with that given in System Z through ranking functions, since (see Remark 36 in the previous Section) P(EIH) = 1 is equivalent to the relation o(EeiH) > o(EIH) between zero-layers. Concerning the use of infinitesimals (in the framework of nonstandard analysis), see our discussion in Section 12.3. The following example, taken from [76], show that those approaches which do not refer to (nontrivial) conditional probability equal to 1, or do not allow conditioning events of zero probability, cannot avoid some drawbacks.
Example 4 7 - Consider the following default rules (our notation): (W
f------t
R)
and
(L
f------t
W)
- if the grass is Wet, then conclude it Rains - if the bottle Leeks, the grass will get Wet Finding the bottle leaking, we do not wish to conclude from these two rules that it rains, i.e. it would be counterintuitive if the two given default rules entailed (L f------t R). In our framework, since P(RIW) = P(WIL) = 1, denoting by .X the conditional probability P(RIL), the system (So) is (in this case H~ = WV L)
+ X2 = Xi + X2 + X4 + X5 Xi + X 4 = Xi + X 4 + X 3 + X 6 Xi + X3 = A(Xi + X4 + X3 + Xs) xi + x2 + X3 + X4 + Xs + Xs = 1 Xi
Xr
~
0,
with (as usual) Xr = P0 (Ar), where
Ai =LA RAW, A2 =Le A RAW, A 3 =LA RA we,
DEFAULT REASONING A4
=L
A
RC A w ' As = Le A RC A w ' A6 = L
255 A RC A we .
The first two equations give x 3 = x 4 = x 5 = x 6 = 0 , so that the third one becomes x 1 = >.x 1 . Now, in the classic approach to conditional probability we cannot consider the solution x 1 = 0 (otherwise the conditioning event L would have null probability), and so necessarily ). = 1 (i.e. the "undesirable" conclusion!). Instead in our setting the solution x 1 = 0 (and then x 2 = 1 ) must be taken into account, and so ). = 1 = P(RIL) is not the unique estension of P to the conditional event RIL. This requires {to check coherence) the consideration of the second system (SI), with Yr = P1(Ar),
+ Y4 = Y1 + Y4 + Y3 + Y6 Y1 + Y3 = >.(yl + Y4 + Y3 + Y6) Y1 + Y3 + Y4 + Y6 = 1 Y1
Yr :2: 0, that gives Y3 = Y6 = 0, and so Y1 + Y4 = 1 and Y1 = >.; it follows y 4 = 1 - >.. In conclusion, the assessment is coherent for any choice of).= P(RIL) E [0, 1], and so RIL cannot be entailed. In conclusion, undesirable effects in the approach through coherence can be avoided (again, "tout se tient"!).
Chapter 21 A Short Account of Decomposable Measures of Uncertainty One of the main features of the theory expounded in the previous Chapters is that our approach to conditional probability through coherence renders the latter a very general and flexible tool: in particular, it is able to deal clearly and easily also with many known concepts (such as fuzzy subsets, ranking functions, possibility measures, default reasoning) in an unified and rigorous framework. In this Chapter our aim is to extend methodology and rules on which our approach is based to more general uncertainty measures, starting again from our concept of conditional event (seen as a "numerical" entity, a particular random quantity), but introducing (in place of the ordinary sum and product) two operations (to be denoted El3 and 0 ) for which some of the fundamental properties of sum and product (commutativity, associativity, monotonicity, distributivity of El3 with respect to 0 ) are required. The interest of this theory (that has been set out in [35]) resides essentially in the following points: 257 G. Coletti et al., Probabilistic Logic in a Coherent Setting © Kluwer Academic Publishers 2002
CHAPTER 21
258
• it is possible to introduce directly, also for these general measures, the concept of conditional measure as a primitive concept, that is a real function (on a set of conditional events) ruled by a set of axioms; • these axioms are induced on the "third" value t(EJH) of a conditional event EJH, and come out "naturally" (as we did for conditional probability) by operating with EEl and 8 on the family of conditional events; • a conditional measure cp(EJH) directly introduced in this way may not be deducible from (or expressed by) a unique (unconditional) measure cp(·) = cp(·JO) on the two events EA Hand H, but it is possible to find a class of unconditional measures that single-out it as unique solution of a suitable equation. Notice that conditional probability P(EJH) is in fact (for each conditional event EJH of the given family) the unique solution x of equation (11.7), that is Pa(E A H)
21.1
= x · Pa(H).
Operations with conditional events
To begin with, suppose that we have a set C of conditional events EJH, and consider the relevant set T of three-valued random variables T(EJH), where the "third" value is a suitable function t(EJH), as widely discussed in the first part of Chapter 10. Our aim is to find "reasonable" axioms for general (i.e., beyond classical probability theory) conditional measures of uncertainty. Given two commutative, associative, and increasing operations EEl and 8 from JR+ x JR+ to JR+, with JR+ = {x E lR: x ~ 0}, we are going to suitably define corresponding operations among the random variables ofT: then different (decomposable) conditional
DECOMPOSABLE MEASURES OF UNCERTAINTY
259
measures cp(·i·) can be obtained by particular choices of the two operations ffi and 0 . For example, choosing ordinary sum and product, or max and min, we get, respectively, conditional probability or conditional possibility: this has been already done, relatively to these two particular cases, respectively in [31] (and set out in Chapter 10) and in [13], [14]. One of the main features of this approach resides in the fact that the conditional measure cp(·i·) can be defined for any pair of events E, H, with H =1- 0, and it is not required the knowledge (or the assessment) of the "joint" and "marginals" unconditional measures cp(E A H) and cp(H). Obviously, if the latter are already given, there must exist suitable rules that put them in relation with cp(EJH), but the converse is not necessarily true! In particular, there is no need, as in the usual approaches- where the conditional measure is introduced by definition as a suitable function of the two aforementioned unconditional measures- of any specific assumption (such as- referring to the case of probability - the requirement of positivity for the measure of the given conditioning event). Another important feature is the possibility of requiring commutativity, associativity, distributivity, etc. of the relevant operations only for certain values cp(EIH) coming from those conditional events EIH satisfying particular properties. This enlarged framework is even larger, since we do not require any continuity axiom for the given operations, as, for example, in the classic results (for unconditional measures) by de Finetti [48] and (for conditional measures) by Cox [42]. Actually, the aims of these authors (and of many others: for a thorough exposition and bibliography, see [81]) is to find conditions on these operations that allow to represent a given (conditional) measure as a continuous one-to-one transformation of a (conditional) probability.
260
CHAPTER 21
On the other hand, at least for finite domains, the aforementioned conditions do not warrant the possibility of this representation (a counter-example relative to 0 is in [81], and one relative to EB is in [35]). In fact, to get this kind of representation more assumptions would be needed (see, e.g., the discussion in the book [100] by J. Paris, and, in the framework of fuzzy measure, [6]). On the contrary, our interest is focused on searching the minimal (necessary and sufficient) conditions on EB and 0 which render a conditional measure rp(·l·) formally (or, better, essentially) "similar" to a conditional probability, in the sense that it can be represented in terms of classes of decomposable uncertainty measures. Now, if® is any operation from 1R x 1R to 1R, and Vl
V2
X= LXhlEh' y = LYklF,. h=l
k=l
are two discrete random variables in canonical form, we define X® Y as the random variable V
z = LZjlGj' j=l
where each event Gi is of the form G1 = EhAFk and Zj = xh®Yk (xh and Yk are the coefficients, respectively, of !Eh and IF,.). Therefore an operation among random variables can be defined by means of a relevant operation among real numbers.
Let us consider now EB and 0 , two commutative, associative and increasing operations from JR+ x JR+ to 1R+, having 0 and 1, respectively, as neutral element, and use the same symbols for the relevant operations on T x T. If we operate between two elements ofT, in general we do not obtain an element ofT. We have in fact, by (10.1), Chapter 10 , for any EIH, AIK E C: T(EIH) EB T(AIK)
= [1 EB 1]/EAHAAAK + [1 EB O]IEAHAAcAK+
DECOMPOSABLE MEASURES OF UNCERTAINTY
261
+[1 ED O]JEcAHAAAK + [0 ED 0]/EcAHAAcAK + [1 ED t(AIK)]JEAHAKc+ +[0 ED t(AIK)]IECAHAKC + [1 ED t(EIH)]IAAKAHC+ +[0 ED t(EIH)]!AcAKAHc + [t(EIH) ED t(AIK)]!HcAKc, and
T(EIH) 0 T(AIK) = [101]/EAHAAAK + [10 O]IEAHAACAK+ +[1 0 O]JEcAHAAAK + [0 0 0]/EcAHAACAK + [1 0 t(AIK)]JEAHAKc+ +[0 0 t(AIK)]!EcAHAKc + [1 0 t(EIH)]JAAKAHc+ +[10 t(EIH)]!AcAKAHc + [t(EIH) 0 t(AIK)]!HcAKc. Nevertheless we notice, as far as the operation EB is concerned, that in the case H = K and E 1\ A 1\ H = 0 we get, taking into account the properties of ED,
Therefore, if the family 7 containing EIH and AIH contains also (E V A)IH, then we have necessarily as its "third" value
t( (E V A)IH)
=
t(EIH) ED t(AIH).
Consider now the operation T(EIH) 0 T(AIK) in the particular case K = E 1\ H ; for these events we obtain, taking into account the properties of EB,
T(EIH) 0
r( AI (E 1\ H)) = 1 . IEAAIH + 0 . I(EAA)clH+ +t(EIH) 0 t(AI(E 1\ H))!Hc.
Therefore, if the family 7 containing EIH and AI (E 1\ H) contains also (E 1\ A)IH, then we necessarily have, as "third" value of the latter conditional event,
t((E 1\ A)IH) = t(EIH) 0t(AI(E 1\ H)).
CHAPTER21
262
When we take as EB and 0 the usual sum and product, we get the theory of coherent conditional probability, as expounded in Chapters 10 and 11; choof?ing instead as EB and 0 the "operations" max and min, and referring to a family C of conditional events EIH such that the set of their Boolean supports (E, H) is CB=£ x 1l0 , with£ a Boolean algebra, 1l C £an additive set and 1l0 = 1l \ {0}, the third value t(·i·) satisfies the following rules:
(a) t(EiH) = t(E A HjH), for every E E £and HE 1l0 (b) t(·iH) is a possibility on£ for any given HE 1l0 ,
,
(c) t(E A AjH) = min{t(EjH), t(AIE A H)}, for every A E £ and E, H, EA H E 1l0 • These can be taken as "axioms" for conditional possibilities ([13],
[14]).
21.2
Decomposable measures
We go back now to the general problem, starting from the definition of EB -decomposable measure on a Boolean algebra£ of events: it is a function
We give now the following general definition of decomposable conditional (uncertainty) measure:
Definition 27 - A real function
DECOMPOSABLE MEASURES OF UNCERTAINTY
263
EB ' 0 from cp(C) X cp(C) to m+' having, respectively, 0 and 1 as neutral elements, and with 0 distributive over EB , such that:
{Cl) cp(EIH) = cp(E 1\ HIH), for every E E £ and HE 1-l 0 , {C2) given HE 1-l 0 , for any E, A E £, with A 1\ E 1\ H =f=. 0, we have
cp(E V AI H) = cp(EIH) EB cp(AIH) ' cp(OIH) = 1 ' cp(01H) = 0' {C3) for every A E £ and E, H, E 1\ HE 1-l 0
,
cp( (E 1\ A)IH) = cp(EIH) 0 cp(AI(E 1\ H)). Clearly, for any given HE 1-l 0 , a conditional measure cp(·IH) is a EB -decomposable ("unconditional") measure. Nevertheless, it is not possible to construct a conditional measure taking as starting point just one decomposable measure, since a conditional measure is essentially a large class of EB -decomposable measures, linked by (C3). We are able to reduce the latter family to a suitable smaller subfamily, in the sense of the characterization theorem below (Theorem 25), which deals with the problem of constructing (univocally) a decomposable conditional measure on a finite set C = £ x 1-l 0 starting from a class of particular EB -decomposable measures ("generating" measures), defined on £. We need the following definitions of almost generating measures.
Definition 28 - £ is a finite Boolean algebra, 1-l an additive set, with 1-l ~£and 1-l 0 = 1-l \ {0}, and A= {Ar}r=l,2, ... ,m is the set of atoms of£. Let {Aa} be a class of subsets of atoms, with Aa" C Aa' for a" > a' , Ao = A, and, given two operations EB and 0 from m+ X m+ to m+, let {
264
CHAPTER 21
has a solution x E [0, 1]. Moreover 'Pa" (Ar) = 0 for every Ar E A\ Aa" and an atom Ar belongs to Aa", with a" 2: 1 , if and only if there exists Hi E 11°, with Ar ~ Hi, such that, for every a< a", there exists Ei E £for which there is not a unique solution of equation {21.1}.
The elements of the class {cp0 , cp 11 ..• , cpk}, with k :::; m, will be called almost generating measures. If 0 is distributive over E9 , they will be called generating measures. Remark 38 - To construct a class of almost generating measures, we start with a E9 -decomposable measure cp 0 , and search for those Hi's such that eq. {21.1}, with a= a 0 , has a unique solution for the corresponding Ei 's: if this is true for all Hi E 11°, the class contains only cp0 • Otherwise, consider the set A 1 of atoms contained in those Hi's that do not satisfy the aforementioned condition, and assign a E9 -decomposable measure cp 1 on A, with cp 1 = 0 outside of A 1 . Search again for those Hi's such that eq. {21.1), with a= a 1 , has a unique solution for the corresponding events Ei 's, and so on. The class { cp0 , cp 1 , ... , 'Pk} is an almost generating class if the inclusion Aa" C Aa' for a" > a' is proper, i.e. if for each a there exists an atom A such that for any Hi (and the corresponding Ei), with A~ Hi~ Aa, there is a unique solution of eq. {21.1). Notice that, if the 'Pa 's are probabilities (recall that in this case E9 is the usual sum, and take 0 as the usual product), there is always an Hi ~ Aa such that 'Pa(Hi) > 0 (so that {21.1} has a unique solution also for all Hi containing the atom A ~ Hi, with 'Pa(A) > 0}. On the other hand, if the 'Pa 's are possibilities (recall that in this case E9 is the "max", and take 0 as the "min"), a "good" atom (in the same sense as above for probabilities) is an atom A such that 'Pa(A) = 1.
Nevertheless, there are E9 -decomposable measures that do not satisfy, for some a, the aforementioned requirement of proper inclusion
DECOMPOSABLE MEASURES OF UNCERTAINTY
265
between the relevant classes of subsets of atoms. To prove this, consider the following simple
Example 48 -Let E be the algebra spanned by the atoms A, B, C and let 1l0 = {H1 = A VB, H 2 = A VC, H 3 = A VB VC}. Consider now the following E9 -decomposable measure, with E9 the Lukasiewicz T -conorm (that is, x E9 y = min{ x + y, 1} (and let x 0 y = max{x + y- 1, 0} be the Lukasiewicz T-norm):
cp0 (A) cp 0 (Av B)
= 2/3,
=0,
cp0 (B) = 2/3 , cp0 ( C) = 1/2,
cp0 (AVC)
= 1/2,
cp0 (BVC)
= cp
0
(AV BVC)
= 1,
It easy to prove that the above assessment satisfies the properties of a decomposable measure, but it is not possible to construct a class of almost generating measures. In fact the equations
cp0 (A)
=X
0 (cp 0 (A) E9 'Po(B))
(21.2)
'Po(A)
= X
0 ('Po(A) E9 'Po( C))
(21.3)
and have an infinite set of solutions, and so there is no atom A* such that for every Hi 2 A* the equation {21.1} has a unique solution for every element of the algebra E.
Definition 29 - A (E9 , 0 ) -decomposable conditional measure cp defined on C = E x 1l0 is reducible if for any Hi E 1£ 0 there exists an Hs E 1£ 0 , with H 8 C Hi, such that for any Hi E 1£ 0 , with Hs ~ Hi C Hi, the equation
has the unique solution x
= cp(EiiHi).
Remark 39 - It is easy to prove that the above condition is satisfied (for instance) by conditional probability, but also by a measure cp with E9 = max and any 0 , or by a cp with any E9 and with 0 strictly increasing.
266
CHAPTER21
We state now the following important result (for the proof, see [35]), which extends the analogous characterization theorems for conditional probability (Theorem 4, Chapter 11) and for conditional possibility (proved in [13], [14]).
Theorem 25 - Given a finite family C = £ x 1l0 of conditional events, with £ a Boolean algebm, 1l an additive set, 1l ~ £ and 1{,0 = 1l \ {0}, let A = { Ar} denote the set of atoms of£. If cp is a real function defined on C, and E9 , 0 are two operations from cp(C) x cp(C) to m:+, the following two statements are equivalent: (a) cp is a reducible (E9 , 0 ) -decomposable conditional measure on C; {b) there exists a (unique) class of genemting E9 -decomposable measures such that, for any Ei!Hi E C, there is a unique a such that x = cp(Ei!Hi) is the unique solution of the equation
Notice that the class {cp0 } of E9 -decomposable measures considered under {b) has a unique element only in the case that there exists no conditional event Ei!Hi E C such that equation (21.4) does not admit cp(Ei!Hi) as its unique solution: for instance, for conditional probability this occurs if there is no Hi with P(Hijn) = 0; for conditional possibility (with the usual max-min operations), this occurs if there is no conditional event EIH such that cp(Hjn) =
cp( (E 1\ H) jn) < cp(E!H).
21.3
Weakly decomposable measures
The results discussed until now suggest that the operations E9, 0 on cp(C) x cp(C), involved in the definition of a conditional measure, should satisfy specific conditions only on suitable subsets of the cartesian product cp(C) x cp(C).
DECOMPOSABLE MEASURES OF UNCERTAINTY
267
Definition 30 - If e is a Boolean algebra, a function cp : e --+ [0, 1] is a weakly EB -decomposable measure if
cp(O)
=1
, cp(0)
=0
and there exists an operation EB from cp(e) x cp(e) to JR+ such that the following condition holds: for every Ei, E; E e, Ei I\ E; = 0,
From the above condition it is easily seen that the restriction to the following subset of cp(e) x cp(e)
is commutative, associative and increasing, and admits 0 as neutral element. Nevertheless it need not be extensible to a function defined on the whole cp(e) x cp(e) (and so neither on [0, 1] 2 ), satisfying the same properties. To deepen these aspects, see Examples 2 and 3 in [35], which show that there does not exist a strictly increasing associa-
tive extension of EB , and even the existence of an increasing associative extension is not guaranteed. We introduce now what it seems to be the most natural concept of conditional measure.
Definition 31 - Given a family c = e X 1l0 of conditional events, where e is a Boolean algebra, 1l an additive set, with 1l ~ e and 1l0 = 1l \ {0}, a real function cp defined on C is a weakly (EB, 0)decomposable conditional measure if h1) cp(EIH) = cp(E I\ HIH), for every E E e and HE 1l0 ; ('y2 } there exists an operation EB : cp(e) x cp(e) --+ cp(C) whose restriction to the set
268
CHAPTER21
is commutative, associative and increasing, admits 0 as neutral element, and is such that, for any given HE 1/,0 , the function cp(·IH) is a weakly ED -decomposable measure;
{'Y3 ) there exists an operation 0 : cp(£) x cp(£) -+ cp(C) whose restriction to the set
r
= {(cp(E!H), cp(AIE 1\ H)): A
E £, E, H, E
1\
HE 1/,0 }
is commutative, associative and increasing, admits 1 as neutral element and is such that, for every A E £and E, HE 1l0 , EI\H =j:. 0, cp((E A A)IH)
= cp(EIH) 0
cp(AI(E A H));
('Y4 ) The operation 0 is distributive over ED only for relations of the kind cp(HIK) 0 (cp(EIH A K) ED cp(FIH A K)), with K, H 1\ K E 1/,0 , E 1\ F 1\ H 1\ K =
0.
Remark 40 - It is easily seen that, with respect to the elements of ~ and r the operations, respectively, ED and 0 are commutative and associative. On the other hand, it is possible to show {see {81}) that 0 is not necessarily extensible as an operation defined on {cp(C)p (and so neither on [0, 1]2 ) satisfying all the usual properties. Definition 32 - The elements of a class of almost generating measures will be called weak generating measures if distributivity of 0 over ED is required only for relations of the kind (x ED y) 0 'Pa(Hi), for all x and y that are unique solutions of the equations {21.1} relative, respectively, to Ei and E; with Ei 1\ E; 1\ Hi = 0. So we are able to extend the characterization theorem (Theorem 25) to weakly decomposable measures:
DECOMPOSABLE MEASURES OF UNCERTAINTY
269
Theorem 26 - Let C = £ x 1l0 , with £ a Boolean algebra, 1l an additive set, 1l ~ £ and 1l0 = 1l \ {0}, a finite family of conditional events, and let A = { Ar} denote the set of atoms of £. If cp is a real function defined on C, and EB , 0 are two operations from cp(C) X cp(C) to m+' then the following two statements are equivalent: (a) cp is a reducible weakly (EB , 0) -decomposable conditional measure on the family C; {b) there exists a (unique) class of weak m-decomposable (generating) measures such that, for any Ei!Hi E C, there is a unique a such that x = cp(Ei!Hi) is the unique solution of the equation
ffi
'Pa(Ar)
~~~~
=X
0
ffi
'Pa(Ar) ·
~~~
For the proof, see [35]. Notice that, due to the relaxation of the associative and distributive properties, the class of weakly decomposable conditional measures is quite large. So the requirement of being reducible in order to characterize them as in Theorem 26 appears, all the more reason, essential. In fact, for an arbitrary decomposable conditional measure, the existence of a class {Aa} and of the relevant class {'Pa} generating it in the sense of Definition 28, is not assured. To prove this, consider the following example (which is a suitable extension of Example 48):
Example 49 - Let£ and 1l0 be defined as in Example 48. Consider now the following conditional decomposable (EB, 0) -measure, where EB , 0 are the Lukasiewicz T -norm and T -conorm :
cp(AIH3) =
o,
cp(BIH3) = 2/3, cp(CIH3) = 1/2,
cp(A V BIH3) = 2/3 ' cp(A V CIH3) = 1/2' cp(B V CIH3)
= cp(A VB V CIH3) = 1'
cp(AIH2) = 1/4' cp(CIH2) = 1 ' cp(A V CIH2) = 1' cp(AIHI) = 1/6 ' cp(BIHI) = 1 ' cp(A V BIHI) = 1 '
CHAPTER 21
270
rp(EIHi)
=0
(E E £, E 1\ Hi=
0, i = 1, 2, 3 ).
It easy to prove that the above assessment satisfies the properties of a decomposable conditional measure and that rp is not reducible. On the other hand, there is no class {Aa} and no relevant almost generating measures {rp0 } : in fact rp 0 , defined in Ao ={A, B, C}, coincides with rp(·IH3 ), but, since the equations {21.2} and {21.3} have an infinite set of solutions {i.e., they do not have rp(AIHI) and rp(AIH2 ) as unique solution, respectively), there is no atom A* such that for every Hi 2 A* the equation {21.1} has a unique solution for every element of£.
21.4
Concluding remarks
The class of reducible weakly decomposable conditional measures (that can be "generated" - as in the case of conditional probabilities - by a suitable family of weakly decomposable unconditional measures) is much larger than the class of measures which are continuous transformations of a conditional probability. This is due to the fact that we deal with not necessarily continuous or strictly increasing operations (so that also min, for instance, can be considered as the operation 8); moreover, we consider operations satisfying commutative and associative properties only in specific subsets of [0, 1j2, and which are not necessarily extensible (preserving the same properties) to the whole set. Nevertheless our results point out that it is not possible to escape any form (even weak) of distributivity of 8 over EB. Finally, we note that the approach based on a direct assignment of the conditional measure removes any difficulty related, e.g., to the problem of conditioning with respect to events of zero-measure.
Bibliography [1] E. Adams. The Logic of Conditionals. Reidel, Dordrecht, 1975 [2] M. Baioletti, A. Capotorti, S. Tulipani, and B. Vantaggi. "Elimination of Boolean variables for probabilistic coherence", Soft Computing 4 (2): 81-88, 2000. [3] M. Baioletti, A. Capotorti, S. Tulipani, and B. Vantaggi. "Simplification rules for the coherent probability assessment problem", Annals of Mathematics and Artificial Intelligence, 35: 11-28, 2002. [4] B. Barigelli. "Data Exploration and Conditional Probability", IEEE Transactions on Systems, Man, and Cybernetics 24(12): 1764-1766, 1994. [5] S. Benferhat, D. Dubois and H. Prade. "Nonmonotonic Reasoning, Conditional Objects and Possibility Theory", Artificial Intelligence 92: 259-276, 1997. [6] P. Benvenuti and R. Mesiar. "Pseudo-additive mesures and triangular-norm-based conditioning", Annals of Mathematics and Artificial Intelligence, 35: 63-70, 2002. [7] P. Berti and P. Rigo. "Conglomerabilita, disintegrabilita e coerenza" Serie Ricerche Teoriche, n.ll, Dip. Statistico Univ. Firenze, 1989. 271
272
BIBLIOGRAPHY
[8] V. Biazzo, A. Gilio. "A generalization of the fundamental theorem of de Finetti for imprecise conditional probability assessments", International Journal of Approximate Reasoning, 24: 251-272, 2000.
[9] V. Biazzo, A. Gilio, and G. Sanfilippo. "Efficient checking of coherence and propagation of imprecise probability assessments", in: Proceeding IPMU 2000, Madrid, pp. 1973-1976, 2000. [10] P. Billingsley. Probability and Measure, Wiley, New York, 1995.
[11] D. Blackwell and L.E. Dubins. "On existence and non existence of proper, regular, conditional distributions", The Annals of Probability, 3: 741-752, 1975. [12] G. Boole. An investigation of the laws of thought on which are founded the mathematical theories of logic and probability, Macmillan, Cambridge, 1854. [13] B. Bouchon-Meunier, G. Coletti and C. Marsala. "Possibilistic conditional events", in: Proceeding IPMU 2000, Madrid, pp. 1561-1566, 2000. [14] B. Bouchon-Meunier, G. Coletti and C. Marsala. "Conditional Possibility and Necessity", in: Technologies for Constructing Intelligent Systems (eds. B. Bouchon-Meunier, J. Gutierrez-Rios, L. Magdalena, and R.R. Yager), Springer, Berlin, 2001. [15] G. Bruno and A. Gilio. "Applicazione del metodo del simplesso al teorema fondamentale per le probabilita nella concezione soggettivistica", Statistica 40: 337-344, 1980.
BIBLIOGRAPHY
273
[16] G. Bruno and A. Gilio. "Confronto tra eventi condizionati di probabilita nulla nell'inferenza statistica bayesiana", Rivista Matem. Sci. Econ. Soc., 8: 141-152, 1985 [17] P. Calabrese. "An algebraic synthesis of the foundations of logic and probability", Information Sciences, 42: 187-237, 1987. [18] A. Capotorti and B. Vantaggi. "Locally Strong Coherence in Inference Processes", Annals of Mathematics and Artificial Intelligence, 35: 125-149, 2002. [19] A. Capotorti, L. Galli and B. Vantaggi. "Locally Strong Coherence and Inference with Lower-Upper Probabilities", Soft Computing, in press. [20] P. Cheeseman. "Probabilistic versus fuzzy reasoning", in: Uncertainty in Artificial Intelligence (Eds. L.N.Kanal and J.F.Lemmer), pp 85-102, North-Holland, 1986 [21] G. Coletti. "Numerical and qualitative judgments in probabilistic expert systems", in: Proc. of the International Workshop on Probabilistic Methods in Expert Systems (Ed. R. Scozzafava), SIS, Roma, pp. 37-55, 1993. [22] G. Coletti. "Coherent Numerical and Ordinal Probabilistic Assessments", IEEE Transactions on Systems, Man, and Cybernetics, 24: 1747-1754, 1994. [23] G. Coletti, "Coherence Principles for handling qualitative and quantitative partial probabilistic assessments", Mathware & Soft Computing, 3: 159-172, 1996. [24] G. Coletti, A. Gilio, and R. Scozzafava, "Conditional events with vague information in expert systems" , in: Lecture Notes in Computer Sciences n.521 (Eds. B. Bouchon-Meunier, R.
274
BIBLIOGRAPHY R. Yager, and L. A. Zadeh), Springer-Verlag, Berlin, pp. 106114, 1991.
[25] G. Coletti and R. Scozzafava. "Characterization of Coherent Conditional Probabilities as a Tool for their Assessment and Extension", International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 4: 103-127, 1996. [26] G. Coletti and R. Scozzafava, "Exploiting zero probabilities", in: Proc. EUFIT '97, Elite Foundation, Aachen, pp. 14991503, 1997. [27] G. Coletti and R. Scozzafava. "Conditional measures: old and new", in: Proc. of "New Trends in Fuzzy Systems", Napoli 1996 (eds. D.Mancini, M.Squillante, and A.Ventre), World Scientific, Singapore, pp. 107-120, 1998. [28] G. Coletti and R. Scozzafava. "Null events and stochastical independence", Kybernetika 34(1): 69-78, 1998. [29] G. Coletti and R. Scozzafava. "Zero probabilities in stochastic independence", in: Information, Uncertainty, Fusion (eds. B. Bouchon-Meunier, R.R. Yager, and L.A. Zadeh), Kluwer, Dordrecht (Selected papers from IPMU 1998, Paris), pp. 185196, 2000. [30] G. Coletti and R. Scozzafava. "Conditional Subjective Probability and Fuzzy Theory", in: Proc. of 18th NAFIPS International Conference, New York, IEEE, pp 77-80, 1999. [31] G. Coletti and R. Scozzafava. "Conditioning and Inference in Intelligent Systems", Soft Computing, 3: 118-130, 1999. [32] G. Coletti and R. Scozzafava. "The role of coherence in eliciting and handling "imprecise" probabilities and its application to medical diagnosis", Information Sciences, 130: 41-65, 2000.
BIBLIOGRAPHY
275
[33] G. Coletti and R. Scozzafava. "Stochastic Independence for Upper and Lower Probabilities in a Coherent Setting", in: Technologies for Constructing Intelligent Systems (eds. B. Bouchon-Meunier, J. Gutierrez-Rios, L. Magdalena, and R.R. Yager), Springer, Berlin (Selected papers from IPMU 2000, Madrid), vol.2, 2001. [34] G. Coletti and R. Scozzafava. "Fuzzy sets as conditional probabilities: which meaningful operations can be defined?", in: Proc. of 20th NAFIPS International Conference, Vancouver, IEEE, pp 1892-1895, 2001. [35] G. Coletti and R. Scozzafava. "From conditional events to conditional measures: a new axiomatic approach", Annals of Mathematics and Artificial Intelligence, 32: 373-392, 2001. [36] G. Coletti and R. Scozzafava. "Stochastic independence in a coherent setting", Annals of Mathematics and Artificial Intelligence, 35: 151-176, 2002. [37] G. Coletti and R. Scozzafava. "Bayes' theorem in a coherent setting", in: Fifth World Meeting of the International Society for Bayesian Analysis (ISBA}, Istanbul (Abstract), 1997. [38] G. Coletti and R. Scozzafava. "Conditional probability, fuzzy sets and possibility: a unifying view", Fuzzy Sets and Systems, 2002, to appear. [39] G. Coletti, R. Scozzafava and B. Vantaggi. "Probabilistic Reasoning as a General Unifying Tool", in: Lecture Notes in Computer Sciences (eds. S. Benferhat and P. Besnard), Vol. LNAI 2143, pp. 120-131, Springer-Verlag, Berlin, 2001. [40] G. Coletti, R. Scozzafava and B. Vantaggi. "Coherent Conditional Probability as a Tool for Default Reasoning", in: Proceeding IPMU 2002, Annecy, France, pp. 1663-1670, 2002.
276
BIBLIOGRAPHY
[41] I. Couso, S. Moral, and P. Walley. "Examples of independence for imprecise probabilities", in: Int. Symp. on Imprecise Probabilities and their applications (ISIPTA '99}, Ghent, Belgium, 121-130, 1999. [42] R. T. Cox. "Probability, frequency and reasonable expectation", American Journal of Physics, 14 (1): 1-13, 1946. [43] A. Csaszar. "Sur la structure des espaces de probabilite conditionelle", Acta mathematica Academiae Scientiarum Hun9aricae, 6: 337-361, 1955. [44] L. M. De Campos and S. Moral. "Independence concepts for convex sets of probabilities", in: Uncertainty in Artificial Intelligence (UAI '95}, Morgan and Kaufmann, San Mateo, 108-115, 1995. [45] B. de Finetti. "Sui passaggi allimite nel calcolo delle probabilita", Rend. Reale Ist. Lombardo Scienze e Lettere, 63: 155156, 1930. [46] B. de Finetti. "A proposito dell'estensione del teorema delle probabilita totali alle classi numerabili", Rend. Reale Ist. Lombardo Scienze e Lettere, 63: 901-905, 1930. [47] B. de Finetti. "Ancora sull'estensione alle classi numerabili del teorema delle probabilita totali", Rend. Reale Ist. Lombardo Scienze e Lettere, 63: 1063-1069, 1930. [48] B. de Finetti. "Sul significato soggettivo della probabilita", Fundamenta Mathematicae, 17: 298-329, 1931 - Engl. transl. in: Induction and Probability {Eds. P. Monari, D. Cocchi), CLUEB, Bologna: 291-321, 1993 [49] B. de Finetti. "La logique de la probabilite", in: Actes du Congres International de Philosophie Scientifique, Paris 1935, Hermann: IV, 1-9, 1936.
BIBLIOGRAPHY
277
[50] B. de Finetti. "Les probabilites nulles", Bull. Sci. Math. 60: 275-288, 1936. [51] B. de Finetti. "La prevision: ses lois logiques, ses sources subjectives", Ann. Institut H. Poincare 7: 1-68, 1937. [52] B. de Finetti. "Sull'impostazione assiomatica del calcolo delle probabilita", Annali Univ. Trieste, 19: 3-55, 1949. - Engl. transl. in: Ch.5 of Probability, Induction, Statistics, Wiley, London, 1972. [53] B. de Finetti. Teoria della probabilitd, Einaudi, Torino, 1970Engl. transl.: "Theory of Probability", Voll. 1 and 2., Wiley, Chichester, 1974. [54] B. de Finetti. "Probability: beware of falsifications!", in: Studies in Subjective Probability (eds. H. E. Kyburg and H.E. Smokier), Krieger Publ., New York, pp. 193-224, 1980. [55] A. P. Dempster. "Upper and Lower Probabilities Induced by a Multivalued Mapping", Annals of Mathematical Statistics 38: 325-339, 1967. [56] L. E. Dubins. "Finitely Additive Conditional Probabilities, Conglomerability and Disintegration", The Annals of Probability 3: 89-99, 1975. [57] D. Dubois, S. Moral and H. Prade. "A semantics for possibility theory based on likelihoods", Journal of Mathematical Analysis and Applications 205: 359-380, 1997. [58] D. Dubois and H. Prade. "Conditioning in Possibility and Evidence Theories: a Logical Viewpoint" , in: Lecture Notes in Computer Sciences (eds. B.Bouchon-Meunier, L.Saitta, R.R.Yager), n.313, pp. 401-408, Springer-Verlag, Berlin, 1988.
278
BIBLIOGRAPHY
[59] D. Dubois and H. Prade. "Conditional Objects as Nonmonotonic Consequence Relationships", IEEE Transactions on Systems, Man, and Cybernetics, 24: 1724-1740, 1994. [60] D. Dubois and H. Prade. "Possibility theory, probability theory and multiple-valued logics: a clarification", Annals of Mathematics and Artificial Intelligence, 32: 35-66, 2001. [61] K. Fan. "On Systems of Linear Inequalities", in: Linear Inequalities and Related Systems, Annals of Mathematical Studies, Vol.38, Princeton University Press, 1956. [62] R. Feynman. "The concept of probability in quantum mechanics", in: Proc. 2nd Berkeley Symp. on Mathematical Statistics and Probability, University of California Press, Berkeley, pp. 533-541, 1951. [63] M. Frechet. "Sur !'extension du theoreme des probabilites totales au cas d'une suite infinie d'evenements, I", Rend. Reale Ist. Lombardo Scienze e Lettere, 63: 899-900, 1930. [64] M. Frechet. "Sur !'extension du theoreme des probabilites totales au cas d'une suite infinie d'evenements, 11", Rend. Reale Ist. Lombardo Scienze e Lettere, 63: 1059-1062, 1930. [65] A. M. Frisch and P. Haddawy. "Anytime Deduction for Probabilistic Logic", Artificial Intelligence 69(1-2): 93-122, 1993. [66] D. Gale. The Theory of Linear Economic Models, MacGrawHill, New York, 1960. [67] P. Gardenfors. Knowledge in Flux, MIT Press, Cambridge (Massachusetts), 1988. [68] R. Giles. "The concept of grade of membership", Fuzzy Sets and Systems, 25: 297-323, 1988.
BIBLIOGRAPHY
279
[69] A. Gilio. "Criterio di penalizzazione e condizioni di coerenza nella valutazione soggettiva della probabilita", Boll. Un. Mat. !tal. (7) 4-B: 645-660, 1990. [70] A. Gilio. "Probabilistic consistency of knowledge bases in inference systems", in: Lecture Notes in Computer Science (eds. M. Clarke, R. Kruse, S. Moral), Vol. 747, pp. 160-167, Springer-Verlag, Berlin, 1993.
[71] A. Gilio. "Probabilistic consistency of conditional probability bounds", in: Advances in Intelligent Computing (eds. B. Bouchon-Meunier, R.R. Yager, and L.A. Zadeh), Lectures Notes in Computer Science, Vol. 945, Springer-Verlag, Berlin, 1995. [72] A. Gilio. "Probabilistic reasoning under coherence in System P", Annals of Mathematics and Artificial Intelligence, 34: 534, 2002. [73] A. Gilio and S. lngrassia. "Totally coherent set-valued probability assessments", Kybernetika 34(1): 3-15, 1998. [74] A. Gilio and R. Scozzafava. "Vague distributions in Bayesian testing of a null hypothesis", Metron 43: 167-174, 1985. [75] A. Gilio and R. Scozzafava. "Conditional events in probability assessment and revision", IEEE Transactions on Systems, Man, and Cybernetics 24(12): 1741-1746, 1994. [76] M. Goldszmidt and J. Pearl. "Qualitative probability for default reasoning, belief revision and causal modeling", Artificial Intelligence 84: 57-112, 1996. [77] I. R. Goodman and H. T. Nguyen. "Conditional objects and the modeling of uncertainties", in: Fuzzy Computing (Eds. M.Gupta, T.Yamakawa), pp. 119-138, North Holland, Amsterdam, 1988.
280
BIBLIOGRAPHY
[78] I. R. Goodman and H. T. Nguyen. "Mathematical foundations of conditionals and their probabilistic assignments", International Journal of Uncertainty, Fuzziness and Knowledge-Based System, 3: 247-339, 1995. [79] T. Hailperin. "Best possible inequalities for the probability of a logical function of events", Amer. Math. Monthly 72: 343359, 1965. (80] P. Hajek. Metamathematics of Fuzzy Logic, Kluwer, Dordrecht, 1998. [81] J. Y. Halpern. "A counterexample to theorems of Cox and Fine", J. of Artificial Intelligence Research 10: 67-85, 1999. [82] P. Hansen, B. Jaumard, and M. Poggi de Aragao. "Column generation methods for probabilistic logic", ORSA Journal on Computing 3: 135-148, 1991. (83] E. Hisdal. "Are grades of membership probabilities?", Fuzzy Sets and Systems, 25: 325-348, 1988. [84] S. Holzer. "On coherence and conditional prevision", Boll. Un. Mat. !tal. (6)4: 441-460, 1985. [85] H. Jeffreys. Theory of Probability, Oxford University Press, Oxford, 1948. [86] E. P. Klement, R. Mesiar, E. Pap. Triangular Norms, Kluwer, Dordrecht, 2000. [87] B. 0. Koopman. "The Bases of Probability", Bulletin A.M.S., 46: 763-774, 1940. [88] P. H. Krauss. "Representation of Conditional Probability Measures on Boolean Algebras", Acta Math. Acad. Scient. Hungar., 19: 229-241, 1968.
BIBLIOGRAPHY
281
[89] F. Lad. Opemtional Subjective Statistical Methods, Wiley, New York, 1996. [90] F. Lad, J. M. Dickey and M. A. Rahman. "The fundamental theorem of prevision", Statistica 50: 19-38, 1990. [91] F. Lad, J. M. Dickey and M. A. Rahman. "Numerical application of the fundamental theorem of prevision" , J. Statist. Comput. Simul. 40: 135-151, 1992. [92] F. Lad and R. Scozzafava. "Distributions agreeing with exchangeable sequential forecasting" , The American Statistician, 55: 131-139, 2001. [93] D. Lehmann and M. Magidor. "What does a conditional knowledge base entail?", Artificial Intelligence 55: 1-60, 1992. [94] R. S. Lehman. "On confirmation and rational betting", The J. of Symbolic Logic 20: 251-262, 1955. [95] D. V. Lindley. "A statistical paradox", Biometrika 44: 187192, 1957. [96] H. T. Nguyen and E. A. Walker. A first course in fuzzy logic, CRC Press, Boca Raton, 1997. [97] A. Nies and L. Camarinopoulos. "Application of fuzzy set and probability theory to data uncertainty in long term safety assessment of radioactive waste disposal systems", in: Probabilistic Safety Assessment and Management (G.Apostolakis, Ed.), Vol. 2, pp. 1389-1394, Elsevier, N.Y., 1991. [98] N. J. Nilsson. "Probabilistic Logic", Artificial Intelligence 28: 71-87, 1986.
282
BIBLIOGRAPHY
[99] N. J. Nilsson. "Probabilistic Logic Revisited", Artificial Intelligence 59: 39-42, 1993. [100] J. B. Paris. The Uncertain Reasoner's Companion, Cambridge University Press, Cambridge, 1994. [101] K. R. Popper. The Logic of Scientific Discovery, Routledge, London, 1959. [102] E. Regazzini. "Finitely additive conditional probabilities", Rend. Sem. Mat. Fis. Milano 55: 69-89, 1985. [103] R. Reiter. "A Logic for Default Reasoning", Artificial Intelligence, 13(1-2): 81-132, 1980. [104] A. Renyi. "On a New Axiomatic Theory of Probability", Acta mathematica Academiae Scientiarum Hungaricae, 6: 285335, 1955. [105] P. Rigo. "Un teorema di estensione per probabilita condizionate finitamente additive", in: Atti XXXIV Riunione Scientifica S.I.S., Siena, Vol.2, pp. 27-34, 1988. [106] A. Robinson. Nonstandard Analysis, Princeton University Press, Princeton, 1996. [107] S. J. Russel and P. Norvig. Artificial Intelligence. A Modern Approach, Prentice-Hall, New Jersey, 1995. [108] G. Schay. "An Algebra of Conditional Events", Journal of Mathematical Analysis and Applications, 24: 334-344, 1968. [109] R. Scozzafava. "Probabilita u-additive e non", Boll. Unione Mat. !tal., (6) 1-A: 1-33, 1982. [110] R. Scozzafava. "A survey of some common misunderstandings concerning the role and meaning of finitely additive probabilities in statistical inference", Statistica, 44: 21-45, 1984.
BIBLIOGRAPHY
283
[111] R. Scozzafava. "A merged approach to stochastics in engineering curricula", European Journal of Engineering Education, 15(3): 241-248, 1990. [112] R. Scozzafava. "Probabilita condizionate: de Finetti o Kolmogoroff?", in: Scritti in omaggio a L.Daboni, pp. 223-237, LINT, Trieste, 1990. [113) R. Scozzafava. "The role of probability in statistical physics", Transport Theory and Statistical Physics, 29{1-2): 107-123, 2000. [114) R. Scozzafava. "How to solve some critical examples by a proper use of coherent probability", in: Uncertainty in Intelligent Systems (Eds. B.Bouchon-Meunier, L.Valverde, R.R.Yager), pp. 121-132, Elsevier, Amsterdam, 1993. [115) R. Scozzafava. "Subjective conditional probability and coherence principles for handling partial information", Mathware & Soft Computing, 3: 183-192, 1996. [116] G. Shafer. A mathematical theory of evidence, University of Princeton Press, Princeton, 1976. [117) G. Shafer. "Probability judgement in Artificial Intelligence and Expert Systems", Statistical Science, 2: 3-16, 1987. [118) P. P. Shenoy. "On Spohn's Rule for Revision of Beliefs", International Journal of Approximate Reasoning, 5: 149-181, 1991. [119] R. Sikorski. Boolean Algebras, Springer-Berlin, 1964. [120) W. Spohn. "Ordinal conditional functions: a dynamic theory of epistemic states", in: Causation in Decision, Belief Change, and Statistics (eds. W. L. Harper, B. Skyrms), Vol.II, Dordrecht, pp. 105-134, 1988.
284
BIBLIOGRAPHY
[121] W. Spohn. "On the Properties of Conditional Independence", in: Scientific Philosopher 1: Probability and Probabilistic Causality (eds. P. Humphreys and P. Suppes), Kluwer, Dordrecht, pp. 173-194, 1994. [122] W. Spohn. "Ranking Functions, AGM Style", Research Group "Logic in Philosophy", 28, 1999. [123] B. Vantaggi. "Conditional Independence in a Finite Coherent Setting", Annals of Mathematics and Artificial Intelligence, 32: 287-313, 2001. [124] B. Vantaggi. "The 1-separation Criterion for Description of cs-independence Models", International Journal of Approximate Reasoning, 29: 291-316, 2002. [125] P. Walley. Statistical Reasoning with Imprecise Probabilities, Chapman and Hall, London, 1991. [126] P. M. Williams. "Notes on conditional previsions", School of Mathematical and Physical Sciences, working paper, The University of Sussex, 1975. [127] P. M. Williams. "Indeterminate probabilities", in: Formal Methods in the Methodology of Empirical Sciences (eds. M. Przelecki, K. Szaniawski, and R. Vojcicki), Reidel, Dordrecht, pp. 229-246, 1976. [128] L. A. Zadeh. "Fuzzy sets", Information and Control, 8: 338353, 1965.
Index additive set, 73, 76, 92, 263, 269 agreeing classes of probabilities, 77, 81, 85, 87, 88, 94, 99, 100, 109, 111, 113, 114, 165, 166, 168, 182, 234, 246 algebra of subsets, 21, 25, 90 almost generating measures, 263 alternative theorem, 37, 42, 53 assumed vs. asserted proposition, 8, 18, 19, 75, 146, 152, 156, 201, 202 atoms, 23, 31-35, 39, 42-44, 49,51,69,80, 109,111113, 117, 119, 120, 124, 130,138,142-147,152154,211,212,232,234, 238, 263-266, 269 axioms of conditional probability, 73 axioms of subjective probability, 14
143,144,153,155,157, 158, 161, 189, 190 Bayesian approach, 11, 75, 138, 139, 146, 155, 157, 189, 209 belief functions, 136, 206, 207 belief network, 8 betting interpretation of coherence, 37, 41, 77 Boolean algebra, 10, 20, 33, 76, 85, 92, 232, 262, 263, 267, 269 Boolean support of a conditional event, 64, 67 Borel sets, 194 car and goats paradox, 203 causality, 164 certain and impossible events, 14, 18, 19, 100 characteristic function, 216, 217 checking coherence, 11, 33, 46, 48, 95, 117, 120, 122, 124-126, 133, 139, 140, 142,143,145,147,150, 152-154,245,250,255 coherence, 9, 11, 15, 24, 31,
basic probability assignment, 206, 208 Bayes' theorem, 75, 107, 140, 285
286 32, 40, 42, 103, 189 coherent conditional probability, 11, 70, 72, 76, 102, 119, 251 coherent conditional probability (characterization theorem), 81 coherent extensions, 9, 26, 33, 41, 43, 46, 48, 58, 80, 87, 106, 109-111, 113, 117,120,123,127,129, 133, 138, 144, 151, 152, 201,226,233,236-238, 246, 247, 253 coherent prevision, 50 combinatorial evaluations, 7, 14, 54, 55 commensurable events, 100 compositional belief, 57, 220 conditional event, 7, 9, 10, 22, 45, 50, 63, 204, 225, 251 conditional independence, 8, 163, 228 conditional measure, 258 conditional possibility, 259, 262, 266 conditional probability, 7, 9, 22, 29, 45, 259 conjunction, 20 consistency of default rules, 245, 251, 253 contrary (of an event), 18
INDEX countable additivity, 12, 28, 71, 90, 155, 191, 194 Cox's axioms, 259 crisp subset, 216, 217, 228-230 Csaszar's dimensionally ordered classes, 91 DAG (directed acyclic graph), 8 de Finetti's axioms, 71 de Finetti's coherence, 15, 32, 37, 80 de Finetti's fundamental theorem of probabilities, 10,44 De Morgan's laws, 20 decomposable conditional measure, 262 decomposable conditional measure (characterization theorem), 266 decomposable uncertainty measures, 260 default logic, 11 default rules, 241, 244, 245 degree of belief, 8, 13, 24, 47, 54, 201, 203, 221-224, 241 degree of truth, 62 Dempster's rule of combination, 136, 206 Dempster's theory, 134 dilation, 189
INDEX
disintegration formula, 146, 153, 201 disjunction, 20 dominating family, 129, 130 Dutch-Book, 34, 40 entailment (of default rules), 246, 247, 252, 254 epistemic state, 19, 106 event, 7, 17, 227 exchangeability, 12, 55, 186 exploiting zero probabilities, 112, 117, 120, 122 finitely additive probability, 25, 28, 191, 194 first digit problem, 28, 192 frequentist evaluations, 12, 14, 54-56, 139, 198, 199, 201 frequentist's temptation, 227 fuzziness and uncertainty, 219 fuzzy subset, 215-217, 219, 221, 223-230, 238, 240 fuzzy theory, 11, 215, 225, 232 g-coherence, 133 Gardenfors approach, 15, 19, 34 graphical models, 163 heads and tails, 17, 18, 28, 95, 101, 166, 167, 204 i-minimal conditional probability, 130, 181, 184
287 if-then rules, 9 imprecise probabilities, 46, 128, 133, 151 inclusion for conditional events, 68, 88, 114 incompatible events, 18, 20 indicator of an event, 17 indicator vector, 23 inference, 11, 45, 46, 103, 107, 137, 155, 189, 247 infinitesimals, 107, 254 iperreal field 1R*, 107 iperreal probability, 107 Jeffreys-Lindley paradox, 157 Kolmogorov's approach, 13, 29, 181, 194, 196 likelihood, 11, 75, 147, 161, 220 likelihood principle, 155 linear programming, 10, 46, 112, 134 locally strong coherence, 122126, 133, 154 logic of certainty, 14, 19 logical dependence, 9, 43, 106, 109, 163, 166 logical implication, 17, 20, 243, 244 logical independence, 23, 32, 167,172,179,182,183 logical relations, 18, 23, 87, 117, 122, 138, 142, 151, 153,
288 245 lower and upper coherent conditional probabilities, 128, 130,131,147,149,206, 211 lower and upper probabilities, 48, 151 medical diagnosis, 10, 23, 33, 138, 139, 141, 144 membership function, 216, 217, 219-224,226-228,232 multi-valued logic, 61 n-monotone function, 135, 207 Nilsson's probabilistic logic, 46 nonstandard numbers, 107,254 Ockham's razor, 242 operations among events, 20 operations between fuzzy subsets,217,220,225,229 operations with conditional events, 10, 12, 65, 72, 258 partial assessments, 1, 10, 13, 19,33,46,93, 106,108, 117, 132 partial likelihood, 137, 138, 144, 145, 158 partition of n, 11, 24, 26, 49, 55,64,65,89,129,135, 137,138,143,144,146, 152,225,226,229,233, 234, 238
INDEX Popper measure, 13, 71, 106, 107 possibility distribution, 232, 233, 238 possibility functions, 11, 223, 232,233,237,238,240, 257, 264 possible worlds, 23 prevision, 12, 49, 72, 73 prior and posterior probabilities, 11, 75, 137, 139, 141, 144, 153, 155, 157, 159, 161 probabilified ranks, 106, 107 probability 1 ("full" or "plain" belief), 11, 19, 96, 102, 106,241-244,246,253, 254 probability density, 155, 195 proposition, 7, 17, 61, 62 pseudodensity, 157 quantum theory, 198, 200 quasi conjunction, 251 Renyi's axioms, 71, 90 Radon-Nikodym approach to zero probability conditioning, 77, 194, 196 random gain, 39, 40, 51, 78 random quantity, 12, 49, 77, 226, 233 random variable, 8, 49, 64, 65, 72, 73, 260
INDEX
sample space, 7, 21 second order probabilities, 11 Simpson's paradox, 204 Spohn's irrelevance (independence), 167 Spohn's ranking function, 101103, 167, 253 state of information, 7, 56, 63 statistical data, 8, 18, 75 stochastic independence, 9, 108, 163-166,172,175,177, 193 stochastic independence for lower probability, 179, 181, 182, 184-186, 190 Stone's theorem, 21 strong negation, 219 subjective probability, 10, 13, 53 superadditivity, 134, 193 symmetry in stochastic independence, 174, 177 System Z, 253 T-conorm, 218, 219, 231, 240, 265, 269 T-norm,217-219,231,240,265, 269 three prisoners paradox, 203 three-valued logic, 62 truth-functional belief, 57, 220 truth-value of a conditional event, 63, 69
289 uncertainty and vagueness, 227 uncertainty measures, 11, 186, 257 updating, 1, 10, 33, 48, 103, 106,107,127,138,139, 141,144-147,150,155, 189, 227 urn of unknown composition, 186, 187, 189 weak implication, 242 weakly compositional belief, 58 weakly decomposable conditional measure, 267 weakly decomposable conditional measure (characterization theorem), 268 You,8, 13, 20, 27,41, 54,151, 164,221,223,226-228, 237 zero lower probability, 128, 150 zero probability, 9, 11, 12, 25, 28, 29, 34, 35, 71, 76, 77, 87, 95, 100, 107, 112, 119, 120, 123, 150, 156,158,160,174,190, 194,196,199,212,235, 252 zero-layer, 94, 99-104, 107, 108, 158, 161,165, 166,172, 182, 238, 244, 253
TRENDS IN LOGIC 1.
G. Schurz: The Is-Ought Problem. An Investigation in Philosophical Logic. 1997 ISBN 0-7923-4410-3
2.
E. Ejerhed and S. Lindstrom (eds.): Logic, Action and Cognition. Essays in Philo-
sophical Logic. 1997
ISBN 0-7923-4560-6
3.
H. Wansing: Displaying Modal Logic. 1998
ISBN 0-7923-5205-X
4.
P. Hajek: Metamathematics of Fuzzy Logic. 1998
ISBN 0-7923-5238-6
5.
H.J. Ohlbach and U. Reyle (eds.): Logic, Language and Reasoning. Essays in Honour ofDov Gabbay. 1999 ISBN 0-7923-5687-X
6.
K. Dosen: Cut Elimination in Categories. 2000
7.
R.L.O. Cignoli, I.M.L. D'Ottaviano and D. Mundici: Algebraic Foundations ofmanyvalued Reasoning. 2000 ISBN 0-7923-6009-5
8.
E.P. Klement, R. Mesiar and E. Pap: Triangular Norms. 2000
ISBN 0-7923-5720-5
ISBN 0-7923-6416-3 9.
V.F. Hendricks: The Convergence of Scientific Knowledge. A View From the Limit. 2001 ISBN 0-7923-6929-7
10.
J. Czelakowski: Protoalgebraic Logics. 2001
11.
G. Gerla: Fuzzy Logic. Mathematical Tools for Approximate Reasoning. 2001 ISBN 0-7923-6941-6
12.
M. Fitting: Types, Tableaus, and Godel's God. 2002
ISBN 1-4020-0604-7
13.
F. Paoli: Substructural Logics: A Primer. 2002
ISBN 1-4020-0605-5
14.
S. Ghilardi and M. Zawadowki: Sheaves, Games, and Model Completions. A Categorical Approach to Nonclassical Propositional Logics. 2002 ISBN 1-4020-0660-8
15.
G. Coletti and R. Scozzafava: Probabilistic Logic in a Coherent Setting. 2002 ISBN 1-4020-0917-8;Pb: 1-4020-0970-4
16.
P. Kawalec: Structural Reliabilism. Inductive Logic as a Theory of Justification. 2002 ISBN 1-4020-1013-3
ISBN 0-7923-6940-8
KLUWER ACADEMIC PUBLISHERS- DORDRECHT I BOSTON I LONDON