Synthese (2011) 181:181–184 DOI 10.1007/s11229-010-9795-2
The 37th annual meeting of the Society for Exact Philosophy Marc Moffett · Greg Ray
Received: 30 July 2010 / Accepted: 30 July 2010 / Published online: 17 September 2010 © Springer Science+Business Media B.V. 2010
1 Introduction This special issue of Synthese is devoted to papers from the 37th annual meeting of the Society for Exact Philosophy (SEP). In 2009, thirty-nine years after its inaugural meeting held at McGill University in Montreal in 1970,1 the Society met in Edmonton, Alberta and showed its continued vigor with a most excellent conference organized by Bernard Linsky and F. Jeffrey Pelletier with the generous support of the University of Edmonton. The purpose of the Society is to “provide sustained discussion among researchers who believe that rigorous methods have a place in philosophical investigations.” To this end, the Society meets annually, alternating between meeting locations in Canada and the United States. According to the Encyclopedia of Associations, the group consists primarily of “American and Canadian academics in philosophy, computer science, linguistics and mathematics.” Properly understood, this is probably fair enough, but the diversity of interests of our members—which number over 400 and span over twenty countries—is not to be underestimated. More information on the Society—its activities and history—can be found at http://www.phil.ufl.edu/SEP.
1 The 37th annual meeting of the Society occurred in its 39th year owing to three occasions between 1970 and 1976 during which the Society did not meet. The reader is invited to check these figures for consonance. I am indebted in all historical matters regarding the SEP to Jeff Pelletier’s historical accounting given in “The Society for Exact Philosophy” Ruch Filozoficzny 48 (1991): 107–118.
M. Moffett University of Wyoming, Laramie, WY, USA G. Ray (B) Department of Philosophy, University of Florida, Gainesville, FL 32611-8545, USA
123
182
Synthese (2011) 181:181–184
Proceedings of SEP meetings have been published in a variety of venues, including the Journal of Philosophical Logic, Philosophia, Philosophical Studies, Synthese, Topoi, as well as in various special conference volumes. We are pleased to present to you in this special issue of Synthese yet another selection of fine papers from a very fine meeting of the Society for Exact Philosophy. 2 Other papers The 2009 SEP meeting was full of good and interesting presentations. Of course, not all of these could be brought to you in the pages of this journal. Accordingly, I would like to make mention of the many fine papers which are not represented here, but are deserving of your interest and attention as they are published elsewhere. The following scholars spoke at the 2009 SEP meeting: Marshall Abrams (Alabama-Birmingham): Toward a Mechanistic Interpretation of Probability. Ken Akiba (Virginia Commonwealth): The Coherence and Significance of Conjunctive, Disjunctive and Negative Objects. Martin W. Allen (Massachusetts): One View of Logical Preservation and Negation. Roberta Ballarin (British-Columbia): Disjunctive Effects. Prasanta S. Bandyopadhyay, Davin Nelson, Gordon Britten, Mark Greenwood and Jesse Berwald (Montana State): The Logic of Simpson’s Paradox. William Bauer (Nebraska-Lincoln): The Being of Pure Powers. Katalin Bimbó (Alberta): FOL Simplified. Tomas Bogardus (Austin): A New Argument for Dualism. David Boutillier (Western Ontario): Rationalism and Logic. Ingo Brigandt (Alberta): Scientific Reasoning is Material Inference. Derek Brown (Brandon): On a Rational Reconstruction of Intensionalism Debates. Bryson Brown (Lethbridge): From Inference to Connectives. Ben Caplan, Chris Tillman, and Patrick Reeder (Ohio State & Manitoba): Parts of Singletons. Charles Chihara (UC Berkeley): Two Nominalistic Accounts of Mathematics. Roger Clarke (British-Columbia): ‘The Ravens Paradox’ is a Misnomer. Murray Clarke (Concordia): Does Error Evolve? Anthony Coleman (Willamete): An Analysis of the Obvious. Kenneth Easwaran (Southern California): Infinitesimal Probabilities. Yvon Gauthier (Montreal): The Foundational Significance of Applied Proof Theory. Eric Hiddleston (Wayne State): Reductionism and the Micro-Macro Mirroring Thesis. Sam Hillier (Alberta): Mathematics in Science? Carnap vs. Quine. Christopher Hitchcock (Cal Tech): Trumping and Contrastive Causation. Glen Hoffman (Ryerson): Rationalist Infallibilism. Robert Hudson (Saskatchewan): Realism and the Bullet Cluster. Octavian Ion (Alberta): Two Problems with the Direct Reference Theory of Belief Reports. Ray E. Jennings and Yue Chen (Simon Fraser): Articular Logic.
123
Synthese (2011) 181:181–184
183
Richard Johns (British-Columbia): Self-Organisation in Dynamical Systems: A Limiting Result. Nicholaos Jones (Alabama Huntsville): Approximation and Inconsistency. Michael Katz and Olga Semyonov (Haifa): Everything You Always Wanted to Know About the Feminist Bank Teller Linda—But Didn’t Know You Could Ask. Herbert Korté (Regina): Naturalizing Natural Deduction. François Lepage (Montreal): Lesniewski’s Ontology and Naïve Set Theory. Aaron Lercher (Louisiana State): Integration, Ergodic Theory and Inference to the Best Explanation. Kirk Ludwig (Florida): Ontology and Collective Action. Genoveva Martí and Jose Martinez (Barcelona): General Terms, Rigidity and the Trivialization Problem. Paul McNamara (New Hampshire): Praise, Blame, Obligation and Beyond: Toward a Framework for Classical Supererogation. Marc A. Moffett (Wyoming): Purposive Knowledge. Michael Morreau (Maryland): Comparative Overall Similarity. Adam Morton (Alberta): Visibility Logic. Seyed Mousavian (Alberta & Iranian Institute of Philosophy): Neo-Meinongian Neo-Russellians. Ioan Muntean (UC San Diego): A Nascent Work on Universals: T. Maudlin’s Fiber Bundle Metaphysics. Kent Peacock (Lethbridge): Quantum Mechanics and the de-Booleanization of Time. Brian Pickel (Austin): Generalizing Soames’ Argument Against Rigidified Descriptivism. John L. Pollock (Arizona): Probable Probabilities. Greg Ray (Florida): The Problem of Negative Existentials Solved Inadvertantly. Peter Schotch & Gillman Payette (Dalhousie & Calgary): Worlds and Times. Gila Sher (UC San Diego): Is Logic in the Mind or in the World?. Giacomo Sillari (Pennsylvania): Rule Following as Coordination: A Game-Theoretic Approach. Barry Hartley Slater (Western Australia): A Perfect Language. Rachel Sterken (St. Andrews & Oslo): Generics, Semantic Blindness and Mosquitoes. Allessandro Torza (Boston): ‘Identity” Without Identity. Alasdair Urquhart (Toronto): Counting types of Boolean functions – from Jevons to Polya. Susan Vineberg (Wayne State): Two Kinds of Mathematical Explanation. Gregory R. Wheeler (CENTRIA): Counterfactual Evidential Probability. Michel-Antoine Xhignesse (Queens): Stuff and Coincident Objects. Byeong-Uk Yi (Toronto): Two Envelope Paradox and Causal Dependence. Edward N. Zalta (CSLI & Stanford Ency of Philosophy): A (Computational) System of the World. As this list will make plain, there are many topics of special and perennial interest in the Society which are not represented by the papers that follow—in fact, more omitted
123
184
Synthese (2011) 181:181–184
than can be represented here, but that is as it must be. Perhaps the only way to get a clear grasp of the interests of this thriving group of researchers is to attend an annual meeting of the Society. HEJ!
123
Synthese (2011) 181:185–208 DOI 10.1007/s11229-010-9797-0
The logic of Simpson’s paradox Prasanta S. Bandyoapdhyay · Davin Nelson · Mark Greenwood · Gordon Brittan · Jesse Berwald
Received: 4 December 2009 / Revised: 3 May 2010 / Accepted: 26 July 2010 / Published online: 28 September 2010 © Springer Science+Business Media B.V. 2010
Abstract There are three distinct questions associated with Simpson’s paradox. (i) Why or in what sense is Simpson’s paradox a paradox? (ii) What is the proper analysis of the paradox? (iii) How one should proceed when confronted with a typical case of the paradox? We propose a “formal” answer to the first two questions which, among other things, includes deductive proofs for important theorems regarding Simpson’s paradox. Our account contrasts sharply with Pearl’s causal (and questionable) account of the first two questions. We argue that the “how to proceed question?” does not have a unique response, and that it depends on the context of the problem. We evaluate an objection to our account by comparing ours with Blyth’s account of the paradox. Our research on the paradox suggests that the “how to proceed question” needs to be divorced from what makes Simpson’s paradox “paradoxical.”
Davin Nelson was a former student of Montana State University. P. S. Bandyoapdhyay (B) · G. Brittan Department of History & Philosophy & Affiliate to Astrobiology Biogeocatalysis Research Center, Montana State University, Bozeman, MT, USA e-mail:
[email protected] G. Brittan e-mail:
[email protected] D. Nelson Department of History & Philosophy, Montana State University, Bozeman, MT, USA M. Greenwood · J. Berwald Department of Mathematical Sciences, Montana State University, Bozeman, MT, USA e-mail:
[email protected] J. Berwald e-mail:
[email protected]
123
186
Synthese (2011) 181:185–208
Keywords Three questions · Conflation of three questions · Two experiments · Collapsiblity principle · Confounding · What to do questions 1 Overview Simpson’s Paradox (SP) involves the reversal of the direction of a comparison or the cessation of an association when data from several groups are combined to form a single whole. There are three distinct questions associated with SP: (i) why or in what sense, is SP a paradox? (ii) what is the proper analysis of this paradox? (iii) how should one proceed when confronted with a typical case of the paradox? We propose a “formal” account of the first two questions. We argue that there is no unique answer to the “how to proceed question?” Rather, what we should do varies as a function of the available background information. One needs to be careful about the scope of the paper. We do not offer any novelty with regard to the treatment of the “how to proceed question” except by way of making clear the assumptions involved in addressing the latter and distinguishing the “how to proceed question” from what makes Simpson’s paradox “paradoxical.” Our analysis of the paradox, however, differs sharply from the causal account offered by Judea Pearl (Pearl 2000; Greenland et al. 1999).1 In our view, his account is not persuasive. The premises that generate the paradox are non-causal in character and a genuine logical inconsistency is at stake when a full reconstruction of the paradox is carried out. 2 Simpson’s paradox and its logical analysis Consider an example of the paradox (hereafter called the type I version) (Table 1). Here, “CV” includes two categorical variables, “F” for “females” and “M” for “men.” “A” and “R” represent “the rates of acceptance/rejection” for two departments, D1 , and D2 . Here is a formulation of the paradox, in which the association in the subpopulations is reversed in the combined population. Although the acceptance rates for females are higher than for males in each department, in the combined population ignoring sex, the rates have reversed. Table 1 Simpson’s paradox (Type I) CV
F M
Dept. 1
Dept. 2
Acceptance rates
Accept
Reject
Accept
Reject
Dept. 1 (%)
Dept. 2 (%)
180 480
20 120
100 10
200 90
90 80
33 10
Overall acceptance rates (%)
56 70
1 The other influential work on causal inference is due to Spirtes, Glymour and Scheines and their colleagues (Spirtes et al. 2000). They are interested in representing systems of causal relationships as well as inferring causal relationships from purely observational data with the help of certain assumptions. How to address/eliminate situations like Simpsons’s paradox in observational data while making causal inference is a key feature of their work. We have evaluated their research in Bandyopadhyay et al. (unpublished).
123
Synthese (2011) 181:185–208
187
We now propose an analysis of the paradox. Consider two populations, [A, B] taken to be mutually exclusive and jointly exhaustive. The measured overall rates for each population are called, [α, β], respectively. Each population is partitioned into categories called, [1, 2], and the measured rates within each partition are called [A1 , A2 , B1 , B2 ]. Let’s assume that f1 = the number of females accepted in D1 , F1 = the total number of females applied to D1 ; m1 = the number of males accepted in D1 , M1 = the total number of males applied to D1 . Then A1 = f1 / F1 , and B1 = m1 /M1 . Similarly, we could define A2 and B2 . Let’s assume that f2 = the number of females accepted in D2 , F2 = the total number of females applied to D2 ; m2 = the number of males accepted in D2 , and M2 = the total number of males applied to D2 . So, A2 = f2 /F2 and B2 = m2 /M2 . Likewise, we could understand α and β representing overall rates for each population, females and males, respec( f1 + f2 ) (m 1 +m 2 ) and β = (M . Because α, β, A1 , A2 , B1 and tively. So the term α = (F 1 +F2 ) 1 +M2 ) B2 are rates of some form, they will range between 0 and 1 inclusive. We stipulate the following definitions. C1 ≡ A1 ≥ B1 C2 ≡ A2 ≥ B2 C3 ≡ β ≥ α. We call C ≡ (C1 & C2 & C3 ) We define a term θ , which provides a connection between the acceptance rates (A1 , B1 , A2 and B2 ) within each partition to their overall acceptance rates (α and β). θ = (A1 − B1 ) + (A2 − B2 ) + (β − α) . A situation is a Simpson’s paradox (SP) if and only if (i) C ≡ (C1 &C2 &C3 ) and (ii) C4 ≡ θ = {(A1 – B1 ) + (A2 – B2 ) + (β –α)} > 0. Each condition (i or ii) is necessary, but they jointly constitute sufficient conditions for generating SP. Consider why condition (i) alone is not sufficient. If C is true, then we get BOX I, because C generates the latter. Box I
(1) A1 = B1 & A2 = B2 & β = α; (2) A1 = B1 & A2 = B2 & β > α; (3) A1 > B1 & A2 = B2 & β = α; (4) A1 = B1 & A2 > B2 & β = α; (5) A1 > B1 & A2 > B2 & β = α; (6) A1 > B1 & A2 = B2 & β > α; (7) A1 = B1 & A2 > B2 & β > α; and finally (8) A1 > B1 & A2 > B2 & β > α.
Case 1, i.e., A1 = B1 & A2 = B2 & β = α (in Box I) shows that the condition (i) (i.e., C) taken alone is not sufficient because the case 1 implies neither the cessation
123
188
Synthese (2011) 181:185–208
Table 2 No Simpson’s paradox CV
Dept. 1
Dept. 2
Acceptance rates Dept. 1 (%)
Overall acceptance rates (%)
Accept
Reject
Accept
Reject
Dept. 2 (%)
F
40
60
100
100
40
50
46.6
M
50
50
80
120
50
40
43.3
of the association nor reversal in the overall population. Therefore, we need (ii) θ > 0 to eliminate case 1. Hence, C can’t alone be sufficient. We provide an example using Table 2 to argue why (ii) alone is not sufficient. In Table 2, A1 > B1 , B2 > A2 and β > α. This example satisfies (ii) because it implies θ is greater than 0. However, this is not a case of Simpson’s paradox. It has violated C because one condition of C, A2 ≥ B2 , remains unsatisfied. Hence, (ii) cannot be solely adequate to generate Simpson’s paradox. If we have both C and C4 (i.e., θ > 0) then we have the sufficient condition for generating the paradox. Consider why (i) is necessary. To answer this we need to show that if C is not satisfied, then we won’t be able to derive the paradox. If we deny C, then we get seven combinations (Box II).
Box II 1. ∼ C1 2. ∼ C2 3. ∼ C3 4. ∼ C1 &∼ C2 5. ∼ C2 & ∼ C3 6. ∼ C1 & ∼ C3 7. ∼ C1 & ∼ C2 & ∼ C3
We will show that if ∼ C1 is the case (i.e., case 1 in Box II), then we get this combination where B1 > A1 , A2 ≥ B2 and β ≥ α. This does not manifest any reversal, hence can’t be a case of Simpson’s paradox as shown in Table 3.
Table 3 No Simpson’s paradox CV
Dept. 1
Dept. 2
Acceptance rates
Overall acceptance rates (%)
Accept
Reject
Accept
Reject
Dept. 1 (%)
Dept. 2 (%)
F
70
30
30
70
70
30
50
M
40
60
50
50
40
50
45
123
Synthese (2011) 181:185–208
189
Given BOX II if we take other cases in which C is false, it will also follow that there will be no paradox. Since proofs for the negations are similar to the one just given for ∼ C1 , we do not repeat them. Hence, C is necessary. Why is (ii), that is, C4 : θ > 0, necessary? If C is true, then θ ≥ 0. If C is true, but θ > 0 is not necessary, then the denial of θ > 0 implies θ ≤ 0. The latter, θ ≤ 0, implies that disjunction (i.e., θ < 0 or θ = 0) is true. If either one disjunct (i.e., θ = 0 or θ < 0) is true, then it implies that the entire disjunction, θ ≤ 0, is true. If θ = 0 is true then it entails case 1, (i.e., A1 = B1 &A2 = B2 &β = α) of BOX I, which does not represent an instance of reversal. Hence, (ii) θ > 0 must be necessary. There are two points worth-mentioning. First, the characterization of the puzzle in terms of our two conditions captures the central intuitions at stake in the example given; they are in no way ad hoc. The central intuitions are, once again, the reversal and the cessation of an association in the overall population. Three more points follow. First, the paradox is “structural” in character, in the sense that the reasoning that leads to it is deductive. Consider our example, which involves simple arithmetic. The overall rates of acceptance for both females and males follow from their rates of acceptance in two departments taken separately. Second, a probabilistic (in the sense of a statistical or inductive) solution is not available.2 Third, unless someone uses the notion of causation trivially, for example, believes that 2+2 “causes” 4, there is no reason to assume that there are causal intuitions lurking in the background.3 So far we have discussed SP in general. We are now interested in knowing specific relationships between two rates of acceptance in each sub-population for both populations. We have proved two theorems, which we call Theorems 1 and 2, to address these relationships. We have also proved a derived result from Theorems 1 and 2, which we call Theorem 3, showing the connection between them. In addition, once we know the interrelationships between two rates of acceptance in each sub-population, we want to know whether we have logical relationships between those rates of acceptance in each sub-population to their overall acceptance rate in each population. A set of lemmas satisfy our curiosity by proving those relationships. First, we
2 This account seems to go against the view held by Freedman et al. (1999). However, this depends on how
we construe the following passage along with an email communication with David Freedman. These authors write in their celebrated textbook, “[t]he statistical lesson: relationships between percentages in subgroups (for instance, admissions rates for men and women in each department separately) can be reversed when the subgroup are combined. This is called Simpsons’s paradox.” If we construe “statistical lesson” in terms of non-monotonic reasoning, then it does not seem that there is any statistical lesson hidden in the paradox. However, it is possible that in this passage the mathematical reasoning involved in the paradox has been taken broadly to stand for statistical reasoning. If the second construal is its intended meaning, then there is no difference between their argument and ours. The latter meaning is what has been hinted by Freedman in his response to one of the author’s query. Freedman wrote, “[t]he issue [concerning the paradox] is not uncertainty. It has little to do with the distinction between inductive and deductive reasoning, as far as I understand these terms. Simpsons’s paradox is a surprising fact about weighted averages, i.e., it’s a math fact. It has big implications for applied statistics (18 January, 2004).” We agree with Freedman that it has nothing to do with uncertainty and is a mathematical fact about ratios, but we disagree with him about the nature of reasoning which, according to us, is purely deductive. Readers are invited to compare this footnote and our comments on Freedman’s email with the footnotes 12 and 13 and our comments in the corresponding body of the paper. We are very much thankful to Freedman for this communication. 3 This goes against the view held by Pearl (2000, 2009).
123
190
Synthese (2011) 181:185–208
Table 4 No Simpson’s paradox CV
Dept 1
Dept 2
Acceptance rates Dept. 1 (%)
Overall acceptance rates (%)
Accept
Reject
Accept
Reject
Dept. 2 (%)
F
75
225
75
225
25
25
25
M
10
90
20
80
10
20
15
Table 5 No Simpson’s paradox CV
Dept 1
Dept 2
Acceptance rates
Overall acceptance rates (%)
Accept
Reject
Accept
Reject
Dept. 1 (%)
Dept. 2 (%)
F
10
90
20
80
10
20
15
M
75
225
75
225
25
25
25
motivate each theorem along with lemmas using examples; their proofs are provided in the appendix. Theorems 1, 2, and 3 are given below: TH1: SP results only if A1 = A2 . TH2: SP arises only if B1 = B2 . TH3: SP arises only if (A1 = A2 ) if and only if (B1 = B2 ). The following example based on Table 4 shows why the condition for Theorem 1 needs to hold. Since A1 = A2 , i.e., 25% = 25%, no paradox results. Table 5 explains why the condition laid down in Theorem 2 needs to hold. Since B1 = B2 , i.e., 25% = 25%, a paradox does not result in Table 5. The first two theorems provide us with the information that neither A1 = A2 nor B1 = B2 can hold in cases of SP. We can see from our examples, based on Tables 4 and 5, that there might be some relationship between A1 = A2 and B1 = B2 . However, those examples can’t tell us exactly what those relationships are. Theorem 3 has convincingly showed that relationship, which is, Simpson’s paradox arises only if (A1 = A2 ) if and only if (B1 = B2 ). Given our characterization of Simpson’s paradox, we realize that α being the overall rate of acceptance for population A, it is a weighted average of A1 and A2 . Hence, α lies in between A1 and A2 . Similarly, β being a weighted average of B1 and B2 , β also lies in between B1 and B2 . Our four lemmas (LM1, LM2, LM3, and LM4) will provide us with more specific information between the relationships of different rates of acceptance in the sub-populations and the overall rate of acceptance in each population. They tell us what specific bridge we are able to build between the rates of acceptance, e.g., A1 = A2 , or B1 = B2 in each subpopulation and their overall acceptance rates in each population. We have two lemmas (LM1 and LM2) showing the inter-connections among A1 , A2 , and α. Lemma 1 and lemma 2 are given below: LM1: If A1 > A2 , then A1 > α > A2 . LM2: If A2 > A1 , then A2 > α > A1 .
123
Synthese (2011) 181:185–208
191
Table 6 Simpson’s paradox CV
Dept. 1
Dept. 2 A
Acceptance rates R
D1 (%)
Overall acceptance rates (%)
A
R
M
25
75
100
100
25
50
42
F
75
225
100
100
25
50
35
D2 (%)
Table 1 type I of Simpson’s paradox obeys LM1 in which when A1 = 90% > A2 = 33%, A1 = 90% > α = 56% > A2 = 33%. Table 6 obeys LM2. Here, when A2 = 50% > A1 = 25%, A2 = 50% > α = 42% > A1 = 25%. Proof for LM1 has been furnished in the appendix. LM3 and LM4 are provided below: LM3: If B1 > B2 , then B1 > β > B2 . LM4: If B2 > B1 , then B2 > β > B1 . Simpson’s paradox type I gives an example in which the condition for LM3 holds. In the type I version, when B1 = 80% > B2 = 10%, we find B1 = 80% > β = 70% > B2 = 10%. Likewise, the following table satisfies the condition for LM4. Table 6 shows that when B2 = 50% > B1 = 25%, B2 = 50% > β = 35% > B1 = 25%. These relationships among four lemmas are symmetric with respect to the indices. Therefore, we have proved only LM1 and LM3. The other cases, LM2 and LM4, are handled identically by swapping variables and indices. This straightforward “formal” analysis might not, however, alleviate the suspicions of those who are not familiar with the literature but who, when confronted with SP examples, find them “perplexing” and want a “deeper” explanation of their puzzlement. We provide an explanation of how the paradox arises in our type I version and why people find it perplexing. To explain each, we have reconstructed our type I version of SP in terms of its premises and conclusion. However, the point of the reconstruction is adequately general to be applicable to all types of SP. Before the reconstruction, we introduce a principle called the collapsibility principle (CP) which plays a crucial role in the reconstruction. We call a dataset collapsible if and only if [A1 ≥ B1 and A2 ≥ B2 ] → α ≥ β. We call the principle that underlies this dataset, the CP. Recall, A1 and A2 stand for the rates of acceptance for population A in departments 1 and 2, respectively. Similarly, B1 and B2 stand for the rates of acceptance for population B in departments 1 and 2, respectively. In contrast, α and β are rates of acceptance for A and B populations in the overall school. More explicitly, if we use our earlier notations of f1 , F2 , m1 , M2 , then the CP implies [(f1 /F1 ) > (m1 /M2 )&(f2 /F2 ) > (m2 /M2 )] → (( f 1 + f 2 )/(F1 + F2 )) > ((m 1 + m 2 )/(M1 + M2 )). In the type I version, even though the data set satisfies the antecedent, that is, A1 (i.e., f1 /F2 ) > B1 (i.e., m1 /M1 ) and A2 (i.e., f2 /F2 ) > B2 (i.e., m2 /M2 ), its consequent remains unsatisfied, that is α, i.e., (( f 1 + f 2 )/(F1 + F2 )) < β, i.e., ((m 1 + m 2 )/(M1 + M2 )). Here is the reconstruction of the type I version of SP.4 4 We by and large agree with those who think that SP is not really a paradox. That is, the reversal or cessation
follows from arithmetic premises. However, it does not explain away why most people not previously
123
192
Synthese (2011) 181:185–208
P1: Female and male populations are mutually exclusive and jointly exhaustive and one can’t be a student of both departments along with satisfying two conditions (i & ii) in our characterization of what is called SP. P2: The acceptance rate of females is higher than that of males in department # 1. P3: The acceptance rate of females is higher than that of males in department # 2. P4: If P2 & P3 are true, then the acceptance rate for females is higher than that of males overall. P5: However, fewer females are admitted overall. (That is, the consequent of P4 becomes false.) Conclusion: the deductive consequence of P2, P3, P4 and P5 contradict one another. There is a genuine paradox involved. In our derivation of the paradox, premise 4 plays a crucial role. It rests on the CP. In our type I version, the rates of acceptance for females are greater than those of males in each department. That is, A1 > B1 and A2 > B2 , but α < β. In fact, that the CP is not generally true is shown by our derivation of a contradiction. So our answer to the question, why do so many individuals find the paradox startling? is simply that humans tend to invoke the CP uncritically, as a rule of thumb, and thereby make mistakes in certain cases about proportions and ratios;5 they find it paradoxical when their usual expectation, i.e., the CP is applicable across the board, captured in premise 4, turns out to be incorrect.6 There is, however, an alternative account of the paradox. The fact that it is so well-known justifies its brief consideration. 3 Pearl’s causal account of the paradox Pearl argues that the correct diagnosis of the paradox lies in understanding it in causal terms. In his view, the arithmetical conclusions reached seem counter-intuitive only because we commonly make two incompatible assumptions, that causal relationships are governed by the laws of probability and that there are certain (non-probabilistic) causal assumptions we share among ourselves about the world. The operative causal assumption to which he refers is that where there is correlation, there must be an underlying cause. The source of our perplexity here is that there cannot be a cause that would simultaneously account for incompatible correlations, the lower and higher Footnote 4 continued acquainted with SP find it perplexing. This logical reconstruction pinpoints that perplexing premise in the logical reconstruction of the paradox. 5 Our empirical research on students over the last 4 years has vindicated our claim. Interestingly, in the last year 83% of 106 responses consistently committed the same type of mistake in the story-driven SP type situation whereas 57% of them committed the error in the formula-driven SP situation (See Sect. 7 for more on our experiments on SP). Although it is an empirical finding, it could be explained within our analysis of the paradox. Details of the protocol for the experiment could be provided on request. Caleb Galloway first suggested to us the idea of running the experiments on the paradox. 6 One might be tempted to contend that the invocation of the CP could cut in either way, a causal or non-causal way. According to this objection, the reader is yet to be convinced that the CP is a non-causal principle. In Sects. 3 and 7, we have addressed this point and argued that it is entirely a non-causal principle which underlies all versions of the paradox.
123
Synthese (2011) 181:185–208
193
rates of male/female acceptance.7 Once we reject either of these assumptions, and he opts for rejecting the first, the “paradox” is no longer paradoxical. On the other hand, when we don’t distinguish causal from statistical hypotheses, we are confronted with the paradox. Pearl’s resolution of the paradox emerges from his general approach to causal hypotheses as distinguished from statistical hypotheses. He makes two basic points. One is that SP arises from mixing explicit statistical and implicit causal considerations together. The notion he uses to explain the paradox is “confounding” which, he argues, is a causal notion. In the type I version, for example, the effect on acceptance (A) of the explanatory variable, sex, (S) is hopelessly mixed up (or “confounded”) with the effects on A of the other variable, department (D). According to him, we are interested in the direct effect of sex on acceptance and not an indirect effect by way of another variable such as department. The effect of S on A is confounded with the effect on A of a third variable D. Pearl’s other point is that causal relationships are more stable than statistical relationships and therefore, causal hypotheses often cannot be analyzed in statistical terms (Pearl 2000, p. 25, 2009, p. 25). Suppose we would like to know Bill Clinton’s place in US history had he not met Monica Lewinsky (Pearl 2000, p. 34, 2009, p. 34). Most people now agree that it would be very different. However, there is no statistical model one could construct that would provide the joint occurrence of Clinton and no Lewinsky. There simply are no appropriate data, as there are, for instance, in a fair coin-flipping experiment. As in the case of the latter in which we have a good understanding about the joint probability of two fair coins, we lack such an understanding in the case of the former because we don’t know the joint probability of Clinton and no Lewinsky.8 In his paper with Greenland and Robins, Pearl thinks confounding is sometimes confused with a non-causal notion, non-collapsibility. In fact, in any version of SP, the data set will be non-collapsible. For example, as we have argued before, A1 > B1 and A2 > B2 , but α < β. This leads some to conclude that non-collapsibility is synonymous with confounding. Although in SP examples, it seems that non-collapsibility and confounding go hand in hand, Pearl thinks rightly that they are conceptually different notions (Pearl 2000, p. 193). A simple example should make this clear. Assume that we have observed that clouds are often followed by a good crop. Based on this statistical information, we could make a causal inference connecting “X”, which stands for clouds, to “Y,” which stands for good crops. It could be the case that there 7 See Pearl (2000, p. 180) and (2009) with the same page number. However, he has here used the example
of the effect of drugs on males and females. 8 One could provide a response to Pearl by contending that we (might) have data for the US presidents
being womanizers as well as how much they are respected in the US presidential history. According to this response, based on these data, we might be able to retrieve a statistic concerning how much Bill Clinton would be remembered even with Lewinsky in the later day presidential history without deploying any counterfactual strategy at the core of a causal account of the paradox. One seeming drawback with this response is that the reference class problem for the frequency interpretation might also arise here regarding whether the data for the US presidents, being womanizers as well as how much they are revered, could be applied to a single case event like “Clinton and Lewinsky”. We thank Elliott Sober for calling our attention to this way of thinking about this issue.
123
194
Synthese (2011) 181:185–208
is another variable “Z,” which is in fact highly correlated with X. “Z” is the variable that farmers plough their land when there are clouds in the sky. If Z is unaccounted for in the model when the effect of X on Y is made, the effect of X that appears could be due to the influence of Z. Under this scenario, Z is said to have confounded the effect of X. Given our characterization of collapsibility, we don’t know whether the data set is collapsible, although we can tell for sure that there is confounding. Pearl insists that without a proper model specification—one where possible confounding facts are accounted for—it is not possible to parcel out the unique effects of X on Z. In other words, we need to either eliminate Z as a causal mechanism or more accurately estimate the effect it (and X) have on Y.
4 An evaluation of Pearl’s account Pearl’s analysis is ingenious. But that SP need not rest on mixing causal and statistical considerations follows at once from the fact that our derivation of it involves neither. It is not easy to come up with an example which precludes invoking some sort of appeal to “causal intuitions.” But what follows is, we think, such a case.9 It tests in a crucial way the persuasiveness of Pearl’s account. Suppose we have two bags of marbles, all of which are either big or small, and red or blue. Suppose in each bag, the proportion of big marbles that are red is greater than the portion of small marbles that are red. Now suppose we pour all the marbles from both bags into a box. Would we expect the portion of big marbles in the box that are red to be greater than the portion of small marbles in the box that are red? Most of us would be surprised to find that our usual expectation is incorrect. The big balls in bag 1 have a higher ratio of red to blue balls than do the small balls; the same is true about the ratio in bag 2. But considering all the balls together, the small balls have a higher ratio of reds to blues than the big balls do. We argue that this is a case of SP since it has the same mathematical structure as the type I version of Simpson’s paradox. There are no causal assumptions made in this case, no possible “confounding.” But it still seems surprising.10 That is the point of the test case. The proponents of a causal analysis of the paradox must argue either that this is not surprising or that it engages in causal reasoning even when the question presents us with nothing causal. We find neither of these replies is tenable. We believe the test case shows that at least sometimes there is a purely mathematical mistake about ratios that people customarily make. 9 This counterexample has been suggested to us by John G. Bennett. 10 One might wonder whether Pearl could maintain his causal stance toward the paradox while conceding
this case as a case of Simpson’s paradox which is non-causal. In fact, we don’t think that Pearl could adopt this (weaker) position. First, his book (Pearl 2000, 2009) and papers on this issue do not allow any such endorsement. Second, if he were to adopt this weaker position about the paradox, then this would imply that there are at least two types of SP. One is non-causal and the second one is causal. As it were, we advocate the former and Pearl the latter giving the impression that both positions could nicely co-exist with regard to the paradox. In fact, contra this position, we argue that there is only one type of Simpson’s paradox which is non-causal with which both its paradoxical nature and the account to provide a correct analysis of the paradox could be explained away in terms of mathematical notions like ratios/proportions which are non-causal. (For more on this point, see the end of this section along with Sect. 7.)
123
Synthese (2011) 181:185–208
195
It must be admitted that there are all sorts of complexities about going from correlation to causation. Correlations are not causes, though correlations are (part of the) evidence for causes. But what is paradoxical in the SP case has little to do with these complexities; there is simply a mistaken inference about correlations, which are really just ratios. Of course, when there are different correlations available which may seem to support conflicting causal inferences, the inference from correlations to cause becomes much more difficult; no one could reasonably deny that. But the paradoxical nature of the examples really lies in the part that involves the mistaken assumptions about the correlations (ratios) themselves. In our reconstruction of the paradox, we suggested that human beings are not good at reasoning concerning ratios. We are not the first, of course, to point out that human beings are not very good at these sorts of computations. What we have done is to isolate the mistaken assumption usually made and to provide empirical support for our claim.11 Pearl himself talks about mistaken numerical assumptions, but proceeds at once to interject causal considerations. He writes, The conclusions we may draw from these observations are that humans are generally oblivious to rates and proportions (which are transitory) and that they constantly search for causal relations (which are invariant). Once people interpret proportions as causal relations, they continue to process those relations by causal calculus and not by the calculus of proportions. Were our mind governed by the calculus of proportions, Fig. 6.3 [i.e., an example of Simpson’s paradox] would have evoked no surprise at all and Simpson’s paradox would never have generated the attention that it did. (Causality, 2000, p. 182 and 2009, p. 182.). Pearl’s point may be reconstructed as follows. Human beings are not good at (transitory) ratios and proportions. To remedy this defect, they import (invariant) causal notions, in the process confusing collapsibility with confounding. If there were no confusion between the two, there would be no paradox (or rather, there would be no “perplexity.”) Although one can sympathize with the claim that humans often tend to see causes where they should not, it is enough here to point out, once again, that mistaken numerical assumptions suffice to demonstrate the paradox; jumping to conclusions does not necessarily require that we are pushed by our causal intuitions. We certainly admit that surprising facts about proportions come up frequently when we infer causes from proportions. This is when our mistakes about proportions seem most troubling to us. In this respect, the test case we contrived is rather unusual. But it proves our point. 5 How to proceed questions In the case of SP, “how to proceed?” questions arise when investigators are confronted with choosing between two conflicting statistics, for example, in Table 1, (i) the uncombined two departments’ statistics and (ii) their combined statistics. Which 11 See footnote 5 for the confirmation of this point.
123
196
Synthese (2011) 181:185–208
Table 7 Simpson’s paradox (medical example) CV
∼M
M R
Recovery rates
Overall recovery rates (%)
∼R
R
∼R
M (%)
∼ M (%)
T
18
12
2
8
60
20
50
∼T
7
3
9
21
70
30
40
Table 8 Simpson’s paradox (agricultural example) CV
∼T
T
Yield rates
Overall yield rates (%)
Y
∼Y
Y
∼Y
T (%)
∼ T (%)
W
18
12
2
8
60
20
50
∼W
7
3
9
21
70
30
40
one should they use to recommend action? Our reply is that there is no unique response to all versions of SP; the response depends on the specific nature of the problem at issue. However, one could provide stable recommendations for certain versions of the paradox when we assume certain features about these versions to be correct over and above the data at hand. One needs to be circumspect that not all versions of SP necessarily involve the “how to proceed?” questions. The test case, for example, asks “would we expect the proportion of big marbles in the box that are red to be greater than the proportion of small marbles in the box that are not red?” No straightforward question concerning whether to recommend an action from sub-groups or whole seems to be at stake. However, it is evident that many interesting cases of recommending actions arise when we infer causes/patterns from proportions. The standard examples (Lindley and Novick 1981; Meek and Glymour 1994; Pearl 2000, 2009) deal with cases in which “how to proceed” questions become preeminent. But it should be clear in what follows that there is no unique response to this sort of question for all varieties of the paradox. Consider Table 7 based on data about 80 patients. 40 patients were given the treatment, T, and 40 assigned to a control, ∼ T. Patients either recovered, R, or didn’t recover, ∼ R. There were two types of patients, (i) males (M) and (ii) females (∼M). One would think that treatment is preferable to control in the combined statistics, whereas, given the statistics of the sub-population, one gathers the impression that control is better for both men and women. Given a person of unknown sex, would one recommend the control? The standard response is clear. That is, the control (Pr(R| ∼ T) > Pr(R|T)) is better for a person of unknown sex. Call this first example the medical example. In the second example, call it the agricultural example, we are asked to consider the same data, but now we are asked to replace T and ∼ T by the varieties of plants (white [W] or black variety [∼ W]), R and ∼ R by the yield (high[Y] or low yield [∼Y]) and M and ∼M by tall and short plants ([T] or [∼ T]) (Table 8). Given this new interpretation, the overall yield rate suggests that planting the white variety is preferable since it is 10% better overall, although the white variety is 10%
123
Synthese (2011) 181:185–208
197
worse among both tall and short plants (sub-population statistics). Which statistics should one follow in choosing between which varieties to plant in the future? The standard recommendation is that in this case one should take the combined statistics and thus recommend the white variety for planting, (Pr(Y|W) > Pr(∼ Y|W)), which is in stark contrast with the recommendation given in the medical case. In short, both medical and agricultural examples provide varying responses to the “how to proceed question?” There is no unique response regarding which statistics, subpopulation or whole, to follow in every case of SP. We agree with standard recommendations with a proviso, i.e., we need to use substantial background information to answer how to proceed questions. These recommendations are standard because they are agreed upon by philosophers (e.g., Meek and Glymour), statisticians (e.g., Lindley and Novick), and computer scientists (e.g, Pearl). To explain which assumption(s) is at stake in these two examples, we will confine ourselves to Pearl’s causal analyses of these examples. According to Pearl, those background assumptions are primarily causal assumptions that go beyond the data presented in the two tables as well as those assumptions that underlie probabilistic calculations capturing our preferences for one course of action over the other. Consider his view with regard to the medical example in which we are confronted with the question “should treatment or control be recommended to a patient of the unknown sex?” We think that there are three assumptions at work in his analysis. Two of them are causal and one is an ethical one. The first causal assumption is that the unknown sex to which the treatment or control will be administered is contingent on whether this new individual shares the same causal conditions with the group we have studied. Whether the conditions for the two groups, the group we have studied with the help of the tables and the group from which the individual of that unknown sex comes, are the same or whether probabilities will remain invariant across the groups depends on making causal assumptions. He also thinks that there might be a difference between these two groups in terms of their causal conditions. That is, the group studied made their decisions whether to receive treatment or control. In contrast, the individual of an unknown sex will be given the treatment or control randomly without deliberate choice. At any rate, his rationale for recommending control to the individual of an unknown sex also has an ethical dimension to it. Since whether to recommend treatment or control to the unknown sex depends on substantial causal assumptions about the population from which that individual has been taken and which assumptions are operative for that individual are unknown, it is, therefore, safer to recommend control to that individual. In addition, there is another causal assumption behind his analysis. There is significant confounding going on among three variables, “treatment”, “sex/gender”, and “recovery.” Interestingly, Pearl is mostly interested in this casual consideration of confounding while recommending “control” in the medical example. In this regard, the effect of treatment on recovery is confounded with the effect of sex on recovery. We are interested in knowing whether to recommend treatment or control to a subject of unknown sex. Hoping that the combined statistics would provide us with the required information, we looked at the combined table. However, since there is a significant effect of confounding in the combined table, the sub-population statistics are the right statistics to look for and based on those statistics we should recommend
123
198
Synthese (2011) 181:185–208
control, because the sub-population shows clearly the confounding effect of gender on the recovery. Therefore, in the case of the medical example, recommending control is based on taking the sub-population statistics as the guide to our action. Pearl argues, however, that in the case of the agricultural example, the operative causal conditions are vastly different from the medical example. No rational deliberation regarding whether to receive treatment is present in the case of the agricultural example. In addition, there is no significant confounding going on. In fact, he could offer this causal account about the agricultural example. Both yield and height are consequences of the variety. The white variety causes plants to grow tall which in turn causes high yield. Being tall increases the chance of high yield and is correlated with exposure (i.e., white variety), since in the latter case, being tall could be a result of the white variety. It could even be a causal factor for high yield among non-exposed plants, because high yield could result from a cause other than being white. For example, taller plants might easily receive more sun-light than shorter plants, leading to high yield from the former. Being tall can’t be considered a purely confounding factor, since the effect of being white is mediated through the effect of tall plants. Any factor that represents a step in the causal chain between the exposure (white variety) and disease (yield) should not be treated as an extraneous confounding factor, but instead requires special treatment as an intermediate factor. In the agricultural example, we were interested in knowing whether to recommend white or black variety for future planting. Hoping that the combined table will give us the correct guidelines, we looked at the latter. Since there is no significant confounding effect of the length of the plant on the variety, we use the combined statistics to support our decision that one should plant the white variety. Although there might be a normative element embedded in any such recommendation (because a wrong decision could result in economic losses in terms of bad crops), the magnitude of this normative consideration is far outweighed by the mere ethical considerations in the medical example. Three points need to be mentioned. First, there is no point in denying that there are causal considerations involved in both examples, the medicine and agricultural examples. They have no doubt contributed to our understanding regarding how to address the “what to do question”? Second, the first point, however, does not imply that the notion of utility is irrelevant to this question. In the medical example, the utility of recommending “control” to a person of unknown gender has to be taken into consideration. What if that particular individual with unknown gender does have certain physiological issues which might react badly with “treatment”? In this case, we will be making a wrong decision if we recommend “treatment” to her. So we need to take into account the disutility associated with this possible scenario. Likewise for the agricultural example, we might run some amount of risk in recommending “white variety” as stated above. None the less, making a wrong decision is not as terrible in the agricultural example as in the case of medicine example. Although it is evident what sorts of utilities are involved in addressing “what to do questions” insofar as both SP type situations are concerned and these are likely reasons why we don’t bring them to the table when confronted with “what to do questions”, we can’t afford to overlook those implicit considerations of utilities in making a decision. The third point has to do with the distinction between what makes SP paradoxical and what to
123
Synthese (2011) 181:185–208 Table 9 A comparison between two sets of conditions
199 Blyth’s conditions
Our Conditions
C1 : A1 ≥ B1 C2 : A2 ≥ B2 C3 : β > α
C1 C2 C3 C4
: A1 ≥ B1 : A2 ≥ B2 :β≥α :θ >0
do when confronted with SP type situations or how to infer a cause from a correlation. In our account, we have primarily addressed two questions (i) why SP is paradoxical? And (ii) what is the correct analysis of the paradox? Providing an explanation for what makes SP paradoxical does not, however, provide information specific to each version of the paradox. Even with the same data, two different sets of assumptions have led to different recommendations regarding the “what to do question.” Neither does providing a correct analysis of the paradox entitle one to address the “what to do question”. Because our account is mainly concerned with the mathematical structure of SP and how the latter provides its correct analysis, it does not directly tell us anything significant about the “what to do question” in contrast to Pearl and other causal theorists who offer illuminating recommendations for this question. To repeat ourselves, Pearl’s account gains plausibility by blurring the difference between our three questions, loses plausibility once they are distinguished as we have here. 6 A comparison with Blyth’s account of the paradox It is alleged that there is a striking similarity between Blyth (1972) celebrated paper on SP and our account as both are formally motivated beginning with some initial conditions of SP and then define it in terms of those conditions.12 Consequently, according to this objection, although our account is correct, it, however, fails to deliver any new information about the paradox than what has already been contained in Blyth’s paper.13 To address this charge, we will first discuss Blyth’s treatment of SP and then evaluate the charge based on several grounds. To make an easy transition between our notation and his, we write his conditions in terms of our notation shown in Table 9. As one could see, Blyth’s first two conditions are the same as ours. But, his third condition is expressed in terms of a strict inequality, whereas ours is a weaker condition. Most importantly, his conditions imply two features which deserve mention. First, he does not allow the cessation of associations between variables in the overall population, when there is a strict inequality in subpopulations, as a case of SP, although ours do. From our first three conditions, it follows, we could have a case of SP where we have A1 > B1 &A2 > B2 , but β = α (see Table 10 below in this section for this possibility). Second, from our three conditions, we are able to derive 12 One of the referees of a leading journal has raised this objection. 13 To be faithful to this objection, the referee admits that the only difference between ours and Blyth’s
is that we have provided two experiments about the paradox which, according to the referee, is the only novelty of the paper (see Sect. 7 for those two experiments).
123
200
Synthese (2011) 181:185–208
Table 10 Simpson’s paradox CV
Dept. 1
Dept. 2
Acceptance rates
Overall acceptance rates (%)
Accept
Reject
Accept
Reject
Dept. 1 (%)
Dept. 2 (%)
F
90
1410
110
390
6
22
10
M
20
980
380
2620
2
12
10
Diff: F-M
4
10
0
a case in which A1 = B1 &A2 = B2 as well as β = α, which is clearly not a case of SP. To eliminate this possibility, we have introduced C4 which states that θ > 0, when θ is defined as (A1 – B1 ) + (A2 – B2 ) + (β – α). However, he does not need a condition like C4 because his third condition will automatically prevent this case (i.e., A1 = B1 & A2 = B2 as well as β = α) from occurring. Blyth prefers to construe SP in terms of interaction effects of two variables. An interaction effect is the one in which combined effects of two variables is not a simple sum of their separate effects. Since one purpose of our account is to provide an analysis of the possible ways SP might be generated, the theme of interaction effect does not directly have any connection with how SP results from interaction effects of two variables. This is one fundamental difference between his and ours. Since this section is intended to compare his account with ours, we will adopt his interaction locutions to see what follows from his argument. Blyth writes, “[t]he paradox (i.e., SP) can be said to result from... interaction of B and C.” It can’t be the case that he takes “interaction effect” as a sufficient condition for the emergence of SP. In Table 2, we came across “interactions” between “gender” and “departments”; yet those interactions do not result in SP. Therefore, interaction effects can’t be sufficient for generating SP. The interaction effect means that the effect of one variable is different depending on which group of the other variable one is considering. For this example, it means that the difference in the acceptance rates between the sexes is different between the departments. This does not, however, address whether there is a reversal or cessation which, as we know, is at the core of the paradox. In contrast, if we take “interaction effect” as necessary, and incidentally, this is perhaps the intention of Blyth’s quote, then Table 10 seems to be consistent with his claim, but the table has not been endorsed by his three conditions as a case of SP. Table 10 shows that even though there is no association between “gender” and “acceptance rates” of students in the overall school (column 8 in Table 10), there remains an association between “gender” and the “acceptance rates” when we divide the student population into two departments. In fact, we observe a clear interaction effect between “gender” and “departments” when the population has been partitioned into two departments. Based on the information from the table together with his three conditions, we find that there is a tension between Blyth’s conditions for SP and his understanding of SP in terms of interaction effects between variables. If we accept his three conditions for
123
Synthese (2011) 181:185–208
201
SP then the example of showing the cessation of association between “gender” and “acceptance rates” gets eliminated as a possible example of SP. However, if we accept his interaction explanation for SP then the cessation of association has turned out to be compatible with his analysis of SP. This tension, however, does not arise for our account as the cessation of association between two variables shown by Table 10 can be subsumed as a case of SP. So, primarily there are three fundamental differences between Blyth’s account and ours. First, Blyth does not endorse the cessation of an association in the overall population as a case of SP. Second, we have already noted that there is a tension between consequences of his account and his preference to construe SP in terms of interactions effects of variables. Our third and final comment has to do with two key theorems of SP we have proved. They are (i) SP arises only if A1 = B1 and (ii) SP arises only if A2 = B2 . Since they are theorems of SP they must hold for his version of SP. However, we are the ones who have pointed out these two theorems of SP. Based on these considerations, we reject the charge that our account of SP is no different from Blyth’s. In fact, we have virtually argued that both are different accounts of SP.
7 Larger significance of our research We began our paper by distinguishing among three types of questions. (i) Why or in what sense is Simpson’s paradox a paradox? (ii) What is the correct analysis of the paradox? (iii) How one should proceed when confronted with a typical case of the paradox? Although these questions are no doubt distinct, our formal reconstruction of the paradox provides a unified account of them, which the empirical studies we have carried out illustrates and amplifies. We showed that Simpson’s paradox can be generated in a straightforward deductive way. Among its premises, there is concealed a distinctly human dimension. In recent years, there has been a great deal of discussion of human frailty in connection with individuals’ assessment of probabilistic statements. Our resolution of the paradox has illuminated another aspect of human frailty. We explained its apparent paradoxical nature by invoking the failure of our widespread intuitions about numerical inference. The failure of collapsibility in Simpson’s paradox-type cases is what makes them puzzling, and the latter is what paints a human face onto the rather abstract structure of “Simpson’s paradox.” Below we discuss the results of two experiments based on Simpson’s paradox. One involves a version of the paradox in non-mathematical language and the second one is in mathematical language. The purpose of these experiments was to determine student responses to the following questions. First experiment involves a non-mathematically explained case of the paradox: Consider the following information to be correct. There are only two high schools in a certain school district. Given that the graduation rate for girls in School #1 is higher than the graduation rate for boys in School 1, and that the graduation rate for girls in School #2 is higher than the graduation rate for boys in School 2. Does it follow that the graduation rate for girls in the district is higher than the graduation rate for boys in the district?
123
202
Synthese (2011) 181:185–208
Which one of the following is true? a. b. c. d.
Yes, the graduation rate for girls is greater than it is for boys in the district. No, the graduation rate for girls is less than it is for boys in the district. No; the graduation rates for girls and boys are equal in the district No inference could be made about the truth or falsity of the above because there is not enough information.
Second experiment involves a mathematically described case of the paradox: Consider the following mathematical expressions to be correct. 1. (f1 /F1 ) > (m1 /M1 ). 2. (f2 /F2 ) > (m2 /M2 ). Does it follow that (( f 1 + f 2 )/(F1 + F2 )) > ((m 1 + m 2 )/(M1 + M2 ))? Which one of the following is true? (a) (b) (c) (d)
Yes, the first expression is greater than the second. No, the first expression is less than the second. No, the first and second expressions are equal. No, inference could be made about the truth or falsity of the above because there is not enough information.
In an experiment with 106 student responses to both questions, we found that for the first non-mathematical question, students chose response (a) 83% of the time which involves the mistaken use of the collapsibility principle. They correctly responded choosing (d) only 12% of the time. For the mathematical question, they are right at the rate of 29%, whereas they have committed the error at 57% of the time.14 Similar surveys over many years of students in philosophy classes have manifested the same patterns of responses.15 The math version of the paradox exactly mirrors our test case which does not involve any causal intuition whatsoever. In turn, the former also has the same structure as the non-math version of our experiment on the paradox. Consequently, it will be a mistake to think that the subjects’ responses have exploited a causal intuition underlying different versions of the paradox based on the reason that there is no difference between these two experiments as they exhibit the same mathematical structure. Most subjects, as we have noticed erroneously took the inference in the Simpson’s paradox type experiment to be inductive since they mistakenly applied the collapsibility principle (CP). In Simpson’s paradox-type situations, subjects were confronted with possible choices in which those choices are subject to real-time constraints. The CP allows them to reach conclusions on the basis of the data quickly. That is to say, if subjects were to look at the data in the light of all possible information, they would perhaps never have reached a conclusion. In a broader perspective, we human beings are like 14 Students chose the “correct” response in the formula-driven version of the question at a higher rate than in the story version of the question. This may be due to students’ lack of certainty when presented with formulas than their ability to detect SP when provided with equations. 15 This error is not unique to philosophy students. Two of our super-string theorist friends committed the
same error when given these two experiments.
123
Synthese (2011) 181:185–208
203
those subjects who are confronted with choices in their evolutionary history as well as their day to day life. If there is a trade-off between speed and error, reaching quick conclusions (on which our survival depends) will sometimes lead to error. We have noticed that those experimental results show clearly that when confronted with a Simpson’s paradox-type situation almost all subjects have jumped to an erroneous conclusion even though one of the options given to them in our experiment is to not make an inference. Why is this error committed so consistently when subjects are clearly given a choice not to make any inference in that situation? One plausible suggestion is that confronted with Simpson’s paradox-type situations, the pressing issue for subjects is to make a decision rather than to suspend inference. On this suggestion, the “what to do” question seems more pressing under uncertainty in at least many situations. According to our analysis, however, this error is the misuse of the CP across the board and has nothing to do with the “what to do question”. The result of our analysis is to divorce the question of the paradox and the reason it seems paradoxical from the question of the solution to the “what to do question.” Most causal theorists including Pearl think that the latter has a resolution in some sort of causal analysis. Our account does not say that this is not possible. But a causal analysis of the “what to do question” should relate to cases where correlations are confused with causations, whereas the discussion of the paradoxical nature of Simpson’s data sets should be related to other mathematical mistakes that people are prone to make that lead them into trouble. It would be a mistake to assume that scholars in general completely agree with our account of the paradox. Steven Sloman, a cognitive scientist, wrote (in an email communication, April 24, 2009) “I [Sloman] believe that this paper is addressing a fundamental psychological question that I generally frame in terms of outside vs. inside perspectives (closely related to extensional vs. intentional perspectives). Contra Pearl and I, you are arguing that people’s reasoning is from the outside, in terms of proportions, whereas we argue that it’s from the inside, in terms of causal structure. . . . I note that these are not mutually exclusive perspectives and that each could capture different aspects of human reasoning.”16 There is no reason to disagree that people’s reasoning is often from the ‘inside.’ Our point is simply that the reasoning does not have to be from the ‘inside’ for the aura of paradox to be generated. In our day to day situations, we humans look for deeper ‘causal structures,’ and are puzzled by our inability, as in Simpson-type cases, to find them. But we are more often puzzled, or so we have argued here, by the fact that such deeply held inductive habits as the principle of collapsibility lead to paradoxical conclusions. The authors are Humeans to this extent that we prefer explanations of untoward results in terms of habits rather than of ‘causality.’ The latter tend to substitute one mystery for another. We believe, however, that better to stay on the ‘outside’ where the sun shines. Acknowledgments We would like to thank John G. Bennett, John Borkowski, Robert Boik, Martin Curd, Dan Flory, Debzani Deb, David Freedman, Caleb Galloway, Jack Gilchrist, Sander Greenland, Martin Hamilton, Joseph Hanna, Daniel Hausman, Christopher Hitchcock, Jayanta Ghosh, Autumn Laughbaum, Adam Morton, Dibyendu Nandi, Michael Neeley, Daniel Osherson, James Robison-Cox, Prasun Roy, Federica
16 For Sloman’s view on the distinction between the inside and outside perspectives, see Sloman (2005).
123
204
Synthese (2011) 181:185–208
Russo,Tasneem Sattar, Billy Smith, Elliott Sober, Steve Sloman, Steve Swinford, Olga Vsevolozhskaya, and Paul Weirich for comments on an earlier version of the paper. We also thank Paul Humphreys for calling our attention to some references relevant to the paper and Donald Gillies for his encouragement to consider the paradox the way we have done it in this paper. The paper also benefited from the comments received at two workshops at the University of Konstanz, a conference at the University of Alabama in Birmingham, Montana Chapter of the American Statistical Association meetings in Butte, Society for the Exact Philosophy meetings at the University of Alberta, Indian Institute of Science and Educational Research in Mohanpur, Centre for Philosophy and Foundations of Science in New Delhi, Statistics Colloquium at Montana State University, Bhairab Ganguly College in Kolkata, an international conference on “Scientific Methodology” at Visha-Bharati University and Ecology Seminar at Montana State University where the paper was presented. We are also thankful to several referees of various journals including the referees of this journal for their useful comments regarding the paper. Special thanks are to John G. Bennett for his numerous insightful suggestions regarding the content of the paper. The research for this paper has been supported both by the NASA’s Astrobiology Research Center (grant #4w1781) and the Research and Creativity grant from Montana State University.
Appendix For proving Theorems 1 to 3 and lemmas 1 to 4, we have used two assumptions (call them a and b, respectively and two definitions (“α“ and “β“). We have defined α and β differently than what we have done previously in the text. 1. Let “a” = (members of A partition in 1)/(total members of A) 2. Let “b” = (members of B partition in 1)/(total members of B) To give an intuitive feeling of what a and b stand for, we could use Simpson’s paradox type I in which a = 200/500 and b = 600/700. Let the quantities A1 , A2 , B1 , B2 , a, b be in [0,1], where A1 , A2 , B1 , and B2 are as before, where for example, A1 is the ratio of the number of females accepted in Department 1 to the total number of females that applied to Department 1 and B2 is the ratio of the number of males accepted in Department 2 to the total number of males that applied to Department 2. Define α := aA1 + (1 − a)A2 and β = bB1 + (1 − b)B2 . The Simpson Paradox results from two conditions being imposed on the the above quantities: (i) (ii) (iii) (iv)
Condition 1 (C1 ) : A1 ≥ B1 ; Condition 2 (C2 ) : A2 ≥ B2 ; Condition 3 (C3 ) : β ≥ α; Condition 4 (C4 ) : θ = (A1 − B1 ) + (A2 − B2 ) + (α − β) > 0.
Theorem 1 Simpson’s Paradox results only if A1 = A2 . Proof Assume to the contrary that A1 = A2 . We force a contradiction of condition (C3 ). First, given our assumption, α = aA1 + (1 − a)A2 = aA1 + (1 − a)A1 = aA1 + A1 − aA1 = A1 = A2 . There are three cases to consider with respect to B1 and B2 .
123
Synthese (2011) 181:185–208
205
Fig. 1 Graph illustrating proof of (i)
(i) Suppose that B1 > B2 . In this case, the following relationships hold amongst B1 , B2 , and β: bB1 + (1 − b) B1 > bB1 + (1 − b) B2 = β ⇒ B1 > β. Geometrically, this places B1 above β. See Fig. 1, where the vertical axis represents β and the horizontal axis represents α. We use the line α = β to place B1 and β on the α-axis in order to compare them to A1 and α. By (C1 ) A1 ≥ B1 , which forces the inequalities β < B1 ≤ A1 = α. This contradicts condition (C3 ). The relative positions of these quantities are illustrated on the horizontal axis of Fig. 1. (ii) Suppose B2 > B1 . As in (i), bB2 + (1 − b)B2 > bB1 + (1 − b)B2 = β ⇒ B2 > β. Using the relation A2 ≥ B2 from (C2 ), we conclude similarly that β < B2 ≤ A2 = α, contradicting (C3 ). (To see this in Fig. 1, replace B1 by B2 and A1 by A2 .) (iii) Lastly, take B1 = B2 . As above, A1 = α and B1 = β. From the assumption that A1 ≥ B1 we have that α ≥ β. And by (C3 ) we have that α ≤ β. The only way for both inequalities to hold is for α = β. Yet if this is the case then θ = 0, contradicting (C4 ). Therefore, equality of A1 and A2 is incompatible with Simpson’s Paradox. Theorem 2 Simpson’s paradox arises only if B1 = B2 . We first prove a lemma. Lemma 1: The following relationships hold: LM1: LM2: LM3: LM4:
If A1 If A2 If B1 If B2
> A2 , then A1 > A1 , then A2 > B2 , then B1 > B1 , then B2
>α >α >β >β
> A2 . > A1 . > B2 . > B1 .
123
206
Synthese (2011) 181:185–208
Proof These relationships are symmetric with respect to the indices. We prove only (1) and (3). The other cases are handled by swapping variables and indices. LM1: We have made use of the following algebraic identity already: A1 = aA1 + A1 − aA1 = aA1 + (1 − a)A1 . For case (1), where A1 > A2 it follows that A1 = aA1 + (1 − a)A1 > α = aA1 + (1 − a)A2 > aA2 + (1 − a)A2 = A2 . Hence, if A1 > A2 , then A1 > α > A2 . LM3: Similarly, when B1 > B2 as in case (ii), B1 = bB1 + (1 − b)B1 > β = bB1 + (1 − b)B2 > bB2 + (1 − b)B2 = B2 . Thus, whenever B1 > B2 , we can bound β by B1 > β > B2 . Proof of Theorem 2 We proceed by supposing that B1 = B2 . From the algebraic identity used in the above lemma it follows that β = B1 = B2 . Since we have shown that A1 cannot equal A2 , we assume without loss of generality that A1 > A2 ; thus A1 > α > A2 by the lemma. Yet, by condition C2 , A2 ≥ B2 . This implies that α > A2 ≥ B2 = β. In particular this forces α > β, contradicting the reversal of C3 . Thus, the case where B1 = B2 and A1 > A2 cannot arise. In Fig. 2, the case where A2 is strictly greater than B2 is shown. If instead A2 > A1 , then we switch the A’s to conclude that α > A1 ≥ β = B2 . Therefore, it is incompatible with Simpson’s Paradox for B1 to equal B2 . Theorem 3 Simpson’s paradox arises only if (A1 = A2 ) if and only if (B1 = B2 ). By definition, Theorem 3 says Simpson’s paradox arises only if {(A1 = A2 ) ⇒ (B1 = B2 )} and {(B1 = B2 ) ⇒ (A1 = A2 )}.
Fig. 2 Graph showing the case when B2 = β < A2 < α < A1
123
Synthese (2011) 181:185–208
207
Proof Consider the first conjunct (A1 = A2 ) ⇒ (B1 = B2 ). This condition is logically equivalent to (B1 = B2 ) ⇒ (A1 = A2 ). The antecedent of that conditional is false because of Theorem 2. Therefore, (B1 = B2 ) ⇒ (A1 = A2 ) is true, which proves (A1 = A2 ) ⇒ (B1 = B2 ). The proof for (B1 = B2 ) ⇒ (A1 = A2 ) is similar to the first conjunct. {(B1 = B2 ) ⇒ (A1 = A2 )} is logically equivalent to {(A1 = A2 ) ⇒ (B1 = B2 )}. The latter is true because the antecedent of the conditional is false by Theorem 1. Therefore, the conjunction is true, leading to Theorem 3. Bibliography Blyth, C. (1972). On Simpson’s paradox and the sure-thing principle. Journal of the American Statistical Association, 67(338), 364–366 (Theory and Method Section). Cartwright, N. (1979). Causal laws and effective strategies. Nous, 13, 419–437. Cartwright, N. (1999). The dappled word: A study of the boundaries of science. UK, Cambridge: Cambridge University Press. Clark, M. (2002). Paradoxes from A to Z. London: Routledge. Elles, E., & Sober, E. (1983). Probabilistic causality and the question of transitivity. Philosophy of Science, 50, 35–57. Freedman, D., Pisani, R., & Purve, R. (1999). Statistics (3rd ed.). New York: W. W. Norton & Company. Good, I. J., & Mittal, Y. (1988). The amalgamation and geometry of two-by-two contingency tables. The Annals of Statistics, 15(2), 694–711. Greenland, S., Robins, J. M., & Pearl, J. (1999). Confounding and collapsibility in causal inference. Statistical Science, 19, 29–46. Hausman, D. (1998). Causal asymmetries. Cambridge: Cambridge University Press. Hoover, K. (2001). Causality in microeconomics. England: Cambridge University Press. Kahneman, D., Slovic, P., & Tversky, A. (Eds.). (1982). Judgment under uncertainty: Heuristics and basics. England: Cambridge University Press. Kyburg, H. (1997). The rule of adjunction and reasonable inference. Journal of Philosophy, XCIV(3), 109–125. Lindley, D., & Novick, M. (1981). The role of exchangeability in inference. Annals of Statistics, 9(1), 45–58. Malinas, G. (2001). Simpson’s paradox: A logically benign, empirically treacherous hydra. The Monist, 84(2), 265–283. Meek, C., & Glymour, C. (1994). Conditioning and intervening. British Journal for the Philosophy of Science, 45, 1001–1021. Mittal, Y. (1991). Homogeneity of subpopulations and Simpson’s paradox. Journal of the American Statistical Association, 86, 167–172. Morton, A. (2002). If you’re so smart why are you ignorant? Epistemic causal paradoxes. Analysis, 62(2), 110–116. Novick, M. R. (1983). The centrality of Lord’s paradox and exchangeability for all statistical inference. In H. Wainer & S. Messick (Eds.), Principles of modern psychological measurement. Hillsdale, NJ: Erlbaum. Otte, R. (1985). Probabilistic causality and Simpson’s paradox. Philosophy of Science, 52(1), 110–125. Pearl, J. (2000). Causality. (1st ed.). Cambridge: Cambridge University Press Pearl, J. (2009). Causality. (2nd ed.). Cambridge: Cambridge University Press Rothman, K., & Greenland, S. (1998). Modern epidemiology (2nd ed.). Philadelphia: Lippincott Williams. Savage, L. (1954). Foundations of statistics. New York: Wiley. Simpson, H. (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series. B, 13(2), 238–241. Skyrms, B. (1980). Causal necessity. New York: Yale University Press. Sloman, S. (2005). Causal models: How people think about the world and its alternatives. New York: Oxford University Press.
123
208
Synthese (2011) 181:185–208
Sober, E., & Wilson, D. (1998). Unto others: The evolution and psychology of unselfish behavior. Mass: Harvard University Press. Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, prediction, and search (2nd ed.). Cambridge: MIT Press.
123
Synthese (2011) 181:209–226 DOI 10.1007/s11229-010-9798-z
Reductionism and the Micro–Macro Mirroring Thesis Eric Hiddleston
Received: 3 December 2009 / Accepted: 26 July 2010 / Published online: 7 September 2010 © Springer Science+Business Media B.V. 2010
Abstract This paper concerns reductionist views about psychology and the special sciences more generally. I identify a metaphysical assumption in reductionist views which I dub the ‘Micro–Macro Mirroring Thesis’. The Mirroring Thesis says that the relation between the entities of any legitimate higher-level science and their lowerlevel realizers is similar to that between the entities of thermodynamics and statistical mechanics. I argue that reductionism implies the Thesis, and that the Thesis is not a priori. It is more difficult to tell whether the Thesis is true, and I indicate some relevant considerations. Keywords Reduction · Philosophy of mind · Functionalism · Second-order property Multiple realization has long been the standard objection to reductionism. I think we should agree with reductionists that the objection is at least somewhat superficial. Multiple realization forces some uncomfortable semantic choices on the reductionist, but as Jaegwon Kim has long maintained, for example, it is compatible with multiple realization that the autonomy of higher-level sciences remains merely taxonomic. On this view, the special sciences merely reclassify lower-level causal and nomic relations in convenient ways, but ways that are metaphysically somewhat arbitrary. I will argue that there is logical space for a more causally robust type of nonreductionism. My strategy in what follows is to look first at the problem of multiple realization and the initial semantic response to it. I will suggest that there is a metaphysical assumption hidden in this semantic response, which I call the ‘Micro–Macro Mirroring Thesis’. Then I will argue that this assumption is not a priori.
E. Hiddleston (B) Wayne State University, Detroit, MI, USA e-mail:
[email protected]
123
210
Synthese (2011) 181:209–226
1 Functional states v. functional descriptions Functionalism about X is in general the view that X is “defined by what it does.” Functionalist views of the mental became prominent in the 1960s, and are widely accepted at least for intentional mental states (belief, desire). The view is plausible for many properties of the special sciences. Some even hold that all genuinely causal or explanatory properties are functional (Shoemaker 1984, 1998). One of the main distinctions among functionalist views is whether the view is taken to apply to properties or merely to predicates. A functional property is, at a minimum, one that has some “causal powers” essentially.1 The idea is that an object or event is not an acid, a predator, a gene, or a belief, unless it has the potential to enter into certain causal/nomic relations. A functional predicate is one whose conditions for correctly applying to an object make reference to such causal/nomic relations. One prominent strand among functionalist thought holds that functional predicates express nonfunctional properties, but do so by means of functional descriptions. This was the original idea of the Mind-Brain Identity Theory, especially in the versions of David Armstrong and David Lewis. Armstrong expresses the view as follows: The concept of a mental state is primarily the concept of a state of the person apt for bringing about a certain sort of behaviour. (Armstrong 1968, p. 82) It may now be asserted that, once it be granted that the concept of a mental state is the concept of a state of a person apt for the production of certain sorts of behaviour, the identification of these states with physico-chemical states of the brain is, in the present state of knowledge, nearly as good a bet as the identification of genes with the DNA molecule. (Armstrong 1968, p. 90) Lewis sums up the view in an argument (1972, pp. 248–249): Mental state M = the occupant of causal roleR(by definition ofM) Neural state N = the occupant of causal roleR(by the physiological theory) ∴ Mental state M = neural stateN (by the transitivity of =) The idea of this version of functionalism is that the predicate supplies some “causal role,” and the predicate expresses the nonfunctional property that “occupies” that role. This view is usually called “Occupant” (or “Realizer” or “Filler”) Functionalism. The usual alternative to this view takes functional predicates to express functional properties. This view is often called “Role Functionalism,” though I prefer to call it the “Functional State Identity Theory.” There is a standard objection to Occupant Functionalism and a standard response. The objection is that functional predicates correctly apply to creatures which share only a second-order, functional property (if they share anything at all). ‘Pain’ correctly applies to humans and to octopuses, Putnam suggested, and yet these creatures need 1 In the conclusion I draw some further distinctions among functional properties.
123
Synthese (2011) 181:209–226
211
share no pain-specific physiological equipment. So, ‘pain’ does not express any brain property, but only a second-order, functional one. The standard response to this objection is to restrict the domain of objects in question. Lewis says: We may say that some state occupies a causal role for a population.... If the concept of pain is the concept of a state that occupies that role, then we may say that a state is pain for a population. Then we may say that a certain pattern of firing of neurons is pain for the population of actual Earthlings..., whereas the inflation of certain cavities in the feet is pain for the population of actual Martians.... Human pain is the state that occupies the causal role for humans. Martian pain is the state that occupies that same role for Martians. (1980, p. 126) Lewis’s idea is that on a given occasion of use, context selects a relevant population p, and ‘pain’ as used on that occasion expresses the property that occupies the pain role in p. He says that ‘pain’ is a “nonrigid designator.” On different occasions, it expresses different properties. The commonality among these uses is similar to that of an indexical: on each occasion ‘pain’ expresses the property that plays the role in the selected population. Kim also has advocated “local reductions” and species- or structure-specific identifications (1993, pp. 327–329; 1998a, pp. 110–111).
2 Interdefinitions It seems to me that there is a metaphysical assumption hidden in this response to the multiple realization objection, which comes out clearly only when we look to the case of interdefinitions. One of the main objections to behaviorism in the philosophy of mind was that there are no characteristic effects of belief and desire in isolation. A belief that p causes different behavior in subjects with different desires; a desire that q causes different behavior in subjects with different beliefs. Lewis proposed to handle this case by interdefining ‘belief’ and ‘desire’ using Ramsey sentences. His general idea was that we can specify a causal role for a pair or n-tuple of properties, and then find realizers that occupy the relevant slots. ‘x is pain’ means something like: there in an n-tuple of realizer properties P1 , . . . , Pn that play some role with respect to each other, and x has P4 (the one that fills the slot for ‘pain’). The general strategy of restricting the domain of objects under consideration is based on the allegedly a priori assumption that there will always be some realizer for a given functional predicate. Whenever such a predicate F applies to an object x, x must have some realizer or other of F. Kim is fairly explicit about this, suggesting that even if we had to restrict the population to microphysical duplicates, his “local reductions” would still be available (1998a, pp. 94–95). It seems to me that in the case of interdefinitions, there is no terribly good reason to think that restricting the population will ever leave us with a unique n-tuple of realizers. To get at the problem, consider how Kim sees the situation with interdefinitions. Kim (1998b, 105) suggests a toy Ramsey sentence for the psychological predicates ‘is in pain’, ‘is normally alert’ and ‘is in distress’. He calls this sentence ‘T ’, and its Ramsey sentence ‘TR ’. He says:
123
212
Synthese (2011) 181:209–226
Suppose that the original psychology, T , is true of both humans and Martians.... Then TR , too, would be true for both humans and Martians: It is only that the triple of physical states H1 , H2 , H3 which realizes the three mental states pain, normal alertness, distress and which therefore satisfies TR , is different from the triple I1 , I2 , I3 , which realizes the mental triple in Martians. (Kim 1998b, p. 111) Kim assumes that the causal role in question will be played by one triple of realizers in the case of humans, and a distinct second triple in the case of Martians. This is one way in which triples of realizers could line up, but it is certainly not the only one. Maybe humans and Martians could share a pain-center. Maybe human pain-centers would work equally well when transplanted into Martians, just as a pig’s heart might function in a human. Or maybe, even among humans, there are multiple properties that fit each of the three slots, and any combination of them works equally well. That is, maybe among humans there are belief realizers B1 , . . . , B10 and desire realizers D1 , . . . , D10 such that any Bi and Dj will satisfy the overall role. Let me give two somewhat more realistic examples. There are multiple definitions of acids and bases. The Brønsted–Lowry one does not define them individually, but in conjugate acid-base pairs. On this account, acids and bases are pairs where the first donates protons and the second accepts them. On this account, the usual acids are components in conjugate acid-base pairs which have H2 O as the base: HCl, H2 O, HCN, H2 O, H2 SO4 , H2 O, etc. But there are also conjugates in which H2 O is the acid, such as H2 O, NH3 . And there are pairs that do not involve H2 O at all. Here is a second example. Suppose we start with first-order chemical structures C1 , . . . , Cn . Some of these structures reproduce themselves. This is an important distinction among the structures. It is a second-order feature, however. The reproducers are not ones that produce C23 ; the reproducers are the Ci s that produce Ci s. The reproducers do not reproduce in isolation, however. They consume resources in reproducing. Which resources they consume varies from one reproducer to another. So, we will have to interdefine reproducers and resources for them. The definition would be roughly along the following lines: a reproducer/resource pair is a Ci , Cj such that Ci and Cj together produce more Ci s and consume Cj s in the process. We might call this the postulate of “organism-theory.” Continue down this second-order road, and eventually we will find population biology, with properties (or predicates) for growth rates of populations of organisms, carrying capacities of environments, predator–prey relations, etc. There can be multiple resources for a given reproducer, and the same thing can be a resource for multiple reproducers. In addition, one reproducer is often a resource for another (predation). These cases have a feature that is troubling for Occupant Functionalism. Lewis and Kim suggest that we should treat interdefinitions as quantifying over n-tuples of properties. They see the relevant Ramsey sentence as of the sort: ∃P1 , . . . , Pn T(P1 , . . . , Pn ). And theoretical terms such as ‘F1 ’ and ‘F2 ’ are supposed to denote the first and second members, respectively, of the unique n-tuple P1 , P2 that satisfies such a Ramsey sentence, perhaps in some restricted domain of objects. But in my two examples, there are no unique n-tuples of realizers; and there are no unique n-tuples, even when we restrict the population in question as drastically as one
123
Synthese (2011) 181:209–226 Fig. 1 One–one causal relation
213
A1 A2 A3
C1 C2
B1 B2
C3
B3 Fig. 2 Many–many causal relation
A1 A2 A3
C1 C2
B1
C3
B2 B3
desires. Restrict the population to some exact microphysical structure S. Suppose S is a chemically pure sample of some compound. Still there could be multiple further substances T such that S is an acid-relative-to-T, or a base-relative-to-T. Similarly, suppose S is the exact microstate of some organism at a given moment. There could be multiple resources for S, and S could be a resource for some other organism. Restricting the population does nothing to remove the problem of there being no unique realizer.2 The general feature that gives rise to this problem is that relations among second-order properties can correspond to many–many relations among their realizers. The case that I want to focus on here concerns many–many causal relations among realizers. Suppose there are second-order properties A, B, C which each have three (exclusive) realizers: A1 , A2 , A3 , B1 , . . . , C3 . Suppose A and B suffice for C. Figures 1 and 2 illustrate some sample possibilities for relations among these realizers. Figures 1 and 2 show two extreme cases; there are many intermediates. In Fig. 1, A1 combines with B1 to cause C1 ; similarly with A2 , B2 and C2 , and with A3 , B3 and C3 . A1 does not combine with B2 or with B3 to do anything. Since we assumed that A and B suffice for C, it must be that A1 and B2 cannot be combined—if they could be, and did not lead to C, then the assumption would be false. A1 and B1 could be some human-specific realizers of certain beliefs and desires, for example, while A2 and B2 are octopus-specific realizers. Octopus brains are not capable of having A1 , and human brains are not capable of having A2 . Figure 1 is amenable to the idea of Occupant Functionalism. In that situation we could pick out a pair of realizers A1 , B1 using a phrase such as ‘the unique pair of realizers that causes C in population p’. Figure 2 represents the other extreme possibility. In it, any A-realizer combines with any B-realizer to suffice for C, but not for any specific C-realizer. I will give two 2 I develop this objection in more detail in Hiddleston (forthcoming).
123
214
Synthese (2011) 181:209–226
abstract illustrations of how this is possible; I return later to the question of whether there are actual cases. First, suppose indeterminism is correct. Suppose the A- and B-realizers are highly specific microstructural features of some acidic sample and a beaker of water in circumstances in which they are about to be combined. It could be that any (or near enough any) combination of A- and B-realizers yields a probability distribution over later states with two features: (i) every specific microstructural result has negligible probability, while (ii) all (or near enough all) resulting states share a gross feature, such as having a given concentration of hydronium ions. This is a case in which any A-realizer and any B-realizer yield C, while yielding no specific C-realizer. Some C realizer occurs in each case, but it was not necessitated by prior states. Only C was necessitated. Second, a similar situation is possible under determinism, as well. Suppose that the A- and B-realizers of Fig. 2 are less than maximally specific. Multiple fully detailed microstructural states are compatible with each of A1 , A2 , etc. Suppose each of the realizers pictured has three further realizers A11 , A12 , A13 , A21 , . . . , C33 . Suppose pairs of maximally specific A- and B-realizers do suffice for a unique C-realizer. Still, it could be that the results yielded by those pairs do not line up with relations among the mid-level realizers A1 , B1 , etc. It could be that A11 combines with B11 to yield C11 , for example, while A12 combines with B12 to yield C31 . Again, the mid-level realizers combine to yield C, but not to yield any specific C-realizer. Figure 2 cases are problematic for Occupant Functionalism. In the case, there is no pair of realizers that combine to uniquely suffice for C, in any population. A1 , B1 suffices for C, but so do A1 , B2 , A1 , B3 , B1 , A2 , etc. In the case of Fig. 2, a phrase such as ‘the n-tuple that suffices for C in population p’ is improper, whatever the p.
3 The Micro–Macro Mirroring Thesis The semantic view of Occupant Functionalism has turned out to rest on a metaphysical assumption about the sorts of nomic relations that special science properties enter into.3 In this section, I formulate that assumption, which I call the Micro–Macro Mirroring Thesis. In the next section, I argue that reductionism implies the Mirroring Thesis. Then I will argue that the Thesis is not a priori. In Fig. 1, there is a sense in which the nomic relation among second-order properties is a mere summary of relations among realizer properties. In this case, the boundaries drawn around properties A, B, and C are somewhat arbitrary. We could equally have drawn boundaries around properties (A1 ∨ A2 ), (B1 ∨ B2 ), and (C1 ∨ C2 ). Call these properties A-minus, etc. In Fig. 1, our language would be none the worse for expressing causal/nomic relations if we had chosen to use terms for A-minus, B-minus, and 3 It may not be strictly true that the semantic view itself rests on this assumption. One could potentially hold on to Occupant Functionalism while accepting my objection, and conclude that special science terms are similar to ‘caloric fluid’ and ‘phlogiston’. They purport to denote the unique realizer (or n-tuple) that s, while there is no such thing. This is contrary to the intentions of my model reductionists, however.
123
Synthese (2011) 181:209–226
215
C-minus, together with ones for A3 , B3 , and C3 . In Fig. 1, the relations among A, B, and C could equally well be expressed in two cases: (A-minus & B-minus) → C-minus, and (A3 &B3 ) → C3 . In Fig. 2, however, the new boundaries fail to line up with the causal/nomic relations. In Fig. 2, it is false that A-minus and B-minus suffice for C-minus: sometimes A-minus and B-minus result in C3 instead. The nomic relation holds only among properties that the new terminology would force us to formulate as disjunctions: [(A-minus ∨ A3 ) & (B-minus ∨ B3 )] → (C-minus ∨ C3 ). So, in Fig. 1 cases, there is a sense in which the nomic relations among second-order properties are themselves disjunctive. The nomic relation among A, B, and C obtains in a class of cases that is mere a union of cases of further nomic relations. Here is an attempt to capture that: Definition A nomic relation among second-order properties (A & B) → C is disjunctive iff in every instance of (A & B) → C, there are realizers AR , BR , and CR , of A, B, and C, respectively, such that (AR &BR ) → CR . The nomic relation of Fig. 1 is disjunctive, while that in Fig. 2 is not. In Fig. 2, it is false that (A1 &B1 ) → C1 , for example. If we start shaving off realizers from the second-order properties A, B, and C, we lose the nomic relation. Occupant Functionalism presupposed the metaphysical assumption that any nomic relations among second-order properties would break down as Fig. 1 illustrates. The view appears to require this assumption4 : The Micro–Macro Mirroring Thesis: Any nomic relation among disjunctive (/second-order) properties is disjunctive. I call this the “Mirroring Thesis” because it says that the second-order nomic relation has many copies in relations among realizers. I have not seen the Mirroring Thesis explicitly formulated, but I will argue in the next section that it captures a central metaphysical issue separating reductionists and nonreductionists. 4 Reductionism implies the Mirroring Thesis In this section, I argue that reductionism implies the Mirroring Thesis. If the Thesis is false, then reductionism is false. I also argue that if the Mirroring Thesis is true, reductionism is at least defensible, even given multiple realization. It is somewhat unclear what reductionism is in the first place, so I start with that question. The reductionist view I consider here is something of an amalgamation of Lewis’s and Kim’s views. In the conclusion, I will consider whether their views should be distinguished, and in what ways. 4 Occupant Functionalism seems to require more than the Mirroring Thesis, as formulate it here. The Mirroring Thesis says in effect that Fig. 2 cases do not obtain. Occupant Functionalism requires that all cases are Fig. 1 cases. There are further cases in which the Mirroring Thesis holds and in which Occupant Functionalism is still in trouble. I formulate the Thesis in this weaker fashion in an attempt to raise even more trouble.
123
216
Synthese (2011) 181:209–226
There are both semantic and metaphysical views in the vicinity of reductionism. Occupant Functionalism and the Functional State Identity Theory are semantic views in the first instance: views about which predicates express which properties. But obviously enough the views are motivated by conflicting metaphysical assumptions. Occupant Functionalism is motivated by the view that second-order functional properties are defective in some way. Kim treats functional properties as a subclass of secondorder properties, and argues that the second-order ones do not exist. They are not properties, but mere “concepts” (1998a, pp. 110–111; 2008, pp. 111–112). Lewis says that functional properties are “excessively disjunctive” (1994, p. 307). So, they cannot occupy any causal roles and cannot serve as the denotations of functional predicates. Lewis treats properties as Carnapian intensions: functions from possible worlds to extensions. These are the “abundant” properties. Lewis then distinguishes among the “natural” and “unnatural” properties (1983). The natural properties are “sparse.” Lewis holds that functional properties are highly unnatural. All parties to the current dispute accept at least some properties. Setting strictly nominalist scruples aside, we at least have Lewis’s intensions to choose from. So, there does not appear to be much real difference between Kim’s view that functional properties are not properties at all but mere concepts, and Lewis’s view that they are properties but highly unnatural and disjunctive ones. So, the relevant metaphysical component of reductionism is the view that second-order, functional properties either do not exist or are at best highly unnatural properties.5 I will simply call this view ‘reductionism’, though I allow that it may fit better with prior usage to reserve the term ‘reductionism’ for the semantic view of Occupant Functionalism, or for the conjunction of that view and the metaphysical one.6 So, the issue separating reductionists and nonreductionists concerns naturalness of properties. Unfortunately, no one has any terribly good account of what naturalness of properties is. Lewis officially takes naturalness as primitive (1983). This approach runs the risk of making disputes about reductionism fruitless. There is an uninterpreted primitive term ‘N’. I say it applies to x, you deny it. That is an impasse. Lewis does give us some marks of naturalness to go on, however. Natural properties capture real similarities and differences of objects. They are the properties that figure in laws, and in virtue of which states or events can be causes. Lewis also holds that the perfectly natural properties are physically fundamental. (Lewis introduces degrees of naturalness based upon the length of a proper definition in a suitably fundamental language.) In the current context, it is question-begging to assume that only fundamental properties can be natural. The nonreductionist holds that nonfundamental properties can be natural, and as evidence for this view she argues that they have the other natural-making (or natural-indicating) features. Schaffer (2004) notes this possibility, and suggests that we revise Lewis’s concept of naturalness to distinguish between the fundamentality component and objective similarity-making component.
5 We should in general distinguish functional from second-order properties. I lump them together now for simplicity of discussion, and distinguish them in the conclusion. 6 There are other acceptable uses of ‘reduction’, as well.
123
Synthese (2011) 181:209–226
217
This distinction seems to be required in order to avoid ruling out the nonreductionist view merely by definition. However exactly naturalness is understood, one key feature of natural properties is that they figure in laws, and that the divisions they induce among objects correspond to nomic differences among those objects. But we have to be careful about how we understand “figure in laws.” This must amount to more than there being nomically necessary truths about a property. There are two problems here. First, everything that is either an electron or an apple is either negatively charged or a fruit. This nomically necessary fact does not make being an electron or an apple a natural property. Call this the problem of “arbitrary disjunctions.” Second, all the necessary truths about electrons are equally true of electrons that are within 2 m of President Obama. That fact does not make being an electron within 2 m of President Obama a natural property. Call this the problem of “irrelevant conjunctions.” Natural properties have to avoid both problems. The moral of these problems is that natural properties have to be such that weakenings of them (with larger extensions) do not stand in the same nomic relations, while strengthenings of them (with smaller extensions) do stand in the same relations. The extensions of natural properties have to be big enough to capture nomic truths, and just big enough. The problem with being an electron within 2 m of Obama is that its extension is too small. Its being too small consists in the fact that nomic truths about this property are equally nomic truths about electrons. The problem with being an electron or an apple is that its extension is too large. Its being too large consists in the fact that the nomic truths about it break down into two classes: the truths about electrons and the truths about apples. So, at least to a first approximation, the relevant concept of “figuring in laws” that distinguishes a natural property P has two features. On the one hand, P stands in nomic relations that weakenings of P do not stand in. On the other hand, the nomic relations that P stands in do not break down into mere disjunctions of further nomic relations. In that case, breaking P into disjuncts P1 and P2 would leave us with perfectly good laws. Irrelevant conjunctions such as being an electron within 2 m of Obama fail to satisfy the first feature: the same necessary truths hold of the weaker property being an electron. Arbitrary disjunctions such as being an electron or an apple fail to satisfy the second: their disjuncts have their own distinctive features. This criterion certainly does not amount to an analysis. Any universal proposition Every P is Q will break down into further laws if we are allowed arbitrary subsets of P and Q: every P-before-the-year-2000 is a Q-before-the-year-2000, and similarly for the after-2000 versions. Instead, the criterion is intended as a constraint on what sets of properties could qualify as the natural ones. Reductionists and nonreductionists alike assume that basic physical properties will be natural. Maybe there are other base-level natural properties, too. But being an electron or an apple fails to qualify if we already have being an electron as natural (unless there are some highly unexpected similarities among apples and electrons). Given this understanding of naturalness, reductionism requires the Mirroring Thesis. If the Thesis is false, there are genuinely new nomic relations among secondorder properties. To illustrate this newness, imagine erasing the circles that represent second-order properties from Figs. 1 and 2. In Fig. 1, there remain nomic relations
123
218
Synthese (2011) 181:209–226
among the realizers. In erasing the second-order properties, we lose only a way of lumping together preexisting nomic relations. But if we erase the second-order properties from Fig. 2, the nomic relations simply disappear. In Fig. 2, there are no realizer properties that can fill the slots on both sides of a nomic conditional ‘→’. Only when we allow disjunctions or unions of these realizers do there come to be any nomic relations at all, even ones among the realizers themselves. For example, A1 stands in this nomic relation: [A1 &(B1 ∨ B2 ∨ B3 )] → (C1 ∨ C2 ∨ C3 ). Without the disjunction of B- and C-realizers, A1 is not nomically related to anything (or at least to anything pictured). All of the nomic relations in Fig. 2 cases involve second-order properties (or at least their intensions).7 By contrast, in the Fig. 1 case of Mirroring, adding the second-order properties A, B, and C does not create anything genuinely new. It merely reclassifies what was already there. Failure of the Mirroring Thesis yields natural second-order properties. Suppose that we attempt to draw borders around sets of objects that correspond to nomic potentials of those objects. We want to draw the borders around sets that are large enough but not too large: large enough to avoid irrelevant conjunctions, not too large to allow arbitrary disjunctions. In Fig. 2, we need the borders around the second-order properties B and C in order to capture the nomic potentials of the specific A-realizers. No specific A-realizer is nomically related to any pair of specific B- and C-realizers, but only to the disjunctions of them. And now consider A1 , which stands in this relation: [A1 &(B1 ∨ B2 ∨ B3 )] → (C1 ∨ C2 ∨ C3 ). Weakenings of A1 also stand in that relation: A2 or A3 would equally do the trick. At least as far as this nomic potential is concerned, A1 is an irrelevant conjunction. So, in these cases, to draw borders around sets of objects that capture their nomic potentials, we must draw borders that correspond to second-order properties, and not (just) to their realizers.8 So, failure of the Mirroring Thesis yields natural second-order properties. These properties bring new laws with them, and thus are natural. We could dub this view “Robust” nonreductionism: Robust Nonreductionism: The Mirroring Thesis is false; there are nomic relations among disjunctive (and second-order) properties that are not themselves disjunctive. Robust Nonreductionism accepts natural second-order properties while remaining inegalitarian about naturalness. Not just any way of lumping realizers together yields a new law, and thus not just any way of lumping realizers together yields a natural property. So, on this view there are more and less natural second-order properties. I do not know how to measure degrees of naturalness in any detail, but I see no bar in principle to second-order properties being about as natural as one wishes, if they
7 Later I consider the objection that the relevant laws require only disjunctions of first-order properties, and do not require any disjunctive properties. I set this objection aside for now. My argument here is that the intensions corresponding to (either disjunctive properties or disjunctions of properties) carve nature at the nomic joints. 8 I assume that the realizers also correspond to natural ways of drawing borders. They stand in further
nomic relations not pictured in Fig. 2.
123
Synthese (2011) 181:209–226
219
can be at all natural to begin with.9 So, if there are nondisjunctive second-order nomic relations, then there are natural second-order properties. Reductionism says there are none. So, reductionism implies the Mirroring Thesis. In addition, if the Mirroring Thesis is true, then reductionism is at least defensible. Multiple realization raises difficulties for reductionism, but these difficulties seem to me to be primarily semantic in nature. The reductionist wants to say that it is somehow in virtue of this or that realizer property that a second-order term applies to an object. It is not easy to capture this idea in a workable semantic view. But the reductionist’s main metaphysical motivation remains largely untouched by multiple realization. The reductionist holds that ways of grouping realizers together into second-order properties are ultimately arbitrary. Maybe some disjunctions are more useful to note for some purposes of creatures like us, but ontologically speaking there is nothing to choose between one grouping and another. As in the case of Fig. 1, there is nothing much to choose between a scheme that had a predicate expressing A (= A1 ∨ A2 ∨ A3 ) and an alternative scheme that had predicates expressing A-minus (= A1 ∨ A2 ) and A3 . So, reductionism is false if the Mirroring Thesis is; and if the Thesis is true, reductionism is at least a defensible option. Multiple realization raises semantic difficulties for the reductionist, but the Mirroring Thesis seems to capture a central metaphysical motivation for the view. So, is the Mirroring Thesis true? is it a priori? 5 The Mirroring Thesis is not a priori In this section, I argue that the Mirroring Thesis is not a priori. I cannot say whether it is actually true or false; I indicate some relevant considerations. The nomic relations depicted in Fig. 2 are clearly possible a priori, and so the Mirroring Thesis is not a priori. There is one way to deny this conclusion, which comes in a few versions. One could maintain that it is necessary and a priori that second-order properties do not exist; thus, there are no nomic relations involving them; thus, all such relations are conjunctive. Alternatively, one could maintain that it is necessary and a priori that second-order properties are unnatural; but only natural properties can figure in laws, so no relation among second-order properties qualifies as ‘nomic’; so all such nomic relations are conjunctive. These responses are question-begging, obviously enough. They build the truth of reductionism into their definitions (of either ‘property’ or ‘naturalness’). One might maintain that these definitions are part of the best overall philosophical system, but that seems simply wrong to me. If the reductionist insists, I will give her the term ‘natural’ (or ‘property’). Let the natural* properties be the ones that correspond to distinctions among the nomic potentials of objects. It seems pretty clear that the natu9 Lewis suggests measuring degrees of naturalness by length of a definition in a language which contains terms only for fundamental properties. Second-order properties could qualify as highly natural by these lights, though probably not perfectly natural. My suggested interdefinition of organisms and resources for them was reasonably short, for example. It is a good question why Lewis does not simply accept that such properties are natural. I return to this issue briefly in the conclusion.
123
220
Synthese (2011) 181:209–226
ral* properties are the ones that are causally, explanatorily, and practically important. So, given the new terminology, the issue concerning reductionism becomes whether second-order properties are natural*. My illustrations show that it is possible a priori that some second-order properties are natural*. So, the Mirroring Thesis is not a priori. Reductionism implies the Mirroring Thesis; so, reductionism is not a priori. There are more subtle ways of begging the question here; I will mention one. Figure 2 depicts a nomic relation of this sort: [(A1 ∨ A2 ∨ A3 ) & (B1 ∨ B2 ∨ B3 )] → (C1 ∨ C2 ∨ C3 )
(1)
One might think something like the following. One can accept that (1) is true, while reading it only to contain terms for first-order realizers. ‘A1 ’ is really short for ‘A1 x’. with ‘x’ universally quantified. So, (1) really only contains predicates that express first-order properties. It is an extra step to see (1) as describing a relation among second-order properties such as (A1 ∨ A2 ∨ A3 )x. So, one can accept (1) while rejecting second-order properties. This objection seems to me to be question-begging again, though a bit more subtly than before. All parties to this debate are realists about properties to some extent. So, we have Lewis’s intensions to choose from. These intensions are not inherently disjunctive, conjunctive, etc. There is no difference between an intension expressed by disjoining predicates and one expressed by disjoining sentences: λx(B1 x ∨ B2 x ∨ B3 x) = λx(B1 ∨ B2 ∨ B3 )x. The question that faces us is whether that one thing corresponds to a property (/natural property).10 The current objection says that it does not; but why? The only answer I can see is that it is not a first-order property; so, it does not qualify as a ‘property’. That is the question-begging response again. The Mirroring Thesis is not a priori; it is much more difficult to determine whether it actually obtains. Let’s start by looking at the classic case of the reduction of thermodynamics to statistical mechanics. In this reduction, macroscopic properties are identified with gross properties of ensembles of particles. The paradigmatic example is the identification of temperature with mean molecular kinetic energy. Nomic relations among macroscopic states, such as that P leads to Q, are then reduced to relations among sets of microscopic states. All (or near enough all) microstates that have the gross feature P lead to other microstates that have the gross feature Q. The evolution of these systems is deterministic and the relations among microstates are one–one. The macroscopic relation of P’s leading to Q is realized on the microlevel as an infinite class of relations of the sort: P1 leads to Q1 , P2 leads to Q2 , etc. Figures 3 and 4 illustrate (respectively) what the Mirroring Thesis and an extreme case of its failure would look like in the current example. Figure 3 and Mirroring hold in the classical reduction. Yet it is currently not clear whether Fig. 3 is ultimately correct. If the underlying dynamics at the quantum level is indeterministic, then the one–one relations depicted in Fig. 3 will not in general hold among microstates. 10 I do not intend to take a stand on the question of whether necessarily coextensive properties are identical.
I do insist that necessarily coextensive properties are alike with respect to naturalness. This seems to be required by the general conception that links naturalness to nomic roles and relations. So, questions of naturalness arise about intensions in the first instance.
123
Synthese (2011) 181:209–226 Fig. 3 Mirroring for statistical mechanics
221
A1
C1
A2
C2
...
...
An
Cn
Fig. 4 Failure of Mirroring
A1
C1
A2
C2
...
...
An
Cn
Some ways in which they could possibly fail to hold would be ones in which both thermodynamics and statistical mechanics are unsalvageably false. No one expects that to happen. Presumably, if the underlying dynamics was indeterministic, it would still be the case that gross generalizations about temperature, heat, entropy, etc. would hold as good approximations. Microstates which realized these gross features would be overwhelmingly likely to evolve into microstates with corresponding gross features, though no particular microstates would be necessitated. Something like Fig. 4 could obtain in such a case. More realistically, we would expect this case to be incredibly complicated. Figures 3 and 4 are only the extreme possibilities; Fig. 5 represents an intermediate one that is more plausible for the case of indeterministic thermodynamics. In the current case, there are continuum-many microstates in the place of the Ai , and each one yields some probability distribution over later microstates. We expect these distributions to be consistent with the approximate truth of macroscopic generalizations about temperature, heat, etc. Beyond that, I admit I do not know what such relations would be like. The question is beyond my own level of expertise. In these intermediate cases, Mirroring does not obtain, but neither does it fail perfectly. Intermediate cases can raise complications. It seems possible for some cases to work out so that neither the reductionist nor the nonreductionist will be fully satisfied. Suppose that A1 ∨ A2 yields C1 ∨ C2 , as a failure of Mirroring, but that a long list of further Ai each yield unique Ci . Then it seems that the considerations of naturalness I raised before would recommend that there are natural disjunctive properties (A1 ∨ A2 ) and (C1 ∨ C2 ), contra the reductionist’s desires. But those considerations would not yield that A and C are natural, contra the nonreductionist’s desires. The natural properties might be the two disjunctive ones, and then each of the further Ai and Ci .
123
222 Fig. 5 An intermediate case
Synthese (2011) 181:209–226
A1
C1
A2
C2
...
...
An
Cn
In the case pictured in Fig. 5, however, the nonreductionist would seem to have what she wants. There is a nomic truth: each A-realizer gives rise to a probability distribution over some subset of the C-realizers, and those subsets are jointly exhaustive of the C-realizers. In the Fig. 5 case, those subsets of C-realizers are overlapping. So, there is no partition of the sets of A- and C-realizers into subsets that satisfies Mirroring with respect to A and C. The borders corresponding to A and C carve nature at a nomic joint. Consider a strengthening C− of C, which includes some but not all of the C-realizers. The nomic truth I mentioned is false of C− : some A-realizers have associated distributions that include elements not in C− . And consider a similar strengthening of A to A− . The realizers in A− give rise to distributions limited to C-realizers, but not exhaustive of them. There is a gain in naturalness in expanding A− to A. Also, it is implicit in Fig. 5 that further expansions of A and C to A+ and C+ , which include more realizers, would fail to stand in a relation similar to that of A and C. So, in the Fig. 5 case, A and C appear to be natural. It is at least epistemically possible at this point in time that the basic dynamics is indeterministic; so it is at least epistemically possible that something like Figs. 4 and 5 are actual too. So, it seems to me to be currently unsettled whether Mirroring really holds in this important case. It may be worth noting that even with classical, deterministic dynamics, it is unclear whether the “reduction” of thermodynamics supplies what our “reductionists” need. One way of reading the “reduction” is as a vindication of the macroscopic properties and relations. At least on this reading, the macroscopic properties still exist, are still second-order, and still stand in nomic relations.11 I have argued that the truth of the Mirroring Thesis is necessary for reductionism; it is not sufficient, or at least it is highly debatable whether it is sufficient. Failures of Mirroring are possible under determinism, as well, so long as the realizer properties Ai and Bi are less than maximally specific. Even if determinism is true, it is 11 For example having mean molecular kinetic energy x is clearly a second-order property. It involves being composed of particles with some or other kinetic energy properties which have an average of x. Kim’s objections to second-order properties apply equally well here. His main idea is that there is no new causal work for second-order properties; anything they could do would already be accomplished by their realizers. If this argument works at all, it clearly disqualifies temperatures (= mean kinetic energy properties). Anything the mean value could do would already be accomplished by the exact microstate. It is less clear whether the case would be problematic for Lewis; I discuss some questions of Lewis scholarship in the conclusion.
123
Synthese (2011) 181:209–226
223
at least abstractly possible that there are second-order properties A, B, and C such that (A & B) → C, and maximally specific, state-of-the-universe realizers which mirror this relation, while there are no Mirroring realizers in between. This is an important possibility. I have treated reductionism as the highly general view that second-order properties are unnatural. But reductionists standardly want more specific versions of reduction too, such as a reduction of mental processes to neurochemical ones. These more specific reductions require more specific versions of the Mirroring Thesis, such as a version that links mental processes and their neurochemical realizers. The neurochemical realizers are mid-level ones. So, the possibility that Mirroring could fail for them, even given determinism, is worth a look. It is difficult for me to see whether Mirroring holds for mid-level realizers. Ultimately, the issue is a posteriori. I will argue for the weaker claim that it is not implausible to suppose that it does often fail, even assuming underlying determinism. There are many cases in which a mechanism has been identified for a macroscopic process. But identifying a mechanism need not amount to a derivation of the dynamics of the macro-level process from a dynamical account of the lower-level one. Mechanisms can provide a sort of how-this-is-possible explanation, rather than a derivation of the sort from statistical mechanics to thermodynamics. For example, population biology contains models of population growth and of the relations between growth rates for predator and prey populations. Imagine a population of geckos which eat flies. We do not currently have any laws capable of predicting the trajectories of individual geckos and flies in any great detail, though we do know that a greater density of geckos in a region will lead to a lesser density of flies in that region. Models of predator–prey relations among populations relate such gross features. Even supposing that fundamental determinism is true, it could be that nothing short of microphysics will be able to predict the positions of individual organisms in sufficient detail to satisfy the Mirroring Thesis by telling us (for example) which geckos will eat which flies at what times. Maybe at the biological level of individual organisms, we can say only that organisms will establish territories with certain features and relations and engage in certain sorts of foraging behavior. We may have knowledge of various mechanisms which are responsible for these individual behaviors. Our knowledge of such facts could help us to pin down the values of parameters in the population-level model without allowing anything like a derivation of the behavior of an aggregate of individual organisms that would satisfy Mirroring. The relevant features in this last example may well hold of the relation between psychological and neurophysiological states, too. For example, suppose we have in hand a molecular-level account of long-term potentiation of neurons. This is not an account of memory, but is merely the underpinnings of an account of how memory is possible for creatures like us. The neurophysiological realizers for human memory involve vast numbers of neurons. Maybe there are nomic relations of the sort: given a high enough density of neurons of type N in brain region R, and given a certain sort of input I, enough of the N-neurons will become potentiated in such a way to fix a memory. But at the same time, it might be that nothing sort of microphysics will tell us exactly which neurons will enter exactly which states. These examples are admitedly speculative, but my point is only that it is an a posteriori question whether the Mirroring Thesis holds for any mid-level realizers, or for
123
224
Synthese (2011) 181:209–226
which ones. I have attempted to argue merely that it is not implausible to suppose that it fails, even given underlying determinism. If it does fail, then more specific versions of reductionism will fail as well. In addition, at least in the case I mentioned of models of population growth, the question of whether the mid-level realizers line up as Mirroring says does not appear to be of burning scientific interest. Maybe there will come to be some theory of bio-statistical mechanics that stands to population dynamics as statistical mechanics stands to thermodynamics. Or maybe not. It appears to be enough for the population-level explanatory practice to be acceptable that there are mechanisms which in aggregation make possible the relevant macro-level relations. It does not appear to be necessary that those mechanisms fit the model of the classical reduction of thermodynamics to statistical mechanics. So, we have at least a prima facie case that the truth of Mirroring is not required by special science practice. That is not to say that Mirroring or any specific version of it is false. But if scientific practice does not require Mirroring, and if Mirroring lacks any compelling a priori justification, then reductionism appears to be somewhat unmotivated.
6 Conclusion I have argued that reductionism (of the metaphysical type) requires the Micro–Macro Mirroring Thesis, and that the Mirroring Thesis is not a priori. It appears to be a currently open question whether the Thesis is true; I attempted to indicate some relevant considerations. One could potentially argue (contra my prima facie case in the last section) that the Mirroring Thesis is in fact an important theoretical and methodological assumption in scientific practice. It seems to me that this is the way reductionists should argue. I do not know now whether such arguments can succeed, but they seem to be required in order for reductionism to be justified. The main current arguments for reductionism seem to me to be unsuccessful. The gist of these arguments is that second-order properties do not exist (or are unnatural) because there is no causal work for them to do; anything they might do would already be accomplished by their realizers. I have attempted to illustrate how it is possible at least in principle for there to be new causal work for second-order properties in cases of Mirroring failure. I will conclude by briefly discussing some issues of Lewis scholarship. In defining ‘reductionism’ earlier, I lumped together functional and second-order properties. For both Kim and Lewis, functional properties are a species of the genus of second-order properties.12 Kim’s objections are directed against the existence (/naturalness) of second-order properties in the first instance. It is less clear to me whether Lewis’s worries 12 Some authors (such as Shoemaker 1984, 1998) allow that functional properties can be first-order, basic
properties in their own right. I have attempted to follow standard usage with terms such as ‘second-order property’, ‘disjunctive property’, etc. The idea of this usage is that the physically basic properties are first-order, nondisjunctive, etc. by stipulation. Let L be a language which contains basic terms for all and only such properties. Other properties are disjunctive, second-order, etc. when their best expression in L is disjunctive, second-order, etc. The qualification “best” is difficult to spell out in detail.
123
Synthese (2011) 181:209–226
225
are directed at functional properties only, or at the genus of second-order properties.13 Lewis clearly holds that functional properties are unnatural. Some of his remarks suggest that this conclusion extends to second-order properties more generally.14 But Lewis’s general metaphysical views seem to allow for natural second-order properties, and even for natural functional properties, in a sense. We could distinguish fully functional properties from quasi-functional properties. The fully-functional properties have causal powers with metaphysical necessity; the quasi-functional properties have powers with merely nomological and not metaphysical necessity. So, for example, in any world anything with the fully-functional property of acidity would dissolve metals in the right circumstances; but in a world with different laws, something with the quasi-functional property of acidity might transform metals into candy instead. A quasi-functional property is the property of having certain potentials given actual laws. A fully-functional property is the property of having certain potentials simpliciter. Both sorts of properties are second-order, at least on Lewis’s and Kim’s views. Lewis’s general metaphysical views seem to me to be consistent with both Robust Nonreductionism and the Functional State Identity Theory when those views are understood to concern quasi-functional properties. The existence (or naturalness) of fullyfunctional properties would violate Lewis’s Humean metaphysics; the existence (or naturalness) of quasi-functional ones does not. Quasi-functional properties can count as at least reasonably natural by Lewis’s own lights. For example, the definitions I sketched of second-order relations of being a conjugate acid-base pair, and an organism-resource pair, need not be terribly lengthy in Lewis’s fundamental vocabulary. I myself see no bar in principle to Lewis’s accepting that these second-order relations are natural and are thus suited to be denotations of special science terms. For whatever reason, this is not the view that Lewis actually adopted in the case of functional properties (and seemingly sometimes in the case of second-order properties more generally). Multiple realization is no problem at all for the Functional State Identity Theory with quasi-functional properties. Humans, octopuses, and martians could all share quasi-functional pain, for example. So, there is no need to restrict domains for Ramsey sentences, or to engage in any of the other moves people have undertaken in response to multiple realization worries. Lewis famously makes such moves, though 13 This difference matters in the case of thermodynamics. Having a given mean kinetic energy value is clearly a second-order property, and so Kim’s objections apply to it. It is not, or at least is not obviously, a functional property, however. Its definition uses second-order quantification, but that quantification is over properties of parts, and does not directly mention any effects. So, one who rejects second-order properties will have to treat the reduction of thermodynamics as an elimination of the macro-level properties, while one who rejects only functional properties could see it as a vindication. 14 Here are two remarks that support this interpretation. First, in “How To Define Theoretical Terms”
(1970), Lewis considers the property of having some property that satisfies a term τi in a second-order Ramsey sentence for theory T . He says of this property, “It is not named by τi in any world, unless T is a very peculiar theory” (p. 86). Presumably the reason it is not named is that it is a defective property. This remark appears to apply to any Ramsey-Lewis definition, functional or otherwise. Second, in objecting to functional properties in “Reduction of Mind” (1994), Lewis briefly raises a Kim-style exclusion problem: “To admit it [a functional property] as causally efficacious would lead to an absurd double-counting of causes. It would be like saying that the meat fried in Footscray cooked because it had the property of being either fried in Footscray or boiled in Bundoora” (1994, p. 307). This objection appears to apply to disjunctive and second-order properties generally.
123
226
Synthese (2011) 181:209–226
I do not see why he has to. My conclusions may well be consistent with Lewis’s general metaphysics. Acknowledgment For valuable comments, I would like to thank Christopher Hitchcock, Michael McKinsey, Larry Powers, Susan Vineberg, and an audience at the 2009 meeting of the Society For Exact Philosophy.
References Armstrong, D. (1968[1993]). A materialist theory of the mind (revised ed.). London: Routledge. Hiddleston, E. (forthcoming). Second-order properties and three varieties of functionalism. Philosophical Studies. doi:10.1007/s11098-010-9518-z. Kim, J. (1993). Multiple realization and the metaphysics of reduction. In J. Kim (Ed.), Supervenience and mind (pp. 309–335). Cambridge: Cambridge University Press. Kim, J. (1998a). Mind in a physical world. Cambridge, MA: MIT Press. Kim, J. (1998b). Philosophy of mind. Boulder: Westview Press. Kim, J. (2008). Reduction and reductive explanation: Is one possible without the other?. In J. Howhy & J. Kallestrup (Eds.), Being reduced (pp. 93–114). Oxford: Oxford University Press. Lewis, D. (1970[1983]). How to define theoretical terms. In D. Lewis (Ed.), Philosophical papers (Vol. 1, pp. 78–95). Oxford: Oxford University Press. Lewis, D. (1972[1999]). Psychophysical and theoretical identifications. In D. Lewis (Ed.), Papers in metaphysics and epistemology (pp. 248–261). Cambridge: Cambridge University Press. Lewis, D. (1980[1983]). Mad pain and martian pain. In D. Lewis (Ed.), Philosophical papers (Vol. 1, pp. 122–130). Oxford: Oxford University Press. Lewis, D. (1983[1999]). New work for a theory of universals. In D. Lewis (Ed.), Papers in metaphysics and epistemology (pp. 8–55). Cambridge: Cambridge University Press. Lewis, D. (1994[1999]). Reduction of mind. In D. Lewis (Ed.), Papers in metaphysics and epistemology (pp. 291–324). Cambridge: Cambridge University Press. Schaffer, J. (2004). Two conceptions of sparse properties. Pacific Philosophical Quarterly, 85, 92–102. Shoemaker, S. (1984[2003]). Causality and properties. In S. Shoemaker (Ed.), Identity, cause, and mind (expanded edition, pp. 206–233). Oxford: Oxford University Press. Shoemaker, S. (1998[2003]). Causal and metaphysical necessity. In S. Shoemaker (Ed.), Identity, cause, and mind (expanded edition, pp. 407–426). Oxford: Oxford University Press.
123
Synthese (2011) 181:227–240 DOI 10.1007/s11229-010-9799-y
Trumping and contrastive causation Christopher Hitchcock
Received: 30 July 2010 / Accepted: 30 July 2010 / Published online: 15 September 2010 © Springer Science+Business Media B.V. 2010
Abstract Jonathan Schaffer introduced a new type of causal structure called ‘trumping’. According to Schaffer, trumping is a species of causal preemption. Both Schaffer and I have argued that causation has a contrastive structure. In this paper, I analyze the structure of trumping cases from the perspective of contrastive causation, and argue that the case is much more complex than it first appears. Nonetheless, there is little reason to regard trumping as a species of causal preemption. Keywords Causation · Contrastive causation · Overdetermination · Preemption · Schaffer (Jonathan) · Trumping Redundant causation occurs when there are two (or possibly more) distinct events C and D; one or both of these events cause some third event E, either singly or collectively; and either one of them would have caused E had the other been absent. Redundant causation is usually thought to come in two flavors: preemption, where one event (say C) causes E and somehow prevents the other from causing E, and overdetermination, where both events cause E, either singly or collectively. How do we distinguish between the two types of redundant causation? Here is one suggestion. Had C not occurred, there is a process that would have run from D to E in virtue of which D would have caused E. In cases where C preempts D, this process is not allowed to run to completion: there is some step in the process that would have occurred in C’s absence, which does not occur in the actual case where C is present. By contrast, in cases where C and D overdetermine E, this process is allowed to run to completion (and likewise for the process connecting C and E). Schaffer (2000)
C. Hitchcock (B) Division of Humanities and Social Sciences, MC 101-40, California Institute of Technology, Pasdena, CA 91125, USA e-mail:
[email protected]
123
228
Synthese (2011) 181:227–240
introduced a type of case called ‘trumping’ that is supposed to refute this suggestion. It is, putatively, a case in which C preempts D, even though the process connecting D to E is allowed to run to completion. Lewis (2000) endorsed this interpretation of trumping. In this paper, I will argue ad hominem against Schaffer that his own theoretical commitments undermine this interpretation of trumping. Specifically, both Schaffer and I have defended contrastivism about causation (Hitchcock 1996b; Schaffer 2005). Contrastivism entails that the moral to be drawn from trumping cases is more drastic: the taxonomy of redundant causation is far more complex than the simple distinction between preemption and overdetermination. Since we have good, independent reasons to accept contrastivism, we should reject the overly simple dichotomy between preemption and overdetermination.
1 Redundant causation According to a simple counterfactual theory of causation (henceforth called the simple theory), endorsed by no one, C is a cause of E just in case E counterfactually depends upon C. More precisely, if C and E are distinct events that both occurred, then C is a cause of E just in case E would not have occurred had C not occurred. For example, suppose that Suzy throws a rock at a bottle; her aim is true, the rock hits the bottle, and the battle shatters. There are no other rock-throwers about, and no other threats to the bottle’s integrity. If Suzy had not thrown her rock, the bottle would not have shattered. The simple theory straightforwardly rules that Suzy’s throw caused the bottle to break. No one endorses the simple theory because it is known to fail in cases of redundant causation. Suppose that Billy is ready and able to throw his rock at the bottle, in case Suzy waivers in her resolve. Suzy still throws and shatters the bottle, but if she hadn’t thrown, the bottle would have shattered anyway, due to Billy’s rock. Or suppose that Billy throws just a fraction of a second after Suzy. Once again, the bottle would have shattered even if Suzy hadn’t thrown. These are both examples of causal preemption: the first is an example of early preemption, the second late preemption. Suzy’s throw caused the bottle to shatter, preempting another event (Billy’s standing at the ready in the case of early preemption, Billy’s throw in the case of late preemption) that did not cause the bottle to shatter, but which would have done so if Suzy had not thrown. Had Suzy not thrown, there would have been a chain of events—Billy’s standing at the ready, Billy’s throw, Billy’s rock hurtling through the air, Billy’s rock striking the bottle, the bottle’s shattering—leading from Billy to the broken glass. In neither of our preemption cases is this chain allowed to run to completion. In the case of early preemption, Billy’s throw does not occur; neither does his rock’s hurtling through the air nor his rock’s striking the bottle; although the bottle does shatter. In the case of late preemption, Billy throws, and his rock hurtles through the air, but Billy’s rock does not strike the bottle. In these examples, we will say that Suzy’s throw is a preempting cause, and that Billy’s standing at the ready (in the case of early preemption) or Billy’s throw (in the case of late preemption) are preempted backups.
123
Synthese (2011) 181:227–240
229
Now suppose that Billy and Suzy both throw their rocks at the same time; both rocks strike the bottle at the same time, each with sufficient force to shatter the bottle on its own. The bottle shatters. This is a case of overdetermination. What should we say about this case? Let’s start with what we shouldn’t say. We shouldn’t say both that Suzy’s throw is a cause and that Billy’s isn’t, or vice versa. Due to the symmetry between the two throws, whatever we say about one, we should say about the other. One conceivable judgment is that neither throw, taken individually, caused the bottle to shatter; rather, it is the mereological sum of the two throws, or the two throws taken collectively, that causes the bottle to shatter. Schaffer (2003) calls this view ‘collectivism’. If collectivism is correct, then the simple theory survives unscathed in this example: the shattering does not counterfactually depend upon either throw taken individually, but it does depend upon their mereological sum. Many authors, however, report having the intuition that each throw, taken individually, is a cause of the shattering. Schaffer (2003) provides the most compelling arguments for this view, which he calls ‘individualism’. If individualism is correct, then the simple theory is once again in trouble, for the shattering does not counterfactually depend upon either throw. I will not presuppose an answer to this question, but will speak of the two throws as overdetermining causes (where overdetermining causes need not be causes simpliciter). In cases of overdetermination, each causal chain runs to completion, just as it would have if the other overdetermining cause had not been present. Schaffer (2003) appeals to this fact in arguing that overdetermining causes are causes simpliciter. Early preemption, late preemption, and overdetermination all have the following common structure: there are two events C and D; one or both of these events cause some third event E, either singly or collectively; and either one of them would have caused E had the other been absent. Any case having this structure is a case of redundant causation, and C and D will be called redundant causes. Note that redundant causes need not be causes simpliciter. If C is a cause of E, but not a redundant cause of E, then we will say that C is a non-redundant cause of E. If C is neither a cause of E nor a redundant cause of E, we will say that C is causally irrelevant to E. The definition of redundancy can be easily extended to cases involving more than two redundant causes. C1 , . . . , Cn are redundant causes of E just in case: (i) either at least one Ci is a cause of E, or some subset of the Ci ’s make up a collective cause of E; and each Ci would have caused E if all of the other C j ’s had been absent. Although we will primarily be concerned with cases involving only two redundant causes, cases involving more than two causes have an interesting feature that is worth noting. Suppose, for example, both Billy and Suzy throw, their rocks hitting the bottle at the same time. Cindy, standing at the ready, would have thrown just in case Suzy had not (and her rock would have hit the bottle at the same time that Billy’s did). How should we label Suzy’s throw, Billy’s throw, and Cindy’s standing at the ready? The answer is that we cannot classify them categorically as preempting cause, overdetermining cause, or preempted backup. Rather, we have to classify them relationally: Suzy’s throw and Billy’s throw are related as overdetermining causes of the bottle’s shattering; Suzy’s throw preempts Cindy’s standing at the ready; while Billy’s throw and Cindy’s standing at the ready do not stand in any relation of redundant causation
123
230
Synthese (2011) 181:227–240
(since it is false that if Billy had not thrown, Cindy’s standing at the ready would have caused the bottle to shatter). As we shall see, once we adopt a contrastive view of causation, this kind of relational structure becomes necessary even when there are only two redundant causes.
2 A note on terminology I have used the term ‘redundant causation’ to name the genus, and ‘preemption’ and ‘overdetermination’ to name the two species. This terminology is common, but by no means universal. Lewis (1973a) uses ‘overdetermination’ as the generic term, and distinguishes between ‘symmetric’ and ‘asymmetric overdetermination’ (what we have called ‘overdetermination’ and ‘preemption’, respectively). Even those who use ‘overdetermination’ in the narrower sense I have given it sometimes retain the modifier ‘symmetric’. The idea seems to be this: In cases of preemption, C is a cause of E, while D is not; hence the asymmetry. In cases of overdetermination, C and D are both causes of E (for the individualist), or both parts of a cause of E (for the collectivist); the situation is symmetric with respect to C and D. I think that the terms ‘symmetric’ and ‘asymmetric’ are misleading. For example, Lewis (2000) analyzes causation in terms of ‘influence’. Never mind the details; the important feature of this account for present purposes is that influence is something that comes in degrees. Now we might have a situation where C and D are overdetermining causes in my sense—each is a cause of E in its own right, or part of a collective cause of E—even though C exerts much more influence over E than D does.1 In this case, the situation is not symmetric with respect to C and D in all relevant respects. It would, therefore, be misleading to call this a case of symmetric overdetermination. But neither is it a case of asymmetric overdetermination in the sense of Lewis (1973a), for D is not completely preempted by C. Not just any relevant asymmetry between C and D guarantees that we have a case of preemption rather than overdetermination. As we will see shortly, I think that a similar conflation of asymmetry and preemption is at work in Schaffer’s interpretation of trumping. In the meantime, I will eschew talk of symmetric and asymmetric overdetermination so as not to encourage this conflation.
3 The counterfactual signature of redundant causation There have been numerous attempts to modify the simple theory to accommodate redundant causation within the framework of a counterfactual theory of causation. Among the most promising attempts are Halpern and Pearl (2005), Hall (2007) and Hitchcock (2001, 2007), although it is safe to say that none of these accounts is
1 The details are a little messy. Influence is defined in terms of counterfactual dependence, and as we have seen, cases of overdetermination tend to involve a lack of counterfactual dependence. What is important here is not that we can construct a specific example, but the purely logical point that we can have overdetermination in the presence of some relevant asymmetry.
123
Synthese (2011) 181:227–240
231
perfect.2 Although the details of these accounts are different, they rely on essentially the same structural features in their treatments of redundant causation. If C causes E, preempting D, or if C and D are overdetermining causes of E, then there will be a characteristic pattern of counterfactuals: (1) E does not depend counterfactually on C (2) If D had not occurred, then (i) E would still have occurred, and (ii) E would have been counterfactually dependent upon C. Call this pattern the counterfactual signature of redundant causation.3 We may motivate the claim that redundant causes bear this counterfactual signature in the following way. First, suppose that C is a non-redundant cause of E. This implies that there is no further event, D, that would have caused E in C’s absence (nor is there some collective that would have caused E). It seems natural to conclude that E would not occur in C’s absence. This assumes that E does not occur spontaneously. It does not assume that no event can ever occur uncaused, or even that E can never occur uncaused, but only that if C is in fact a cause of E, then E would not have occurred uncaused in C’s absence.4 Thus if C is a non-redundant cause of E, then E will counterfactually depend upon C. So now suppose that C and D are redundant causes of E. If C had been absent, D would have caused E, hence E would have occurred anyway. It follows that E does not depend counterfactually upon C. But now suppose that D had not occurred.5 By removing D, we make C a non-redundant cause of E. Hence, E would have occurred and would have depended counterfactually upon C. The counterfactual signature of redundant causation is not sufficient for causation, for C will also carry this signature if it is preempted by D. So a successful counterfactual theory of causation will need a way to determine when the sort of secondary counterfactual dependence described in clause (2) suffices for causation. The accounts of Halpern and Pearl (2005) and Hall (2007) exploit the fact that while preempting and overdetermining causes initiate processes that run to completion, preempted backups (at least in cases of early and late preemption) do not. This is not to say that these accounts defines causation in terms of processes running to completion; rather, this is the difference between preempting and overdetermining causes, on the one hand, and preempted backups, on the other, that the accounts happen to be sensitive to.6 Hitchcock (2007) account of early preemption, which is parasitic upon Halpern and Pearl’s, also hinges on this distinction. By contrast, Hitchcock (2007) account of late preemption hinges on the fact that the process running from the preempted backup 2 See, for example, Hall (2007) for a critique of Halpern and Pearl (2005) and Hitchcock (2001), and Hitchcock (2009) for a critique of Hall (2007). 3 If there are more than two redundant causes of E, then the antecedent of the counterfactual in clause 2 requires that none of the redundant causes except for C occur. 4 I suppose that it may be logically possible for C to preempt a spontaneous occurrence of E, but know of
no plausible example that actually has this structure. 5 As well as any further redundant causes, besides C. 6 Other broadly counterfactual theories of causation also exploit this difference, e.g., Lewis (1986a),
McDermott (1995), Ramachandran (1997), Menzies (1989), and Noordhof (1999).
123
232
Synthese (2011) 181:227–240
does not run to completion before the effect occurs. Hitchcock’s account therefore allows for the possibility of late preemption where the process initiated by the preempted backup does eventually run to completion. In order to avoid repeating a lengthy disjunction, however, I will stipulate that the phrase ‘complete process’ will exclude this possibility in the remainder of this paper. 4 Trumping Schaffer (2000) introduced the following sort of case, which he called ‘trumping’. Soldiers are trained to obey the order given by the highest ranking soldier that outranks them. Suppose that this is a law of soldier psychology. A major and a sergeant simultaneously give the order to march; the soldiers (privates and corporals) march. It is stipulated that the process initiated by the sergeant’s order is not blocked. The soldiers hear the order, they process it, and act in the knowledge that it has been given.7 Schaffer claims that the major’s order preempts the sergeant’s order; thus we have a case of preemption, even though the process initiated by the preempted backup is allowed to run to completion. Lewis (2000) endorses this interpretation of trumping. If this is correct, then trumping will pose a problem for counterfactual theories of causation that rely on process completion to distinguish between preempting and overdetermining causes, on the one hand, and preempted backups on the other. What reason do we have for thinking that this is a case of preemption, rather than overdetermination? Schaffer gives two main arguments. The first is that the major’s order (but not the sergeant’s order) and the soldier’s action are subsumed under the relevant law, and hence the major’s order (but not the segeant’s) is a cause of the soldiers’ marching. At first blush, this is not very persuasive. The law—that the soldiers obey the order of the highest ranking officer—describes a regularity. But this regularity can equally well be described by saying that the soldiers obey all orders given by higher ranking officers, except in the case of conflict, in which case they obey only the order given by the highest ranking officer. This formulation employs the same basic predicates, so there can be no reason to prefer one formulation over the other on the basis of naturalness. But in the second formulation, both the sergeant’s and the major’s orders are subsumed under the law.8 Schaffer (2000) responds to a version of this objection. His response has two parts, corresponding to different ways of understanding laws. First, he argues that in the Mill–Ramsey–Lewis account of laws (see, e.g., Lewis 1973b), a law is an axiom or theorem in the systematization of regularities that best optimizes simplicity and strength. He claims that the second formulation of the law adds complexity, with no corresponding gain in strength. But this does not matter: in the Mill–Ramsey–Lewis account, lawhood is preserved under deductive entailment, and since the two formulations are logically equivalent, if the first is an axiom or theorem in the best system, the second will be an axiom or theorem as well. 7 Schaffer gives another example involving magic spells that operate at a spatiotemporal distance, where we needn’t worry about this additional stipulation. 8 See McDermott (2002), Halpern and Pearl (2005), and Paul (MS) for similar comments.
123
Synthese (2011) 181:227–240
233
The second part of Schaffer’s response is that if laws are not analyzed in terms of regularities, for example, if laws are primitive or relations among universals, then it may be the case that one is a law but the other is not. [Schaffer mentions the accounts of Amstrong (1983), Dretske (1977a), and Tooley (1987) in this connection; see also Carroll (1994) and Maudlin (2007, Chap. 1).] There are a number of problems with this response. First, note that it is particularly inappropriate to be appealing to Tooley in this argument, since he denies that singular causation is determined by laws together with other non-causal matters of fact. So on Tooley’s account, an appeal to laws won’t settle the issue of whether the sergeant’s order is a cause (or part of a cause) of the soldiers’ action. Second, while I grant that on some of these conceptions of law, it is a logical possibility that ‘soldiers obey the order the order of the highest ranking officer’ is a law, while ‘soldiers obey all orders, except in the case of conflict, in which case the obey only the order given by the highest ranking officer’ is not, there is nothing about the example to suggest that this is the case. More specifically, it seems highly implausible that either is a fundamental law; and if either is a derivative law, then it seems that the other should be as well. Finally, I worry that this response proves too much. If (unlike Tooley) we think that the causal facts supervene on the laws and other non-causal matters of fact, and if we are free to construct hypothetical examples in which we can stipulate what the laws are, independently of any other features of the example, then I suspect we can construct counterexamples to just about any claim about causation. But then we are operating without any constraints whatsoever. The second reason Schaffer gives for thinking that trumping is a type of preemption is that the soldiers’ action is sensitive to the command of the major in a way that it is not sensitive to the command of the sergeant. If the major had given the order to retreat, to fire, or to surrender, the soldiers would have acted accordingly; if the sergeant had given a different order, the soldiers still would have marched. Note that the lack of counterfactual dependence of the soldiers’ action on the sergeant’s order does not, by itself, give us reason to think that the sergeant’s order is a preempted backup, rather than an overdetermining cause. For as we have seen, effects as a rule do not counterfactually depend upon overdetermining causes. So the argument must be that the asymmetry between the major’s order and the sergeant’s order gives us reason to view the major’s order as a preempting cause, and the sergeant’s order as a preempted backup. As we have seen in Sect. 2 above, however, the presence of an asymmetry does not, by itself, give us reason to view the case as one of preemption rather than overdetermination. Further investigation is in order here. On the other hand, the presence of a complete process connecting the sergeant’s order to the soldiers’ action gives us a prima facie reason to think that the sergeant’s order is a cause (or at least part of a cause) of the soldiers’ action.9 Indeed, Schaffer himself, in arguing for individualism about overdetermination, claims that the existence of a complete process is a ‘quasi-sufficient’ condition for causation (Schaffer 9 Schaffer (personal communication) informs me that he agrees that there is a complete process in a weak sense: all of the relevant events occur. But he thinks that there is a stronger sense in which the process is not complete: the events are not nomically linked, as they would be if the major had not given his order. Since this involves a view about the underlying laws that I have already rejected, I will not consider this distinction further.
123
234
Synthese (2011) 181:227–240
2003). What he means by this is that in almost all cases where there is a complete process connecting C to E, C is a cause of E. He claims to know of only four sorts of counterexample, and trumping does not belong to one of these types. Thus, by Schaffer’s own lights, the sergeant’s order and the soldiers’ action stand in a ‘causally suggestive relationship’ (Schaffer 2003). Moreover, as we have seen, our most promising theoretical treatments of causation rely on completed processes to distinguish preempting causes from preempted backups. There are thus theoretical pressures to reject the possibility of preempted backups with completed processes. So the case for viewing trumping as a species of preemption rather than overdetermination is far from convincing. 5 Contrastive causation Schaffer and I have both argued that causation is contrastive (Hitchcock 1993, 1995, 1996a,b; Schaffer 2005). Contrastivism about causation can take two different forms. Schaffer and I have both advocated an explicit or relational contrastivism. According to this view, the causal relation is not a binary relation among events with the form ‘C caused E’. Rather, it has the form ‘C rather than C* caused E’ (in Hitchcock 1996b) or ‘C rather than C* caused E rather than E*’ (in Schaffer 2005), where C* and E* are specific alternatives, or contrasts, to C and E respectively (see also Maslen 2004). An alternative version of contrastivism, suggested, e. g., by Menzies (2004), is implicit, or contextual. The idea is this: While it may not be possible to reductively analyze causation in terms of counterfactuals, causation and counterfactuals are closely related. Counterfactuals are notoriously context-sensitive, so we should expect causation to be context-sensitive in the same way. In particular, the truth value of the counterfactual ‘If C hadn’t occurred, C* would have occurred in its place’ will depend upon context. The same sorts of shifts in context that change the truth value of this counterfactual can also change the truth value of causal claims involving C. Implicit or contextual contrastivism has the advantage that it preserves the surface structure of ordinary causal claims. It has the disadvantage that the context-dependence cannot be read off the surface form of a causal claim. Since I am challenging Schaffer’s interpretation of trumping, I will here adopt his version of explicit or relational contrastivism, which takes causation to be four-place relation having the form: ‘C rather than C* caused E rather than E*’. It is easy enough to translate into other versions of contrastivism. To translate into Hitchcock’s ternary relation, simply drop the final relatum. To translate into contextual contrastivism, read ‘C rather than C* caused E rather than E*’ as ‘C caused E’ uttered or inscribed in a context which, inter alia, makes true the counterfactual that C* would have occurred in C’s absence. The benefits of contrastivism may be seen by examining an example originally due to Dretske (1977b). Consider the following claim: 1. Susan’s stealing the bicycle caused her to be arrested. Dretske argued that this claim was ambiguous. We can draw out the ambiguity by making explicit the contrastive structure:
123
Synthese (2011) 181:227–240
235
2. Susan’s stealing the bicycle, rather than buying it, caused her to be arrested rather than remain free. 3. Susan’s stealing the bicycle, rather than the car, caused her to be arrested rather than remain free. In many circumstances we might judge that 2 is true, while 3 is false. It is easy to modify the simple theory to accommodate contrastivism. Indeed, such a modification might readily be judged to be a clarification of the simple theory. In order to evaluate 1, the simple theory requires that we determine what would have happened if Susan had not stolen the bicycle. But there are many different ways in this event might have failed to come about, and these different ways will not univocally lead to arrest or freedom. The contrastive formulation tells us more explicitly what we are to imagine occurring in place of Susan’s stealing the bicycle.10 Thus, according to the modified simple theory, ‘C rather than C* caused E rather than E*’ is true just in case (i) C and E are distinct events that occurred, (ii) C* and E* are alternatives to C and E, respectively, (iii) and if C* had occurred instead of C, then E* would have occurred instead of E. According to the form of explicit contrastivism assumed here, binary causal claims are not well-formed, and hence do not have truth values.11 Saying ‘C caused E’ is like saying ‘Alice is taller’. In a given conversational context, however, particular alternatives may be salient. In such a case, the claim ‘C caused E’ can be understood as elliptical for a full contrastive claim, where context supplies the missing relata. (Analogously, if we are talking about how tall Sam is, I could utter ‘Alice is taller’ and be understood as asserting that Alice is taller than Sam.) Alternatively, if all of the alternatives to C that are in conversational play would result in alternatives to E, then ‘C caused E’ might be understood as elliptical for a quantified claim such as ‘for all alternatives C* to C, there is some alternative E* to E such that if C* had occurred, E* would have occurred’. I will not embark upon a detailed defense of contrastivism here, but will just briefly note its advantage over a chief rival. According to Lewis (1986a,b) and Yablo (1992), 1 is ambiguous because ‘Susan’s stealing the bicycle’ can refer to different events. These events differ in which properties they bear essentially, and which they bear accidentally. One event is essentially a stealing, and only accidentally involves a bicycle; the other event is only accidentally a stealing, and essentially involves a bicycle. A possible world in which only accidental properties of event C are different is a possible world in which that very event still occurs, whereas a possible world in which no event has all of the essential properties of C is one in which C does not occur. If ‘Susan’s stealing the bicycle’ refers to an event that is essentially a stealing, then the simple theory directs us to worlds such in which Susan does not steal; thus the alternative in which Susan steals a car (and is arrested) is not relevant. If ‘Susan’s stealing the bicycle’ refers to an event that essentially involves a bicycle, then the alternative in which Susan buys the bicycle (and is not arrested) is not relevant. This approach has the disadvantage of multiplying entities beyond necessity. Moreover, it leaves us with 10 In implicit versions of contrastivism, this is settled by context, rather than given explicitly. 11 In implicit versions of contrastivism, such claims are well-formed, but they do not have context-inde-
pendent truth values.
123
236
Synthese (2011) 181:227–240
no guidance as to whether the event that is essentially the stealing of a bicycle is a cause of Susan’s arrest. 6 Contrastivism and redundant causation Just as C rather than C1 can cause E rather than E 1 while C rather than C2 does not cause E rather than E 2 , the status of a redundant cause may depend upon the contrasting alternatives. Thus it may be that C rather than C1 is a non-redundant cause of E rather than E 1 , C rather than C2 is a preempting cause of E rather than E 2 , C rather than C3 is an overdetermining cause of E rather than E 3 , C rather than C4 is a preempted backup of E rather than E 4 ,12 and C rather than C5 is causally irrelevant to E rather than E 5 . Thus there need be no simple fact of the matter as to whether a case is one of preemption or overdetermination. But even this does not capture the full complexity of the situation. We saw in Sect. 1 that if there are more than two redundant causes, C, D, and B then it is possible, e.g., for C to preempt D while C and B overdetermine E. The same thing may happen when we consider different alternatives to D. C rather than C* and D rather than D1 may overdetermine E rather than E*, C rather than C* may preempt D rather than D2 in causing E rather than E*, and C rather than C* may be preempted by D rather than D3 in causing E rather than E*.13 Note that if C rather than C* is a non-redundant cause of E rather than E*, it cannot be any kind of redundant cause; it does not matter which alternative to D we choose. This is because effects depend counterfactually on their non-redundant causes, but not upon their redundant causes. There is a fourth possibility: C rather than C* is causally irrelevant to E rather than E* given D rather than D4 . This seems counterintuitive: surely whether C rather than C* is a cause of E rather than E* should not depend upon which alternative to D we consider? In order to see how this is possible, we will need to re-formulate our criterion for redundant causation, and re-examine its counterfactual signature. Before we went contrastive, we said that C and D are redundant causes of E just in case: (1) either (i) C causes E, (ii) D causes E, (iii) both C and D cause E (for the individualist), or (iv) C and D collectively cause E (for the collectivist); (2) if C had not occurred, then D would have caused E; and (3) if D had not occurred, then C would have caused E. In order to reformulate this in contrastive terms, we need to replace all of the causal claims with suitable contrastive versions, and we need to replace the counterfactuals about the non-occurrence of events with counterfactuals about the occurrence of specific alternatives. Making these substitutions yields: C rather than C* and D rather than D* are redundant causes of E rather than E* just in case: 1. Either (i) C rather than C* causes E rather than E* (given D rather than D*), or 12 However, if we require that preempted causes lack complete processes, then this possibility will not be compatible with the first three, unless different contrasts give rise to distinct processes. 13 However, if we require that preempted causes lack complete processes, then only one of these three
possibilities is possible.
123
Synthese (2011) 181:227–240
237
(ii) D rather than D* causes E rather than E* (given C rather than C*), or (iii) both (i) and (ii) (for the individualist), or (iv) CD rather than C*D* causes E rather than E*;14 and 2. if C* had occurred, then D rather than D* would have caused E rather than E*; and 3. if D* had occurred, then C rather than C* would have caused E rather than E*. In addition to making the required substitutions, it has been necessary to add the parenthetical clause to parts 1(i)–1(iii). This is because it is possible for C rather than C* and D rather than D* to be redundant causes of E rather than E*, while C rather than C* is causally irrelevant to E rather than E* given D rather than D † . Here’s how that could happen. Suppose all of the clauses above are satisfied for C, C*, D, D*, E and E*. Note that clause 2 entails that if C* had occurred, then E would still have occurred. Thus C rather than C* is not a non-redundant cause of E rather than E*. This remains true, regardless of which alternative to D we consider. Now suppose, moreover that if D † had occurred instead of D, then E † would have occurred, where E † is incompatible with E. Thus D rather than D † is a non-redundant cause of E rather than E † . Now C, C*, D, D † , E, E*, will not satisfy clause 3: if D † had occurred, then E † would have occurred, and hence C could not have caused E, given any contrasts. Since C rather than C* is not a non-redundant cause of E rather than E*, and C rather than C* is not a redundant cause of E rather than E* with D rather than D † , it follows that C rather than C* cannot be any kind of cause of E rather than E* given D rather than D † . Let us make similar changes to our formulation of the counterfactual signature: If C rather than C* and D rather than D* are redundant causes of E rather than E*, Then: 1. If C* had occurred, then (i) E would still have occurred; and (ii) if D* had occurred, then E* would have occurred; and 2. If D* had occurred, then (i) E would still have occurred, and (ii) if C* had occurred, then E* would have occurred. This can be readily generalized to the case of more than two redundant causes.
14 This seems to be the natural way to formulate the collectivist idea in contrastive terms. In cases of
redundant causation, CD rather than CD* and CD rather than C*D are not causes of E rather than E*.
123
238
Synthese (2011) 181:227–240
7 Trumping reconsidered We are now in a position to re-examine Schaffer’s trumping scenario from a contrastivist perspective. We will use the revised counterfactual signature to check for redundant causation. We will also initially rely on completed processes to distinguish preempting and overdetermining causes from preempted backups. Of course this is tendentious, since trumping cases are alleged to be cases of preemption in which the preempted backup still initiates a process that runs to completion. But we will see what the strategy of relying on completed processes gives us, and retain the right to re-consider our verdicts later. In order to simplify the presentation of the results of our revised analysis, let us introduce some notation. Let M1 represent the major’s order to march; M0 represents the failure of the major to give an order, and M2 , M3 , . . . represent various alternative orders that the major might have given. S0 , S1 , S2 , S3 , . . . represent the corresponding orders issued (or not issued) by the sergeant, and A0 , A1 , A2 , A3 , . . . represent the corresponding actions (or inaction) of the soldiers. Then our contrastive account yields the following verdicts: 1. 2. 3.
For all i > 1, M1 rather than Mi is a non-redundant cause of A1 rather than Ai . For all i = 1, j = 1, k > 1, S1 rather than Si is causally irrelevant to A1 rather than A j , given M1 rather than Mk . For all i = 1, M1 rather than M0 and S1 rather than Si overdetermine A1 rather than Ai .
1 tells us, for example, that the major’s giving the order to march rather than the order to fire is a non-redundant cause of the soldiers’ marching rather than firing. It is not a redundant cause because if the major had instead given the order to fire, there is no other event that would have caused the soldiers to march (relative to any contrast). Thus the relation of the major’s ordering the soldiers to march rather than fire to the soldiers marching rather than firing is not one of either preemption or overdetermination. Thus, whatever we might decide about the status of redundant causes with complete processes, trumping is not purely an example of either preemption or overdetermination. 2 tells us, for example, that the sergeant’s giving the order to march rather than giving no order is rendered irrelevant to the soldiers’ action by the major’s giving the order to march rather than fire. This is not to say that the sergeant’s giving the order to march rather than no order is preempted by the major’s giving the order to march rather than fire; in order for that to be the case, the sergeant’s order to march rather than no order would have to cause the soldiers to march (rather than, say, to perform no action) in the event that the major instead gives the order to fire. But that would not have happened: the troops would have fired, rather than marched. 3 tells us, for example, that the major’s giving the order to march rather than giving no order and the sergeant’s giving the order to march rather than giving no order overdetermine the soldiers’ marching rather than performing no action. The judgment that they are overdetermining causes, rather than one preempting the other, stems from the fact that both processes run to completion. So, should we withdraw our reliance upon completed processes, and replace the judgment of overdetermination with one of preemption? The first thing to note is that
123
Synthese (2011) 181:227–240
239
even if we do so, this will affect only judgment 3. Judgments 1 and 2 do not involve redundant causation at all, hence there can be no question of whether they involve preemption or overdetermination. Second, the primary incentive for judging this to be a case of preemption was the asymmetry between the way the soldiers’ action depends upon the orders of the major and the sergeant. But judgments 1–3 clearly indicate an asymmetry, even without the need to posit preemption in 3. There are two main asymmetries here: First, for i > 1, M1 rather than Mi is a non-redundant cause of A1 rather than Ai , while S1 rather than Si is either a redundant cause of A1 rather than Ai , or else causally irrelevant to A1 rather than Ai (depending upon our choice of contrast for M1 ). Second, M1 rather than M0 is an overdetermining cause of A1 (relative to some contrast) for any choice of alternative to S1 , while S1 rather than S0 only gets to be an overdetermining cause of A1 (relative to some contrast) for one specific alternative to M1 , namely M0 ; for all other alternatives to M1 , S1 rather than S0 is causally irrelevant to A1 . These asymmetries nicely capture the different causal roles of the major’s and sergeant’s orders, without positing preemption. Given the theoretical utility of relying on completed processes to discriminate between overdetermining causes and preempting causes, on the one hand, and preempted backups, on the other, it seems preferable to retain the judgment of overdetermination in judgment 3. Acknowledgments For comments, criticism, and discussion, I would like to thank Eric Hiddleston, Laurie Paul, Jonathan Schaffer, Jim Woodward and Jiji Zhang.
References Amstrong, D. (1983). What is a law of nature?. Cambridge: Cambridge University Press. Carroll, J. (1994). Laws of nature. Cambridge: Cambridge University Press. Dretske, F. (1977a). Laws of nature. Philosophy of Science, 64, 248–268. Dretske, F. (1977b). Referring to events. Midwest Studies in Philosophy, 2, 90–99. Hall, N. (2007). Structural equations and causation. Philosophical Studies, 132, 109–136. Halpern, J., & Pearl, J. (2005). Causes and explanations: A structural-model approach. Part I: Causes. British Journal for the Philosophy of Science, 56, 843–887. Hitchcock, C. (1993). A generalized probabilistic theory of causal relevance. Synthese, 97, 335–364. Hitchcock, C. (1995). The mishap at Reichenbach fall: Singular vs. general causation. Philosophical Studies, 78, 257–291. Hitchcock, C. (1996a). The role of contrast in causal and explanatory claims. Synthese, 107, 395–419. Hitchcock, C. (1996b). Farewell to binary causation. Canadian Journal of Philosophy, 26, 267–282. Hitchcock, C. (2001). The intransitivity of causation revealed in equations and graphs. Journal of Philosophy, 98(6), 273–299. Hitchcock, C. (2007). Prevention, preemption, and the principle of sufficient reason. Philosophical Review, 116, 495–532. Hitchcock, C. (2009). Structural equations and causation: Six counterexamples. Philosophical Studies, 144, 391–401. Lewis, D. (1973a). Causation. Journal of Philosophy, 70, 556–567. (Reprinted in Philosophical Papers, Vol. II, pp. 159–172, by D. Lewis, 1986, Oxford: Oxford University Press.) Lewis, D. (1973b). Counterfactuals. Cambridge, MA: Harvard University Press. Lewis, D. (1986a). Events. In Philosophical papers (Vol. II, pp. 241–269). Oxford: Oxford University Press. Lewis, D. (1986b). Postscripts to ‘Causation’. In Philosophical papers (Vol. II, pp. 172–213). Oxford: Oxford University Press.
123
240
Synthese (2011) 181:227–240
Lewis, D. (2000). Causation as influence. Journal of Philosophy, 97, 182–197. (Expanded version in Causation and counterfactuals, pp. 75–106, by J. Collins, N. Hall, & L. Paul, Eds., 2004, The MIT Press, Cambridge, MA.) Maslen, C. (2004). Causes, contrasts, and the nontransitivity of causation. In J. Collins, N. Hall, & L. Paul (Eds.), Causation and counterfactuals (pp. 341–358). Cambridge, MA: The MIT Press. Maudlin, T. (2007). The metaphysics within physics. Oxford: Oxford University Press. McDermott, M. (1995). Redundant causation. British Journal for the Philosophy of Science, 46, 523–544. McDermott, M. (2002). Causation: Influence vs. sufficiency. Journal of Philosophy, 99, 84–101. Menzies, P. (1989). Probabilistic causation and causal processes: A critique of Lewis. Philosophy of Science, 56, 642–663. Menzies, P. (2004). Difference-making in context. In J. Collins, N. Hall, & L. Paul (Eds.), Causation and counterfactuals (pp. 139–180). Cambridge, MA: The MIT Press. Noordhof, P. (1999). Probabilistic causation, preemption, and counterfactuals. Mind, 108, 95–125. Paul, L. (MS). Understanding trumping. Ramachandran, M. (1997). A counterfactual analysis of causation. Mind, 106, 263–277. Schaffer, J. (2000). Trumping preemption. Journal of Philosophy, 97, 165–181. (Reprinted in Causation and counterfactuals, pp. 59–73, by J. Collins, N. Hall, & L. Paul, Eds., 2004, The MIT Press, Cambridge, MA.) Schaffer, J. (2003). Overdetermining causes. Philosophical Studies, 114, 23–45. Schaffer, J. (2005). Contrastive causation. Philosophical Review, 114, 327–358. Tooley, M. (1987). Causation: A realist approach. Oxford: Oxford University Press. Yablo, S. (1992). Cause and essence. Synthese, 93, 403–449.
123
Synthese (2011) 181:241–253 DOI 10.1007/s11229-010-9800-9
Two kinds of a priori infallibility Glen Hoffmann
Received: 3 December 2009 / Accepted: 26 July 2010 / Published online: 4 September 2010 © Springer Science+Business Media B.V. 2010
Abstract On rationalist infallibilism, a wide range of both (i) analytic and (ii) synthetic a priori propositions can be infallibly justified (or absolutely warranted), i.e., justified to a degree that entails their truth and precludes their falsity. Though rationalist infallibilism is indisputably running its course, adherence to at least one of the two species of infallible a priori justification refuses to disappear from mainstream epistemology. Among others, Putnam (1978) still professes the a priori infallibility of some category (i) propositions, while Burge (1986, 1988, 1996) and Lewis (1996) have recently affirmed the a priori infallibility of some category (ii) propositions. In this paper, I take aim at rationalist infallibilism by calling into question the a priori infallibility of both analytic and synthetic propositions. The upshot will be twofold: first, rationalist infallibilism unsurprisingly emerges as a defective epistemological doctrine, and second, more importantly, the case for the a priori infallibility of one or both categories of propositions turns out to lack cogency. Keywords Rationalism · Infallibilism · Analyticity · Syntheticity · A Priori · Defeasibility · Self knowledge · Descartes · Burge 1 Introduction: rationalist infallibilism On rationalist infallibilism, a wide range of both (i) analytic and (ii) synthetic a priori propositions can be infallibly or absolutely justified, i.e., justified to a degree that entails their truth and precludes their falsity.1 In particular, on this doctrine, at least 1 The second clause of this definition (viz. the exclusion of falsity) is intended to avoid explicit commitment to the law of non-contradiction according to which for all statements X, it is not the case that X& ∼ X.
G. Hoffmann (B) Department of Philosophy, Ryerson University, 350 Victoria Street, Toronto, ON M5B 2K3, Canada e-mail:
[email protected]
123
242
Synthese (2011) 181:241–253
two main classes of a priori propositions are susceptible of infallible justification: (i) logical, conceptual and mathematical propositions, and (ii) so-called self justifying propositions. Though rationalist infallibilism is undoubtedly running its course, adherence to at least one of the two species of infallible a priori justification refuses to disappear from mainstream epistemology. Among others, Putnam (1978) still professes the a priori infallibility of some category (i) propositions,2 while Burge (1986, 1988, 1996) and Lewis (1996) have recently affirmed the a priori infallibility of some category (ii) propositions. In this paper, I take aim at rationalist infallibilism by disputing the a priori infallibility of both analytic and synthetic propositions. There will be two main outcomes of our inquiry. First, rationalist infallibilism predictably emerges as a fundamentally defective epistemological doctrine. Second, more importantly, the case for the a priori infallibility of one or both categories of propositions (erected by rationalists or empiricists) turns out to lack cogency. 2 Analytic propositions The rationalist quest for infallible certitude often begins and sometimes ends with analytic propositions. There is a long tradition in philosophy, mathematics, the sciences and other disciplines of proclaiming the a priori infallibility of putative logical, conceptual and/or mathematical truths.3 While these are not mutually exclusive categories, the following kinds of examples have been canvassed: Millie is either in the study or not in the study (M∨ ∼ M), Jerry cannot both be a bachelor and married, Zoran cannot be in Moscow and London simultaneously, an object cannot be both blue and green all over at the same time, and (Mathematical Truth) 2 + 3 = 5. (Logical Truth) (Conceptual Truth) (Conceptual Truth) (Conceptual Truth)
In philosophy the infallibility thesis concerning analytic propositions is primarily associated with rationalists such as Descartes (1996) and Frege (1967, 1974), but has also been endorsed by empiricists such as Hume (1992, 1993), Putnam (1978), Ayer (1936, 1940, 1956), Carnap (1935, 1950) and some (other) proponents of logical positivism. Whether rationalist or empiricist, the case for the infallibility of logical, conceptual and mathematical propositions is essentially modal in character. At least some propositions in these domains, it is urged, are logically necessary (and thus infallibly justified) truths: they are true on all truth value assignments and false on no truth value assignment. Any straightforward employment of deductive reasoning, the claim runs, yields the transparency of this fact. 2 Of course, Putnam merely argues that there is at least one infallible a priori truth, what he calls ‘the minimal principle of contradiction’, leaving open whether there are others (1978, p. 155). Moreover, he claims that this a priori truth is infallibly justifiable in the sense that it is rationally impossible to disbelieve it (1978, p. 155ff), a kind of infallibility that may or may not be covered by our definition. 3 See Boghossian (1994, p. 117ff) for a brief discussion of this history.
123
Synthese (2011) 181:241–253
243
On one standard formulation of the modal argument for analytic infallibility it is urged that some logical, conceptual and mathematical propositions have rigid meanings, meanings completely specifiable on the basis of syntactic and semantic principles.4 A superficial inspection of the syntax and semantics of sentences expressing such propositions, the reasoning runs, reveals they are true by meaning, and consequently, are necessary truths. For example, the mode of composition and the meaning of the lexical components of ‘Jerry cannot both be a bachelor and married’ are sufficient to yield the necessary truth, and corresponding infallibility, of the proposition expressed by this sentence. On another (related) variant of the modal argument it is urged that some logical, conceptual and mathematical propositions have a special property concomitant with their rigid meanings: they make formally specifiable assertions, assertions whose meanings are entirely spelled out by syntactic and semantic principles (Cf. above).5 Since these propositions make formally specifiable assertions, the reasoning runs, they are susceptible of irrefutable proof by way of deductive logic (i.e., the universal laws of classical truth functional logic), a proof that cannot be overturned ex post facto. For example, a four row truth table or the application of two rules of natural deduction supply irrevocable proof of any assertion of the form X∨ ∼ X, by establishing its truth on any possible truth value assignment. In spite of its appeal at various periods in history, there is a seemingly decisive rebuttal to the modal argument for the infallibility of analytic propositions. Doubtless, it is acknowledged, some logical, conceptual and mathematical propositions are necessary truths and correspondingly conclusively justified within a specific logical/semantic framework. Presupposing classical truth functional logic, Millie is either in the study or not in the study is a necessary conclusively justified proposition. Presupposing a minimally acceptable semantic framework for logical and non-logical terms,6 2 + 3 = 5, one cannot be a married bachelor, and an object cannot be blue and red all over at the same time are necessary conclusively justified propositions. The problem, evidently, is that propositions of this kind are not infallibly justifiable (or susceptible of absolute warrant), justifiable to a degree that is truth entailing and falsity precluding. Insofar as logical/semantic frameworks require some kind of confirmation in their own right, reason cannot infallibly justify elementary logical, conceptual and mathematical propositions (or apparently any other analytic proposition). In particular, deductive reasoning falls short in this case since inevitably it fails to establish the truth of the analytic propositions in question on any possible logic, semantics
4 This formulation of the argument is typically advanced in support of the analytic infallibility of so-called conceptual truths. It has been made in some form or another by Kant (2007), Ayer (1936, 1940) and Carnap (1935), among others. 5 This formulation of the argument is typically advanced in support of the analytic infallibility of so-called logical and mathematical truths. 6 By ‘minimally acceptable’ I have in mind a semantic framework in which, like all standard proposals, logical and non-logical terms have a determinate and consistent reference across all possible worlds. This would rule out bizarre Goodmanian-style logics in which ‘+’ might refer to the addition function except in cases where it follows ‘2’ where it will refer to the subtraction function or in which ‘bachelor’ refers to an adult unmarried male except in cases where that male is a cultural relativist.
123
244
Synthese (2011) 181:241–253
or interpretative standpoint.7 Deductive reasoning by its very nature cannot deliver analytic infallibility.8 This well rehearsed line of argument enshrines a view of reason that on the surface looks compelling. In the long standing debate concerning epistemological infallibilism, stretching approximately from Plato to the present, an appealing principle has gained widespread recognition: reason cannot furnish absolute warrant for any analytic proposition. If reason is not an autonomous vehicle of justification, it cannot infallibly justify analytic propositions. But it would seem that reason is not sui generis in this sense. The veracity of reason is not something that can be established ex hypothesi: whether or not reason is being exercised correctly is seemingly insensitive to data. Many, including Klein (2003, pp. 40–42) and Davidson (1986), consider this to be a fundamental defect of foundationalism in general and rationalist foundationalism in particular (infallibilist or otherwise): since reason is not sui generis, it cannot deliver direct unmediated warrant that is transferable to propositions, statements or beliefs. Leaving to one side the anti-foundationalist invective, the force of the fallibilist view of reason seems to spring, at bottom, from a falsifiability thesis: any analytic proposition, the truth of which is purportedly established by reason, is susceptible to falsification. Reflection on familiar skeptical hypotheses seems to reinforce the conviction that there is at least one defeasor available for every analytic proposition. Consider some of the radical otherworldly thought experiments concocted by skeptics to the effect that all of our previous beliefs could turn out to be wrong: e.g., Descartes’ evil demon hypothesis, Putnam’s brain-in-a-vat hypothesis and Russell’s false memory of past experience hypothesis. It is surely conceded by all manner of skeptic, including the radical skeptic, that reason ab initio can deliver the firmest of convictions about the truth of a wide range of analytic propositions. But as the skeptics would have it (at least provisionally), reason cannot justify, certainly not infallibly, belief in such propositions since it cannot be excluded that insidious deception is occurring. It is a venerable skeptical insight that there is at least one defeasor available for every analytic proposition in the form of an otherworldly skeptical hypothesis. The hypotheses canvassed by radical skeptics underscore the sense in which the exercise of reason seems to require presupposing its veracity. Evidently, one can never be sure that reason has not misfired since the possibility cannot be completely ruled that one is mistaken about the veracity of reason itself (e.g., when one is deceived by an omnipotent malevolent force). The exercise of reason cannot conclusively establish the falsity of the skeptical hypotheses since by virtually all accounts the veracity of reason itself depends on their falsity.9
7 This line of argument has a close kinship with the Quinean repudiation of analyticity (1953) since it
impugns analytic infallibility on the basis of the logical defeasibility of all analytic propositions. Needless to say, though, it does not directly confront the viability of the analytic/synthetic distinction. 8 Although strictly speaking it has not been ruled out that there might be some other way to deliver analytic infallibility. 9 A great deal of the philosophical import of the radical skeptic’s assault on justified belief and knowledge is that it shines light over the most pernicious form of circularity plaguing the rationalist’s case for a priori analytic infallibility.
123
Synthese (2011) 181:241–253
245
If there is a way to escape the fallibilist/skeptical problematic, it must involve rejecting some fundamental modal intuitions. The champion of analytic infallibility is essentially required to reject the possible falsity of propositions such as Millie is either in the study or not in the study, Jerry cannot be a married bachelor, 2 + 3 = 5, and the corresponding possibility of an evil demon deceiving us about the truth of these propositions. Such a strategy is likely to involve recruiting a broadly pragmatic semantics, found in various forms in the work of Wittgenstein (1969, p. 10ff)), Strawson (1957), Putnam (1978) and others.10 On this basic approach, the putative falsity of the analytic propositions in question would be viewed as a variety of semantic error.11 To allow the falsity of 2 + 3 = 5, Millie is either in the study or not in the study, and the like (or to allow that an evil demon could be deceiving us about these matters), it might be claimed, is merely to misuse the logical and/or non-logical terms contained in such propositions; it involves misconstruing the meaning of one or more of the relevant terms. Whatever the merits of pragmatic semantics, a concern is that the attempt to dispel deep seated modal intuitions on its basis looks prima facie unpromising insofar as these intuitions do not look to be directly informed by theory (semantic or otherwise). On first blush, semantic theories are supported by intuitive modal reflection: i.e., reflection about what is conceivable, acceptable or possible. The standard pragmatic defense of a conception of logical/metaphysical possibility on which the analytic propositions in question cannot be false (and on which radical skeptical propositions are logically/metaphysically impossible), on this score, looks to reverse the natural order of explanatory priority. If this is correct, the analytic infallibilist encounters a stiff challenge in deposing ingrained intuitions concerning the logical possibility of the falsity of analytic propositions. But even were our opponent to meet the heavy burden of subduing staunch presentiment concerning the possibility of analytic falsehood along pragmatic lines, it is unclear what she will have accomplished. Suppose it turns out that the apparent possibility of analytic falsehood involves some kind of semantic mistake, a misuse of the logical and/or non-logical terms contained in analytic propositions or a misconstrual of their meanings. In this case, the pragmatist can be understood to have established the rationally impossibility, incoherence or inconsistency of disbelieving propositions such as Millie is either in the study or not in the study, Jerry cannot be a married bachelor, 2 + 3 = 5, and the like. Crucially, though, she would not have established the logical impossibility of analytic falsehood or the logical possibility of analytic infallibility, in the way these concepts are being used. The rational impossibility of disbelieving a (analytic) proposition is equivalent to the logical impossibility of a (analytic) proposition’s falsity only on the condition that logical possibility is a species of epistemic possibility, doxastic possibility, psychological possibility or some close cognate. Now while the latter thesis cannot be dismissed outright, and may in fact be
10 Although none of these philosophers, except perhaps Putnam, can be described as defending a kind of analytic infallibility, some pragmatic semantic maneuver seems to be the only riposte available to the analytic infallibilist in this case. 11 Thanks to Danny Goldstick for discussion of this point.
123
246
Synthese (2011) 181:241–253
thought in some way presumptive of pragmatic semantics, this would not threaten the doctrine of analytic fallibilism. The bottom line is that granting the pragmatist thesis that the apparent possibility of analytic falsity is a kind of semantic mistake, and the consequent logical possibility ↔ epistemic possibility thesis, in actuality involves recasting the analytic infallibility/fallibility debate in a way that leaves analytic fallibilism unscathed. Exploiting pragmatic semantics of this type effectively involves redefining the alethic concepts of truth and falsity in anti-realist terms. The problem is that such revisionary semantics, whatever its credentials, undoubtedly forecloses on the possibility of the kind of analytic infallibility under consideration, an infallibility that involves certitude of the highest measure (viz. incorrigible certitude). Wittgenstein, for his part, is happy to acknowledge this point. The brand of pragmatics sketched in On Certainty, Wittgenstein concedes, rules out the possibility of certainty as the concept is ordinarily understood, i.e., in the realist terms of indefeasibility, a concept that connotes “… I can’t be wrong” (1969, p. 7; Wittgenstein’s emphasis). In short, contra analytic infallibilism, reason cannot infallibly justify analytic propositions. At a minimum, the compelling case for the fallibility of reason, made inter alia by anti-foundationalists and skeptics, effectively raises the specter of a priori uncertainty vis-à-vis logical, conceptual and mathematical propositions (and apparently all other analytic propositions), no matter how faint the specter. Naturally, this does not preclude the a priori justification of logical, conceptual, mathematical or any other analytic proposition tout court, nor that these kinds of propositions might be susceptible of a high degree of a priori justification. 3 Synthetic propositions The other primary candidates for a priori infallibility include synthetic propositions about the external world, material objects and the self. For example, the former two proposals for synthetic a priori infallibility have been canvassed inter alia by Descartes (1996, p. 117ff), C.I. Lewis (1946), Price (1953) and Unger (1975). There is a near consensus, though, that these kinds of proposals for synthetic a priori infallibility, like all proposals vis-à-vis the ‘external world’, have not survived serious scrutiny. At any rate, whatever the merits of the case for synthetic a priori infallibility of this kind, discussion here will be focused on a subset of synthetic propositions-so-called self justifying propositions-whose a priori infallibility still seems to be a live issue.12 Traditionally, self justifying propositions have been at the heart of the rationalist infallibilist program. For the rationalist, self justifying propositions are those whose sincere assertion is supposed to be a priori sufficient to establish their truth. More specifically, an a priori self justifying proposition is one whose sincere assertion about an extant first person cognitive state purportedly establishes its truth. Descartes’ proposals 12 Another proposal for synthetic a priori infallibility is the putative ontologically necessary truth to the effect that ‘something exists’. I leave open that some such synthetic proposition is an infallibly justified a priori truth. In the event that it is, I minimally revise my view to be that no synthetic a priori proposition of substantive cognitive import can be infallibly justified. Thanks to Henry Jackman and David Hunter for discussion of this proposal.
123
Synthese (2011) 181:241–253
247
in the Meditations on First Philosophy (1996, pp. 80ff) are the locus classicus for infallible a priori self justification: (Cogito) I think therefore I exist as a thinking thing and along the same lines (Dubito) I doubt therefore I exist as a doubting thing. For Descartes, (Cogito) and (Dubito) are self justifying propositions since they can be thought only if they are understood (their esse is their percipi), i.e., they are objects of thought with which one has direct unmediated comprehension (1996, pp. 80–81). Independent of whether Descartes’ archetypal a priori propositions are in fact self justifying in any interesting normative sense is the question of whether the propositions are susceptible of infallible self justification, i.e., whether their sincere assertion entails their truth and precludes their falsity. Sustained reflection suggests (Cogito), (Dubito) and analogous proposals for a priori self justification fail on this grade. The focal point of the denial of infallible a priori self justification is a compelling rationale recommended in some form or another by such diverse philosophers as Sellars (1997); Hume (1992, 1993); Ayer (1936, 1956); Chisholm (1977) and Klein (1990, 1999, 2003). It is commonly professed nowadays that mental states cannot be vehicles of infallible justification since they cannot ex nihilo guarantee the truth of any fact. Mental states, like third person cognitive states (or observations), cannot be guarantors of truth since any assertion about such a state is of necessity at a remove from it: it involves saying something about the state. In asserting something about a mental state (immediate or mediated), it is claimed, the possibility for error inevitably surfaces. As Ayer puts the point (1956, p. 19): “there will not be a formal contradiction in saying both that a man’s state of mind is such that he is absolutely sure that a given statement is true, and that the statement is false”.13 On this view it follows that for any factive mental state M there is a possible world W1 in which the fact M is directed towards is false even when the fact M is directed towards is its own existence. The fallibilist view of a priori self justification represents, in no small way, a variation of an influential assault on formal foundationalism Sosa has called the doxastic assent argument according to which no cognitive state can be both (i) infallibly self justifying and (ii) bear content. Sosa glosses the argument as follows (1980, p. 6): a.
(i) If a mental state incorporates a propositional attitude, then it does not give us direct contact with reality, e.g., with pure experience, unfiltered by concepts or beliefs. (ii) If a mental state does not give us direct contact with reality, then it provides no guarantee against error. (iii) If a mental state provides no guarantee against error, then it cannot serve as a foundation for knowledge. (iv) Therefore, if a mental state incorporates a propositional attitude, then it cannot serve as a foundation for knowledge.
13 Hume in effect supplies what is perhaps the most fundamental ground for this claim (1992, pp. 86–87):
“there is no object, which implies the existence of any other if we consider these objects in themselves, and never look beyond the idea which we form of them”.
123
248
b.
c. d.
Synthese (2011) 181:241–253
(i) If a mental state does not incorporate a propositional attitude, then it is an enigma how such a state can provide support for any hypothesis, raising its credibility selectively by contrast with its alternatives. (ii) If a mental state has no propositional content and cannot provide logical support for any hypothesis, then it cannot serve as a foundation for knowledge. (iii) Therefore, if a mental state does not incorporate a propositional attitude, then it cannot serve as a foundation for knowledge. Every mental state either does or does not incorporate a propositional attitude. Therefore, no mental state can serve as a foundation for knowledge. (From a(iv), b(iii), and c.)
Whatever the case may be regarding the merits of the prevailing anti-foundationalist argument schema, an important lesson can be drawn from it. Whether or not brute cognitive states, first person or otherwise, can be viewed as potential foundational justifiers, the possibility of infallible first person justification looks spurious on the face of it. First person cognitive states cannot be infallible justifiers, it would seem, since any assertion about them necessarily engenders the possibility of faulty inference. It follows mutatis mutandis that no mental state can be infallibly self justifying. Considering (Cogito) as a case in point, when it is sincerely asserted I think therefore…, the truth of whatever claim follows the ellipsis is never entailed by what precedes it since there is always the possibility a faulty inference is made. There is a conspicuous epistemological gulf between ‘I think’ and ‘I exist as a thinking thing’ since in moving from the former assertion to the latter a judgment is made about the proto mental state thought, namely, that there is an I or subject bearing it. At least this is Hume’s view of the matter (1992, 1993): (Cogito) cannot be an infallible (or fallible) self justifying proposition since from an epistemological standpoint there is a normative distinction between asserting the existence of thought and asserting the existence of a thinking thing. For Descartes, on the other hand, Cogito-like propositions are instances of direct unmediated ratiocination that give us irrevocable acquaintance with the reality of the subject. Lewis (1996, pp. 564ff) similarly construes Cogito-like propositions as pure rational intuitions, minus the Cartesian metaphysics, intuitions that furnish unmediated access to the reality of the subject. For Lewis these intuitions give rise to a specific kind of infallibility regarding subjective reality, an infallibility restricted to the specious present of the subject (what she experiences in the here and now) and that eludes her upon any kind of reflection or second order contemplation (1996, pp. 559–561).14 If I am correct, neither Descartes’ nor Lewis’ defense of infallible self justification withstands serious scrutiny. Securing warrant for self reflective judgments about one’s own thoughts necessarily involves reflecting on the basis of such judgments— judgments about the content of the proto thought. And when one reflects on the warrant 14 This is why for Lewis knowledge about the self, though infallible in some contexts, is intrinsically
‘elusive’.
123
Synthese (2011) 181:241–253
249
for a judgment regarding one’s thoughts, the thought one is reflecting on is distinct from or independent of the reflecting thought.15 Since the numerical independence of these thoughts manifestly implies their epistemological independence, Cogito-like propositions cannot be infallibly self verifying either contextually (e.g., in a way that is restricted to the specious present) or unrestrictedly.16 This line of thinking is more or less a direct product of the Humean position concerning purported self verifying judgments such as (Cogito) and (Dubito). (Cogito) and (Dubito) cannot be infallibly self verifying judgments since there is an epistemological gap between the existence of thought/doubt and the existence of a thinking/doubting thing. (Cogito) and (Dubito) must, contra Descartes and Lewis, be (inchoate) inferences since they are falsifiable by at least one proposition the truth of which cannot be excluded in principle: ∼ (thoughts and doubts belong to subjects). One can rightly claim here that this is exactly the line of reasoning Descartes and Lewis reject in principle. For Descartes and Lewis, it is possible to have direct unreflective intuitions about oneself, intuitions that do not admit of or require reflection. This is the professed basis of infallible judgments about the self. The problem with this outlook, in my view, is that judgment is an essentially normative notion that implies the possibility of being right or wrong. On this way of thinking, judgments are inherently reflective or discursive in nature. But self justifying propositions such as (Cogito) and (Dubito) by all accounts are rational intuitions and per force a species of judgment. The implication, if this is correct, is that Cogito-like propositions are subject to the same discursive requirement that all judgments are, the in principle commitment to supply reasons for one’s judgment—for judging that p.
4 Synthetic infallible self-justification While the fallibilist view of self justification is becoming increasingly entrenched, it is not without its detractors (e.g., Burge 1986, 1988, 1996; Parent 2007; Lewis 1996).17 Most notably, Burge has recently defended a species of infallible self justification (and knowledge) involving privileged access that has become definitive of 15 Lewis in a sense agrees with this latter claim since he concedes that once one reflects on one’s first order thought, judgments concerning it are stripped of their infallibility (1996, pp. 559–560). For Lewis, though, since one can have direct unmediated intuitions about oneself, one can have an infallible Cogito-like self judgment about one’s own thought without reflecting on the first order thought. Of course Lewis’s view of unmediated rational intuition incurs significant theoretical commitments including a contextualist/variantist view of epistemological justification and the elusive character of warrant for certain kinds of rational judgments, commitments many find suspect. 16 Macdonald (2007, p. 369) defends a similar position regarding the nature and limits of self verifying
judgment. 17 Note, Burge and Lewis’s defenses of infallible self justification (and knowledge) are part of their
ambitious project to establish the compatibility of a specific kind of externalism with certain kinds of self justification (and self knowledge), a project that has received considerable attention in the literature on self knowledge in the last few decades (see Bealer (1999), Parent (2007) and Macdonald (2007)). In the discussion to follow, I suppress debate concerning the possibility of self justification or knowledge as such (i.e., fallible self justification and knowledge) and concerning the compatibility of such knowledge with externalist accounts of intentional content.
123
250
Synthese (2011) 181:241–253
the rationalist infallibilist stance concerning self justification.18 On Burge’s account, privileged access to some first order thoughts about the self is borne out by the reflexive or self referential character of a specific class of second order thought (i.e., thought about one’s thoughts). First order thoughts, it is claimed, are logically locked onto their second order counterparts: i.e., they are a proper part of the cognition that constitutes the second order thought. As Burge puts the point (1988, pp. 659–660), In basic self knowledge, one simultaneously thinks through a first-order thought (that water is liquid) and thinks about it as one’s own. The content of the firstorder (contained) thought is fixed by nonindividualistic background conditions. And by its reflexive, self-referential character, the content of the second-order judgment is logically locked (self-referentially) onto the first-order content which it both contains and takes as its subject matter. On this view, if a subject has a second order thought that (for instance) she thinks P then she has the first order thought P. This implies, along the lines of (Cogito), that if I have the second order thought ‘I am now thinking’, the first order thought it contains as a constituent must be true and cannot be false—i.e., it is infallibly justified.19 Burge’s infallibilism about self justification is predicated on his view of the self referential character of second order thought. The primary thesis in this connection concerns the logical locking of some second order thoughts about the self onto their first order constituents: a subject thinking she thinks P (e.g., I am now thinking that water is a liquid) formally entails she thinks P (Burge 1996, pp. 95ff). Needless to say, generally speaking, a subject thinking P does not formally entail P. The question, then, is whether self referential thought is an exception to this rule, whether a subject thinking she thinks P formally entails she thinks P, and if so, why. In my view, Burge’s thesis concerning the self referential character of certain kinds of second order thought, whatever it merits, is not a thesis of pure reason that can engender infallible a priori self justification. If some second order thoughts infallibly self refer to their first order constituents, this ostensibly is an empirical fact—one whose infallibility is purchased a posteriori by reflecting on the evidence. After all, the logical locking thesis is certainly not a logically necessary truth: as a matter of pure logic, there can be no guarantee that any second order thought about the self logically locks on to its first order constituent. As Hume effectively showed, there is at least one defeasor vis-à-vis the logical locking thesis, i.e., the negation of Descartes’ implicit (Cogito) inference: ∼ (thoughts → subjects).20 If thoughts don’t require subjects then second order thoughts about the self don’t entail their first order constituents, e.g., the second order thought ‘I am now thinking’ doesn’t imply ‘I am now thinking’. If Burge is to deliver infallible self justification it appears, then, to come at the cost of a 18 Though Burge’s account of infallible self justification (and knowledge), we will see, differs from Descartes’ account in certain respects. Even so, I argue it suffers the same basic defects as the Cartesian account. 19 Parent also defends the logical locking thesis (2007, pp. 415ff), but on empirical rather than putative a
priori grounds. 20 In bald terms, the Humean insight is that thoughts → subjects is not a logically (or epistemologically)
unassailable inference.
123
Synthese (2011) 181:241–253
251
priority since he will be required to adduce evidence to confute Hume’s dismissal of the (Cogito) inference. The central problem with Burge’s view of self referential justification is that it ignores the inescapable epistemological chasm between thoughts about the self and the subject matter (i.e., subject) of those thoughts. If we are correct, in reflecting on the warrant for a judgment regarding one’s thoughts, the reflecting thought is distinct from the reflected thought. This gives rise to an epistemological lacuna that rules out the possibility of infallible a priori self justification. This is just to say, in a manner of speaking, that the logical locking thesis (or any thesis regarding the mechanics of the reference fixing of self referential judgments) is not a logical truth since it is not logically indefeasible. In fact, the logical locking thesis does not even seem to be ‘relatively uncontroversial’ as Burge maintains (1988, p. 660). If Heil (1988, pp. 240–241) and Macdonald (2007, pp. 369–370) are correct, the logical locking thesis is not the most perspicuous way of explaining so-called privileged access or first person authority, the phenomenon in which a subject is supposed to have better epistemic access to her own thoughts than do others.21 Parent, unlike Heil and Macdonald, is unwavering in his commitment to the logical locking thesis, and while he rejects Burge’s a priori argument for this thesis, does not rule out the possibility of its a priori justification (2007, p. 420). It is perhaps telling, though, that Parent’s defense of the logical locking thesis and the corresponding infallibility of self justification (of certain forms) is explicitly a posteriori. The principal basis of Parent’s argument for infallibilism about self justification is a thesis (about a Fodorian language of thought) he acknowledges to be empirical: “…thoughts are composed of concepts according to specific formation and transformation rules, i.e., a ‘grammar”’ (2007, p. 415). Parent’s position regarding infallible self justification obliquely reveals what Burge’s attempts to conceal, that infallible self justification cannot be purchased a priori. In the end, Burge, for reasons similar to Descartes, has failed to furnish us with a brand of a priori infallible self justification that is “… self-referential in a way that insures the object of reference just is the thought being thought” (1988, p. 659). This finding should be no real surprise since it is a relatively direct consequence of the fallibilist view of reason, sketched and championed here. If even the most diligent exercise of reason is not failsafe, pure reason cannot deliver any variety of infallible justification, including infallible self justification. Of course whether (Cogito), (Dubito) and analogous proposals for infallible a priori self justification are amenable to a lesser degree of justification is an entirely separate question. 5 Concluding remarks If I am correct, pure reason cannot furnish absolute warrant for any kind of proposition. Two direct consequences follow from this recognition. First, rationalist infallibilism, 21 As with the debates concerning fallible self justification (and knowledge) and the compatibility of self
knowledge with externalism, I suppress discussion concerning the possibility, scope and limits of privileged access or first person authority in this paper.
123
252
Synthese (2011) 181:241–253
the doctrine according to which a wide range of both analytic and synthetic a priori propositions can be infallibly justified (or absolutely warranted), is completely without merit (contra Descartes (1996); C.I. Lewis (1946), Unger (1975) and Price (1953)). Second, more importantly, the two component theses of rationalist infallibilism lack credibility: neither (i) analytic a priori propositions nor (ii) synthetic a priori propositions can be infallibly justified (contra Putnam (1978), Burge (1986, 1988), Lewis (1996), Price (1953) and Unger (1975)). On the other hand, no more wide ranging conclusions can be drawn for the origins, structure or possibility of justified belief (or knowledge), since no explicit reason has been given for rejecting alternative fallibilist varieties of rationalism or foundationalism, or, strictly speaking, for rejecting infallibilist empiricism. While it might turn out a compelling dismissal of foundationalism, rationalism and/or infallibilist empiricism can be erected on the basis of the resilient anti-foundationalist/fallibilist insights of Hume, Sellars, Quine and others, final judgment on this matter demands further investigation. Acknowledgments I would like to thank Danny Goldstick, Henry Jackman, David Hunter, Derek Brown and an anonymous referee from the Society for Exact Philosophy for helpful comments on this paper. This research was supported by the Ryerson University SIG-SSHRC Research Fund.
References Ayer, A. J. (1936). Language, truth, and logic. London: Gollancz. Ayer, A. J. (1940). The foundations of empirical knowledge. London: Macmillan. Ayer, A. J. (1956). The problem of knowledge. London: Macmillan. Bealer, G. (1999) A theory of the a priori. Philosophical perspectives, 13, 29–55. (Cambridge, MA: Basil Blackwell). Boghossian, P. (1994). Analyticity and conceptual truth. Philosophical Issues, 5, 117–131. Burge, T. (1986). Individualism and psychology. The Philosophical Review, 45, 3–45. Burge, T. (1988). Individualism and self-knowledge. The Journal of Philosophy, 85, 649–663. Burge, T. (1996). Our entitlement to self-knowledge. Proceedings of the Aristotelian Society, 96, 91–116. Carnap, R. (1935). Philosophy and logical syntax. Bristol, UK: Thoemmes. Carnap, R. (1950). Empiricism, semantics, ontology. Revue Internationale De Philosophie, 4, 20–40. Chisholm, R. (1977). The truths of reason. In Theory of knowledge. Englewood Cliffs: Prentice-Hall. Davidson, D. (1986). A coherence theory of truth and knowledge. In E. LePore (Ed.), Truth and interpretation, perspectives on the philosophy of Donald Davidson. Oxford: Basil Blackwell. Descartes, R. (1996). Meditations on first philosophy. (Trans.) J. Cottingham. Cambridge: Cambridge University Press. Frege, G. (1967). Concept script, a formal language of pure thought modelled upon that of arithmetic. (Trans. Bauer-Mengelberg) In J. Van Heijenoort (Ed.), From Frege to Gödel: A source book in mathematical logic, 1879–1931. (Cambridge, MA: Harvard University Press). Frege, G. (1974). The foundations of arithmetic (Trans.) J. L. Austin. (Oxford: Basil Blackwell). Heil, J. (1988). Privileged access. Mind, 97, 238–251. Hume, D. (1992). A treatise concerning human nature. Buffalo: Prometheus Books. Hume, D. (1993). In E. Steinberg (Ed.), An inquiry concerning human understanding. Indianapolis: Hackett Publishing Company. Kant, I. (2007). Critique of pure reason (Trans.). N. Kemp Smith. Edinburgh: Blunt Press. Klein, P. (1990). Epistemic compatibilism and canonical beliefs. In M. Roth & G. Ross (Eds.), Doubting: Contemporary perspectives on skepticism. Dordrecht: Kluwer Academic Publishers. Klein, P. (1999). Human knowledge and the infinite regress of reasons. In Philosophical perspectives, 13, Epistemology, pp. 297–325.
123
Synthese (2011) 181:241–253
253
Klein, P. (2003). How a Pyrrhonian skeptic might respond to academic skepticism. In The skeptics: Contemporary essays. Aldershot: Ashgate. Lewis, C. I. (1946). An analysis of knowledge and valuation. La Salle: Open Court. Lewis, D. (1996). Elusive knowledge. Australasian Journal of Philosophy, 74(4), 549–567. Macdonald, C. (2007). Introspection and authoritative self knowledge. Erkenntnis, 67, 355–372. Parent, T. (2007). Infallibilism about self-knowledge. Philosophical Studies, 133, 411–424. Price, H. H. (1953). Thinking and experience. Cambridge, MA: Harvard University Press. Putnam, H. (1978). There is at least one a priori truth. Erkenntnis, 13, 153–170. Quine, W. (1953). From a logical point of view. Cambridge: Harvard University Press. Sellars, W. (1997). In R. Brandom (Ed.), Empiricism and the philosophy of mind. Cambridge, MA: Harvard University Press. Sosa, E. (1980). The raft and the pyramid. In Midwest studies in philosophy, 5, Studies in epistemology, pp. 3–25 (Minneapolis: University of Minnesota Press). Strawson, P. (1957). Propositions, concepts and logical truths. Philosophical Quarterly, 7. Unger, P. (1975). Ignorance: A case for scepticism. Oxford: Clarendon Press. Wittgenstein, L. (1969). In E. Anscombe & G. H. Von Wright (Eds.), On certainity. Oxford: Basil Blackwell.
123
This page intentionally left blank
Synthese (2011) 181:255–275 DOI 10.1007/s11229-010-9801-8
Self-organisation in dynamical systems: a limiting result Richard Johns
Received: 4 December 2009 / Accepted: 26 July 2010 / Published online: 7 September 2010 © Springer Science+Business Media B.V. 2010
Abstract There is presently considerable interest in the phenomenon of “self-organisation” in dynamical systems. The rough idea of self-organisation is that a structure appears “by itself” in a dynamical system, with reasonably high probability, in a reasonably short time, with no help from a special initial state, or interaction with an external system. What is often missed, however, is that the standard evolutionary account of the origin of multi-cellular life fits this definition, so that higher living organisms are also products of self-organisation. Very few kinds of object can selforganise, and the question of what such objects are like is a suitable mathematical problem. Extending the familiar notion of algorithmic complexity into the context of dynamical systems, we obtain a notion of “dynamical complexity”. A simple theorem then shows that only objects of very low dynamical complexity can self organise, so that living organisms must be of low dynamical complexity. On the other hand, symmetry considerations suggest that living organisms are highly complex, relative to the dynamical laws, due to their large size and high degree of irregularity. In particular, it is shown that since dynamical laws operate locally, and do not vary across space and time, they cannot produce any specific large and irregular structure with high probability in a short time. These arguments suggest that standard evolutionary theories of the origin of higher organisms are incomplete. Keywords Self organisation · Complexity · Information · Evolution · Limitative · Symmetry · Irregularity · Dynamics
R. Johns (B) Department of Philosophy, University of British Columbia, 1866 Main Mall/E370, Buchanan Building, Vancouver, BC V6T 1Z1, Canada e-mail:
[email protected]
123
256
Synthese (2011) 181:255–275
1 Introduction Self organisation, or “order for free”, is an important (and expanding) area of inquiry. Self-organised structures occur in many contexts, including biology. While these structures may be intricate and impressive, there are some limitations on the kinds of structure than can self-organise, given the dynamical laws. (William Paley pointed out, for example, that a watch cannot be produced by “the laws of metallic nature”.) In this paper I will demonstrate that certain fundamental symmetries in the laws of physics constrain self organisation in an interesting way. Roughly speaking, structures that are both large and non-self-similar cannot self organise in any dynamical system. 2 What is self-organisation? The term “self-organisation” (SO for short) is used to describe the emergence of an object or structure “by itself”, or “spontaneously”, within a dynamical system. Of course the structure isn’t entirely uncaused—it arises from the dynamics. The easiest way to make the notion of SO precise is to exclude other possible causes of the structure, as follows: 1. The appearance of the object does not require a special, “fine-tuned” initial state. 2. There is no need for interaction with an external system. 3. The object is likely to appear in a reasonably short time. The first two conditions are clear enough, ruling out cases where the structure is latent in the initial state of the system (like an oak tree from an acorn), and where the structure comes from outside (like an artist carving a sculpture). The third condition rules out cases of dumb luck and dogged persistence. A purely random dynamics, for example, might produce a watch with some fantastically small probability, or with a large probability given some fantastically long time, but these are not cases of self organisation. There are many kinds of object that appear by self organisation. Crystals are one obvious example. The vortex created by draining the bath tub is another. Living organisms are a case where self-organisation is largely, although not quite entirely, responsible, according to the standard evolutionary picture. This case is of particular interest, and will be discussed separately in Sects. 11 and 12. 3 Limits to self-organisation It is obvious enough that there are limits to self-organisation, as even simple arithmetic will show. Any given set of dynamical laws might produce some kinds of object spontaneously, but cannot produce all kinds of object that way. Consider, for example, the first 1,000 objects that a set of laws produces, from a random initial state. It is clear that there cannot be more than 1,000 distinct objects that are guaranteed to be in this set. And, similarly, there cannot be more than 100,000 objects with a better than 1% chance of being in this set.
123
Synthese (2011) 181:255–275
257
For any given set of dynamical laws, therefore, we can ask such questions as: “Which types of object can these laws produce?”, and “Which types of object can these laws not produce?” Of course these questions are not too precise, as for a stochastic system it might turn out that every conceivable object is a possible member of the first 1,000 products, but it’s still true for most dynamical systems that some kinds of structure tend to be produced much more quickly and probably than others. The need to describe this situation precisely will lead us to the concept of salience below. In short, even if any object can be produced at any time, some objects are still far more salient than others, with respect to the dynamics. 4 Dynamical symmetries In examining the question of which objects tend to be produced by a given set of dynamical laws, one important feature of those laws will be the symmetries they contain. The idea that symmetry in a cause constrains its possible effects (and more generally the probability function over its possible effects) is familiar enough. In a deterministic world, for example, Buridan’s ass will starve, since eating either bale will break the initial symmetry. And in a stochastic world, the two bales have equal probability of being eaten. More generally I assume, in cases where two possible events A and B are symmetric with respect to both the dynamical laws and the initial state, that A and B have the same chance of occurrence. In the following argument, I will focus on just two types of symmetry that dynamical laws typically possess. (i) Invariance under spatial and temporal translation (ii) Locality. The first symmetry is just the familiar idea that the laws of physics are the same everywhere and at all times. The second says that what happens in one place x at time t depends directly on what happened just prior to t in the neighbourhood of x. There is no direct action at a distance, or across times. (You may not think of this second property as a symmetry, but it is in some sense at least.) The argument below is made in the context of cellular automata, rather than dynamical systems with continuous space and time, for simplicity. I hope that the results will generalise fairly easily, however. The conclusion of this argument is that, in a dynamical system with the two symmetries stated above, the only large structures than can have high salience are regular, or self-similar ones. More precisely, I will show that a large object with high salience must be highly determined by its local structure. Let us therefore define this term. 5 Local structure Suppose you are provided with a square grid of cells, 1,000 cells wide and 1,000 cells high, for a million cells in all. Each cell may be filled with either a white or a black
123
258
Synthese (2011) 181:255–275
Fig. 1 Local blocks in a highly regular structure
counter. You’re also provided with a black-and-white digital photograph, which has one million pixels in a 1,000 by 1,000 grid. You’re given the task of placing counters into your grid to produce an exact copy of the “target” photograph. Simple enough? To make it more of a challenge, let’s suppose that you can view the target only through a thin straw, which permits you to see only a 3 by 3 block of pixels at one time. Also, when you’re looking at such a “local block”, you can have no idea where in the target image it lies. These two constraints, of being able to see only local blocks, and being ignorant of their positions, may or may not greatly hamper one’s ability to complete the task. Suppose, for example, that the local blocks turn out to be of only two different kinds, as in Fig. 1. In this case, the target is clearly one of two different things, so you are bound to complete the task in your first two attempts. If, on the other hand, when you look through the straw you see all 512 possible kinds of local block, and with about the same frequency, then it’s a lot more difficult, for the target image may be any one of a rather large set of possible states. One can then do no better than guess which one it is. You will complain that the task is practically impossible. One way to describe the situation is in terms of “local structure”. Looking at the image through the straw tells you its local structure. We can define the local structure more precisely as a function from the 512 block types to their frequencies in the image. The difficulty of this task then depends on the extent to which the target image is determined by its local structure. In the first case, where there were only two block types, the local structure almost completely determined the image. In the second case, however, the image was largely undetermined by the local structure. Using this notion of local structure we can define the irregularity of an image s in terms of the number N of possible images that have the same local structure as s. For reasons of convenience, I actually define the irregularity of s as log N . Suppose the target is s, and s is locally equivalent to s. (I.e. s and s have the same local structure.) Let Fr s be the event that you manage to produce s among the first r attempts. We then see that, in the absence of additional information, so that you’re reduced to guessing the global structure of the target, P(Fr s) = P(Fr s ). It is possible to see this equality as a result of symmetry in your information, even though it is not a straightforward geometrical symmetry. Your information about the target image (i.e. the local structure of the target) is symmetric with respect to the N -membered set of images with that local structure, in the sense that it does not allow you to single out any member of the set.
123
Synthese (2011) 181:255–275
259
6 Local dynamics Dynamical laws, as stated in Sect. 4, operate at the local level. Thus they are restricted in something like the way one is restricted by looking at the target through a straw. But there’s an important difference: Instead of looking at the target through a straw, the dynamical law looks at the present state of the system through a straw. To see how this works, let’s consider the image problem again. Suppose that when you look at the target image through the straw you see all the 512 kinds of block, in equal frequency. You complain that the task cannot be done by any clever means, but only by sheer luck, (very) dogged persistence, or both. In response, a new problem is set, where you are shown the image all at once, which turns out to be a portrait of Abraham Lincoln. But the catch is that you’re now only allowed to look at your own grid through the straw, not knowing which block you’re looking at. You decide the next colour of a given cell after examining just its present colour and those of the surrounding eight cells. (You’re not allowed to take into account any knowledge of other parts of your grid, but you can use your knowledge of the target.) For the sake of clarity it may help to present this new problem in a different way. For each time t your assistant looks at the state of your grid at t, and prepares a “t-sheet”, which lists all the local blocks in the state at time t. Every individual block is listed, not just each type of block, so that there are exactly as many blocks as cells. (Each cell is, of course, at the centre of exactly one local block.) The blocks are however listed in random order, so that you have no information about the location of each block in the grid. You move through the t-sheet, making a decision about each block, whether to keep the central cell as it is or change it to the other colour. This decision is based entirely on its colour and those of the surrounding 8 blocks, not on any other blocks in the t-sheet. The decision may be either deterministic or probabilistic, however. The most general case therefore is that you have a set of 512 different “toggle probabilities”, i.e. probabilities for toggling (changing) the central cell, based on the 512 different possible colour combinations of that cell and its surrounding cells. Your assistant takes all these toggle/keep decisions and uses them to update the state of the grid, and then provide you with a (t + 1)-sheet. Then you make a similar set of decisions about the blocks on the (t + 1)-sheet, and so on. In making these decisions you are exactly mimicking the work done by a (local and invariant) dynamical law of a cellular automaton. In the new problem the difficulty has been shifted. Instead of having restricted information about the target image, you have even more tightly restricted information about the present state of the grid. (Your information is slightly less in the new problem than in the old, since in the new problem you never see the entire local structure of your grid. You just see one local block at a time.) How do the two problems compare in difficulty? In Appendix 1 the following answer is demonstrated. Theorem 1 The new problem is at least as hard as the old one. Theorem 1 is specifically used to convert results about the old problem into results about dynamical systems. In fact, some of the results in this paper apply primarily to the old problem. But then, using Theorem 1, we derive a corresponding result about the new problem, i.e. about dynamical systems.
123
260
Synthese (2011) 181:255–275
7 Salience As mentioned in Sect. 3, the notion of salience is needed to express the fact that some types of object tend to appear more quickly than others, in a given dynamical system, from a random initial state. I define the salience of an object type, with respect to a dynamical law, as follows. First we define the r -salience of an object type s: Definition Let the proposition F r s say that s is among the first r (distinct) types of object that appear in the history. Then Sal r (s) = P(F r s)/r . Note that if the object s tends to be produced fairly quickly, from most initial states, then its r -salience will be quite high for some (low) values of r . If s is rarely produced, on the other hand, for almost all initial states, then its r -salience will be low for all r . This fact suggests that we define salience from r -salience in the following way. Definition Sal(s) = maxr {Sal r (s)}. In other words, we define the salience of s as its maximum r -salience, over all values of r . Thus, if a type of object tends to be produced quickly by the dynamics, so that its r -salience is quite high, for some small r , then its salience will also be quite high. An object that is unlikely to be produced in any short time1 will have low salience. For convenience I also define the “dynamical complexity” of s as minus the log of its salience. Definition Comp(s) = −log Sal(s). Note that if s and s are locally equivalent, then they are equally likely to be guessed, by a player who views the target through a straw, as in the old problem. In other words, for such a player: P(Fr s) = P(Fr s ). We then have the following theorem (see Appendix 2 for a proof). Theorem 2 If s is one of N objects in a locally-equivalent set S = {s1 , s2 , . . ., s N }, then Sal(s) ≤ 1/N . Further, according to Theorem 1, which states that a local dynamical law is at least as severely restricted as such a player, we infer that Sal(s) ≤ 1/N for a dynamical system as well. It will be useful to consider the salience of an n-bit binary string in a couple of rather trivial dynamical systems. The first such system, which we’ll call the completely random system, is one in which each cell evolves completely at random, and independently of the other cells. It’s as if the content of each cell at each time is determined by the outcome of a fair coin toss, with there being a separate toss for each cell at each time. 1 More precisely, a low salience object type is unlikely to be produced with any low rank, i.e. a low ordinal number in the sequence of objects produced. If a system rarely produces new kinds of object, then even an object that requires trillions of years to be produced could still have low rank, and hence low salience.
123
Synthese (2011) 181:255–275
261
In considering the salience (and hence dynamical complexity) of an n-bit string here, the size of the system (i.e. the number of cells) is a complicating factor, so to begin with let’s suppose that the system is a one-dimensional array of n cells. In this case, the salience of every string is the same, namely 2−n , so that the complexity is n. Note that this result regards the target string (s say) as distinct from its mirror image, s− . (I.e. s− is just s in reverse.) If we regard s and s− as identical, then the complexity of this object is n − 1. This simple case also supposes that the system is linear, so that there are edge cells. This implies that each bit of s has just one cell in which it can appear. Another possibility, however, is that the system is a closed loop of cells, so that there are no edges. In that case, the first bit of s can appear in any of the system’s n cells. There then are n products on each time step, so that s is likely to appear much more quickly. The probability of s being somewhere in the initial state, for example, is now n·2−n . But since there are n objects in that state, the n-salience is 2−n . It is easily shown that the r -salience of s is also 2−n , for all r , so that the complexity of s is still n. What if the system is larger than n cells, however? In a larger system there is (one might say) more guessing going on, so that any given n-bit string (s say) is likely to be guessed sooner. On the other hand, there are also more “products” at each time step, since each n-bit section of the state might be considered a product. What will the net effect on the salience of s be, as the system size is increased? Actually there will be no change at all. Suppose there are m cells, for example, where m > n, then s can appear in m different places in the system, so that there are m products on each time step. In that case the probability of s in the initial state is m·2−n , so that the m-salience (and indeed the r -salience) is still 2−n . The same situation obtains in two- and three-dimensional systems. But note that in a two-dimensional system we might allow the string to appear in a column, going up as well as down, as well as backwards in a row. Thus the salience of this set of four objects will be 4.2−n , and its complexity n − 2. For large n this difference is rather trivial, however.
8 Complexity and information In the previous section the ‘dynamical complexity’, or Comp, is defined in a merely arithmetical way from the salience of s. The reader is therefore left in the dark as to what (if anything) dynamical complexity means. In this section I shall briefly explain why I regard it as measuring the ‘information content’ of an object. To answer this it is best to begin with the meaning of ‘information’ in the epistemic context, of a thinker who has a particular epistemic state, or state of knowledge, at a given time. In that context one can define Inf K (A), the information content of a proposition A, relative to the epistemic state K , as −log2 PK (A), where PK (A) is the epistemic (or evidential) probability of A within K . Note that if A is believed with certainty in K then PK (A) = 1 and hence Inf K (A) = 0. Thus we see that Inf K measures the amount of information in A, over and above what is already known in K , i.e. it measures the information in A that is lacking in K . To understand the meaning of Inf more concretely, it helps to consider a decomposition of A into a conjunction of
123
262
Synthese (2011) 181:255–275
propositions that are mutually independent, and which each have probability 1/2. If Inf K (A) = n, then A will clearly decompose into n such propositions. If we call such a proposition a ‘bit’, then it’s very natural to say that A contains n bits of information. One may wonder what value there is in introducing Inf, since it allows one to say only things that can already be expressed in terms of P. In fact some relations, while they can be expressed perfectly precisely in terms of P, seem more natural and intuitive when expressed using Inf. This in turn renders some important logical facts easier to see. Consider, for example, the conjunction rule for epistemic probabilities2 : PK (A ∧ B) = PK (B) ·PK (A|B) This same relation, in terms of information, is: I n f K (A ∧ B) = I n f K (B) + I n f K (A|B) This is very intuitive—much more intuitive than the conjunction rule for probabilities. For consider that one way to learn the conjunction (A ∧ B) is to learn B, and then learn A. The information gained in the first step is Inf K (B), of course, putting one in the new state K + B. Then, when one learns A, the extra information gained is Inf K +B (A), i.e. Inf K (A|B). Since Inf K (A ∧ B) can also be expressed as Inf K (A) + Inf K (B|A), we see here a kind of “path independence” involved in learning (this can be proved generally). The quantity of information gained along a “learning path” of expanding epistemic states depends only on the end points, and is independent of the path taken. Such path independence is vaguely reminiscent of conservative force fields, where the work done in moving a particle from point A to point B is independent of the path taken. One might therefore wonder whether the ‘information’ thus defined, like energy, is some sort of conserved quantity. In fact there are some conservation theorems of this sort, involving Inf, that make the analogy useful. (Note that there are also many important disanalogies between energy and information.) To understand these information conservation theorems, it is essential to understand that epistemic probability is based on the idea of an ideal agent that is logically omniscient. This means that the agent believes all logical truths (such as tautologies) with certainty. His set of certain beliefs is also deductively closed, so that it contains all the logical consequences of its members. For such an agent, it is easy to see that no new information can be obtained by thought alone. Some sort of external information input is needed. The basic idea of these conservation theorems is that logical consequence requires the premise to contain more information than the conclusion. (Note that this is only a necessary condition for logical consequence, not a sufficient one.) As Chaitin (1982, p. 942) put it: “I would like to be able to say that if one has ten pounds of axioms and a twenty-pound theorem, then that theorem cannot be derived from those axioms.” 2 Note that P (A|B) is here defined as P K K +B (A), where K + B is the epistemic state K expanded by adding full belief in the proposition B. In a similar way one can define Inf K (A|B) as Inf K +B (A). Also, A ∧ B means “A and B”, i.e. the weakest proposition that entails A and entails B.
123
Synthese (2011) 181:255–275
263
Suppose that K represents one’s initial, “background”, knowledge, and that B is some theorem one would like to be able to prove. But B isn’t certain in K , having instead some non-zero information content Inf K (B). So let us add some new proposition A to K , in the hope that B will be provable (i.e. certain) in K + A. It is trivial to show that, for this to be possible, PK (A) ≤ PK (B), so that Inf K (A) ≥ Inf K (B). In other words, if the theorem B weighs twenty pounds (relative to K ) then the axioms A must weigh at least twenty pounds as well. The use of the term “conservation” to describe such results is not ideal, as it might suggest that the weight of the axioms always equals that of the theorems, which is obviously not the case. One can certainly prove a “light” theorem from a “heavy” set of axioms! What is ruled out is an increase in weight, from premises to conclusion. This result might therefore be better described as a non-amplification theorem. This is rather an awkward term, however, so I shall continue to call it a conservation theorem. More generally we have the following (also trivial) conservation theorem. Theorem If learning A reduces the information content of B by r bits, then the information content of A itself is at least r bits. I.e. If Inf K (B|A) = Inf K (B) − r , then Inf K (A) ≥ r . Similar conservation laws apply to the notion of dynamical complexity defined in Sect. 7. Before we examine these, I will explain why Comp can actually be thought of as a measure of information content. In the epistemic context, the notion of information content naturally applies to propositions rather than objects. In algorithmic information theory, by contrast, the complexity or information content of an object s is defined as the length of the shortest (binary) program which, put into a given universal Turing machine, generates s as the output. This idea is often loosely expressed by saying that “the complexity of an object is the number of bits of information needed to specify it exactly”. The basic idea of dynamical complexity is the same, except that a dynamical system takes the place of the Turing machine. Dynamical systems also produce objects of course (such as ourselves) and they also have inputs (initial conditions). There are two important differences, however, between Turing machines and dynamical systems. First, dynamical systems do not produce a unique output and then halt. Rather, multiple objects appear within the system, at different times. Second, dynamical systems are often stochastic rather than deterministic, so that probabilities must be considered. First let us consider an object s that is produced with a (physical) probability q rather than with certainty. In this case, the system can be seen as lacking −log q bits of information (taking base-2 logs) for making s. (After all, the proposition ‘s is produced’ has −log q bits of information over and above what is contained in the system’s dynamics.) Hence the information content of s is −log q bits, in addition to any other information inputs. Second, suppose that s is produced by the system, but never by itself, only as part of a set of r objects. In that case, after s is produced, it must be “selected” from the set, a task which requires log r bits of information. Putting these two aspects together, let’s suppose that s is produced with probability q as one of a set of r objects. In that case, the information needed to specify s from the dynamics is log r − log q, i.e. −log (q/r ). Note that q/r is the r -salience of s. Further, the complexity of s is the minimum information needed for the system to specify s
123
264
Synthese (2011) 181:255–275
exactly, so we define Comp(s) as the minimum value of −log (q/r ), over all possible values of r . 9 Complexity conservation theorems The idea of starting the system in a random initial state is that no initial information is provided. In general, however, some initial states will be more likely than others, and many states will be impossible. How will this affect things? In this section we will examine one kind of constraint on the initial state, where one restricts the initial state of the system to a subset of the full state space. It is easy to see that such restrictions can increase (as well as decrease) the salience of a given object, but a conservation theorem applies here. It is shown below that for a restriction of the initial state to reduce the complexity of s by v bits, the restriction itself must contain at least v bits of information. In cases where the initial state is restricted by a conditionalisation of the chance function, we can regard anything that emerges as a product of conditional self organisation. It is not, as it were, absolute self organisation, since the system had some outside help in getting started. But it is self organisation from that point onward. In order to investigate conditional self organisation, we introduce the notion of a program, as follows: Definition A program is a restriction of the initial state to a subset of the state space. (The subset is also called .) Since the initial state is set at random, each program has a probability P(), which in the case of a finite state space is simply the proportion of states that are in . The more restrictive the program, the lower its probability is. We also define the length of a program as follows: Definition The length of , written ||, is −log P(). A trivial theorem then says that if reduces the complexity of s by v bits, then || ≥ v. Rather than prove this conservation theorem, however, it’s more convenient to combine it with another that relates the complexity of an object with the time needed to produce it, with reasonable probability. Suppose, for example, that Comp(s) = n, within a given system. How long will the system take to produce s? In the case of a deterministic system it is easy to see that the system must produce s in a set of no fewer than 2n objects. For suppose the negation, that s appeared reliably in a smaller set of 2m objects, say, where m < n. In that case s could be found by letting the system produce 2m objects, and selecting one of them at random. Such a procedure would yield s with probability 2−m , giving a complexity of m for s, contrary to the assumption that Comp(s) = n. The conservation theorem below includes this result (see Appendix 2 for a proof). For reasons that will become clear, I call it the random equivalence, or RE theorem. RE Theorem Suppose Comp(s) = n, and |∗ | < n. Then P(F r ∗ s|∗ ) ≤ r ∗ ·2|
123
∗ |−n
.
Synthese (2011) 181:255–275
265
The RE theorem says that the probability of the system producing an object of complexity n among the first r ∗ products increases in proportion to r ∗ , but decreases exponentially with the difference between n and the length of the program. (It should be noted that this theorem is independent of Theorem 1.) To see what this theorem really means, consider a case where a very intelligent person is asked to reproduce a hidden 20-bit string that has previously been produced by some purely random process, such as tossing a fair coin. The target string s has epistemic probability 2−20 , of course, since all possible strings are equally likely. Then the information content of s is 20 bits. What can this person do? If he is allowed only one attempt, then he can do no better than flip 20 coins himself, submitting their output as his answer. Note that this corresponds, in the above theorem, to the case were r ∗ = 1 and |∗ | = 0. If the person is allowed multiple attempts at producing s (suppose he’s allowed r ∗ attempts) then what is the best strategy? There is actually nothing better than making a series of independent random guesses, being sure of course not to make the same guess twice. Using this method, the probability of success is of course exactly r ∗ ·2−n . Finally, we can consider the case where the person is allowed r ∗ attempts at the code, and is provided with additional relevant information in the form of the proposition , whose information content is then ||. By the conservation theorem of Sect. 8, we see that this can increase the probability of producing s by a factor of 2|| at most, giving ∗ a probability of r ∗ ·2| |−n . In other words, the theorem says that “producing a given object with complexity n in a dynamical system is no easier than producing a given n-bit string in a completely random system”. This is a perfectly general result, applying to any system whatsoever. One obvious consequence is that an object whose dynamical complexity in a particular system is very large (a million bits, say) cannot be produced in that system in any reasonable length of time. One might as well wait for monkeys with typewriters to produce Hamlet.
10 Complexity and irregularity In Sect. 5 we defined the irregularity of a state s as log N , where N is the number of states that are locally equivalent to s. In Sect. 7 we saw that the salience of such a state s is no greater than 1/N , so that the dynamical complexity of an object always exceeds its irregularity. Then, in Sect. 9 we saw that states with very low salience, such as 2−n (where n is one million or greater) effectively cannot be produced by a dynamical system. The question that remains concerns whether any of the objects we see around us have such low salience. In other words: How great is the value of N for real objects? For a very simple case, consider a binary string s of n bits that is maximally irregular. In other words, the string contains all eight kinds of “local triple”, i.e. 000, 001, 010, 011, 100, 101, 110 and 111, in equal frequency. How irregular is s? I don’t yet have a strict proof here, but given a very plausible assumption it is easy to show that the dynamical complexity of an irregular string is roughly the same as its length. (See Appendix 4 for a relative proof.)
123
266
Synthese (2011) 181:255–275
Conjecture If s is a maximally-irregular string of length n, where n is of the order one million or greater, then the irregularity (and hence dynamical complexity) of s is at least 0.999n. This conjecture entails that long, irregular strings have very low salience. An irregular string of a billion bits, for example, would have a dynamical complexity of virtually one billion bits. Thus, using the RE theorem, producing a particular billion-bit string of this kind is no easier in a dynamical system than in a completely random system. In other words, it is (effectively) impossible. This impossibility is quite independent of the actual dynamical laws, but depends only on the general features of locality and invariance in the dynamical laws. We thus have the following theorem: Limitative Theorem A specific large, maximally irregular object cannot appear by self organisation in any dynamical system whose laws are local and invariant. Proof Suppose that an object s is maximally irregular, and of size n bits. Then its irregularity is approximately n, by the above (practically certain) conjecture. Using Theorem 2, the dynamical complexity of s is at least (approximately) n as well. Then, according to the RE theorem, to produce s with any reasonable probability requires that a total of about 2n objects are produced. If n is large, say 106 or greater, then 2n is ridiculous. 11 Did life emerge spontaneously? The appearance of the first self-replicating molecule (or system of molecules) may or may not have been by self-organisation. Some authors, Richard Dawkins for example, have appealed to the vast size of the universe to help explain this event. Having supposed that the probability of a self-replicator appearing on any single planet might be around 10−9 , Dawkins continues: Yet, if we assume, as we are perfectly entitled to do for the sake of argument, that life has originated only once in the universe, it follows that we are allowed to postulate a very large amount of luck in a theory, because there are so many planets in the universe where life could have originated. If, as one estimate has it, there are 100 billion billion planets, this is 100 billion times greater than even the very low [origin of life probability] that we postulated. Other authors disagree with Dawkins here, however, claiming that the first self replicator self organised. Manfred Eigen seems to have held such a view. I shall steer clear of this issue, however, and assume only that the emergence of life after a self-replicator exists was by self-organisation. This view is very widely held. Dawkins, for example, expresses it as follows: My personal feeling is that, once cumulative selection has got itself properly started, we need to postulate only a relatively small amount of luck in the subsequent evolution of life and intelligence. Cumulative selection, once it has begun, seems to me powerful enough to make the evolution of intelligence probable, if not inevitable. (Dawkins 1986, p. 146)
123
Synthese (2011) 181:255–275
267
Note that, by neither appealing to large amounts of luck, nor enormous times, nor external help, Dawkins is claiming that life self organised (in my sense, from Sect. 2). I am aware that it is unusual to describe all of biological evolution as self-organisation (SO). It is more common to contrast SO with selection, identifying some biological structures as due to selection, and others to self-organisation, and see these as complementary processes. Blazis (2002) writes, for example: The consensus of the group was that natural selection and self-organisation are complementary mechanisms. Cole (this volume) argues that biological selforganised systems are expressions of the hierarchical character of biological systems and are, therefore, both the products of, and subject to, natural selection. However, some self-organised patterns, for example, the wave fronts of migrating herds, are not affected by natural selection because, he suggests, there is no obvious genetic connection between the global behavior (the wave front) and the actions of individual animals. The term SO is applied, it seems, only to cases where the emergence of the structure is not controlled by the genome. Despite this usage, however, it is very important to see that biological evolution, proceeding by the standard mechanisms, satisfies the definition of SO in Sect. 2. Standard biological evolution is, therefore, a case of conditional self-organisation (i.e. conditional on a self-replicator in the initial state). The first self-replicator on earth must have appeared (very roughly) 4 billion years ago. Since that time, an enormous profusion of complex living organisms has appeared, as a result of the laws of physics operating on that initial state. Now, in view of the size of these organisms, 4 billion years is very much shorter than the time required for assembly of such objects by pure chance. This is why I say that the emergence of life, given the first self-replicator, was by self-organisation. The three criteria from Sect. 2 are met. Is this fact in conflict with the Limitative Theorem above, that large, irregular objects cannot emerge by SO? It may seem not. For, while living organisms are very large, containing trillions of atoms, they are far from maximally irregular, and the Limitative Theorem applies only to maximally-irregular objects. However, the following two considerations should be born in mind. First, while the Limitative Theorem applies only to maximally-irregular objects, there is unlikely to be too much difference in salience between maximally and highly irregular objects. An object has to be highly regular before its global structure becomes highly constrained by its local structure. A more general result, therefore, would surely find a similar situation with all irregular objects, not just maximally-irregular ones. Second, one can apply the Limitative Theorem to the genome instead of the organism itself. Genomes are very small compared to phenotypes, of course, but they are still very large, and (I believe) much more irregular. The shortest bacterial genomes, for example, contain about half a million base pairs, or roughly a million bits. If these genomes are indeed highly irregular, as I suppose, then their production by SO is ruled out by the Limitative Theorem. At this point we should recall, however, that I am assuming the existence of a selfreplicating entity in the initial state. What difference does this make? The following
123
268
Synthese (2011) 181:255–275
theorem shows that, while the existence of a self-replicator might well reduce the dynamical complexity of life, such a reduction cannot exceed the complexity of the self-replicator itself. Hence, since the original self-replicator is assumed to be small, and relatively simple, its presence makes very little difference. First we require some definitions. Definition Sal(s|s ) = Sal(s) given that the initial state contains the object s . In other words, one can use the definition of Sal(s) above, but the probability function used is generated from the dynamics by applying a random initial state and then conditionalising on the presence of s in that state. Conditional dynamical complexity is then defined from conditional salience: Definition Comp(s|s ) = −log Sal(s|s ) Theorem 3 Constraining the initial state of a dynamical system to include some object s can reduce the complexity of other objects by no more than Comp(s ). Proof We shall first prove that Sal(s) ≥ Sal(s )·Sal(s|s ). Suppose that the value of r that maximises Sal r (s ) is r1 , and the value of r that maximises Sal r (s|s ) is r2 . Then one may try to generate s from a random initial state using the following method. One allows the system to evolve for some period of time, producing r1 objects. One of these objects is then selected at random. The probability of the selected objected being s is exactly Sal(s ), according to the definition of salience. If s is selected then the system is prepared in a random state containing s , and allowed to evolve again to produce another r2 objects. One of these r2 objects is selected at random. Given that the first stage succeeds, the chance of selecting s at the second stage is exactly Sal(s|s ). The overall probability of getting s at the second stage is then Sal(s )·Sal(s|s ). This is clearly less than Sal(s), since Sal(s) involves selecting just s in a history that begins with a random initial state, whereas here we are selecting s as well as s in such a history. I.e. Sal(s) ≥ Sal(s )·Sal(s|s ). Using the definition Comp(s)= − log Sal (s), we immediately obtain the result that Comp(s|s ) ≥ Comp(s) − Comp(s ), as required. We can roughly gauge the effect of introducing a self-replicator into the initial state by using a few very approximate numbers. Suppose we wish to make a relatively simple organism, such as a bacterium, whose complexity is about 106 bits. The complexity of the first replicator must be much less than this, for its appearance not to be a miracle. On Dawkins’ view, for example, there might be around 1020 planets, which “pays for” a chance of about 10−20 per planet, or about 60 bits. Within each planet there might be many opportunities for the self-replicator to appear, over a billion years or more, which pays for perhaps another few tens of bits, or even as many as 100. In any case, the complexity must surely be below 1,000 bits. But subtracting even 1,000 bits from one million makes almost no difference. Hence the Limitative Theorem cannot be circumvented by imposing a self replicator in the initial state.
123
Synthese (2011) 181:255–275
269
12 Does this result ignore natural selection? Dawkins’ view quoted above, concerning the high probability of intelligent life once a self replicator exists, is probably an extreme one among biologists. Nevertheless, it is very commonly supposed that the processes of genetic mutation and natural selection allow complexity to emerge far more quickly than pure chance would allow. This idea has been at the heart of the evolutionary thinking since Darwin and Wallace. My argument has, for this reason, been suspected of somehow ignoring natural selection, or assuming its absence. After all, my result finds very little difference between purely random processes and general dynamical systems in the time required to produce irregular objects. But surely we know that natural selection can produce complex objects much more quickly than pure chance can? It follows that I must somehow be assuming the absence of natural selection. The short answer to this worry is that my Limitative Theorem cannot possibly make any assumptions about biological processes of any kind. It cannot assume that such processes are absent, since it does not engage with biological matters in any way! The argument is entirely at the level of physics, being based on symmetries in the dynamical laws. There cannot of course be any contradiction between biological and physical facts—any biological claim that violated the conservation of energy, or the second law of thermodynamics, for example, would be false. (If someone is convinced that some assumption is being made that rules out natural selection, then they are most welcome to identify it.) While this short answer is correct, it sheds no light on what is going on. So let us examine the general idea of producing complex objects gradually, through a series of small modifications or changes. Theorem 3 perhaps suggests that a gradual approach might make a big difference to the time required. Consider, for example, an object s of complexity n relative to the first self-replicator. According to the RE theorem, it will require at least 2n modifications to make s with high probability. But now suppose we consider an intermediate object s , whose complexity is n/2 relative to the first self replicator. According to Theorem 3, the complexity of s relative to s may be as little as n/2 as well. In that case, the production of s from the random initial state might take only about 2n/2 changes, and the same for the transition from s to s. Hence the total could be a mere 2 × 2n/2 changes, i.e. 2n/2+1 , which is a tiny fraction of 2n . It appears that the insertion of even just one intermediate stage has drastically reduced the time required to produce the object. The “gradualist” argument of the previous paragraph must be a fallacy, since the proof of the Limitative Theorem is very general, and includes such cases as the one above. But what is wrong with the argument? In order to investigate this, it will be helpful to consider a particular dynamical law, and some intermediate objects. Consider, for example, a 20-bit counter, that begins with random values, and then on each time step adds one to the number showing. (After 11111…1 it goes back to 00000…0.) To obtain the target number s will require, on average, about 220 (about one million) iterations. Now suppose that the counter is, at some point in its evolution, showing 10 correct bits among the 20. Does this entail that s will be obtained in about 210 , i.e. about 1,000, further iterations? It does not, because it all depends on which 10 bits are correct! If the first ten are correct, then
123
270
Synthese (2011) 181:255–275
s is indeed close at hand. But if the last 10 are correct, then this means nothing, as those correct bits must be lost before the incorrect bits can change. We are still about a million steps away. In this example we see that there is some intermediate state, namely where the first ten bits are correct, from where the goal s is very close. And this intermediate state has 10 bits of complexity, since it has salience 2−10 . (There is, after all, a probability 2−10 of getting it as the initial state.) Yet, interestingly, this state is very unlikely to be obtained in the first 1,000 time steps. Also, there is another 10-bits-complex intermediate state that is quickly produced from the initial state. (Namely, where the last ten bits are correct.) But this is unlikely to evolve to s in less than about a million steps. So each intermediate object is far from either the initial state or from s. At this point we should recall that the RE theorem is an inequality. It takes at least 2n objects to have a good chance of producing an object with complexity n bits. Moreover, we know from modal logic that ♦A & ♦B does not entail ♦(A & B), i.e. the possibility of A and B together isn’t a consequence of the individual possibility of A together with the possibility of B. In this simple example we find that, while a 10-bit object can be close to the initial state, and can be close to the desired 20-bit object, it cannot be close to both of them. Thus, through this counter-example, we have identified a serious fallacy in the gradualist argument. The individual possibility of each small step occurring in a short time does not entail the possibility of the entire sequence of steps each occurring in a short time. We should therefore be wary of any general argument that seeks to show that a complex object can be produced gradually, by a cumulative process, far more rapidly than the Limitative Theorem allows.
13 Conclusion I have argued that there is an important limitation on the kinds of object that can appear spontaneously in a dynamical system. Such systems, with laws that operate locally and invariantly across space and time, are able to control only the local structure of the state. The state as a whole is therefore uncontrolled, except insofar as it is constrained by the local structure. This led us to the Limitative Theorem, which says that a specific irregular object, i.e. one that is largely undetermined by its local structure, cannot easily be produced in a dynamical system. Indeed, it was shown that its production is no easier than the appearance of an object of very similar size in a purely random system. This result, while relevant to biology, does not of course contradict the theory of evolution in its most general form, i.e. that life evolved through a process of descent with modification. This is just as well, since the historical process of phylogeny is very well supported by the evidence. Nevertheless, the Limitative Theorem does suggest that the currently recognised processes driving evolutionary change are incomplete.
123
Synthese (2011) 181:255–275
271
Appendices Appendix 1: Proof of Theorem 1 First consider a case where one can, in some manner, choose at least the local structure of the next state of the grid. In that case one would, at each iteration, choose a local structure that equals the one of the target. (Otherwise one is bound to fail!) It is also clear, on the other hand, that one can do no better than that. Hence in such a case, one’s task would be exactly as hard as the old problem. In the new problem, one has strictly less control over the grid than this, since one cannot (directly, in one step) choose even the grid’s local structure. Hence the new problem is at least as hard as the old. Appendix 2: Proof of Theorem 2 (Note that this result applies to the old problem.) We previously defined Sal r (s) = P(F r s)/r . Now let I (F r si ) be the indicator function for the proposition F r si , so that I (F r si ) = 1 when si is among the first r products, and I (F r si ) = 0 otherwise. Note that, for all i, P(F r si ) = E[I (F r si )], where E[] is the expectation operator. Since there are no more than r members of S among the first r objects, we see that: N
I (F r si ) ≤ r.
i=1
And also: E
N
I (F si ) ≤ r. r
i=1
Then, since the expectation operator is linear, it follows that: N
E I (F r si ) ≤ r
i=1
⇒
N
P F r si ≤ r.
i=1
But now, since (for all i) P(Fr si ) = P(Fr s), in the old problem, it follows that P(Fr s) ≤ r/N . Hence, for every r , the r -salience of s is no greater than 1/N . From this it follows that Sal(s) ≤ 1/N .
123
272
Synthese (2011) 181:255–275
Appendix 3: Proof of the RE Theorem First we prove this useful lemma. Basic Lemma Comp(s) = minr, {logr + || − logP(F r s|)}. Proof of Basic Lemma Let the O r i be all the possible output sets of length r . Then P(F r s) =
P F r s Oir P(Oir ).
i
Now P(F r s|O r i ) = 1 if s ∈ O r i , and is 0 otherwise. Thus: P(F r s) =
P(Oir ); Also P(F r s | ) =
s∈Oir
P(Oir | ).
s∈Oir
Further, P(O r i ) = P(Oir |)P() + P(O r i |¬)P(¬), so P(O r i ) ≥ P(O r i |) P(). But, if is the entire state space, then P(Oir ) = P(Oir |)P(). Hence P(O r i ) = max {P(Oir |)P()}. Substituting this in the previous equation gives: P(F r s) =
max P(Oir | )P() .
s∈Oir
Thus ⎫ ⎬ Sal(s) = maxr max P(Oir | )P() ⎭ ⎩r s∈Oir ⎫⎫ ⎧ ⎧ ⎬⎬ ⎨ ⎨ P() = maxr max . P(Oir | ) ⎭⎭ ⎩ ⎩ r r ⎧ ⎨1
s∈Oi
But
P(Oir | ) = P F r s |) ,
s∈Oir
So P() Sal(s) = maxr max P(F r s | ) r P()P(F r s | ) . = maxr, r
123
Synthese (2011) 181:255–275
273
Then
P()P(F r s | ) Comp(s) = − log maxr, r P()P(F r s | ) = minr, − log r
= minr, log r − log P() − log P(F r s | ) .
Then, putting || = −logP(), we get: Comp(s) = minr, {logr + || − logP(F r s|)}. Proof of the RE Theorem From the basic lemma, n = minr, {logr + || − log P(F r s|)}. Then consider some particular program ∗ and some value r ∗ . It is then clear that: n
≤ logr ∗ + |∗ | − logP(F r ∗ s|∗ ),
And therefore P(F r ∗ s|∗ ) ≤ r ∗ ·2|
∗ |−n
.
Appendix 4: Counting irregular states To get a rough estimate on the number of (maximally) irregular strings of n bits we first define the “1-triple form” of a binary sequence. Consider, for example the 24-bit state 000001010011100101110111. We can break this into three-bit strings, or triples, as follows: 000 001 010 011 100 101 110 111 There are of course 8 possible triples, which we can call 0 (000), 1 (001), etc. up to 7 (111) in the obvious way. We then obtain the 1-triple form of this sequence as: 01234567 I call this the 1-triple form because the first triple begins on bit #1. We could begin on bit #2, and get the 2-triple form, namely: 02471356, i.e. 0 000 010 100 111 001 011 101 11 (Note that I am treating the sequence as a closed loop here.) In a similar way, the 3-triple form is: 05162734, i.e. 00 000 101 001 110 010 111 011 1.
123
274
Synthese (2011) 181:255–275
Now: an irregular state is one where each triple occurs with frequency 1/8. This doesn’t require, of course, that the triple has frequency 1/8 in the 1-triple form, 2-triple form and 3-triple form individually, but only that it has frequency 1/8 overall. Nevertheless, if the triple does have frequency 1/8 in each of those forms, then it will have frequency 1/8 overall. Note that, in this contrived example, each triple form is irregular, so that the whole state is irregular as well. But why bother with these triple forms? It’s because it’s easy to calculate the number of n-bit sequences that are irregular in the 1-triple form (and similarly for both of the other triple forms, of course). This will give us a ballpark estimate of the number of irregular sequences, I think. Suppose you have a bag of n/3 triples, containing n/24 of each type of triple (n is a multiple of 24). By arranging these into a sequence, you’re sure to generate a state of n bits that is irregular in its 1-triple form, and moreover you can generate every such (1-triple irregular) sequence this way. So how many ways are there to arrange this bag of triples? Let the number of arrangements be N . We then have: n
! N = 3 8 n 24 ! Applying Stirling’s approximation to the factorial, namely: n! ≈
√
2π n
n n e
We obtain: √ 8 N ≈ 2 7 πn n
12
2
To get an idea of how big a fraction of 2n this is, let’s consider log N , i.e. log2 N ≈ n +
3 7 π n − log 2 2 12
I am interested in cases where n is in the rough interval from one million to one billion. Let us plug in these values for n. For n = 106 , log N = 999,939, approx, while for n = 109 , log N = 999,999,904, approx. In other words, while the 1-triple irregular states are a tiny subset of the whole (being tens of orders of magnitude smaller) on the logarithm scale the sizes are roughly equal, to a tiny fraction of 1%. I think that the number of states that are 1-triple irregular is a very rough estimate of the number of irregular states, but I’d guess that it’s an overestimate. To get a (fairly firm but not rock solid) lower bound on the number of irregular states, I think we can take the cube of the proportion of this set in the total. To see this, consider the fact that: (i) If (but not only if) a state is irregular in all three triple forms, then it is irregular.
123
Synthese (2011) 181:255–275
275
Also, (ii) I think we can assume that the proportion of 2-triple irregulars among the 1-triple irregulars is at least as great as the proportion of 2-triple irregulars among the total. And similarly, the proportion of 3-triple irregulars among those that are irregular in both the 1-triple form and the 2-triple form is at least as great as the proportion of 3-triple irregulars among the total. Given (i) and (ii), the product of these three (equal) proportions will be at least as great as the proportion of the intersection of the three sets. In this way we obtain the lower bound N ’ for the number of irregular states of length n. 3
N ≈ 2n
82 21 πn 12
2
This gives similar results to the previous estimate. We now have that log N is roughly n − 21/2 log n, so that for n = 1,000,000, log N is roughly 999791. References Blazis, D. E. J. (2002). Introduction. Biological Bulletin, 202(3), 245–246. Chaitin, G. (1982). Gödel’s theorem and information. International Journal of Theoretical Physics, 21, 941–954. Dawkins, R. (1986). The blind watchmaker. Reprinted by Penguin (1988).
123
This page intentionally left blank
Synthese (2011) 181:277–293 DOI 10.1007/s11229-010-9802-7
General terms, rigidity and the trivialization problem Genoveva Martí · José Martínez-Fernández
Received: 6 December 2009 / Accepted: 26 July 2010 / Published online: 7 September 2010 © Springer Science+Business Media B.V. 2010
Abstract We defend the view that defines the rigidity of general terms as sameness of designated universal across possible worlds from the objection that such a characterization is incapable of distinguishing rigid from non-rigid readings of general terms and, thus, that it trivializes the notion of rigidity. We also argue that previous attempts to offer a solution to the trivialization problem do no succeed. Keywords
Rigidity · Rigid designators · General terms · Trivialization
Our purpose in this paper is to defend the view that defines the rigidity of general terms as sameness of designation across possible worlds. On this view, a general term is rigid just in case it designates the same universal (species, substance or property) in every possible world. The main disagreement among proponents and detractors of the view has to do with whether this characterization of the rigidity of general terms trivializes the notion of rigidity.1 According to the detractors, the view in question is ultimately incapable of distinguishing rigid from non-rigid readings of general terms: when we consider terms such as ‘the color of the sky’, there is no way of distinguishing a non-rigid reading, where the term designates different colors in different possible worlds, from a rigid reading where it designates in every possible world the property 1 Proponents of the view include Linsky (1984, 2006), Salmon (1982, 2003), and LaPorte (2000). The
detractors include Schwartz (2002) and Soames (2002). G. Martí ICREA and Universitat de Barcelona, Barcelona, Spain G. Martí (B) · J. Martínez-Fernández Departament de Lògica, Història i Filosofia de la Ciència, Universitat de Barcelona, Montalegre, 6, 08001 Barcelona, Spain e-mail:
[email protected] J. Martínez-Fernández e-mail:
[email protected]
123
278
Synthese (2011) 181:277–293
that things instantiate in a world w when they are the same color as the sky in w. If ‘blue’ designates rigidly a universal, the color blue, then ‘Mary’s favorite color’ or ‘the color of the sky’ can also be interpreted as designating rigidly universals (being Mary’s favorite color or being the color of the sky). Any sentence of the form t is G (or t is a G) is true just in case t has the property of being G, so any general term G, arguably, designates the property of being G (or being a G) and it does so rigidly. On this approach all general terms may as well be considered rigid.2 It is no solution to try to argue one’s way out of the objection by pointing out that the rigid reading of, say, ‘Mary’s favorite color’ is highly unnatural, contending that properties such as being Mary’s favorite color do not exist. From a purely semantic point of view there is no reason to deny legitimacy to such properties and to the rigid readings of terms that designate them. As Schwartz (2002, p. 268) has pointed out: “[K]inds may have an important role in our common sense understanding of the world and even in science, but they don’t have a metaphysical status that is useful to formal semantics. (…) [P]roperties are not limited to robust things like causal powers –they are simply sets of actual and possible individuals and for every such set there is a property.” 3 We nevertheless wish to argue that the problem of trivialization is a pseudo-problem. Characterizing rigidity for general terms as sameness of universal designated across possible worlds does not trivialize the notion of rigidity. In the next section we describe the problem and discuss some attempts at providing a solution. Section 2 re-examines the problem and identifies what an adequate response to the charge of trivialization should consist in. Section 3, which exploits ideas presented in Linsky (1984), provides the response. 1 The problem and some non-solutions To illustrate intuitively how the problem of trivialization arises, let us consider the truth conditions of ‘Ann’s dress is the color of the sky’. Here ‘the color of the sky’ appears to have a natural non-rigid reading, according to which it designates the universal *blue* in the actual world and some other color, say *red*, in some other world w. So, when we explain the truth conditions of that sentence, it is natural to say that the sentence is true in the actual world because Ann’s blue dress exemplifies *blue*, the color designated by ‘the color of the sky’ in the actual world, and in w, where Ann’s dress continues to be blue, it is false, since Ann’s dress does not exemplify the color designated by ‘the color of the sky’ with respect to w. 2 There are two other important objections that would continue to make the approach unattractive to many, even if there were an answer to the trivialization objection. First, it is argued that the approach overgeneralizes, i.e., that it classifies as rigid terms such as ‘pencil’ or ‘philosopher’ which should not be so classified. It is argued also that the approach does not account for the necessity of theoretical identifications involving rigid terms. We address these two objections in Martí and Martínez-Fernández (2010). 3 And moreover, it is arguable that there are some natural properties, properties that exist in the world, that correspond to the rigid readings of some prima facie non rigid general terms. For instance, the property of being the color of the sky is a natural property that some surfaces have by virtue of their capacity to reflect colors, and that is the property designated by the rigid interpretation of ‘the color of the sky’.
123
Synthese (2011) 181:277–293
279
But suppose that we take ‘the color of the sky’ to designate rigidly *TCS*, the property that things have in any given world w when they are the color of the sky in w. In that case, the sentence in question is true in the actual world, since Ann’s dress, being blue, instantiates being the same color as the sky in the actual world, and it is false in w, for in w Ann’s blue dress does not instantiate that property, it is not the same color as the red sky of w. No difference in truth value or truth conditions arises whether we interpret ‘the color of the sky’ rigidly or non-rigidly.4 The problem of trivialization arises at a relatively intuitive level, when we think about the truth conditions of sentences, and before we start considering any formal system to represent the rigidity of general terms, so it is not just a technical problem generated by the inadequacies of this or that formal apparatus: there seems to be a difference between the rigid and not rigid readings of general terms such as ‘the color of the sky’, but when we explain the truth conditions of sentences such as ‘Ann’s dress is the color of the sky’ no difference in truth conditions seems to arise. Several philosophers have defended the conception of general terms as designators from the trivialization objection and they have provided examples of sentences that, arguably, require distinguishing between rigid and non-rigid readings of general terms. Here we will discuss arguments by LaPorte (2000), Salmon (2003) and López de Sa (2006). As we will argue, the cases they consider fail to provide an adequate response to the objection, for they leave the core of the objection of trivialization intact. Joseph LaPorte has argued that treating general terms as designators and defining rigidity as sameness of designation does not trivialize the notion of rigidity: My preferred account does not trivialize rigidity. It is simply not the case that every kind designator rigidly designates its kind (…) ‘soda pop’, ‘soda’, and ‘pop’ all rigidly designate the soda pop kind; but ‘the beverage my uncle requests at Super Bowl parties’ only accidentally designates the kind (…) ‘Soda = the beverage my uncle requests at Super Bowl parties’ is true but not necessarily true, since the second designator is not rigid.” (LaPorte 2000, pp. 296, 299) López de Sa (2006), following a similar strategy, argues that if the identity sentence formed with the canonical nominalizations of general terms P and Q (i.e., a sentence of the form being P is being Q) is intuitively true but contingent, at least one of the general terms is non-rigid. So, if ‘being blue is being the color of the sky’ is intuitively true and contingent, at least one of the two terms must be interpreted non-rigidly, and he uses this to argue that ‘the color of the sky’ in a sentence such as ‘Ann’s dress is the color of the sky’ is non-rigid or, at least, that it has a significant non-rigid interpretation.
4 It is important to stress that the problem of trivialization consists in the apparent inability of the view that general terms designate universals to distinguish rigid from non-rigid readings of general terms. It is sometimes said that the view that general terms designate universals trivializes the notion of rigidity because it classifies all general terms as rigid; but the problem, rather, is the inability to distinguish the two readings. Soames himself seems to present the problem in these two ways: “…there is no point in defining a notion of rigidity for predicates according to which all predicates turn out, trivially, to be rigid” (Soames 2002, p. 251) and “it is not even clear that the approach …can coherently distinguish between rigid and nonrigid predicates at all” (Soames 2002, p. 261).
123
280
Synthese (2011) 181:277–293
The problem with LaPorte’s and López de Sa’s remarks is that they do not show that there is any truth conditional difference between rigid and non rigid interpretations of general terms when those have a clearly predicative role, i.e., when they are used to attribute something to an object. Practically no one denies that rigid and non-rigid readings of general terms can be distinguished when the sentences in which the terms occur are naturally interpreted as making claims about universals.5 As Soames has noted recently: . . . there is a natural way of extending Kripke’s distinction between rigid and non-rigid designators from singular to general terms—even though it does not extend to predicates constructed from those terms …. (Soames 2004, p. 95) Thus, Soames is quite ready to accept that there is a difference between rigid and non-rigid readings of ‘the color of the sky’ in a sentence such as ‘blue is the color of the sky’ when the sentence is taken to say something about the universals designated by the general terms. On the non-rigid reading the sentence is contingently true, for ‘the color of the sky’ designates the universal *blue* in the actual world and other universals in other worlds. On the rigid reading, on the other hand, the sentence is simply false, for the universal designated by ‘blue’ is altogether different from the universal designated by the rigid reading of ‘the color of the sky’. However, when we focus on sentences such as ‘Ann’s dress is the color of the sky’ where ‘the color of the sky’ is used as a predicate to attribute something to an object, the problem remains, as we saw when we discussed intuitively the truth conditions of that sentence. No difference in truth value and truth conditions surfaces between the rigid and the non-rigid readings of ‘the color of the sky’. LaPorte’s and López de Sa’s arguments do not respond to the trivialization objection. What they show, at most, is that we can use non-rigid terms to talk about properties, substances, colors, species and whatnot, but they do not give us a clear case of a significant difference in a sentence where a general term such as ‘the color of the sky’ is used to predicate something of an object. Salmon (2003) responds to the objection by pointing out that the following argument is valid: My true love’s eyes are the color of the sky The color of the sky is blue Therefore, my true love’s eyes are blue. Since the second premise is a true contingent identity sentence in which ‘the color of the sky’ is naturally interpreted as designating non-rigidly the color blue, Salmon argues that the same semantic value should be assigned to the general term in the first premise, and hence that the general term is not rigid. Now, Salmon’s strategy seems, in principle, better suited than LaPorte’s and López de Sa’s. If Salmon is right, the occurrence of ‘the color of the sky’ in the first premise— a sentence where ‘the color of the sky’ is used to attribute a color to an object—must be interpreted non-rigidly. The intuitive validity of the argument seems to require that
5 Thanks to Benjamin Schnieder for comments on this issue.
123
Synthese (2011) 181:277–293
281
the semantic values of the expressions that occur in premises and conclusion be kept constant; and clearly, the contingency of the second premise requires that ‘the color of the sky’ be interpreted non-rigidly. One could protest here that Salmon’s arguments still fall short from providing a convincing example of a sentence in which different interpretations of a general term, used predicatively, as rigid and as non-rigid, resulted in different truth conditions. But we think that Salmon’s proposal has other problems. Even if ‘the color of the sky’ were interpreted rigidly throughout, the argument would still be valid: there is no index of evaluation at which the premises are both true and the conclusion is false. Suppose that ‘the color of the sky’ is interpreted as rigidly designating the universal that objects instantiate at a world when they are the same color as the sky. A world w in which the first premise is true is a world in which object e (designated by ‘my true love’s eyes’) instantiates that property. The only way the second premise can be true at w is for all the objects that instantiate the property of being the color of the sky to be also objects that instantiate the property of being blue. But that is a world where object e instantiates the property of being blue, as the conclusion demands. One might argue that if ‘the color of the sky’ is interpreted rigidly, what the second premise says is false in w, for the property of being the color of the sky cannot be identical to what ‘blue’ designates, since that property is, after all, not a color. Fair enough; but then the second premise is false at all worlds and the argument is still valid, even if vacuously so. It is not a surprise that Salmon’s defense of the significance of the distinction between rigid and non-rigid predicative uses of general terms faces a problem. For the problem is precisely that we do not seem to be able to detect any difference in truth conditions between examples of sentences that contain predicatively used general terms when those terms are interpreted rigidly and when they are interpreted non-rigidly. So, the fact that we cannot detect a difference in the validity of arguments is to be expected. It might be objected here that we are not respecting the spirit of Salmon’s point. Salmon argues that the natural interpretation of ‘the color of the sky’ in the second premise is a non-rigid interpretation, and that in order to explain the argument’s intuitive validity, we should interpret ‘the color of the sky’ also as non-rigid in the first premise. The assumption is that in order for the argument to be valid, the terms involved must have the same semantic value. Although in our response so far we respected the assumption of consistency (interpreting ‘the color of the sky’ as rigid in both premises) we, arguably, failed to pay sufficient attention to the natural intuitive reading of premise two. The point however is that the assumption that ‘the color of the sky’ has to be interpreted consistently as non-rigid to preserve the intuitive validity of the argument is incorrect. There are arguments whose intuitive validity is not affected by such lack of consistency. Suppose that we have been discussing the property of reflecting colors, a property possessed by some surfaces; in particular we have been discussing a property that the surfaces of large bodies of water have, the property of reflecting the color of the sky, or being the color of the sky. Consider now the following chain of reasoning:
123
282
Synthese (2011) 181:277–293
Loch Ness is the color of the sky, that is, it has the property of being the color of the sky. The color of the sky happens to be blue. So, Loch Ness is blue. The natural interpretation of ‘the color of the sky’ in the first sentence, given the conversational setting we are envisaging here, is clearly a rigid interpretation. The natural interpretation in the second sentence is, also clearly, a non-rigid one. Yet, it would be hard to deny that we are entitled to draw the conclusion from these two premises. So, again, even if it is natural to interpret the occurrence of ‘the color of the sky’ as non-rigid in the second premise, we do not have compelling reasons, based on the intuitive validity of the argument, to interpret its occurrence in the first premise as non-rigid. This is of particular importance since it is only in the first premise that ‘the color of the sky’ has a predicative role.6 Soames’ challenge remains unanswered: nothing shows that the notion of rigidity can be extended to predicative uses of general terms. 2 The form of a solution None of the responses to the trivialization objection we discussed succeed in addressing the problem.7 Nevertheless, we think that the approach to rigidity defended here does not trivialize the notion of rigidity, for there are differences in truth conditions between rigid and non-rigid readings of general terms when those are used predicatively. In fact, the tools to solve the problem are to be found in Linsky (1984), and they were available well before the problem was posed. Bernard Linsky observed that a sentence such as ‘her eyes are the color of the sky’ has the same truth condition on both the rigid and the non-rigid reading of ‘the color of the sky’. However, “the way this condition is determined is different in the two cases (. . .). Only in the presence of certain world sensitive operators, such as actuality operators” will the difference show up (Linsky 2006, pp. 659, 660). Linsky, we think, is right, but in order to apply the 1984 tools to the solution of the trivialization problem, we first need to reflect on how to separate what Soames calls ‘general terms’ (in bold) from predicates within an approach that is committed to the basic claim that general terms designate universals. In order to do this, we need to distinguish cases in which a general term is used to talk about the universal it designates, from cases in which the universal designated by a general term is attributed to particulars. There are two different uses of general terms in natural language. There are sentences such as ‘This [pointing to a ring] is gold’ or ‘Tommy is a tiger’; these are cases in which, intuitively, the subject matter of the claim expressed is a particular and a universal is attributed to it. Being faithful to the view that treats general terms as designators of universals, we should say that the truth value of the claim expressed depends on a particular and on whether or not it exemplifies the universal designated by the 6 In Sect. 3.2 we discuss formally Salmon’s argument. 7 Other answers to the objection go approximately along the lines suggested by the authors we have con-
sidered here.
123
Synthese (2011) 181:277–293
283
general term. In these cases, and according to the position being explored here, the general term designates a universal and the copula (or a similar syntactic mechanism) indicates that the universal is being attributed. The predicative role of a general term is not determined by the position it occupies in a natural language sentence (the predicate) but rather by the fact that the truth value at an index of evaluation depends on the things (or samples) that instantiate the universal designated. For instance, on the most natural reading of the sentence ‘tigers are cute’ the term ‘tiger’ plays a predicative role, because the truth value of the sentence at an index depends on whether the individuals that are exemplars of the species designated by ‘tiger’ are or aren’t cute. There is, on the other hand, a different kind of sentence, one in which the subject matter of the claim expressed is the universal (species, substance, property) itself. When we say ‘gold is a precious metal’ or ‘the dodo is extinct’ we are talking about the substance, gold, not about its samples and we are talking about the species, dodo, not of the sadly gone individual dodos.8 The general term designates the entity we are talking about. It has, hence, no attributive role and the truth value of the claim expressed depends on which universal is designated independently of which objects exemplify it. These two types of sentences have different types of truth conditions: for sentences of the first type the truth value at a possible world is determined by the extension at that world of the universal designated at that world by the general term. For instance, the truth value of ‘Tommy is a tiger’ at a world w depends on whether the animal ‘Tommy’ designates belongs to the extension at w of the universal designated by ‘tiger’ at w. We will call this way of giving the truth conditions of a sentence involving a general term G exemplifier semantics (for G), since the truth value is determined by the instantiators of the universals involved. For sentences of the second type the truth value at a possible world is determined by the universal designated at the world by the general term. We will call this way of giving truth conditions kind semantics for the general term in question. For instance, the truth value of ‘gold is a precious metal’ at w depends on whether the substance designated by ‘gold’ at w is one of the universals in the extension at w of the property designated by ‘precious metal’ at w. In this sentence we give a kind semantics for ‘gold’ and an exemplifier semantics for ‘precious metal’.9,10 Traditionally, general terms have been formalized in possible worlds semantics as predicates, with sets of objects or samples as their extension. From that point of view, 8 It would be tempting to argue that the species being extinct is just the fact that all its, at one time or another, members are dead, so that ultimately the claim is about the things that were dodos. That, we think, would be a mistake. We shouldn’t confuse what a claim is about with the multiplicity of facts that contribute to its truth value. When we say ‘Barack Obama is the President of the US’ it surely is the case that the truth of the claim depends entirely on the number of people who voted for him. But the claim is about Mr. Obama and the property of being President, not about individual voters and their ballots. 9 Different occurrences of a general term in a sentence may have different types of semantics as in ‘if this is water, water is not drinkable’. 10 The distinction between kind semantics and exemplifier semantics goes along the lines of what Soames (2002) presents as the second strategy to characterize rigidity for general terms. Soames thinks that the strategy ends up trivializing the notion of rigidity.
123
284
Synthese (2011) 181:277–293
predicates designate those sets relative to evaluation indices. The truth value, at any world, of any sentence containing a general term depends then on which objects fall under the extension of the predicate at the world of evaluation.11 This is, first of all, not germane to the view that general terms are designators of universals and, second, it confuses the two types of sentences which, on our view, should be distinguished. According to the view defended here, general terms always designate universals. The universal designated can either function attributively or as subject-matter. What role the universal plays in a given claim depends on whether the truth value at an index of the claim in question is determined by the objects or samples that exemplify the universal or by the universal itself.12 In possible worlds semantics universals are typically represented as intensions, functions from indices of evaluation to sets, intuitively the sets of things that exemplify the universal at the index in question. Since on this approach general terms designate universals, formally general terms will denote intensions relative to indices. A rigid general term such as ‘blue’ will designate the same intension, the same function from worlds to sets, in every possible world. A non-rigid term such as ‘the color of the sky’ will designate different functions. On the other hand, ‘the color of the sky’, interpreted as rigidly designating the property of being the color of the sky, will designate rigidly a function whose value at each index is, intuitively, the set of things that are the same color as the sky at that index. Obviously, although the universals designated by the rigid and the non-rigid reading of ‘the color of the sky’ at a given index w are different, the values at w of the function designated, at w, by the non-rigid interpretation of ‘the color of the sky’ and by the rigid interpretation will coincide: things under the extension of whatever universal (color) is designated by the non-rigid interpretation of ‘the color of the sky’ at w should also be the things that exemplify at w the property of being the color of the sky (designated by the rigid interpretation of ‘the color of the sky’); after all, if the sky is red in w, the set of red things in w will exemplify the property of being the color of the sky in that world. The statement of truth conditions becomes then a bit more complicated than in traditional intensional semantics: for instance, a sentence such as ‘Ann’s dress is blue’ is not anymore true at a given index w just in case Ann’s dress is in the extension of ‘blue’ at w, for the extension of a general term at an index is not a set of objects. The truth of ‘Ann’s dress is blue’ at a given index depends on whether Ann’s dress is in the set of things that constitute the value at w of the function designated by ‘blue’ 11 For instance, Soames assumes that if a general term designates a universal, then its only interpretation
is the interpretation that would make a sentence containing the term a sentence about the universal and that, interpreted as a predicate, a general term, if it designates at all, designates the set of objects or samples that constitutes its extension. This is obvious in the fragment from Soames (2004) mentioned before. But it is also in Soames (2002) Chap. 9, especially pp. 248–250 and 259–262, as well as Chap. 11, especially pp. 306–311. 12 It can be argued that all we are doing is providing a relational reading of sentences such as ‘Tommy is a tiger’ and hence that we are not really using the general term as a predicate. But we think that the essential difference between a term that is a predicate and one that is not is precisely what is captured by our distinction between exemplifier semantics and kind semantics.
123
Synthese (2011) 181:277–293
285
at w. Since ‘blue’ is rigid, that function is constant. General terms such as ‘the color of the sky’ will be treated as ambiguous, i.e., as having both a rigid and a non-rigid reading.13 The truth, at w, of a sentence such as ‘this (pointing to my ring) is gold’ will depend on whether the ring in question is a member of the value of the function *gold* at w. On the other hand, the truth, at w, of a sentence such as ‘gold is the substance with atomic number 79’ will depend on whether ‘gold’ and ‘the substance with atomic number 79’ designate the same universal, the same intension, at w. The semantics for ‘gold’ in the first sentence is exemplifier; in the latter it is kind semantics.14 This approach breaches the standard intensional semantics assumption that a general term G designates a set of objects relative to an index. But it violates that assumption in order to capture the idea that G designates a universal, an abstract entity, and it does so using the kinds of representational tools that possible worlds semantics traditionally uses. We are now in a position to state what a solution to the challenge posed by the trivialization objection consists in: we need to show that there are sentences containing general terms whose truth conditions differ depending on whether the terms are interpreted rigidly or non-rigidly, when those terms have an exemplifier semantics. Following Linsky’s clue, we will see that such sentences can be found. 3 The case against the objection 3.1 First example Consider the sentence: (1) It might have been the case that Loch Ness was not the actual color of the sky. Suppose ‘the color of the sky’ has a non-rigid reading in sentence (1). Then ‘the actual color of the sky’ designates rigidly the universal that is the color of the sky in the actual world, that is, it designates rigidly the color *blue*. If the sky were red, the surface of Loch Ness would appear red and not blue. So, with this interpretation, (1) is intuitively true. Suppose instead that ‘the color of the sky’ rigidly designates the property that things have when they are the color of the sky (i.e., it designates the property of being the same color as the sky). Now, on the rigid interpretation of ‘the color of the sky’ both ‘the actual color of the sky’ and ‘the color of the sky’ designate, rigidly, the same 13 Notice that in our explanation of the truth conditions of these sentences we use the index of evaluation twice: first to determine which property is designated by the general term, and then to determine which is the extension that is relevant for the determination of truth value. It might be argued that we would achieve the same results with a system of double indexing, where the interpretation of a general term P is a binary function that, for any indices u and v, gives the extension at v of the property designated by P at u. We think our choice is preferable, because it highlights that general terms are designators of universals. 14 See also Martí and Martínez-Fernández (2007) for an intial discussion of these issues. For a discussion of how this bears on the explanation of the necessity of true theoretical identifications, see Martí and Martínez-Fernández (2010).
123
286
Synthese (2011) 181:277–293
property, for the same reason that, in the singular case, a rigid description such as ‘the successor of eight’ designates in every world the same number as ‘the actual successor of eight’. What is then the intuitive truth value of (1)? Since lakes owe their color to the reflection of the color of the sky, it is clear that a lake cannot fail to be the color of the sky, it cannot fail to be one of the things that have the property of being the same color as the sky. And since on the rigid reading ‘the color of the sky’ and ‘the actual color of the sky’ designate the same property, no lake, including Loch Ness, can fail to be the actual color of the sky, and (1) is false. We see that there is a difference between a rigid and a non-rigid reading of ‘the color of the sky’ even in sentences in which the general term is used predicatively, i.e., when the truth conditions of the sentence are given by appeal to the individuals that exemplify the property designated by the general term. We will construct a simple model to clarify how the evaluation of the sentence should proceed in both cases. Suppose there are just two worlds, @ and w, the domain of individuals is {0, 1, 2, 3}, ‘Loch Ness’ names 0 and ‘the sky’ names 1. We suppose, to simplify, that all individuals exist at all worlds. The non-rigid interpretation of ‘the color of the sky’, which we will represent by TCSnr, designates different colors in different possible worlds: TCSnr : @ → *blue* w → *red* with *blue* and *red* defined as: *blue* : @ → {0, 1, 2} w → {2, 3} *red* : @ → {3} w → {0, 1} Since ‘the actual color of the sky’ rigidly denotes the property that the non-rigid interpretation of ‘the color of the sky’ designates at the actual world, the right intension for that expression (represented as @TCSnr) is:15 @TCSnr : @ → *blue* w → *blue* Let us now evaluate sentence (1) in this model according to the non-rigid reading of ‘the color of the sky’. (1) says that it is possible for Loch Ness not to be the actual color of the sky, that is, that there is a world u in which the object designated by ‘Loch Ness’ does not exemplify the property designated by ‘the actual color of the sky’ at u. This is true in our model, since 0 ∈ / {2, 3} = *blue*(w), where *blue* = (@TCSnr)(w). 0 15 The intension of a general term is a higher-level function that assigns universals (i.e. lower-level inten-
sions) to the different indices of evaluation.
123
Synthese (2011) 181:277–293
287
is red in w, not blue, so it is not the actual color of the sky at w. Sentence (1) is true, as expected. Let us now consider sentence (1) under the rigid interpretation of the definite description. On its rigid reading, ‘the color of the sky’ designates the property that things exemplify at a given world when they are the same color as the sky in that world. Let us call that property *TCS*. The property *TCS* applies in the actual world to things that are blue (i.e., the color of the sky at @) at @, and it applies in w to the things that are red (i.e., the color of the sky at w) at w. So it is represented by this function: *TCS* : @ → {0, 1, 2} w → {0, 1} The intension corresponding to the rigid interpretation of ‘the color of the sky’ is thus TCSr : @ → *TCS* w → *TCS* Since this intension is already rigid, it coincides with the intension corresponding to ‘the actual color of the sky’ in its rigid interpretation (represented by @TCSr): @TCSr : @ → *TCS* w → *TCS* With this rigid reading, (1) it is true just in case there is a world u in which the object denoted by ‘Loch Ness’ does not exemplify the property that ‘the actual color of the sky’ on its rigid reading designates at u, that is, in which 0 does not exemplify the property *TCS*. But we can see that 0 ∈ *TCS*(@) and that 0 ∈ *TCS*(w), so sentence (1) is false. The two sentences really diverge in their truth conditions. On the non-rigid interpretation of ‘the color of the sky’ (where ‘the actual color of the sky’ rigidly designates *blue*) the fact that Loch Ness reflects a red sky in w makes the sentence false. But that very same fact, i.e., that Loch Ness has the property of reflecting the color of the sky, a property rigidly designated by the rigid interpretation of ‘the color of the sky’, makes the sentence so interpreted true. In other words, Loch Ness’ being red in w makes the sentence false on one reading, true on the other.16 16 Notice that sentences similar to this one are not so rare. For example, if Tom is a chameleon, which is
now looking reddish on a red background, the sentence ‘It might have been the case that Tom was not the actual color of the background’ is true if ‘the actual color of the background’ designates rigidly what the non-rigid reading of ‘the color of the background’ designates in the actual world, namely, the property *red*, because Tom could have looked brown instead of red, on a different circumstance. But, on the other hand, supposing folk science to be correct, chameleons change their color according to the background. Hence Tom has to be the actual color of the background—in the rigid sense of ‘the color of the background’— because Tom always exemplifies the property of being the same in color as the background. Again we find a difference in truth value of the sentence depending on whether general terms are interpreted rigidly or non-rigidly.
123
288
Synthese (2011) 181:277–293
We will follow Linsky and use second-order modal logic to formalize our examples. In this logic all simple terms are interpreted rigidly (singular terms are interpreted by an element of the domain, first-order predicates are assigned a property17 —i.e., a function from possible worlds to extensions—and second-order predicates are assigned a second-order property—a function from worlds to sets of first-order properties).18 In general, from a purely formal point of view, nothing forces this decision. We could have rigid and non-rigid simple general, and even singular, terms if we so wish. But it seems to us that there is no semantic reason whatsoever to do so. When it comes to semantic function, we think that terms such as ‘water’ or ‘tiger’ (terms for natural kinds) behave in exactly the same way as ‘bachelor’, ‘philosopher’, ‘computer’ or ‘pencil’, pretty much for the same reason that we think that all proper names are rigid, independently of whether they name a person or a robot.19 This logic codifies the distinction between exemplifier semantics and kind semantics by the different position of the general term in a sentence. Consider a general term P. If P is used as a predicate that is being attributed to an object, the form of the sentence will be Pt, and the sentence will be true when the object denoted by t is in the extension of the property denoted by P at the index of evaluation. When P has a kind semantics, some predicates are applied to it, so it appears in a context of the form R(P), where R is a second-order predicate. In this context the sentence will be true when the property denoted by P (and not just the extension of that property at the index of evaluation) belongs to the extension of the term R at the index of evaluation. For the case of the identity predicate, we will accordingly evaluate a sentence P = Q as true exactly when both terms denote the same property. Non-rigid readings of complex general terms will be represented by second-order definite descriptions. An expression of the form ‘the F such that ϕ’, when evaluated at different indices, will (in general) denote different properties, as intuitively required.20 ‘The color of the sky’ will then be represented as the Russellian description ( F)C(s, F), where C(x, Y ) is a second-order predicate that says that object x has color Y and s stands for the sky. The term ‘the actual color of the sky’ will then correspond to ( F)AC(s, F), where A is the operator of actuality.21 Then the non-rigid reading of sentence (1) is given by the sentence ι
ι
17 Following traditional terminology, we do not distinguish here between properties, species, substances, colors, etc. 18 We will use a standard semantics. See Gallin (1975, Sect. 9), for the general semantics of higher-order modal logic. Here we concentrate on the simplest case. 19 So, our reply to the so-called overgeneralization problem is that it is not a problem at all. For discussion see Martí and Martínez-Fernández (2010). 20 The same observations can be made for the modal languages of higher orders: all simple general terms, of any order, have a rigid interpretation, and the non-rigid interpretation is obtained using definite descriptions. The contexts in which a general term has an exemplifier or a kind semantics are the same mutatis mutandis as those in the second-order case. 21 We use the quantificational notation for descriptions to disambiguate scope: in the formula [( F)ϕ]ψ,
ι
the expression ‘[( F)ϕ]’ acts as a quantifier that binds any free occurrence of the variable F in ψ. The definition of the operator is [( F)ϕ]ψ =def ∃F(∀G(ϕ(G) ↔ G = F) ∧ ψ). ι
ι
123
Synthese (2011) 181:277–293
289
(2) [( F)AC(s, F)]♦¬F(l) ι
where l is the name of Loch Ness. It is easy to see that this sentence gives the right truth value in the previous model where 0 is Loch Ness and 1 is the sky. The rigid reading of (1) requires that we first give a formal representation of the rigid reading of ‘the color of the sky’. Linsky has already shown how to do that.22 He defines the operator (βFx) that applies to a formula ϕ(x) and gives as result a general term that designates, at each index of evaluation, the same property: the property that is exemplified at each index by the individuals that satisfy the formula ϕ(x) at that index. We may say that (β F x) rigidifies the for (∀x)(ϕ(x) ↔ F x). The rigid mula ϕ(x). The definition is (β F x)ϕ(x) =def ( F) reading of ‘the color of the sky’ corresponds then to (β F x)[( G)C(s, G)]Gx.23 However, the result of applying ‘actual’ to the rigid reading of ‘the color of the sky’ cannot be represented as (β F x)[( G)AC(s, G)]Gx, since that is defined as the rigidification of the formula ‘x is the actual color of the sky’ and that would amount to ‘x is blue’, giving as a result a general term that denotes *blue* at every index of evaluation. We want to apply the actuality operator to a general term that designates *TCS* with respect to every index. This reading is captured (∀x)([( G)C(s, G)]Gx ↔ F x), that we will abbreviby the expression ( F)A ate as (Aβ F x)[( G)C(s, G)]Gx.24 We may finally express the rigid reading of (1) as ι
ι
ι
ι
ι
ι
(3) [(Aβ F x)[( G)C(s, G)]Gx]♦¬F(l). ι
An advantage of using this formal language is that it provides different precise formal translations, (2) and (3), for the different meanings that can be associated with the rigid and non-rigid reading of the general term in (1). The natural language sentence (1) is ambiguous between those two interpretations. Which of those readings is more natural and how the speakers select the intended one, is a pragmatic matter that depends on the context of utterance. 22 See Definition Schema II in Linsky (1984, p. 273). In Linsky’s notation the term that rigidifies the formula ϕ(x) is (βx)ϕ(x). We have changed the notation to (β F x)ϕ(x) to specify which is the secondorder variable that will be bound by the quantifier expression (i.e., when we write [(β F x)ϕ(x)]ψ, every free occurrence of Fin ψ will refer to the rigidification of the formula ϕ, just as in [( F)ϕ]ψ every free occurrence of F in ψ refers to the unique property F such that ϕ).
ι
23 This expression is unpacked as (*) ( F) (∀x)(([( G)C(s, G)]Gx) ↔ F x). It may seem that this is a
ι
ι
mistake, since the following appears as the natural way to formalize the rigid reading of ‘the color of the sky’: (**) ( F) (∀x)([( G)C(s, G)](Gx ↔ F x)). In a model such that at every index there is a unique G satisfying C(s, G), both expressions designate at a given index of evaluation the same property F which is the rigidification of Gx, but in a model in which there are some indices i where none or more than one property satisfy C(s, G), the expression (*) designates with respect to an index of evaluation a property that has an empty extension at the i indices, while the expression (**) does not designate a property at all at any index. Hence, in this case any sentence of the form [(β F x)[( G)C(s, G)]Gx]ψ will be false if interpreted with (**). We think that (*) is more natural: just as ‘the teacher of Alexander’ represents an individual in the actual world even though it could fail to designate in another world (either because Alexander had no tutor or because he had more than one), we want to say that ‘the color of the sky’ designates a property in the actual world even if there are worlds where the sky has more than one color or none at all. ι
ι
ι
24 Notice that this formula is the instantiation of the general form (Aβ F x)ϕ(x) = (∀x)(ϕ(x) ↔ def ( F)A F x) when applied to the formula ϕ(x) =def [( G)C(s, G)]Gx.
ι
123
ι
290
Synthese (2011) 181:277–293
3.2 An interlude: Salmon’s argument This formal language allows us also to see more clearly what our criticism of Salmon’s proposal consists in. The arguments we discussed in Sect. 2 were instances of the following general schema: a exemplifies the property F such that ϕ(F) the F such that ϕ(F) is B Therefore, a is B. We argue that this general form of argument is valid independent of whether the occurrences of ‘the F such that ϕ(F)’ have a rigid or a non-rigid interpretation. The interesting case is the one in which the general term is interpreted rigidly in the first premise and non-rigidly in the second: [(β H x)[( F)ϕ]ϕ(x)]H a ι
[( F)ϕ] (F = B) ∴ Ba ι
Consider a model M and an arbitrary index u. Let us call I (B) the property that the model assigns to the constant B. If the second premise is true, there is a unique property, call it f , that satisfies the formula ϕ, and that property coincides with I (B). In particular, the extension of I (B) at u coincides with the extension of f at u. The truth of the first premise implies that I (a) (the interpretation of the constant a in M) belongs to the extension of the rigidification of the property f at u. By the definition of the β operator, the extension of f at u coincides with the extension of the rigidification of f at u. It follows that I (a) belongs to the extension of I (B) at u and the conclusion is true at u with respect to M. 3.3 Second example Our second example is a bit trickier to analyze, since the difference in truth value between the rigid and the non-rigid interpretation will depend on how the actuality operator is interpreted. Consider the sentence ‘the color of the sky could have been different from what it actually is’. The truth value of this sentence depends on whether the sentence (4) The color of the sky is different from the actual color of the sky is, although actually false, true at some worlds or false at all worlds. In (4), intuitively, we seem to be talking about colors as universals. Hence the more natural way to give the truth conditions of (4) is by using kind semantics: (4) is true at a world u if the property designated by ‘the color of the sky’ at u is different from the property designated by ‘the actual color of the sky’ at u. The examples discussed by Joseph LaPorte already show that in cases like the present one there is a difference in truth value between the rigid and non-rigid readings of the sentence. Let us use the model defined for the sentence (1) above, and let us suppose we give a non-rigid
123
Synthese (2011) 181:277–293
291
reading of (4). Then (4) is true at a world u if TCSnr(u) is different from @TCSnr(u). We check that TCSnr(@) = *blue* = @TCSnr(@), TCSnr(w) = *red* = *blue* = @TCSnr(w), so (4) is false at @ and true at w. On the rigid reading, (4) is true at u if, and only if, TCSr(u) is different from @TCSr(u). Now we have TCSr(@) = *TCS* = @TCSr(@), TCSr(w) = *TCS* = @TCSr(w), and (4) is false at both worlds. This shows the difference in truth conditions. The difference is to be expected, for on the rigid reading the sentence could only be true if the property of being the color of the sky could fail to be the property it actually is. From the exemplifier semantics perspective, what sentence (4) says is that the things that are the color of the sky are different from the things that are the actual color of the sky. We believe that two different readings can be given of the expression ‘the actual color of the sky’. On one reading, let us call it the global reading of the actuality operator, (4) says that the things that are the color of the sky are different from the actual things that are the actual color of the sky. In more formal terms, (4) is true at a world u if the extension at u of the property designated by ‘the color of the sky’ at u is different from the extension at the actual world25 of the property designated by ‘the color of the sky’ at the actual world. On the other reading, which we will call the local reading, the actuality operator contributes to determining just the property we are talking about: (4) is true at a world u if the extension at u of the property designated by ‘the color of the sky’ at u is different from the extension at u of the property designated by ‘the color of the sky’ at the actual world. In the latter case, the truth value of the sentence will depend on whether things in u exemplify in u the property actually designated by ‘the color of the sky’. Let us evaluate sentence (4) in the previous model. First, let us consider the global reading. On the non-rigid reading, (4) is true at u if TCSnr(u)(u) = @TCSnr(@)(@). The sentence is of course false at @, but true at w, since TCSnr(w)(w) = *red*(w) = {0, 1} = {0, 1, 2} = *blue*(@) = @TCSnr(@)(@). On the rigid reading, (4) is true at u if TCSr(u)(u) = @TCSr(@)(@). We find again that (4) is false at @ and true at w, since TCSr(w)(w) = *TCS*(w) = {0, 1} = {0, 1, 2} = *TCS*(@) = @TCSr(@)(@). 25 Strictly speaking, we should consider those actual things that exist at world u but, again, we assume that
all individuals exist at all worlds, to simplify the exposition.
123
292
Synthese (2011) 181:277–293
Sentence (4), under the global interpretation, does not provide a counterexample to the trivialization problem, since the truth value is identical whether we consider a rigid or a non-rigid reading of the general term. Let us move on to the local non-rigid reading. Now (4) is true at u if TCSnr(u) (u) = @TCSnr(@)(u). At @ the sentence is obviously false, and at w it is true: TCSnr(w)(w) = *red*(w) = {0, 1} = {2, 3} = *blue*(w) = @TCSnr(@)(w) The local rigid reading makes the sentence false at both worlds, because TCSr(w)(w) = *TCS*(w) = {0, 1} = *TCS*(w) = @TCSr(@)(w). With the local reading (4) gives another example of a sentence whose truth value differs depending on whether the general terms appearing in it are rigid or non-rigid.26 The formalization of (4) in terms of kind semantics can be given using similar methods as in sentence (1). The non-rigid reading is given by the sentence (5) [( F)C(s, F)][( G)AC(s, G)]F = G ι
ι
and the rigid reading by the sentence (6) [(β F x)[( H )C(s, H )]H x][(AβGx)[( J )C(s, J )]J x]F = G ι
ι
Let us focus now on the exemplifier semantics interpretation. If we use the local reading of the actuality operator, then the rigid and non-rigid versions of (4) are expressed by sentences similar to (5) and (6), replacing the condition that properties F and G are different by the condition that they are not coextensional. Thus the non-rigid reading is (7) [( F)C(s, F)][( G)AC(s, G)] ∼ ∀x(F x ↔ Gx) ι
ι
and the rigid reading is given by (8) [(β F x)[( H )C(s, H )]H x][(AβGx)[( J )C(s, J )]J x] ∼ ∀x(F x ↔ Gx). ι
ι
When we think about the truth conditions of sentence (4) with the global reading of the actuality operator we realize that what we have is in fact two occurrences of the actuality operator, with one of them anchoring the property designated by ‘the actual color of the sky’ to the actual world. So the non-rigid reading is (9) [( F)C(s, F)][( G)AC(s, G)] ∼ ∀x(F x ↔ AGx) ι
ι
and the rigid reading is given by (10) [(β F x)[( H )C(s, H )]H x][(AβGx)[( J )C(s, J )]J x] ∼ ∀x(F x ↔ AGx). ι
ι
In this section we have presented sentences whose truth conditions are given in terms of the things exemplifying the properties that the general terms designate. There is a difference in the truth conditions of those sentences depending on whether the general terms occurring in them are given a rigid or a non-rigid interpretation. We think that this shows that there is no trivialization problem. 26 Obviously, the plausibility of sentence (4) as a counterexample to the trivialization problem depends on the plausibility of the local reading of the actuality operator that occurs in (4) but we think it is arguably a reasonable reading.
123
Synthese (2011) 181:277–293
293
Acknowledgments We are very grateful to Bernard Linsky for helpful comments and discussion. We also thank the audience of 37th Annual Meeting of the Society for Exact Philosophy, held in Alberta, Canada, May 2009. The research for this paper has been partly funded by the Spanish MICINN, under grants 2008-FFI04263 and Consolider-Ingenio 2010 (CSD2009-0056), and the European Commission’s Seventh Framework Programme FP7/2007-2013 under grant agreement FP7-238128.
References Gallin, D. (1975). Intensional and higher-order modal logic. North-Holland. LaPorte, J. (2000). Rigidity and kind. Philosophical Studies, 97, 293–316. Linsky, B. (1984). General terms as designators. Pacific Philosophical Quarterly, 65, 259–276. Linsky, B. (2006). General terms as rigid designators. Philosophical Studies, 128, 655–667. López de Sa, D. (2006). Flexible property designators. Grazer Philosophische Studien, 73, 221–230. Martí, G., & Martínez-Fernández, J. (2007). General terms and non-trivial rigid designation. In C. Martínez (Ed.), Current topics in logic and analytic philosophy (pp. 103–116). Santiago de Compostela: Universidad de Santiago de Compostela. Martí, G., & Martínez-Fernández, J. (2010). General terms as designators. A defense of the view. In H. Beebee & N. Sabbarton-Leary (Eds.), The semantics and metaphysics of natural kinds (pp. 46–63). New York: Routledge. Salmon, N. (1982). Reference and essence. Oxford: Basil Blackwell. Salmon, N. (2003). Naming, necessity and beyond. Mind, 112, 475–492. Schwartz, S. (2002). Kinds, general terms and rigidity: A reply to LaPorte. Philosophical Studies, 109, 265–277. Soames, S. (2002). Beyond rigidity. Oxford: Oxford University Press. Soames, S. (2004). Replies to Gómez Torrente and Ezcurdia. Critica, 36, 83–114.
123
This page intentionally left blank
Synthese (2011) 181:295–315 DOI 10.1007/s11229-010-9803-6
Worlds and times NS and the master argument Peter K. Schotch · Gillman Payette
Received: 3 December 2009 / Accepted: 26 July 2010 / Published online: 15 September 2010 © Springer Science+Business Media B.V. 2010
Abstract In the fourteenth century, Duns Scotus suggested that the proper analysis of modality required not just moments of time but also “moments of nature”. In making this suggestion, he broke with an influential view first presented by Diodorus in the early Hellenistic period, and might even be said to have been the inventor of “possible worlds”. In this essay we take Scotus’ suggestion seriously devising first a double-index logic and then introducing the temporal order. Finally, using the temporal order, we define a modal order. This allows us to present modal logic without the usual interpretive questions arising concerning the relation called variously ‘accessibility’, ‘alternativeness’, and, ‘relative possibility.’ The system in which this analysis is done is one of those which have come to be called a hybrid logic. Keywords
Hybrid logic · Modal logic · Master argument of Diodorus
Nihil invita Minerva
P. K. Schotch Department of Philosophy, Dalhousie University, Halifax, NS, Canada e-mail:
[email protected] G. Payette (B) Department of Philosophy, University of Calgary, Calgary, AB, Canada e-mail:
[email protected]
123
296
Synthese (2011) 181:295–315
1 Introduction From the time of the earliest philosophers, the notions of necessity and possibility have been key ideas.1 But having introduced these modal ideas how shall we adjudicate their use in philosophical arguments? Concern with this issue not only engaged the great minds of antiquity, but remains to puzzle us today. During the Hellenistic period Diodorus of the Megarian school, proposed his famous ‘ruling’ or ‘master’ argument which seemed to have as a consequence that the proper account of the modals is a temporal one. More particularly Diodorus seems to show that in order to reject the position that possibility is merely current or future truth, one must also reject some other principle of compelling intuitive force. This argument was brought to the attention of the twentieth century logical community by Arthur Prior—see especially Prior (1967). Although the temporal account of possibility seems to have held sway for a long time (and still sways at least some philosophers and logicians) the medieval account of the modals showed that it was not entirely adequate. The main innovation occurs in the debate between Ockham and Scotus on predestination. During that debate Scotus insists that in order to properly frame the concept of the absolute properties of God, one must have recourse not only to moments or instants of time, but also to moments or instants of nature. For Scotus it seems that x may be naturally, but not temporally, prior to y, when y occurs at an instant of nature subsequent to that at which x occurs, even though both x and y occur at exactly the same instant of time.2 Ockham was not very receptive to this doctrine, but a discerning eye might see here the beginnings of all that later came under the heading of ‘possible worlds semantics’. Or better, a semantics for the modals that deploys worlds as well as times. It is tempting to call such account Scotian. Predestination and the absolute properties of God might seem a bit arcane to contemporary readers, but we can construct something very close to the Scotus motivation in more familiar terms. Suppose we say, as some historians have, that in 1940 it was possible that Germany win the second world war (so called) but that by 1943 it was no longer possible. On the Diodorean analysis of possibility, that assertion and all which, like it, suggest that possibilities might evaporate over time must be false. Since Germany did not, in 1940 nor at any subsequent date, actually win the second world war, it was not possible in 1940, nor at any other time. In this paper we develop a hybrid logic to look at historical necessity, and the issues raised by the master argument. In Sect. 2 we present the semantics and some prooftheory for the logic which we have named NS.3 In Sect. 3 we consider how these models match intuitions on time and look at frame conditions for models of NS. We then bring this lore to a discussion of the master argument.
1 Consider, for example that the notion of necessity occurs in the single fragment that we have from the work of Anaximander (McKirahan 1994, p. 43). 2 See especially Wolter (1962, pp. 52–61). 3 The reason for this name is that we think of the system as Neo-scotian or perhaps even Nova Scotian.
123
Synthese (2011) 181:295–315
297
2 The logic NS 2.1 The language The language of NS is constructed from the following primitive elements: A (countable) set of sentence letters, Let = {A, B, C, A1 , . . .} A (countable) set of time nominals, T = {t0 , t1 , . . .} (the nominal t0 is used to indicate the first time, if need be) A (countable) set of world nominals, W = {w, w1 , . . .} Members of I known as linguistic indices are the pairs drawn from the set W × T, while individual members of the latter sets will be known as coordinates. For every coordinate c, the nominal c represents a kind of world-proposition in the object language. We use N for the set of nominals and since they are atomic, we may define the set of atomic formulas At as Let ∪ N.4 The usual classical sentence operators {∧, ∨, ¬, ⊃, ≡} A function @ defined over indices such that for every {w, t} , @({w, t}), which will be indicated by @wt , is a unary sentence operator. The temporal operators F (informally, ‘It will be the case that …’) and G (informally, ‘It will always be the case that …’) as well as P (‘It has been the case that …’) and H (‘It has always been the case that …’). We define the notion of formula in the usual way using α, β, . . . with and without subscripts as variables over formulas, the set of which will be denoted F. 2.2 Semantics The semantics of NS is of the multidimensional sort. But things go differently from the way they do in the usual temporal logic and even in the unusual ones like Thomason’s.5 One difference is that the ‘coordinates’ we use are not uniform in the type of their components. To begin with sets, we recognize the times already mentioned in connection with the syntax of NS, i.e. 4 In doing this we are abandoning the Wittgensteinian criterion which requires that there be no logical relations between atomic sentences in favor of the criterion on which a formula is atomic if and only if it has no proper part which is also a formula. 5 We have in mind his work on historical necessity. Hans Kamp has also worked on this subject. The semantics of the two approaches are similar, but they don’t use hybrid logics. The semantics used in this area, by everyone involved are very close to those for the stit logics of Belnap, and others. The histories that are commonly used are what we call worlds. The reader should note that we are after necessities, but not merely historical necessity.
123
298
Synthese (2011) 181:295–315
The set T of times. We assume that this set has an order on it. Next there is a new set. The set S of states, informally understood as possible states of some world. We shall, in a sense to be made clear below, construct our worlds as sequences of states. 2.2.1 The differences begin The semantic indices, which is to say the objects with respect to which formulas take on truth values are drawn from the set S × T, alternatively the propositions are subsets of that product. However, not every such subset is guaranteed to be a proposition. Every world is a (temporal) sequence of states, but not every such sequence need be a world. We require that every member of T be a temporal component of some world although not every member of S need be a state of every world. Indeed, we allow the possibility that there are ‘virtual’ states which are not states of any world. We can make this a bit more formal as follows: Our semantics requires a set of functions such that for each f ∈ , f : T → S. These functions will be used to construct the set of worlds for each semantic structure. 2.2.2 Propositions We take the set of propositions to be those subsets X of S × T such that for every pair s, t ∈ X there is some f ∈ for which f (t) = s. A pair which has this property will be called propositional. The set of all propositional pairs will be referred to as ‘P’. So we may say that propositions are sets of propositional indices. It ought to be clear that the set of propositions, which is to say the set of subsets of P, is closed under intersection, union, and relative (to P) complement. A frame (for NS) is a structure F = S, (T, ), , where each of the sets, structures, and functions are as described above. An NS model M, is a pair F, V , where V is a function (the valuation function) from At to P(P): the set of propositions (relative to the frame F). For nominals, the valuation is defined in the following ways. A time nominal t is assigned all propositional pairs s, t where the s coordinate varies. This means that for some unique t ∈ T, V (t) = S × {t} ∩ P. The propositions assigned to world nominals are, as was hinted at, determined by functions in . So a nominal w is assigned a set V (w) from {{ s, t ∈ P : f (t) = s} : f ∈ } in a model M. Some observations are in order. Notice that V (w) ∩ V (t) will always be a unit set. When we want to refer to the element of this unit set we will simply write V (w, t) for ease of reference. As another convention and harmless abuse of notation we shall employ s, V (t) to refer to the index with the t coordinate picked out by the nominal t. The world nominals persist throughout all of time since they are determined by functions each of whose domain is T. From a realist perspective, this kind of model will make sense since even if there is nothing around, things will still either be true or false. It is possible, as we shall see, that two world nominals can be true at the same semantic index. When that happens we say that they are coincident at that index or that
123
Synthese (2011) 181:295–315
299
they coincide there. The class of all worlds that coincide at a given index is referred to as the coincidence class of that index. Relative to a model M and a propositional index s, t , we define the notion of truth relative to an index, inductively: [Atomic] M, s, t
| α ⇐⇒ s, t ∈ V (α)
for α ∈ At
[jump] M, s, t
| @wt α ⇐⇒ M, V (w, t ) | α In the standard way for the Boolean connectives and ⊥ The temporal operators need some explanation. 2.2.3 The future isn’t what it used to be In standard temporal logic we take the future of an instant t to consist of all those successor instants t .6 Now we have been forced to think in terms of worlds since not all pairs s, t are propositional. There could be some successor t of t but the pair s , t
does not belong to the future of the (assumed to be propositional) pair s, t . So what then is the future of s, t ? If we cast our net as widely as possible (and why shouldn’t we?) then the future of s, t will be the ‘future parts’ of all the worlds which belong to the coincidence class of s, t . The future parts, that is of all the worlds which “pass through” s, t . Given this somewhat larger view of things these seem to be the truth-conditions which suggest themselves. [F] M, s, t
| Fα ⇐⇒ (∀ f ∈ s.t. f (t) = s)(∃t t & M, f (t ), t
| α) [G] M, s, t
| Gα ⇐⇒ (∀ f ∈ s.t. f (t) = s)(∀t t , M, f (t ), t
| α) It will not have escaped the keen attention of the experienced reader that this is not temporal logic as it is usually presented. We feel no great need to apologize on that ground though. Given the framework within which we are working, it simply isn’t possible that F and G be duals of each other. Anybody who claims that they must be duals is just begging the question against our use of these structures which, we claim, are entirely intuitive. We shall have more to say on this matter soon. 2.2.4 And neither is the past Our account of F and G is a bit reminiscent of the situation in the usual temporal logic with ‘branching time,’ modulo the fact that the operators are not the duals of each other and that time isn’t required to branch. In NS models, it is nature that is 6 This is an obvious over-simplification. We might, for example, be thinking in terms of intervals rather
than instants.
123
300
Synthese (2011) 181:295–315
branching. With these caveats though, the situation looks rather familiar. In the usual approach the intuition is that while the future branches, the past must be linear. We shall take up this idea in more detail later, but for now we shall simply say that we are interested in the most general way of doing things. At the moment we haven’t quite got the apparatus we would require to impose something like a linear past so we can’t simply write down the obvious truth-conditions when it comes to H and P—the conditions aren’t obvious. We are reduced, if that is the correct word, to making our treatment of P and H mirror our earlier treatment of F and G respectively. [P] M, s, t
| Pα ⇐⇒ (∀ f ∈ s.t. f (t) = s)(∃t t & M, f (t ), t
| α) [H ] M, s, t | H α ⇐⇒ (∀ f ∈ s.t. f (t) = s)(∀t t, M, f (t ), t
| α) 2.2.5 What can duality do for you? While neither of the pairs F, G nor P, H are dual to one another, each of them surely has a dual. The time has come to raise the question of the significance of these duals. We shall indicate them by FG , G G , PG , and, HG where FG is ¬G¬, G G is ¬F¬, PG is ¬H ¬ and HG is ¬P¬. The reason for this notation should become clear shortly. First let’s derive the truth conditions for these. [FG ] M, s, t
| FG α ⇐⇒ (∃ f ∈ s.t. f (t) = s)(∃t t & M, f (t ), t
| α) [G G ] M, s, t | G G α ⇐⇒ (∃ f ∈ s.t. f (t) = s)(∀t t , M, f (t ), t
| α) [PG ] M, s, t | PG α ⇐⇒ (∃ f ∈ s.t. f (t) = s)(∃t t & M, f (t ), t
| α) [HG ] M, s, t | HG α ⇐⇒ (∃ f ∈ s.t. f (t) = s)(∀t t, M, f (t ), t
| α) As we say below, FG and PG look like a kind of possibility, a fact which will bear on our discussion of the master argument. But perhaps more interestingly they seem to echo work done in the 1970’s by Woolhouse on the subject of tensed modality. In particular FG and G G might be thought to be close to the going to be future tenses first discussed by him.7 We add the past looking equivalent of going to be which we render as going to have been.
7 See in particular Woolhouse (1973, 1975).
123
Synthese (2011) 181:295–315
301
The basic intuition for Woolhouse8 is that the going to be future contains implicit reference to a plan or script. To say then that α is going to be the case, is just to say that if all goes according to the (accepted) plan, α will be the case. This has the effect of constraining the worlds to those which are consistent with the plan or script. Our truth condition for FG α has it that there is at least one world passing through the index of evaluation containing a future index at which α is true. The world (or worlds) which satisfy the truth condition, if any, are those which conform to some plan or script. For G G α, it is always going to be the case that α, α must be true at every future index of the world or worlds passing through the index of evaluation, which world or worlds conform to the plan or script. For the past tenses ‘going to have been’ and ‘always going to have been’ we do the same thing, except looking backward at the past portions of the world or worlds which pass through the index of evaluation. Again the assumption is that such worlds conform to the dictates of a certain script or plan. 2.2.6 Some things aren’t different If a formula α is true at every index of a model M, we say that α is true in the model and write M | α. If, and only if, α is true in every model, we say that α is logically true or valid and write | α. When there is no index of any model at which the formula α is false while every member of the set of formulas is true, we say that semantically entails α, written | α. 2.3 Some proof-theory of NS 2.3.1 The hybrid part It is easy to see that under the proposed semantics, the ‘@’ operators are, one and all, truth operators, in the sense that they are homomorphisms with respect to the classical operators. That is, they distribute, strongly, over the Boolean connectives. In the proof-theory there are axioms for each such operator. [DIST¬] @wt ¬α ≡ ¬@wt α [DIST] @wt (α β) ≡ ((@wt α) (@wt β)) where ∈ {∧, ⊃, ≡, ∨}. Operators of this kind are sometimes called ‘jump’ operators. They jump you to a world-time, a state-time semantically speaking; the world-time at which you evaluate the formula that comes after the operator. Let’s suppose that ik = {wk , tk }, then we can see that the jump operators have the permutation property: @i1 . . . @in−1 @in α ≡ @is(1) . . . @is(n−1) @in α where i1 , . . . , in is any finite string of linguistic-indices, and s is any permutation of 1, . . . , n − 1. 8 Though assuredly not for his student Lloyd Humberstone who also wrote on the subject in his DPhil
thesis.
123
302
Synthese (2011) 181:295–315
They also have the idempotency property @i1 . . . @in−1 @in α ≡ @in α A formula which is not equivalent to one of the form @i α is referred to as a tensed formula. We need only a simple account of consistency, viz.: [Con] @i ⊥ ≡ ⊥ For any linguistic-index i. It is easy to see that anything like a ‘truth operator’ must validate the rule: α RT @i [] @i α where “@i []” stands for “{@i β|β ∈ }” and i is any linguistic-index. This also gives us the standard necessitation rule. α Nec@ @i α We also require some standard rules from hybrid logic. For instance, we must notice that for any coordinates w, w t, t , if @wt t and @wt α are true, then @wt α is true. Thus, the axiom schema ‘exchange’ [EXGt] (@wt t ∧ @wt α) ⊃ @wt α is valid. There is a similar axiom for worlds as follows: [EXGw] (@wt w ∧ @wt α) ⊃ @w t α On a semantic note, the @wt t formula forces t and t to be the same by the truth conditions, i.e., V (t) = V (t ). We also have to have for any w, w , t
[WT] @wt (w ∧ t) [ST] @wt t ⊃ @w t t [WT] says that at the index referred to by w, t those nominals are true. [ST] just says that if t and t are the same time relative to one world, they are so everywhere. These are obviously sound axioms for our models. We also must make explicit in the object language what the connection between the nominals and the @ operators is, namely: [POINT] (w ∧ t ∧ α) ⊃ @wt α 2.3.2 The temporal part It is not so clear that the usual proof theory for G and H obtains in the new setting since F and P are no longer the duals of G and H , respectively. But the operators—at least G and H -remain normal. This is to say that, for instance,
123
Synthese (2011) 181:295–315
303
α RG G[] Gα
α RH H [] H α
are sound. To validate RG and RH is standard. The only difference is the extra universal quantifier on the condition, but it causes no worry. If there were some s, t where G[] was satisfied, but Gα wasn’t, then there would have to be an f ∈ that passes through s, t and doesn’t model α. But that point has to model by assumption. That means that point has to model α, a contradiction. This also implies that we have the necessitation rules [NecH] and [NecG] for these operators. The normality continues with the usual distribution properties for G and H over conjunction. However, since we don’t have the duality between the G–F and H –P operators anymore, we don’t get all that we usually have. It is a bookkeeping matter to check that we do have the following: [DisF] (Fα ∨ Fβ) ⊃ F(α ∨ β) [DisP] (Pα ∨ Pβ) ⊃ P(α ∨ β) Now we can use the hybrid operators in standard ways to express the conditions on of linearity, irreflexivity, transitivity, and that each world is defined for every moment of time. [LIN] @wt (Pt ∨ t ∨ Ft ) for all w ∈ W, t, t ∈ T. [IRR] ¬@wt Pt ∧ ¬@wt Ft
for all w ∈ W, t ∈ T.
[TRANSF] @wt F Ft ⊃ @wt Ft for all w ∈ W, t, t ∈ T. [TRANSP] @wt P Pt ⊃ @wt Pt for all w ∈ W, t, t ∈ T. [SERW] @wt ¬F¬w. For those unfamiliar with hybrid logic some comments are in order. For [LIN], we can see that it says that in each world either the time indicated by the nominals are the same, or follow one another which is to say that is linear. For [IRR] it says that no time can follow or precede itself in the temporal order, i.e., it is irreflexive. Irreflexivity is usually inexpressible as an axiom in modal logic, but in a hybrid language we do not have that problem. [TRANS] is clearly transitivity. Finally, [SERW] says that the worlds never stop; the worlds are defined for each time. This condition might be taken as saying that worlds are serial. What should be noticed is that the F and P operators are not ‘normal.’ In fact, there is a breakdown, in general, between the usual formulas and conditions on frames. For instance, Gα ⊃ GGα implies that frames are transitive, but not all transitive frames will validate it. We will discuss this particular aspect of things below. The proof theory presented here doesn’t constitute a complete axiomatization of NS, that is provided in the appendix.
123
304
Synthese (2011) 181:295–315
Fig. 1 Model with a first moment of time t0
2.4 The Models of NS But what do these models look like? The most general model will probably look something like that in Fig. 1. That model, however, has a first moment of time. But it is not necessary that the models, in general, have such a first moment. But it can be imposed on the models using the proof theory. [BEGIN] @wt0 H ⊥ where t0 is the special “first” nominal in T, and w ∈ W. [BEGIN] says that in all worlds at all times preceding that picked out by t0 the impossible is true. The only way that can hold is that there is a time for which there are no ‘earlier’ worlds. We have said already that our notion of the past may be a bit too branchy to suit many consumers of temporal logic. For them, and for Diodorus too, the past is linear: there is no backwards branching. We now embark upon a quest to recover the linear past within our framework. It will turn out to revolve around adding other axioms to make the models conform to that picture. On this notion the past is like a set of worlds bundled or twisted together into a kind of cable which may unwind over time as strands of worlds split off from the main trunk. Each of those split off worlds are lost from our past if they have diverged before the index of evaluation. The past of the semantic index s, t consists of only those worlds which are still bound into the cable. This idea is quite familiar to anybody who thinks that worlds can be the same up to a point in time and then diverge after that. What is wanted are models in which once a world branches off it can never return to the fold. This is pictured in Fig. 2 where each line represents bundles of functions. These models could also be thought of as trees, in the mathematical sense. It is important to note that the model has a common starting state for all the worlds, something analogous to the ‘big bang’ perhaps, but there need not be only one, as in the example. There could be models of this kind in which worlds have independent start states. In such models, however, those worlds will remain independent for the rest of time, analogous to parallel universes perhaps. We can achieve this kind of model, if we add the following axiom schema: [ATS] @wt ((w ∧ w ) ⊃ H (w ∧ w )) for all t ∈ T and w, w ∈ W.
123
Synthese (2011) 181:295–315
305
Fig. 2 Model with first moment t0 and ATS
Fig. 3 Model where G A ⊃ GG A fails
This says that if two worlds are coincident at a point, then they have always been that way, they were always the same (ATS). This excludes the possibility of another world crashing into this ‘bundle’ because H ’s truth condition won’t allow it. If such a thing did happen, then there could be a point in the past of that index at which the current world nominal under investigation is not true. We are certain that this doesn’t even come close to exhausting the set of principles one might want to add to the base logic in aid of pursuing this project or that. In these models, which we call NS D models, a few things happen. First, the H and P operators are duals again. But also, the NS D models guarantee that Gα ⊃ GGα is always valid. From the model in Fig. 3 we can see what goes wrong without ATS. Suppose G A is true at s, t . That is, on this frame, assuming no other branching than what is shown we have the following containment:
s ∗ , t ∗ : f (t ∗ ) = s ∗ & t t ∗ ⊆ V (A)
We can also impose that s , t ∈ V (A). Thus we have that s , t G A; therefore, s, t GG A. This cannot happen in the NS D case since f would have to pass through s, t . Now that we have some idea of the models and the way that they can be manipulated with proof theory we can now look at an application of this semantics.
123
306
Synthese (2011) 181:295–315
3 Now comes diodorus It is time now to visit the master argument. To start with, here is an authoritative account: The following three propositions mutually conflict: ‘Every past truth is necessary’; ‘Something impossible does not follow from something possible’; and ‘There is something possible which neither is nor will be true.’ [Long and Sedley 1995, p. 230] It is clear from the context, that ‘mutually conflict’ here does not mean pairwise conflict. Only one of the statements need be rejected to secure a consistent set. That’s what makes this so convincing—on an ordinary understanding of the first two propositions, it seems that no reasonable person could reject either. We take such an understanding to be carried by the glosses: ‘If something has been true, it cannot now or at any future time not have been true.’ and ‘Whatever cannot be true, cannot be a logical consequence of premises which can be (simultaneously) true.’ Some, Arthur Prior for example, have argued that this argument is really an enthymeme and requires some filling out. We find a difficulty in Prior’s account which uses explicitly modal words even though we have been given no way of interpreting the modals which we represent by means of the usual and ♦, for necessity and possibility respectively. This lack will trouble us below. Let’s look at a reconstruction of the argument. In the following argument the temporal operators are to have their intuitive meanings. (1) ‘Every past truth is necessary’ (Pα ⊃ Pα); (2) ‘Something impossible does not follow from something possible’; ((α ⊃ β) ⊃ (♦α ⊃ ♦β)) (3) ‘There is something possible which neither is nor will be true.’ ((♦A) ∧ (¬A ∧ G¬A)) (4) If α is not true, and never will be, then there is some time in the past where α never will be true. ((¬A ∧ G¬A) ⊃ P G¬A) (5) If α is true, then it has always been the case that at some time in the future α will be true. (A ⊃ H F A) It is suggested9 that the argument proceeds: Suppose ♦A and ¬A ∧ G¬A. Principle 4 is seen not to hold in general since if time is dense it will fail. But we can suppose that (4) holds for the sake of this argument. Then we can detach P G¬A and note that this is a past truth. Thus, it must be necessary if we grant (1), so P G¬A. This is supposed to be equivalent to ¬♦¬P G¬A and is again equivalent to ¬♦H F A, assuming the duality of the temporal operators. From (5) and the assumption from the master argument that (α ⊃ β) ⊃ (♦α ⊃ ♦β), we can get ♦A ⊃ ♦H F A, and from modus tollens we get ¬♦A. Thus, A is impossible. It might be claimed that the argument is not actually enthymatic, but Prior (1967) argues quite extensively that 4 and 5 were accepted generally, and thus must be part of the argument. Now we can turn to our diagnosis of the argument. 9 Here we reconstruct a version of Prior’s version of the argument from Bobzien (2004).
123
Synthese (2011) 181:295–315
307
In NS, the master argument as stated is valid. In fact, it is valid no matter what we take to be the temporal order. We accept the validity of the argument because we reject the necessity of the past.10 For us, a past truth is one which is true at some past index of every world which passes through the index of evaluation. At some future index of evaluation there is no guarantee that the worlds in the past of that index all make the same things true as the worlds at the previous index. For us the past can evaporate (and the future as well we hasten to add). But that may not be the way to read 1. The problem is that we haven’t been given a way to interpret the and ♦. In a sense we can look at the well behaved models of NS D and then our gloss of 1 in the last paragraph will turn out to be true. In fact all three propositions turn out true and the master argument fails. But our gloss is Pα ⊃ G Pα which is not the way that 1 is formalized above. In a sense, the alternative gloss of 1, the one formalized by Pα ⊃ Pα, judges the past through the eyes of the present. It says that if something has been the case, then necessarily it has been the case. If what is implied by that is that things couldn’t have been otherwise, things had to evolve the way they did, that is simply to beg the question. There is a similar problem in accepting premise 5; it says pretty much the same thing. Our diagnosis, then, is to say that the master argument trades on equivocation. The kind of necessity that is used in our expression of 1—using G Pα—is what makes the premise plausible, unless you are a determinist who doesn’t mind settling the issue as a matter of logic. But the master argument is right on a certain account. There is a kind of possibility for which it cannot be that ♦A and ¬A ∧ G¬A both hold. But that is a future looking use of ‘possible’ that is captured by FG in the language of NS. The kind of possibility that makes 3 consistent, is the kind that looks back and checks the alternate histories/worlds that could have been taken given the way that the world is at the current time. One might express that in our language as PG FG α (at some time in some history, there is a time in some future at which α holds). With this language we can discern the various kinds of possibility, but we can give them all a temporal explanation. This doesn’t make the master argument invalid, particularly when we give the question begging readings to its premises. In the class of models which have only a single history/world, the argument comes out sound when we give our G Pα reading to necessity. Instead we provide a family of logics based on NS which we can use to present different treatments of the master argument, and correlatively of the modals. In the reading that we have given to necessity as G Pα, we can look at the master argument as being valid on some classes of models and not on others. And this is the way it should be since we must bend every effort to avoid settling the determinism question by logical fiat.
Appendix We should mention the lemma that allows the proofs used in the soundness of NS.
10 The Stoics also accepted the argument, but rejected the premise that nothing impossible could follow
from something possible.
123
308
Synthese (2011) 181:295–315
Lemma 1 Given a model M and α that contains no mention of i ∈ W ∪ T, for a model M∗ that is exactly like M except on i, for any s, t in the domain of M (thus in the domain of M∗ ) M s, t α ⇐⇒ M∗ s, t α Proof The usual induction of the complexity of formulas. This depends on what the nominal is of course. For the axiomatization we use all of those axioms mentioned in Sect. 2.3, and the following: Hybrid axioms @wt [] @wt α TGEN @wt∗ [] @wt∗ α for any t∗ ∈ T where t is not mentioned in ∪ {α}. @i α @I @j α
@i [] @i α @E α
for any j where neither i nor its coordinates are mentioned in ∪ {α}. , @wt w α CutW α
, @wt t α CutT α
when w nor t are mentioned in ∪ {α}. In the proof of soundness you set V ∗ (w ) = V (w) and similarly for t, t . Temporal and bridge rules [@GE] G@i α ≡ @i α (where i is any index). [F] (@wt Ft ∧ @wt Ft ) ⊃ @wt F Ft [P] (@wt Pt ∧ @wt Pt ) ⊃ @wt P Pt [TConF] (@wt Ft ⊃ ¬@wt Pt ) [TConP] (@wt Pt ⊃ ¬@wt Ft ) , @wt Ft , @wt α ψ TemIF , @wt Fα ψ
, @wt Pt , @wt α ψ TemIP , @wt Pα ψ
where t is not mentioned in ∪ {α, ψ}, and w, w ∈ W.
123
Synthese (2011) 181:295–315
309
@w t Ft CONTF @wt Ft
@w t Pt CONTP @wt Pt
where w is foreign to . Note that these rules follow from @I. , @wt Ft , @wt ¬α ψ , @wt Pt , @wt ¬α ψ TemIG¬ TemIH¬ , @wt ¬Gα ψ , @wt ¬H α ψ , @wt w , @wt (G(w ⊃ ¬α)) ψ WrdF¬ , @wt ¬Fα ψ , @wt w , @wt (H (w ⊃ ¬α)) ψ WrdP¬ , @wt ¬Pα ψ where t is not mentioned in ∪ {α, ψ} , w, w ∈ W, and w is foreign to , α, ψ. [G] (@wt Gα ∧ @wt Ft ) ⊃ @wt α. [H] (@wt H α ∧ @wt Pt ) ⊃ @wt α. This will clearly be true since whatever function is assigned to w will be one of the “futures” passing through the semantic-index referenced by w, t and then the time referenced by t is in the future. And the truth condition for G requires that α be true at every state in a “future” passing through the current state of evaluation. Finally, we add all (classical) tautologies. One last note, we are defining is a Hilbert style calculus in another guise. Rules like , @wt Ft , @wt α ψ TemIF , @wt Fα ψ should be considered as rule schema of the form: (β ∧ @wt Ft ∧ @wt α ⊃ ψ) (β ∧ @wt Fα ⊃ ψ) with the side conditions and β is arbitrary. β is intended to be a conjunction of some finite number of elements of . The soundness proofs for this system are standard so most will be omitted. Here is one of the more complicated ones. Proof partial proof of soundness for NS. For the [TemI] cases where it is G and H they follow the same pattern mutatis mutandis. For [TemI] Suppose that , @wt Ft , @wt α ψ. Then suppose that for some M, s, t we have M, s, t
| , @wt Fα. Let V (wt) = s , t and the function assigned to w is f . Notice that we must have t t such that M, f (t ), t
α. Then make M as in the other cases so t is interpreted as t . Then M , s, t
, @wt Ft , @t α, thus M , s, t
ψ. Since t is not mentioned in any of ∪{α, ψ} we know that M, s, t
ψ. For [TemA] the truth conditions for @wt ¬G¬α are simply given by those for @wt Ft and @wt α.
123
310
Synthese (2011) 181:295–315
For [CONT] (constant time) we can see this since the temporal order is fixed. So if we vary the function (i.e., the world) of evaluation. We always will have the same order. So, if t is before t in one world, it is so in every world. For [WrdF¬], suppose that , @wt w , @wt (G(w ⊃ ¬α)) ψ and M, s, t
model ∪ {@wt ¬Fα}. Then let V (w, t) = s , t . Then we know that there is f ∈ such that f (t ) = s where, for all (1) t t , M, f (t ), t
α, i.e., M, f (t ), t
¬α. Let M be just like M, but we will let w be interpreted as f . Then, in M M , s, t
, @wt ¬Fα, and M , s , t
w . Thus, M , s, t
@wt w . Let f ∈ with f (t ) = s , and t t . Then f (t ) = f (t ) or not. If not, then M , f (t ), t
w in which case M , f (t ), t
w ⊃ ¬α. If so, then f (t ), t = f (t ), t and so M , f (t ), t
¬α by (1) since t t . So, M , f (t ), t
w ⊃ ¬α. Since f was arbitrary, we have M , s , t
G(w ⊃ ¬α), which means that M , s, t
@wt (G(w ⊃ ¬α)). But then ψ must also be satisfied. For the rest of the completeness proof we offer a detailed sketch. The completeness proof follows a Henkin type construction to derive a canonical model for each maximally consistent set of a certain kind, which we will call a monad. Definition 1 A monad μ is a maximally NS-consistent set of formulas that contain a conjunction w ∧ t of nominals, and satisfies the following: if @w t (Fα), @w t w ∈ μ, then there is some t∗ such that @w t Ft∗ ∈ μ and @w t∗ α ∈ μ, if @w t (Pα), @w t w ∈ μ, then there is some t∗ such that @w t Pt∗ ∈ μ and @w t∗ α ∈ μ, If @w t ¬Fα ∈ μ, then there is w∗ such that @w t w∗ ∈ μ, and @w t (G(w∗ ⊃ ¬α)) ∈ μ, If @w t ¬Pα ∈ μ, then there is w∗ such that @w t w∗ ∈ μ, and @w t (H (w∗ ⊃ ¬α)) ∈ μ, Finally, if @w t ¬Gα, @w t w ∈ μ, there is some t∗ such that @w t Ft∗ ∈ μ and @w t∗ α ∈ μ, and if @w t ¬H α, @w t w ∈ μ, there is some t∗ such that @w t Pt∗ ∈ μ, and @w t ∗ α ∈ μ Since the truth-operators @i all have the ‘strong distribution’ properties with respect to the classical operators, it must be the case that @i () (which is the set: {α : @i α ∈ }) is maximally NS-consistent whenever is. In other words, for every index i, ‘@i (·)’ maps the set of maximally NS-consistent sets into itself. Realizing this, we can now define: Definition 2 has inherent coordinates w, t if, and only if @wt () = .
123
Synthese (2011) 181:295–315
311
We will see that all monads in our language have this property. The completeness proof of this language follows a standard proof of completeness for a many sorted hybrid logic. The ‘many sortedness’ of this language refers to the multiple types of nominals in the language. Now we must show: Lemma 2 (Extension Lemma for NS) Every NS-consistent set of sentences can be extended to a monad. Proof Let be an arbitrary NS-consistent set of formulas over the language L. Extend the language by adding countably many nominals—and the corresponding @ operators—to each of the sets W and T. That will give the language L∗ . Then take the logic NS over L∗ , and let δ1 , . . . , δk , . . . (k ∈ N) by some fixed ordering of all the formulas of NS over L∗ . We now define an infinite sequence of sets: 0 = ∪ {w ∧ t} (Where w, t are the ‘first’ nominals in L∗ not mentioned in .) ⎧ ifk−1 ∪ {δk } ⊥, ⎨ k−1 if k−1 ∪ {δk } ⊥ and one of 1−6 holds k = k−1 ∪ k ⎩ k−1 ∪ {δk } otherwise where 1–6 are as follows: 1. δk = @w t (Fα ∧ w ), 2. δk = @w t ¬Fα, 3. δk = @w t ¬Gα, 4. δk = @w t (Pα ∧ w ), or 5. δk = @w t ¬Pα, 6. δk = @w t ¬H α, and the k is given by ⎧ ∗ ⎪ ⎪ @w t (Fα ∧ w ), @∗ w t Ft , @w∗t∗ α ⎪ ⎪ {@ ¬Fα, @w t w , @w t (G(w ⊃ ¬α))} ⎪ ⎪ ⎨ w t {@ ¬Gα, @w t w∗ , @w∗ t Ft∗ , @w∗t∗ ¬α} k = w t @w t (Pα ∧ w ), @w t Pt∗ , @w t∗ α ⎪ ⎪ ⎪ ⎪ {@ ¬Pα, @w t w∗ , @w t (H (w∗ ⊃ ¬α))} ⎪ ⎪ ⎩ wt {@w t ¬H α, @w t w∗ , @w t Pt∗ , @w t∗ ¬α}
if if if if if if
1, 2, 3, 4, 5, and 6.
Note that t∗ , w∗ are the first coordinates of the extended language not occurring in k−1 or δk .
123
312
Synthese (2011) 181:295–315
Define + to be the union of the sequence of s. We claim that + is an NS-monad. It ought to be clear that + is maximal, so deductively closed. + has the inherent coordinates w, t since if α ∈ + then w ∧ t ∧ α ∈ + by closure, and so @wt α ∈ + also by closure. Thus, @wt ( + ) = + . But we must reassure ourselves that it is NS-consistent. First suppose, for reductio, that 0 is not consistent. Then: ¬(w ∧ t). Thus, @wt [] @wt ¬(w ∧ t). But that means that @wt [] ⊥ since @wt (w ∧ t) is a theorem. Then @wt @wt [] @wt ⊥ by [RT], but everything in @wt @wt [] follows from @wt [], so @wt [] @wt ⊥. Since the index w, t is foreign to we have, by [@E], that ⊥ contrary to assumption. For the induction cases suppose: k−1 ∪ @w t (Fα ∧ w ), @w t Ft∗ , @w t∗ α ⊥ Since @w t w @w t Ft∗ ≡ @w t Ft∗ , the distribution properties of the @-operators, and the closure of the set, can assert that k−1 ∪ @w t (Fα), @w t w , @w t Ft∗ , @w t∗ α ⊥ Then by [TemIF] we have k−1 ∪ @w t Fα, @w t w , @w t Fα ⊥ but @w t Fα, @w t w @w t Fα so by [Cut] we get k−1 ∪ @w t Fα, @w t w ⊥ and the distribution properties we get k−1 ∪ @w t (Fα ∧ w ) ⊥ which is contrary to construction. In the case of the third clause, k−1 ∪ @w t ¬Gα, @w t w∗ , @w∗ t Ft∗ , @w∗ t∗ ¬α ⊥ then we use [TemIG¬] to get k−1 ∪ @w t ¬Gα, @w t w∗ , @w∗ t ¬Gα ⊥ but with [Cut] and [CutW] we get k−1 @w t ¬Gα ⊥ which is contrary to construction. For the second clause suppose, k−1 ∪ @w t ¬Fα, @w t w∗ , @w t (G(w∗ ⊃ ¬α)) ⊥
123
Synthese (2011) 181:295–315
313
Since w is foreign to k−1 , α, ⊥, we can use [Wrd¬] to get k−1 ∪ {@w t ¬Fα, @w t ¬Fα} ⊥ which is simply k−1 ∪ {@w t ¬Fα} ⊥ but that is also contrary to construction. For the last clauses the procedure is the same with the H and P respectively. Thus, no stage can be inconsistent. But if + were inconsistent, there must be some stage which is inconsistent. Thus, + is consistent. Since ∅ is consistent by soundness, there are monads. Given a monad μ, we can define the NS-canonical model of μ, Mμ , as follows: Definition 3 Mμ = (Tμ , μ ), Sμ , μ , Vμ
First Tμ is made from the set of nominals, T in μ as T/ ∼μ = Tμ , where t ∼μ t ⇐⇒ @wt t ∈ μ for all w ∈ W. The set Sμ is constructed in stages. From the set of world nominals mentioned in μ, W, we can define the set Wμ as a set of equivalence classes. So Wμ = W/ ∼μ where w ∼μ w ⇐⇒ @wt w ∈ μ for all t ∈ T. The set Sμ is then w [t] : w ∈ W, [t] ∈ Tμ . Each w [t] = [w ] ∈ Wμ : @wt w ∈ μ . One should note that the w [t] s are not equivalence classes on W; however, they are partitions on Wμ . What they do is define the coincidence classes relative to the [t] coordinate. The set μ is defined as follows: μ = f w : Tμ → Sμ |w ∈ W . where f w ([t]) = w [t] ∀ [t] ∈ Tμ . Of course the canonical domain Pμ is defined in the usual way. The canonical valuation Vμ (α) = for α = A ∈ Let : w [t] , [t] : @wt A ∈ μ , for w ∈ W: f w ([t]), [t] |[t] ∈ Tμ for t ∈ T: Pμ ∩ Sμ × {[t]}. The relation μ is defined by considering the nominals associated to the ordering by occurring in temporal formulas. Thus we can use that to define [t] μ [t ] ⇐⇒ @wt Ft ∈ μ or @wt Pt ∈ μ For all w ∈ W. In fact, the second clause is redundant since they are equivalent. Finally we can state the fundamental theorem. Theorem 1 (Fundamental Theorem for NS) For all α and pairs w, t, Mμ , w [t] , [t] | α ⇐⇒ @wt α ∈ μ
123
314
Synthese (2011) 181:295–315
First let’s consider a few things that might give us pause. The equivalence relations are well defined since the rules alow the exchange of nominals due to the rule called [EXG] above. One may also worry about two formulas of the form @wt t and @w t Ft occurring in a monad, which would cause problems for the equivalence classes. However, we have @wt t ∧ @w t Ft ⊃ ⊥ as a theorem for each w, w , t, t of the language by [TCon]. One should also note that there are no pairs w [t] , [t ] in the model where [t] = [t ]. Thus, we don’t have to worry about mismatching times. Proof of Theorem 1 The proof is the usual induction on the structure of α. The basis cases are handled by the definition of Vμ , we will do those, the case for F, the case for H and the case for the jump operators. By definition Mμ , w [t] , [t] | t iff w [t] , [t] ∈ Vμ (t ) iff [t ] = [t]. But that is only the case iff @w t t ∈ μ for all w , so, because @wt t ≡ @w t t from [ST], we have @wt t ∈ μ as a particular case. Conversely, if @wt t ∈ μ then by [ST] we get @w t t ∈ μ for all w . By definition Mμ , w [t] , [t] | w iff w [t] , [t] ∈ Vμ (w ) iff f w ([t]) = w [t] iff [w ] ∈ w [t] iff @wt W ∈ μ as we wanted. Since the pair w, t was arbitrary the basis case holds. The induction hypothesis is: For all β of less complexity than α and all pairs w, t, Mμ , w [t] , [t] | β ⇐⇒ @wt β ∈ μ. In the induction step the interesting cases are those of the temporal operators, we will do the F and G cases. The Boolean cases follow by use of the idempotent property and the distribution rules for the @ operators. Suppose that Mμ , w [t] , [t] | Fα. We will omit the μ subscripts. Suppose that @wt Fα ∈ μ, then by maximality and distribution the formula @wt ¬Fα is a member of μ. By stipulation on μ, @wt w ∈ μ and @wt (G(w ⊃ ¬α)) ∈ μ. We will also have @w t (G(w ⊃ ¬α)) ∈ μ by [EXGw]. Then by definition for each f w where f w ([t]) = w [t] there is [t ] such that [t] [t ] and M, w [t ] , [t ] | α. Since our assumption gives @wt w ∈ μ we have f w ([t]) = w [t] and so some [t ] past t at which α holds. By definition of [t] [t ], we have that @wt Ft ∈ μ. Applying the inductive hypothesis we will have that @w t α ∈ μ. By [G] on @w t (G(w ⊃ ¬α)) we will have @w t (w ⊃ ¬α) ∈ μ. Using distribution laws, [WT] on w t and closure of μ we have ¬@w t α ∈ μ, which would imply that μ was inconsistent. Thus, @wt Fα ∈ μ. Suppose that @wt Fα ∈ μ. Then let f w ([t]) = w [t] . Thus @wt w ∈ μ. By closure @wt (Fα ∧ w ) ∈ μ, and so @w t Ft , @w t α ∈ μ. By definition of μ we have [t] μ [t ], so by IH M, w [t ] , [t ] | α. Since f w was arbitrary Mμ , w [t] , [t] | Fα. Suppose that Mμ , w [t] , [t] | Gα. Then let @wt Gα ∈ μ so its negation is in μ. In that case @wt Ft , @wt ¬α ∈ μ by definition of μ. We know that @wt w ∈ μ and so f w ([t]) = w t and then from the truth condition on G, we have Mμ , f w ([t ]), [t ] | α which gives us, since f w ([t ]) = w [t ] , @wt α ∈ μ. With the distribution laws that would mean that μ was inconsistent. Suppose that @wt Gα ∈ μ. Let f w ([t]) = w t and [t] [t ]. Then we know by the various [EXG] laws that @w t w ∈ μ and so @w t Ft ∈ μ. We also get @w t Gα ∈ μ. But then we can use [G] to get that @w t α ∈ μ. By the IH we have Mμ , w [t ] , [t ] | α which is to say Mμ , f w ([t ]), [t ] | α. Since f w was arbitrary, Mμ , w [t] , [t] | Gα.
123
Synthese (2011) 181:295–315
315
Now, what remains to be checked is whether the relations of the canonical models have the properties we want them to have. That is, we check if they are irreflexive, transitive and linear. To see that the relations are irreflexive notice that ¬@t Ft is a theorem for any t. So if [t] [t] held then @wt Ft ∈ μ which would make μ inconsistent. For transitivity assume that [t][t ] and [t ][t ]. So @wt Ft ∈ μ and @wt Ft ∈ μ. Since these two sentences imply @wt F Ft , by [F], so that sentence is also in μ. Notice that @wt F Ft ⊃ @wt Ft is a theorem so we have @wt Ft ∈ μ. But that means that [t] [t ]. Linearity comes in the form of the trichotomy axiom schema @t Ft ∨@t t ∨@t Pt . Suppose that there were [t] and [t ] such that none of [t] [t ], [t] = [t ] nor [t ] [t] holds, i.e., that @wt Ft , @wt t nor @wt Pt is in μ. Then the negations of each of these are in μ. But then μ ¬(@wt Ft ∨ @wt t ∨ @wt Pt ). However, it is easily seen by distribution that @wt Ft ∨ @wt t ∨ @wt Pt is a theorem. Which would imply that μ was inconsistent. We can then take the restriction to the original language of to get a model of . So, if NS α we can construct a monad μ,¬α whose canonical model will satisfy , but not α. Thus, NS α. So we have proved: Theorem 2 (Completeness of NS) NS α ⇐⇒ NS α. References Bobzien, S. (2004). Dialectical school. In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy (Summer 2004 edition). http://plato.stanford.edu/entries/dialectical-school/. Long, A. A., & Sedley, D. N. (2001). The hellenistic philosophers (Vol. 1). Cambridge: Cambridge University Press. McKirahan, R. (1994). Philosophy before Socrates. Indianapolis, IN: Hackett Publishing Company. Prior, A. (1967). Past, present and future. Oxford: Oxford Univerity Press. Wolter, A. (1962). Duns Scotus: Philosophical writings. Edinburgh, UK: Nelson. Woolhouse, R. S. (1973). Tensed modalities. Journal of Philosophical Logic, 2, 393–415. Woolhouse, R. S. (1975). Leibniz’s principle of predeterminate history. Studia Leibnitiana, 7, 207–228.
123
This page intentionally left blank
Synthese (2011) 181:317–352 DOI 10.1007/s11229-010-9804-5
Reasoning defeasibly about probabilities John L. Pollock
Received: 4 December 2009 / Accepted: 12 May 2010 / Published online: 21 January 2011 © Springer Science+Business Media B.V. 2011
Abstract In concrete applications of probability, statistical investigation gives us knowledge of some probabilities, but we generally want to know many others that are not directly revealed by our data. For instance, we may know prob(P/Q) (the probability of P given Q) and prob(P/R), but what we really want is prob(P/Q&R), and we may not have the data required to assess that directly. The probability calculus is of no help here. Given prob(P/Q) and prob(P/R), it is consistent with the probability calculus for prob(P/Q&R) to have any value between 0 and 1. Is there any way to make a reasonable estimate of the value of prob(P/Q&R)? A related problem occurs when probability practitioners adopt undefended assumptions of statistical independence simply on the basis of not seeing any connection between two propositions. This is common practice, but its justification has eluded probability theorists, and researchers are typically apologetic about making such assumptions. Is there any way to defend the practice? This paper shows that on a certain conception of probability—nomic probability—there are principles of “probable probabilities” that license inferences of the above sort. These are principles telling us that although certain inferences from probabilities to probabilities are not deductively valid, nevertheless the second-order probability of their yielding correct results is 1. This makes it defeasibly reasonable to make the inferences. Thus I argue that it is defeasibly reasonable to assume statistical independence when we have no information to the contrary. And I show that there is a function Y(r, s, a) such that if prob(P/Q) = r , prob(P/R) = s, and prob(P/U ) = a (where U is our background knowledge) then it is defeasibly reasonable to expect that prob(P/Q&R) = Y(r, s, a). Numerous other defeasible inferences are licensed by
John L. Pollock—Deceased. J. L. Pollock (B) Department of Philosophy, University of Arizona, Tucson, AZ 85721, USA
123
318
Synthese (2011) 181:317–352
similar principles of probable probabilities. This has the potential to greatly enhance the usefulness of probabilities in practical application. Keywords Probability · Statistical independence · Defeasible reasoning · Direct inference · Nomic probability · Epistemology 1 The problem of sparse probability knowledge The use of probabilities is ubiquitous in philosophy, science, engineering, artificial intelligence, economics, and many other disciplines. It is generally supposed that the logical and mathematical structure of probabilities is well understood, and completely characterized by the probability calculus. The probability calculus is typically identified with some form of Kolmogoroff’s axioms, often supplemented with an axiom of countable additivity. Mathematical probability theory is a mature subdiscipline of mathematics based upon these axioms, and forms the mathematical basis for most applications of probabilities in the sciences. There is, however, a problem with the supposition that this is all there is to the logical and mathematical structure of probabilities. The uninitiated often suppose that if we know a few basic probabilities, we can compute the values of many others just by applying the probability calculus. Thus it might be supposed that familiar sorts of statistical inference provide us with our basic knowledge of probabilities, and then appeal to the probability calculus enables us to compute other previously unknown probabilities. The picture is of a kind of foundations theory of the epistemology of probability, with the probability calculus providing the inference engine that enables us to get beyond whatever probabilities are discovered by direct statistical investigation. Regrettably, this simple image of the epistemology of probability cannot be correct. The difficulty is that the probability calculus is not nearly so powerful as the uninitiated suppose. If we know the probabilities of some basic propositions P, Q, R, S,. . ., it is rare that we will be able to compute, just by appeal to the probability calculus, a unique value for the probability of some logical compound like ((P&Q) ∨ (R&S)). To illustrate, suppose we know that prob(P) = .7 and prob(Q) = .6. What can we conclude about prob(P&Q)? All the probability calculus enables us to infer is that .3 ≤ prob(P&Q) ≤ .6. That does not tell us much. Similarly, all we can conclude about prob(P ∨ Q) is that .7 ≤ prob(P ∨ Q) ≤ 1.0. In general, the probability calculus imposes constraints on the probabilities of logical compounds, but it falls far short of enabling us to compute unique values. Unless we come to a problem already knowing a great deal about the relevant probabilities, the probability calculus will not enable us to compute the values of unknown probabilities that subsequently become of interest to us. Suppose a problem is described by logical compounds of a set of simple propositions P1 , . . . , Pn . Then to be able to compute the probabilities of all logical compounds of these simple propositions, what we must generally know is the probabilities of every conjunction of the form prob((∼)P1 & . . . &(∼)Pn ). The tildes enclosed in parentheses can be either present or absent. These n-fold conjunctions are called Boolean conjunctions, and jointly they constitute a “partition”. Given fewer than all but one of them, the only constraint the probability calculus imposes on the probabilities of the remaining
123
Synthese (2011) 181:317–352
319
Boolean conjunctions is that the sum of all of them must be 1. Together, the probabilities of all the Boolean conjunctions determine a complete “probability distribution”—an assignment of unique probabilities to every logical compound of the simple propositions. In theoretical accounts of the use of probabilities in any discipline, it is generally assumed that we come to a problem equipped with a complete probability distribution. However, in real life this assumption is totally unrealistic. In general, given n simple propositions, there will be 2n logically independent probabilities of Boolean conjunctions. As Harman (1986) observed years ago, for a rather small number of simple propositions, there is a completely intractable number of logically independent probabilities. For example, given just 300 simple propositions, a grossly inadequate number for describing many real-life problems, there will be 2300 logically independent probabilities of Boolean conjunctions. 2300 is approximately equal to 1090 . To illustrate what an immense number this is, recent estimates of the number of elementary particles in the universe put it at 1080 –1085 . Thus to know the probabilities of all the Boolean conjunctions, we would have to know 5–10 orders of magnitude more logically independent probabilities than the number of elementary particles in the universe. Lest one think this is an unrealistic problem, consider a simple example. Pollock (2006a) describes a challenge problem for AI planners. This problem generalizes Kushmerick et al. (1995) “slippery gripper” problem. We are presented with a table on which there are 300 numbered blocks, and a panel of correspondingly numbered buttons. Pushing a button activates a robot arm which attempts to pick up the corresponding block and remove it from the table. We get 100 dollars for each block that is removed. Pushing a button costs two dollars. The hitch is that half of the blocks are greasy. If a block is not greasy, pushing the button will result in its being removed from the table with probability 1.0, but if it is greasy the probability is only 0.01. We are given exactly 300 opportunities to either push a button or do nothing. Between button pushes, we are given the opportunity to look at the table, which costs one dollar. Looking will reveal what blocks are still on the table, but will not reveal directly whether a block is greasy. What should we do? Humans find this problem terribly easy. An informal survey reveals that most people quickly produce the optimal plan: push each button once, and don’t bother to look at the table. But when Pollock (2006a) surveyed existing AI planners, most could not even encode this problem, much less solve it. The difficulty is that there are too many logically independent probabilities. For every subset K of the 300 blocks, let p K ,i be the probability that, when K is the set of blocks on the table, block i is still on the table after the button corresponding to block i is pushed. There are 2300 choices of K , so there are more than 2300 probabilities p K ,i such that i ∈ K . Furthermore, none of them can be derived from any of the others. Thus they must each be encoded separately in describing a complete probability distribution for the problem. It seems to be impossible for a real cognitive agent to encode such a probability distribution. Although we humans cannot encode a complete probability distribution for the preceding problem, we can deal with problems like the slippery blocks problem. How do we do that? It is, apparently, computationally impossible for the requisite probabilities to be stored in us from the start, so they must be produced one at a time as we need them. If they are produced as we need them, there must be some kind of inference
123
320
Synthese (2011) 181:317–352
mechanism that has the credentials to produce rationally acceptable estimates. We have seen that, unless we begin with more information than it is computationally possible for us to store, we cannot derive the new probability estimates from previously accepted probabilities by way of the probability calculus. So there must be some other rational inference procedures that enable us to generate new probability estimates that do not follow logically, via the probability calculus, from prior probability estimates. What might these rational inference procedures be? I will call this the problem of sparse probability knowledge. It is computationally impossible for us to store explicit knowledge of a complete probability distribution. At any given time, our knowledge of probabilities is worse than just incomplete. The set of probabilities we know is many orders of magnitude smaller than the set of all true probabilities. How then can we be as successful as we are in applying probability to real-world problems? It is noteworthy that in applying probabilities to concrete problems, probability practitioners commonly adopt undefended assumptions of statistical independence. The probabilities prob(P) and prob(Q) are statistically independent iff prob(P&Q) = prob(P)· prob(Q). An equivalent definition is that prob(P/Q) = prob(P). In the practical use of probabilities it is almost universally assumed, often apologetically, that probabilities are independent unless we have some reason for thinking otherwise. In most real-world applications of probabilities, if we did not make such assumptions about independence we would not be able to compute any of the complex probabilities that interest us. Imagine a case in which we know that the probability is .3 of a Xian (a fictional Chinese car) having a defective door lock if it has power door locks and was manufactured in a certain plant, whereas the probability of its having a defective door lock otherwise is only .01. We also know that the probability of a Xian being manufactured in that plant is .33, and the probability of a Xian having power door locks is .85. If we know nothing else of relevance, we will normally assume that whether the car has power door locks is statistically independent of whether it was manufactured in that plant, and so compute prob(power-locks & plant) = .33 × .85 = .28. Then we can compute the general probability of a Xian having defective door locks: prob(defect) = prob(defect/power-locks & plant) · prob(power-locks & plant) + prob(defect/ ∼ (power-locks & plant)) ·(1 − prob(power-locks & plant)) = .3 × .28 + .01 × (1 − .28) = .09. We could not perform this, or similar computations, without the assumption of independence. The independence assumption is a defeasible assumption, because obviously we can discover that conditions we thought were independent are unexpectedly correlated. The probability calculus can give us only necessary truths about probabilities, so the justification of such a defeasible assumption must have some other source.
123
Synthese (2011) 181:317–352
321
If we have a problem in which we can assume that most propositions are statistically independent of one another, there are compact techniques for storing complete probability distributions using what are called “Bayesian nets” (Pearl 1988). The use of Bayesian nets allow us to explicitly store just that subset of probabilities that cannot be derived from each other by assuming statistical independence, and provides an efficient inference mechanism for recovering derivable probabilities from them. However, this is not the entire solution to the problem of sparse probability knowledge, because in the slippery blocks problem, none of the probabilities p K ,i can be derived from others, so they would all have to be encoded separately in a Bayesian net, and that would make the Bayesian net impossibly large. I will argue that a defeasible assumption of statistical independence is just the tip of the iceberg. There are multitudes of defeasible inferences that we can make about probabilities, and a very rich mathematical theory grounding them. It is these defeasible inferences that enable us to make practical use of probabilities without being able to deduce everything we need via the probability calculus. I will argue that, on a certain conception of probability, there are mathematically derivable second-order probabilities to the effect that various inferences about first-order probabilities, although not deductively valid, will nonetheless produce correct conclusions with probability 1, and this makes it reasonable to accept these inferences defeasibly. The second-order principles are principles of probable probabilities. 2 Two kinds of probability No doubt the currently most popular theory of the foundations of probability is the subjectivist theory due originally to Ramsey and Savage, and developed at length by many more recent scholars. However, my solution to the problem of sparse probability knowledge requires that we start with objective probabilities. Historically, there have been two general approaches to probability theory. What I will call generic probabilities1 are general probabilities, relating properties or relations. The generic probability of an A being a B is not about any particular A, but rather about the property of being an A. In this respect, its logical form is the same as that of relative frequencies. I write generic probabilities using lower case “prob” and free variables: prob(Bx/Ax). For example, we can talk about the probability of an adult male of Slavic descent being lactose intolerant. This is not about any particular person—it expresses a relationship between the property of being an adult male of Slavic descent and the property of being lactose intolerant. Most forms of statistical inference or statistical induction are most naturally viewed as giving us information about generic probabilities. On the other hand, for many purposes we are more interested in propositions that are about particular persons, or more generally, about specific matters of fact. For example, in deciding how to treat Herman, an adult male of Slavic descent, his doctor may want to know the probability that Herman is lactose intolerant. This illustrates the need for a kind of probability that attaches to propositions rather than relating properties 1 In the past, I followed Jackson and Pargetter 1973 in calling these “indefinite probabilities”, but I never
liked that terminology.
123
322
Synthese (2011) 181:317–352
and relations. These are sometimes called “single case probabilities”, although that terminology is not very good because such probabilities can attach to propositions of any logical form. For example, we can ask how probable it is that there are no human beings over the age of 130. In the past, I called these “definite probabilities”, but now I will refer to them as singular probabilities. The distinction between singular and generic probabilities is commonly overlooked by contemporary probability theorists, perhaps because of the popularity of subjective probability (which has no way to make sense of generic probabilities). But most objective approaches to probability tie probabilities to relative frequencies in some essential way, and the resulting probabilities have the same logical form as the relative frequencies. That is, they are generic probabilities. The simplest theories identify generic probabilities with relative frequencies (Russell 1948; Braithwaite 1953; Kyburg 1961, 1974a; Sklar 1970, 1973).2 The simplest objection to such “finite frequency theories” is that we often make probability judgments that diverge from relative frequencies. For example, we can talk about a coin being fair (and so the generic probability of a flip landing heads is 0.5) even when it is flipped only once and then destroyed (in which case the relative frequency is either 1 or 0). For understanding such generic probabilities, we need a notion of probability that talks about possible instances of properties as well as actual instances. Theories of this sort are sometimes called “hypothetical frequency theories”. C. S. Peirce was perhaps the first to make a suggestion of this sort. Similarly, the statistician R. A. Fisher, regarded by many as “the father of modern statistics”, identified probabilities with ratios in a “hypothetical infinite population, of which the actual data is regarded as constituting a random sample” (1922, p. 311). Popper (1956, 1957, 1959) endorsed a theory along these lines and called the resulting probabilities propensities. Kyburg (1974a) was the first to construct a precise version of this theory (although he did not endorse the theory), and it is to him that we owe the name “hypothetical frequency theories”. Kyburg (1974a) also insisted that von Mises should also be considered a hypothetical frequentist. There are obvious difficulties for spelling out the details of a hypothetical frequency theory. More recent attempts to formulate precise versions of what might be regarded as hypothetical frequency theories are van Fraassen (1981), Bacchus (1990), Halpern (1990), Pollock (1990), Bacchus et al. (1996). I will take my jumping-off point to be the theory of Pollock (1990), which I will sketch briefly in section three. After brief thought, most philosophers find the distinction between singular and generic probabilities intuitively clear. However, this is a distinction that sometimes puzzles probability theorists many of whom have been raised on an exclusive diet of singular probabilities. They are sometimes tempted to confuse generic probabilities with probability distributions over random variables. Although historically most theories of objective probability were theories of generic probability, mathematical probability theory tends to focus exclusively on singular probabilities. When mathematicians talk about variables in connection with probability, they usually mean “random variables”, which are not variables at all but functions assigning values to the different members of a population. Generic probabilities have single numbers as 2 Kneale (1949) traces the frequency theory to R. L. Ellis, writing in the 1840’s, and Venn (1888) and
C. S. Peirce in the 1880’s and 1890’s.
123
Synthese (2011) 181:317–352
323
their values. Probability distributions over random variables are just what their name implies—distributions of singular probabilities rather than single numbers. It has always been acknowledged that for practical decision-making we need singular probabilities rather than generic probabilities. For example, in deciding whether to trust the door locks on my Xian, I want to know the probability of its having defective locks, not the probability of Xians in general having defective locks. So theories that take generic probabilities as basic need a way of deriving singular probabilities from them. Theories of how to do this are theories of direct inference. Theories of objective generic probability propose that statistical inference gives us knowledge of generic probabilities, and then direct inference gives us knowledge of singular probabilities. Reichenbach (1949) pioneered the theory of direct inference. The basic idea is that if we want to know the singular probability prob(Fa), we look for the narrowest reference class (or reference property) G such that we know the generic probability prob(F x/Gx) and we know Ga, and then we identify prob(Fa) with prob(F x/Gx). For example, actuarial reasoning aimed at setting insurance rates proceeds in roughly this fashion. Kyburg (1974a) was the first to attempt to provide firm logical foundations for direct inference. Pollock (1990) took that as its starting point and constructed a modified theory with a more epistemological orientation. The present paper builds upon some of the basic ideas of the latter. The appeal to generic probabilities and direct inference has seemed promising for avoiding the computational difficulties attendant on the need for a complete probability distribution. Instead of assuming that we come to a problem with an antecedently given complete probability distribution, one can assume more realistically that we come to the problem with some limited knowledge of generic probabilities and then infer singular probabilities from the latter as we need them. For example, I had no difficulty giving a description of the probabilities involved in the slippery blocks problem, but I did that by giving an informal description of the generic probabilities rather than the singular probabilities. We described it by reporting that the generic probability prob(Gx/Bx) of a block being greasy is .5, and the generic probability prob(∼ T x(s + 1)/T xs & P xs & Gx) of a block being successfully removed from the table at step s if it is greasy is .01, but prob(∼ T x(s + 1)/T xs & P xs & ∼ Gx) = 1.0. We implicitly assumed that prob(∼ T x(s + 1)/ ∼ T xs) = 1. These probabilities completely describe the problem. For solving the decision-theoretic planning problem, we need singular probabilities rather than generic probabilities, but one might hope that these can be recovered by direct inference from this small set of generic probabilities as they are needed. Unfortunately, I do not think that this hope will be realized. The appeal to generic probabilities and direct inference helps a bit with the problem of sparse probability knowledge, but it falls short of constituting a complete solution. The difficulty is that the problem recurs at the level of generic probabilities. Direct statistical investigation will apprise us of the values of some generic probabilities, and then others can be derived by appeal to the probability calculus. But just as for singular probabilities, the probability calculus is a weak crutch. We will rarely be able to derive more than rather broad constraints on unknown probabilities. A simple illustration of this difficulty arises when we know that prob(Ax/Bx) = r and prob(Ax/C x) = s, where r = s, and we know both that Ba and Ca. What should we conclude about the value of
123
324
Synthese (2011) 181:317–352
prob(Aa)? Direct inference gives us defeasible reasons for drawing the conflicting conclusions that prob(Aa) = r and prob(Aa) = s, and standard theories of direct inference give us no way to resolve the conflict, so they end up telling us that there is no conclusion we can justifiably draw about the value of prob(Aa). Is this reasonable? Suppose we have two unrelated diagnostic tests for some rare disease, and Bernard tests positive on both tests. Intuitively, it seems this should make it more probable that Bernard has the disease than if we only have the results of one of the tests. This suggests that, given the values of prob(Ax/Bx) and prob(Ax/C x), there ought to be something useful we can say about the value of prob(Ax/Bx & C x), and then we can apply direct inference to the latter to compute the singular probability that Bernard has the disease. Existing theories give us no way to do this, and the probability calculus imposes no constraint at all on the value of prob(Ax/Bx & C x). I believe that standard theories of direct inference are much too weak to solve the problem of sparse probability knowledge. What I will argue in this paper is that new mathematical results, coupled with ideas from the theory of nomic probability introduced in Pollock (1990), provide the justification for a wide range of new principles supporting defeasible inferences about the expectable values of unknown probabilities. These principles include familiar-looking principles of direct inference, but they include many new principles as well. For example, among them is a principle enabling us to defeasibly estimate the probability of Bernard having the disease when he tests positive on both tests. I believe that this broad collection of new defeasible inference schemes provides the solution to the problem of sparse probability knowledge and explains how probabilities can be truly useful even when we are massively ignorant about most of them. 3 Nomic probability Pollock (1990) developed a possible worlds semantics for objective generic probabilities,3 and I will take that as my starting point for the present theory of probable probabilities. The proposal was that we can identify the nomic probability prob(F x/Gx) with the proportion of physically possible G’s that are F’s. A physically possible G is defined to be an ordered pair w, x such that w is a physically possible world (one compatible with all of the physical laws) and x has the property G at w. Let us define the subproperty relation as follows: F G iff it is physically necessary (follows from true physical laws) that (∀x)(F x → Gx). F ∼ = G iff it is physically necessary (follows from true physical laws) that (∀x)(F x ↔ Gx). We can think of the subproperty relation as a kind of nomic entailment relation (holding between properties rather than propositions). More generally, F and G can have any number of free variables (not necessarily the same number), in which case F G iff the universal closure of (F → G) is physically necessary. 3 Somewhat similar semantics were proposed by Halpern (1990) and Bacchus et al. (1996).
123
Synthese (2011) 181:317–352
325
Given a suitable proportion function ρ, we could stipulate that, where F and G are the sets of physically possible F’s and G’s respectively: probx (F x/Gx) = ρ(F, G).4 However, it is unlikely that we can pick out the right proportion function without appealing to prob itself, so the postulate is simply that there is some proportion function related to prob as above. This is merely taken to tell us something about the formal properties of prob. Rather than axiomatizing prob directly, it turns out to be more convenient to adopt axioms for the proportion function. Proportion functions are a generalization of measure functions, studied in mathematics in measure theory. Pollock (1990) showed that, given the assumptions adopted there, ρ and prob are interdefinable, so the same empirical considerations that enable us to evaluate prob inductively also determine ρ. Note that probx is a variable-binding operator, binding the variable x. When there is no danger of confusion, I will omit the subscript “x”, but sometimes we will want to quantify into probability contexts, in which case it will be important to distinguish between the variables bound by “prob” and those that are left free. To simplify expressions, I will often omit the variables, writing “prob(F/G)” for “prob(F x/Gx)” when no confusion will result. It is often convenient to write proportions in the same logical form as probabilities, so where ϕ and θ are open formulas with free variable x, let ρx (ϕ/θ ) = ρ({x|ϕ&θ }, {x|θ }). Note that ρx is a variable-binding operator, binding the variable x. Again, when there is no danger of confusion, I will typically omit the subscript “x”. I will make three classes of assumptions about the proportion function. Let #X be the cardinality of a set X . If Y is finite, I assume: ρ(X, Y ) =
#X ∩ Y . #Y
However, for present purposes the proportion function is most useful in talking about proportions among infinite sets. The sets F and G will invariably be infinite, if for no other reason than that there are infinitely many physically possible worlds in which there are F’s and G’s. My second set of assumptions is that the standard axioms for conditional probabilities hold for proportions. These axioms automatically hold for relative frequencies among finite sets, so the assumption is just that they also hold for proportions among infinite sets. That further assumptions are needed derives from the fact that the standard probability calculus is a calculus of singular probabilities rather than generic probabilities. A calculus of generic probabilities is related to the calculus of singular probabilities in a manner roughly analogous to the relationship between the predicate calculus 4 Probabilities relating n-place relations are treated similarly. I will generally just write the one-variable
versions of various principles, but they generalize to n-variable versions in the obvious way.
123
326
Synthese (2011) 181:317–352
and the propositional calculus. Thus we get some principles pertaining specifically to relations that hold for generic probabilities but cannot even be formulated in the standard probability calculus. For instance, Pollock (1990) endorsed the following two principles: Individuals prob(F x y/Gx y & y = a) = prob(F xa/Gxa) PPROB prob(F x/Gx & prob(F x/Gx) = r ) = r. I will not assume either of these principles in this paper, but I mention them just to illustrate that there are reasonable-seeming principles governing generic probabilities that are not even well formed in the standard probability calculus. What I do need in the present paper is three assumptions about proportions that go beyond merely imposing the standard axioms for the probability calculus. The three assumptions I will make are: Finite set principle For any set B, N > 0, and open formula , ρ X ((X )/ X ⊆ B X = N ) = ρx1 ,...,x N (({x1 , . . . , x N })/x1 , . . . , x N are pairwise distinct & x1 , . . . , x N ∈ B). Projection principle If 0 ≤ p, q ≤ 1 and (∀y)(Gy → ρx (F x/Rx y) ∈ [ p, q]), then ρx,y (F x/Rx y&Gy) ∈ [ p, q].5 Crossproduct principle If C and D are nonempty, ρ (A × B, C × D) = ρ(A, C) · ρ(B, D). Note that these three principles are all theorems of elementary set theory when the sets in question are finite. For instance, to illustrate the finite case of the projection principle, let F be “x is an even non-negative integer”, let Rxy be “x and y are non-negative integers and x ≤ y”, and let Gy be “y ∈ {5, 6, 7}. Then ρx (F x/Rx5) = ρx (F x/Rx7) = 1/2 and ρx (F x/Rx5) = 4/7. Thus (∀y)(Gy → ρx (F x/Rx y) ∈ [4/7, 1/2]). And ρx,y (F x/Rx y & Gy) = 11/21 ∈ [4/7, 1/2]. 5 Note that this is a different (and more conservative) principle than the one called “Projection” in Pollock (1990).
123
Synthese (2011) 181:317–352
327
The crossproduct principle holds for finite sets because #(A × B) = (#A) · (#B), and hence #((A ∩ C) × (B ∩ D)) #((A × B) ∩ (C × D)) = #(C × D) #(C × D) #(A ∩ C) · #(B ∩ D) #(A ∩ C) #(B ∩ D) = = · = ρ(A, C) · ρ(B, D). #C · # D #C #D
ρ (A × B, C × D) =
My assumption is simply that ρ continues to have these algebraic properties even when applied to infinite sets. I take it that this is a fairly conservative set of assumptions. I often hear the objection that in affirming the Crossproduct Principle, I must be making a hidden assumption of statistical independence. However, that is to confuse proportions with probabilities. The Crossproduct Principle is about proportions—not probabilities. For finite sets, proportions are computed by simply counting members and computing ratios of cardinalities. It makes no sense to talk about statistical independence in this context. For infinite sets we cannot just count members any more, but the algebra is the same. It is because the algebra of proportions is simpler than the algebra of probabilities that it is useful to axiomatize nomic probabilities indirectly by adopting axioms for proportions. The preceding amounts to a “realistic possible worlds semantics” for nomic probability. A realistic possible world semantics takes possible worlds, objects in possible world, properties, relations, and propositions as basic. There are many different approaches to how these concepts are to be understood, but for the most part it makes no different to the present paper what approach is taken. All that my mathematics requires is that propositions, properties, and relations are closed under various operations that everyone grants them to be closed under. As long as the proportion function satisfies my postulates, the mathematical results follow. To be contrasted with realistic possible world semantics are model theoretic semantics (e.g., Halpern 1990; Bacchus et al. 1996). A model-theoretic approach constructs set-theoretic models and interprets formal languages in terms of them. It it mathematically precise, but it is only as good as the model theory. You can construct model theories that validate almost anything. If your objective is to use model theory to illuminate pre-analytic concepts, it is important to justify the model theory. Model theoretic approaches to modalities rely upon formal analogues to possible worlds, but it has become apparent that the formal analogues are not precise. The simplest analogue generates Carnap’s modal logic, which no one thinks is right. To get even S5 one must make basically ad hoc moves regarding the accessibility relation. This is a topic I discussed at great length in my (1984a). What I argued was that to get the model theory right, you have to start with a realistic possible worlds semantics and justify it. The appeal to model theory cannot replace the appeal to a realistic possible world semantics. Pollock (1990) derived the entire epistemological theory of nomic probability from a single epistemological principle coupled with a mathematical theory that amounts to a calculus of nomic probabilities. The single epistemological principle that underlies probabilistic reasoning is the statistical syllogism, which can be formulated as follows:
123
328
Synthese (2011) 181:317–352
Statistical syllogism If F is projectible with respect to G and r > 0.5, then Gc & prob(F/G) ≥ r is a defeasible reason for Fc, the strength of the reason being a monotonic increasing function of r . I take it that the statistical syllogism is a very intuitive principle, and it is clear that we employ it constantly in our everyday reasoning. For example, suppose you read in the newspaper that George Bush is visiting Guatemala, and you believe what you read. What justifies your belief? No one believes that everything printed in the newspaper is true. What you believe is that certain kinds of reports published in certain kinds of newspapers tend to be true, and this report is of that kind. It is the statistical syllogism that justifies your belief. The projectibility constraint in the statistical syllogism is the familiar projectibility constraint on inductive reasoning, first noted by Goodman (1955). One might wonder what it is doing in the statistical syllogism. But it was argued in Pollock (1990), on the strength of what were taken to be intuitively compelling examples, that the statistical syllogism must be so constrained. Furthermore, it was shown that without a projectibility constraint, the statistical syllogism is self-defeating, because for any intuitively correct application of the statistical syllogism it is possible to construct a conflicting (but unintuitive) application to a contrary conclusion. This is the same problem that Goodman first noted in connection with induction. Pollock (1990) then went on to argue that the projectibility constraint on induction derives from that on the statistical syllogism. The projectibility constraint is important, but also problematic because no one has a good analysis of it. I will not discuss it further here. I will just assume, without argument, that the second-order probabilities employed below in the theory of probable probabilities satisfy the projectibility constraint, and hence can be used in the statistical syllogism. The statistical syllogism is a defeasible inference scheme, so it is subject to defeat. I believe that the only primitive (underived) principle of defeat required for the statistical syllogism is that of subproperty defeat:
Subproperty defeat for the statistical syllogism If H is projectible with respect to G, then H c & prob(F/G&H ) < prob(F/G) is an undercutting defeater for the inference by the statistical syllogism from Gc & prob(F/G) ≥ r to Fc.6 In other words, information about c that lowers the probability of its being F constitutes a defeater. Note that if prob(F x/G&H ) is high, one may still be able to 6 There are two kinds of defeaters. Rebutting defeaters attack the conclusion of an inference, and undercutting defeaters attack the inference itself without attacking the conclusion. Here I assume some form of the OSCAR theory of defeasible reasoning (Pollock 1995). For a sketch of that theory see Pollock (2006b).
123
Synthese (2011) 181:317–352
329
make a weaker inference to the conclusion that Fc, but from the distinct premise Gc & prob(F/G&H ) = s. Pollock (1990) argued that we need additional defeaters for the statistical syllogism besides subproperty defeaters, formulated several candidates for such defeaters. But one of the conclusions of the research described in this paper is that the additional defeaters can all be viewed as derived defeaters, with subproperty defeaters being the only primitive defeaters for the statistical syllogism. 4 Indifference Principles of probable probabilities are derived from combinatorial theorems about proportions in finite sets. I will begin with a very simple principle that is in fact not very useful, but will serve as a template for the discussion of more useful principles. Suppose we have a set of 10,000,000 objects. I announce that I am going to select a subset, and ask you how many members it will have. Most people will protest that there is no way to answer this question. It could have any number of members from 0 to 10,000,000. However, if you answer, “Approximately 5,000,000,” you will almost certainly be right. This is because, although there are subsets of all sizes from 0 to 10,000,000, there are many more subsets whose sizes are approximately 5,000,000 than there are of any other size. In fact, 99% of the subsets have cardinalities differing from 5,000,000 by less than .08%. If we let “x ≈ y” mean “the difference between x δ
and y is less than or equal to δ”, the general theorem is: Finite indifference principle For every ε, δ > 0 there is an N such that if U is finite and #U > N then ρ X ρ(X, U ) ≈ 0.5/ X ⊆ U ≥ 1 − ε. δ
In other words, the proportion of subsets of U which are such that ρ(X, U ) is approximately equal to .5, to any given degree of approximation, goes to 1 as the size of U goes to infinity. To see why this is true, suppose #U = n. If r ≤ n, the number of n! r -membered subsets of U is C(n, r ) = r !(n−r )! . It is illuminating to plot C(n, r ) for variable r and various fixed values of n.7 See Fig. 1. This illustrates that the sizes of subsets of U will cluster around n2 , and they cluster more tightly as n increases. This is precisely what the Indifference Principle tells us. The reason the Indifference Principle holds is that C(n, r ) becomes “needle-like” in the limit. As we proceed, I will state a number of similar combinatorial theorems, and in each case they have similar intuitive explanations. The cardinalities 7 Note that throughout this paper I employ the definition of n! in terms of the Euler gamma function. Specifically, n! = 0∞ t n e−t dt. This has the result that n! is defined for any positive real number n, not just for integers, but for the integers the definition agrees with the ordinary recursive definition. This makes the mathematics more convenient.
123
330
Synthese (2011) 181:317–352
Fig. 1 C(n, r ) for n = 100, n = 1000, and n = 10000
of relevant sets are products of terms of the form C(n, r ), and their distribution becomes needle-like in the limit. In this paper, I will omit the proofs of theorems. They will be presented elsewhere in detail, and can be found on my website in a much longer version of this paper (http://oscarhome.soc-sci.arizona.edu/ftp/PAPERS/ Probable%20Probabilities%20with%20proofs.pdf). The finite indifference principle is a mathematical theorem about finite sets. It tells us that for fixed ε, δ > 0, there is an N such that if U is finite but contains at least N members, then the proportion of subsets X of a set U which are such that ρ(X, U ) ≈ 0.5 is greater than 1 − ε. This suggests that the proportion is also is greater δ
than 1 − ε when U is infinite. But if the proportion is greater than 1 − ε for every ε > 0, it follows that the proportion is 1. In other words: If U is infinite then for every δ > 0, ρ X ρ(X, U ) ≈ 0.5/ X ⊆ U = 1. δ
Given the rather simple assumptions I made about ρ in section three, we can derive this infinitary principle from the finite principle. First, we can use familiar looking mathematics to prove:
Law of large numbers for proportions If B is infinite and ρ(A/B) = p then for every ε, δ > 0, there is an N such that ρ X ρ(A/ X ) ≈ p/ X ⊆ B & X is finite & # X ≥ N ≥ 1 − ε. δ
Note that unlike Laws of Large Numbers for probabilities, the Law of Large Numbers for Proportions does not require an assumption of statistical independence. This is because it is derived from the crossproduct principle, and as remarked in section three, no such assumption is required (or even intelligible) for the crossproduct principle. Given the law of large numbers, the finite indifference principle can be shown to entail:
123
Synthese (2011) 181:317–352
331
Infinitary indifference principle If U is infinite then for every δ > 0, ρ X ρ(X, U ) ≈ 0.5/ X ⊆ U = 1. δ
Nomic probabilities are proportions among physically possible objects. For any property F that is not extraordinarily contrived, the set F of physically possible F’s will be infinite.8 Thus the infinitary indifference principle for proportions implies an analogous principle for nomic probabilities: Probabilistic indifference principle For any property G and for every δ > 0, prob X prob(X/G) ≈ 0.5/ X G = 1.9 δ
Next note that we can apply the statistical syllogism to the probability formulated in the probabilistic indifference principle. For every δ > 0, this gives us a defeasible reason for expecting that if F G, then prob(F/G) ≈ 0.5, and these concluδ
sions jointly entail that prob(F/G) = 0.5. For any property F, (F & G) G, and prob(F/G) = prob(F & G/G). Thus we are led to a defeasible inference scheme: Indifference principle For any properties F and G, it is defeasibly reasonable to assume that prob (F/G) = 0.5. The indifference principle is my first example of a principle of probable probabilities. We have a quadruple of principles that go together: (1) the finite indifference principle, which is a theorem of combinatorial mathematics; (2) the infinitary indifference principle, which follows from the finite principle given the law of large numbers for proportions; (3) the probabilistic indifference principle, which is a theorem derived from (2); and (4) the Indifference Principle, which is a principle of defeasible reasoning 8 The following principles apply only to properties for which there are infinitely many physically possible instances, but I will not explicitly include the qualification “non-contrived” in the principles. 9 If we could assume countable additivity for nomic probability, the Indifference Principle would imply that prob X (prob(X, G) = 0.5/ X G) = 1. Countable additivity is generally assumed in mathematical probability theory, but most of the important writers in the foundations of probability theory, including de Finetti (1974), Reichenbach (1949), Jeffrey (1983), Skyrms (1980), Savage (1954), and Kyburg (1974a), have either questioned it or rejected it outright. Pollock (2006a) gives what I consider to be a compelling counter-example to countable additivity. So I will have to remain content with the more complex formulation of the Indifference Principle.
123
332
Synthese (2011) 181:317–352
that follows from (3) with the help of the statistical syllogism. All of the principles of probable probabilities that I will discuss have analogous quadruples of principles associated with them. Rather than tediously listing all four principles in each case, I will encapsulate the four principles in the simple form: Expectable indifference principle For any properties F and G, the expectable value of prob(F/G) = 0.5. So in talking about expectable values, I am talking about this entire quadruple of principles. I have chosen the indifference principle as my first example of a principle of probable probabilities because the argument for it is simple and easy to follow. However, as I indicated at the start, this principle is only occasionally useful. If we were choosing the properties F in some random way, it would be reasonable to expect that prob(F/G) = 0.5. However, pairs of properties F and G which are such that prob(F/G) = 0.5 are not very useful to us from a cognitive perspective, because knowing that something is a G then carries no information about whether it is an F. As a result, we usually only enquire about the value of prob(F/G) when we have reason to believe there is a connection between F and G such that prob(F/G) = 0.5. Hence in actual practice, application of the indifference principle to cases that really interest us will almost invariably be defeated. This does not mean, however, that the indifference principle is never useful. For instance, if I give Jones the opportunity to pick either of two essentially identical balls, in the absence of information to the contrary it seems reasonable to take the probability of either choice to be .5. This can be justified as an application of either the indifference principle or the generalized indifference principle. That applications of the indifference principle are often defeated illustrates an important point about nomic probability and principles of probable probabilities. The fact that a nomic probability is 1 does not mean that there are no counter-instances. In fact, there may be infinitely many counter-instances. Consider the probability of a real number being irrational. Plausibly, this probability is 1, because the cardinality of the set of irrationals is infinitely greater than the cardinality of the set of rationals. But there are still infinitely many rationals. The set of rationals is infinite, but it has measure 0 relative to the set of real numbers. A second point is that in classical probability theory (which is about singular probabilities), conditional probabilities are defined as ratios of unconditional probabilities: prob(P/Q) =
prob(P&Q) . prob(Q)
However, for generic probabilities, there are no unconditional probabilities, so conditional probabilities must be taken as primitive. These are sometimes called “Popper functions”. The first people to investigate them were Popper (1938, 1959) and the mathematician Renyi (1955). If conditional probabilities are defined as above, prob(P/Q) is undefined when prob(Q) = 0. However, for nomic probabilities, prob(F/G&H )
123
Synthese (2011) 181:317–352
333
can be perfectly well-defined even when prob(G/H ) = 0. One consequence of this is that, unlike in the standard probability calculus, if prob(F/G) = 1, it does not follow that prob(F/G&H ) = 1. Specifically, this can fail when prob(H/G) = 0. Thus, for example, prob(2x is irrational/x is a real number) = 1 but prob(2x is irrational/x is a real number & x is rational) = 0. In the course of developing the theory of probable probabilities, we will find numerous examples of this phenomenon, and they will generate defeaters for the defeasible inferences licensed by our principles of probable probabilities. 5 Independence Now let us turn to a truly useful principle of probable probabilities. It was remarked above that probability practitioners commonly assume statistical independence when they have no reason to think otherwise, and so compute that prob(A&B/C) = prob(A/C) · prob(B/C). In other words, they assume that A and B are statistically independent relative to C. This assumption is ubiquitous in almost every application of probability to real-world problems. However, the justification for such an assumption has heretofore eluded probability theorists, and when they make such assumptions they tend to do so apologetically. We are now in a position to provide a justification for a general assumption of statistical independence. Although it is harder to prove than the finite indifference principle, the following combinatorial principle holds in general: Finite independence principle For 0 ≤ r, s ≤ 1 and for every ε, δ > 0 there is an N such that if U is finite and #U > N , then ρ X,Y,Z
ρ(X ∩ Y, Z ) ≈ r · s/ X, Y, Z ⊆ U & ρ(X, Z ) δ = r & ρ(Y, Z ) = s ≥ 1 − ε.
In other words, for a large finite set U , subsets X, Y and Z of U tend to be such that ρ(X ∩ Y, Z ) is approximately equal to ρ(X, Z ) · ρ(Y, Z ), and for any fixed degree of approximation, the proportion of subsets of U satisfying this approximation goes to 1 as the size of U goes to infinity. Given the law of large numbers for proportions, the finite independence principle entails:
123
334
Synthese (2011) 181:317–352
Infinitary independence principle For 0 ≤, r, s ≤ 1, if U is infinite then for every δ > 0: ρ X,Y,Z ρ(X ∩ Y, Z ) ≈ r · s/ X, Y, Z ⊆ U & ρ(X, Z ) δ = r &ρ(Y, Z ) = s = 1. As before, this entails: Probabilistic independence principle For 0 ≤ r, s ≤ 1 and for any property U , for every δ > 0: prob X,Y,Z prob(X &Y/Z ) ≈ r · s/ X, Y, Z U & prob(X/Z ) δ = r & prob(Y/Z ) = s = 1. Again, applying the statistical syllogism to the second-order probability in the probabilistic independence principle, we get: Principle of statistical independence prob(A/C) = r & prob(B/C) = s is a defeasible reason for prob(A&B/C) = r · s. Again, we can encapsulate these four principles in a single principle of expectable values: Principle of expectable statistical independence If prob(A/C) = r and prob(B/C) = s, the expectable value of prob(A &B /C) = r · s. So a provable combinatorial principle regarding finite sets ultimately makes it reasonable to expect, in the absence of contrary information, that properties will be statistically independent of one another. This is the reason why, when we see no connection between properties that would force them to be statistically dependent, we can reasonably expect them to be statistically independent. The assumption of statistical independence sometimes fails. Clearly, this can happen when there are causal connections between properties. But it can also happen for purely logical reasons. For example, if A = B, A and B cannot be independent unless r = 1. More general defeaters for the principle of statistical independence will emerge below.
123
Synthese (2011) 181:317–352
335
6 The probable probabilities theorem Principles like that of Statistical Independence are supported by a general combinatorial theorem, which underlies the entire theory of probable probabilities. Given a list of variables X 1 , . . . , X n ranging over subsets of a set U , Boolean compounds of these sets are compounds formed by union, intersection, and set-complement. So, for example (X ∪ Y) − Z is a Boolean compound of X, Y , and Z . Linear constraints on the Boolean compounds either state the values of certain proportions, e.g., stipulating that ρ(X,Y) = r , or they relate proportions using linear equations. For example, if we know that X = Y ∪ Z , that generates the linear constraint ρ(X, U ) = ρ(Y, U ) + ρ(Z , U ) − ρ(X ∩ Z , U ). Our general theorem is: Probable proportions theorem Let U, X 1 , . . . , X n be a set of variables ranging over sets, and consider a finite set LC of linear constraints on proportions between Boolean compounds of those variables. If LC is consistent with the probability calculus, then for any pair of Boolean compounds P, Q of U, X 1 , . . . , X n there is a real number r between 0 and 1 such that for every ε, δ > 0, there is an N such that if U is finite and #U > N , then ρ X 1 ,...,X n ρ(P, Q) ≈ r/LC&X 1 , . . . , X n ⊆ U ≥ 1 − ε. δ
This is the theorem that underlies all of the principles developed in this paper. Given the law of large numbers for proportions, we can prove: Limit principle for proportions Consider a finite set LC of linear constraints on proportions between Boolean compounds of a list of variables U, X 1 , . . . , X n . For any real number r between 0 and 1, if for every ε, δ > 0, if there is an N such that for any finite set U such that #U > N , ρ X 1 ,...,X n ρ(P, Q) ≈ r/LC&X 1 , . . . , X n ⊆ U ≥ 1 − ε, δ
then for any infinite set U , for every δ > 0: ρ X 1 ,...,X n ρ(P, Q) ≈ r/LC&X 1 , . . . , X n ⊆ U = 1. δ
Given the limit principle for proportions, the Probable Proportions Theorem entails:
123
336
Synthese (2011) 181:317–352
Probable probabilities theorem Let U, X 1 , . . . , X n be a set of variables ranging over properties and relations, and consider a finite set LC of linear constraints on probabilities between truthfunctional compounds of those variables. If LC is consistent with the probability calculus, then for any pair of truth-functional compounds P, Q of U, X 1 , . . . , X n there is a real number r between 0 and 1 such that for every δ > 0, prob X 1 ,...,X n prob(P/Q) ≈ r/LC&X 1 , . . . , X n U = 1. δ
In other words, given the constraints LC, the expectable value of prob(P/Q) = r . This establishes the existence of expectable values for probabilities under very general circumstances. The theorem can probably be generalized further, e.g., to linear inequalities, or even to nonlinear constraints, but this is what I have established so far. The Probable Probabilities Theorem tells us that there are expectable values. It turns out that there is a general strategy for finding and proving theorems describing these expectable values, and I have written a computer program (in Common LISP) that will often do this automatically, both finding the theorems and producing human readable proofs. It can be downloaded from http://oscarhome.soc-sci.arizona.edu/ftp/ OSCAR-web-page/CODE/Code-for-probable-probabilities.zip. I will go on to illustrate these general results with several interesting theorems about probable probabilities. 7 Nonclassical direct inference Pollock (1984a) noted (a restricted form of) the following limit principle: Finite principle of agreement For 0 ≤ a, b, c, r ≤ 1 and for every ε, δ > 0, there is an N such that if U is finite and #U > N , then: ⎛
ρ(X, Y ∩ Z ) ≈ r/ X, Y, Z ⊆ U & ρ(X, Y ) = r
δ ⎜ ρ X,Y ⎝ &ρ(X, U ) = a & ρ(Y, U ) = b & ρ(Z , U ) = c
⎞ ⎟ ⎠ ≥ 1 − ε.
In the theory of nomic probability (Pollock 1984a, 1990), this used this to ground a theory of direct inference. We can now improve upon that theory. As above, the Finite Principle of Agreement yields a principle of expectable values: Nonclassical direct inference If prob(A/B) = r , the expectable value of prob(A/B & C) = r .
123
Synthese (2011) 181:317–352
337
This is a kind of “principle of insufficient reason”. It tells us that if we have no reason for thinking otherwise, we should expect that strengthening the reference property in a nomic probability leaves the value of the probability unchanged. This is called “nonclassical direct inference” because, although it only licenses inferences from generic probabilities to other generic probabilities, it turns out to have strong formal similarities to classical direct inference (which licenses inferences from generic probabilities to singular probabilities), and as we will see in section seven, principles of classical direct inference can be derived from it. It is important to realize that the principle of agreement, and the corresponding principle of nonclassical direct inference, are equivalent (with one slight qualification) to the probabilistic product principle and the defeasible principle of statistical independence. This turns upon the following simple theorem of the probability calculus: Independence and agreement theorem If prob(C/B) > 0 then prob(A/B&C) = prob(A/B) iff A and C are independent relative to B. As a result, anyone who shares the commonly held intuition that we should be able to assume statistical independence in the absence of information to the contrary is also committed to endorsing nonclassical direct inference. This is important, because I have found that many people do have the former intuition but balk at the latter. There is a variant of the principle of agreement that is equivalent to the first version but often more useful: Finite principle of agreement II For 0 ≤ r ≤ 1 and for every ε, δ > 0, there is an N such that if U is finite and #U > N , then: ρ X,Y ρ(X, Z ) ≈ r/ X, Y ⊆ U & Z ⊆ Y & ρ(X, Y ) = r ≥ 1 − ε. δ
This yields an equivalent variant of the principle of nonclassical direct inference: Nonclassical direct inference II If C B and prob(A/B) = r , the expectable value of prob(A/C) = r . The principle of nonclassical direct inference supports many defeasible inferences that seem intuitively reasonable but are not licensed by the probability calculus. For example, suppose we know that the probability of a 20 year old male driver in Maryland having an auto accident over the course of a year is .07. If we add that his girlfriend’s name is “Martha”, we do not expect this to alter the probability. There is no way to
123
338
Synthese (2011) 181:317–352
justify this assumption within a traditional probability framework, but it is justified by nonclassical direct inference. Nonclassical direct inference is a principle of defeasible inference, so it is subject to defeat. The simplest and most important kind of defeater is a subproperty defeater. Suppose C D B and we know that prob(A/B) = r , but prob(A/D) = s, where s = r . This gives us defeasible reasons for drawing two incompatible conclusions, viz., that prob(A/C) = r and prob(A/D) = s. The principle of subproperty defeat tells us that because D B, the latter inference takes precedence and defeats the inference to the conclusion that prob(A/C) = r :
Subproperty defeat for nonclassical direct inference C D B and prob(A/D) = s = r is an undercutting defeater for the inference by nonclassical direct inference from C B and prob(A/B) = r to prob(A/C) = r . We obtain this defeater by noting that the principle of nonclassical direct inference is licensed by an application of the statistical syllogism to the probability prob A,B,C prob(A/C) ≈ r/A, B, C U and C δ B and prob(A/B) = r = 1.
(1)
We can easily establish the following principle, which appeals to a more comprehensive set of assumptions:
prob(A/C) ≈ s/A, B, C, D U and C D δ
prob A,B,C
and D B and prob(A/B) = r and prob(A/D) = s
= 1.
(2)
If r = s then (2) entails:
prob A,B,C = 0.
prob(A/C) ≈ r/A, B, C, D U and C D
δ
and D B and prob(A/B) = r and prob(A/D) = s (3)
The reference property in (3) is more specific than that in (1), so (3) gives us a subproperty defeater for the application of the statistical syllogism to (1). A simpler way of putting all of this is that corresponding to (2) we have the following principle of expectable values:
123
Synthese (2011) 181:317–352
339
Subproperty defeat for nonclassical direct inference If C D B, prob(A/D) = s, prob(A/B) = r , prob(A/U ) = a, prob(B/U ) = b, prob(C/U ) = c, prob(D/U ) = d, then the expectable value of prob(A/C) = s (rather than r ). As above, principles of expectable values that appeal to more information take precedence over (i.e., defeat the inferences from) principles that appeal to a subset of that information. Because the principles of nonclassical direct inference and statistical independence are equivalent, subproperty defeaters for nonclassical direct inference generate analogous defeaters for the principle of statistical independence: Subproperty defeat for statistical independence (B&C) D C and prob(A/D) = p = r is an undercutting defeater for the inference by the principle of statistical independence from prob(A/C) = r & prob(B/C) = s to prob(A&B/C) = r · s. This is because prob(A&B/C) = r · s only if prob(A/B&C) = prob(A/C), and this defeater makes it unreasonable to believe the former. 8 Classical direct inference Direct inference is normally understood as being a form of inference from generic probabilities to singular probabilities rather than from generic probabilities to other generic probabilities. However, I showed in my (1990) that these inferences are derivable from nonclassical direct inference if we identify singular probabilities with a special class of generic probabilities. The present treatment is a generalization of that given in my (1984a and 1990).10 Let K be the conjunction of all the propositions the agent knows to be true, and let K be the set of all physically possible worlds at which K is true (“K-worlds”). I propose that we define the singular probability prob(P) to be the proportion of K-worlds at which P is true. Where P is the set of all P-worlds: prob(P) = ρ(P, K). More generally, where Q is the set of all Q-worlds, we can define: prob(P/Q) = ρ(P, Q ∩ K). Formally, this is analogous to Carnap (1950, 1952) logical probability, with the important difference that Carnap took ρ to be logically specified, whereas I take the identity of ρ to be a contingent fact. ρ is determined by the values of contingently true nomic probabilities, and their values are discovered by various kinds of statistical induction. 10 Bacchus (1990) gave a somewhat similar account of direct inference, drawing on my 1983 and 1984b.
123
340
Synthese (2011) 181:317–352
It turns out that singular probabilities, so defined, can be identified with a special class of nomic probabilities: Representation theorem for singular probabilities (1) prob(Fa) = prob(F x/x = a&K); (2) If it is physically necessary that [K → (Q ↔ Sa1 . . . an )] and that [(Q&K) → (P ↔ Ra1 . . . an )], and Q is consistent with K, then prob(P/Q) = prob(Rx1 . . . xn /Sx1 . . . xn & x1 = a1 & . . . &xn = an &K). (3) prob(P) = prob(P & x = x/x = x & K). prob(P) is a kind of “mixed physical/epistemic probability”, because it combines background knowledge in the form of K with generic probabilities.11 The probability prob(F x/x = a & K) is a peculiar-looking nomic probability. It is an generic probability, because “x” is a free variable, but the probability is only about one object. As such it cannot be evaluated by statistical induction or other familiar forms of statistical reasoning. However, it can be evaluated using nonclassical direct inference. If K entails Ga, nonclassical direct inference gives us a defeasible reason for expecting that prob(Fa) = prob(F x/x = a & K) = prob(F x/Gx). This is a familiar form of “classical” direct inference—that is, direct inference from nomic probabilities to singular probabilities. More generally, we can derive: Classical direct inference Sa1 ...an is known and prob(Rx1 . . . xn /Sx1 . . . xn & T x1 . . . xn ) = r is a defeasible reason for prob(Ra1 . . . an /T a1 . . . an ) = r . Similarly, we get subproperty defeaters: Subproperty defeat for classical direct inference V S, V a1 . . . an is known, and prob(Rx1 . . . xn /V x1 . . . xn & T x1 . . . xn ) = r is an undercutting defeater for the inference by classical direct inference from Sa1 . . . an is known and prob(Rx1 . . . xn /Sx1 . . . xn & T x1 . . . xn ) = r to prob(Ra1 . . . an /T a1 . . . an ) = r . Because singular probabilities are generic probabilities in disguise, we can also use nonclassical direct inference to infer singular probabilities from singular probabilities. Thus prob(P/Q) = r gives us a defeasible reason for expecting that prob(P/Q&R) = r . We can employ principles of statistical independence similarly. For example, prob(P/R) = r & prob(Q/R) = s gives us a defeasible reason for expecting that prob(P&Q/R) = r · s. 11 See chapter six of my (2006a) for further discussion of these mixed physical/epistemic probabilities.
123
Synthese (2011) 181:317–352
341
9 Computational inheritance Suppose we have two seemingly unrelated diagnostic tests for a disease, and Bernard tests positive on both tests. We know that the probability of his having the disease if he tests positive on the first test is .8, and the probability if he tests positive on the second test is .75. But what should we conclude about the probability of his having the disease if he tests positive on both tests? The probability calculus gives us no guidance here. Nor does direct inference. Direct inference gives us one reason for thinking the probability of Bernard having the disease is .8, and it gives us a different reason for drawing the conflicting conclusion that the probability is .75. It gives us no way to combine the information. Intuitively, it seems that the probability of his having the disease should be higher if he tests positive on both tests. But how can we justify this? This is a general problem for theories of direct inference. When we have some conjunction G 1 & . . . &G n of properties and we want to know the value of prob(F/G 1 & . . . &G n ), if we know that prob(F/G 1 ) = r and we don’t know anything else of relevance, we can infer defeasibly that prob(F/G 1 & . . . &G n ) = r . Similarly, if we know that an object a has the properties G 1 , . . . , G n and we know that prob(F/G 1 ) = r and we don’t know anything else of relevance, we can infer defeasibly that prob(Fa) = r . The difficulty is that we usually know more. We typically know the value of prob(F/G i ) for some i = 1. If prob(F/G i ) = s = r , we have defeasible reasons for both prob(F/G 1 & . . . &G n ) = r and prob(F/G 1 & . . . &G n ) = s, and also for both prob(Fa) = r and prob(Fa) = s. As these conclusions are incompatible they all undergo collective defeat. Thus the standard theory of direct inference leaves us without a conclusion to draw. The upshot is that the earlier suggestion that direct inference can solve the computational problem of dealing with singular probabilities without having to have a complete probability distribution was premature. Direct inference will rarely give us the probabilities we need. Knowledge of generic probabilities would be vastly more useful in real application if there were a function Y(r, s) such that, in a case like the above, when prob(F/G) = r and prob(F/H ) = s, we could defeasibly expect that prob(F/G&H ) = Y(r, s), and hence (by nonclassical direct inference) that prob(Fa) = Y(r, s). I call this computational inheritance, because it computes a new value for prob(Fa) from previously known generic probabilities. Direct inference, by contrast, is a kind of “noncomputational inheritance”. It is direct in that prob(Fa) simply inherits a value from a known generic probability. I call the function used in computational inheritance “the Y-function” because its behavior would be as diagrammed in Fig. 2. It has generally been assumed that there is no such function as the Y-function (Reichenbach 1949). Certainly, there is no function Y(r, s) such that we can conclude Fig. 2 The Y-function
prob(F/ G) = r
prob(F/ H ) = s
prob(F/ G&H) = Y(r,s)
123
342
Synthese (2011) 181:317–352
deductively that if prob(F/G) = r and prob(F/H ) = s then prob(F/G&H ) = Y(r, s). For any r and s that are neither 0 nor 1, prob(F/G&H ) can take any value between 0 and 1. However, that is equally true for nonclassical direct inference. That is, if prob(F/G) = r we cannot conclude deductively that prob(F/G&H ) = r . Nevertheless, that will tend to be the case, and we can defeasibly expect it to be the case. Might something similar be true of the Y-function? That is, could there be a function Y(r, s) such that we can defeasibly expect prob(F/G&H ) to be Y(r, s)? It follows from the Probable Probabilities Theorem that the answer is “Yes”. It is more useful to begin by looking at a three-place function rather than a two-place function. Let us define: Y(r, s | a) =
r s(1 − a) a(1 − r − s) + r s
I use the non-standard notation “Y(r, s | a)” rather than “Y(r, s, a)” because the first two variables will turn out to work differently than the last variable. Let us define: B and C are Y-independent for A relative to U iff A, B, C U and (a) prob(C/B & A) = prob(C/A), and (b) prob(C/B &∼A) = prob(C/U &∼A). The key theorem underlying computational inheritance is the following theorem of the probability calculus: Y-theorem Let r = prob(A/B), s = prob(A/C), and a = prob(A/U ). If B and C are Y-independent for A relative to U then prob(A/B&C) = Y(r, s | a). In light of the Y-theorem, we can think of Y-independence as formulating an independence condition for C and D which says that they make independent contributions to A—contributions that “add” in accordance with the Y-function, rather than “undermining” each other. By virtue of the principle of statistical independence, we have a defeasible reason for expecting that the independence conditions (a) and (b) hold. Thus the Y-theorem supports the following principle of defeasible reasoning: Computational inheritance B, C U & prob(A/B) = r & prob(A/C) = s & prob(A/U ) = a is a defeasible reason for prob(A/B & C) = Y(r, s | a). It should be noted that we can prove analogues of Computational Inheritance for finite sets, infinite sets, and probabilities, in essentially the same way we prove the Y-theorem. This yields the following principle of expectable values:
123
Synthese (2011) 181:317–352
343
Fig. 3 Y(z, x | .5), holding z constant (for several choices of z as indicated in the key)
Y-principle If B, C U, prob(A/B) = r , prob(A/C) = s, and prob(A/U ) = a, then the expectable value of prob(A/B & C) = Y(r, s | a). In the corresponding quadruple of principles, the Finite Y-Principle can be proven directly, or derived from the Finite Principle of Agreement. Similarly, the Y-Principle is derivable from the Principle of Agreement. Then the Y-Principle for Probabilities is derivable from either the Y-Principle or from the Principle of Agreement for Probabilities. To get a better feel for what the principle of computational inheritance tells us, it is useful to examine plots of the Y-function. Figure 3 illustrates that Y(r, s | .5) is symmetric around the right-leaning diagonal. Varying a has the effect of warping the Y-function up or down relative to the rightleaning diagonal. This is illustrated in Fig. 4 for several choices of a. The Y-function has a number of important properties.12 In particular, it is important that the Y-function is commutative and associative in the first two variables: Theorem 1 Y(r, s | a) = Y(s, r | a). Theorem 2 Y(r, Y(s, t | a) | a) = Y(Y(r, s | a), t | a). Theorems 1 and 2 are very important for the use of the Y-function in computing probabilities. Suppose we know that prob(A/B) = .6, prob(A/C) = .7, and prob(A/D) = 12 It turns out that the Y-function has been studied for its desirable mathematical properties in the theory of associative compensatory aggregation operators in fuzzy logic (Dombi 1982; Klement et al. 1996; Fodor et al. 1997). Y(r, s | a) is the function Dλ (r, s) for λ = 1−a a (Klement et al. 1996). The Y-theorem may provide further justification for its use in that connection.
123
344
Synthese (2011) 181:317–352
Fig. 4 Y(z, x | a) holding z constant (for several choices of z), for a = .7, a = .3, a = .1, and a = .01
.75, where B, C, D U and prob(A/U ) = .3. In light of theorems 1 and 2 we can combine the first three probabilities in any order and infer defeasibly that prob(A/B&C&D) = Y(.6, Y(.7, .75 | .3) | .3) = Y(Y(.6, .7 | .3), .75 | .3) = .98. This makes it convenient to extend the Y-function recursively so that it can be applied to an arbitrary number of arguments (greater than or equal to 3): If n ≥ 3, Y(r1 , . . . , rn | a) = Y(r1 , Y(r2 , . . . , rn | a) | a). Then we can then strengthen the Y-Principle as follows: Generalized Y-principle If B1 , . . . , Bn U , prob(A/B1 ) = r1 , . . . , prob(A/Bn ) = rn , and prob (A/U ) = a, the expectable value of prob(A/B1 & . . . & Bn &C) = Y(r1 , . . . , rn | a). If we know that prob(A/B) = r and prob(A/C) = s, we can also use nonclassical direct inference to infer defeasibly that prob(A/B&C) = r . If s = a, Y(r, s | a) = r ,
123
Synthese (2011) 181:317–352
345
so this conflicts with the conclusion that prob(A/B&C) = Y(r, s | a). However, as above, the inference described by the Y-principle is based upon a probability with a more inclusive reference property than that underlying Nonclassical Direct Inference (that is, it takes account of more information), so it takes precedence and yields an undercutting defeater for Nonclassical Direct Inference: Computational defeat for nonclassical direct inference A, B, C U and prob(A/C) = prob(A/U ) is an undercutting defeater for the inference from prob(A/B) = r to prob(A/B&C) = r by Nonclassical Direct Inference. It follows that follows that we have defeater for the principle of statistical independence: Computational defeat for statistical independence A, B, C U and prob(A/B) = prob(A/U ) is an undercutting defeater for the inference from prob(A/B) = r & prob(A/C) = s to prob(A&B/C) = r · s by Statistical Independence. The phenomenon of Computational Inheritance makes knowledge of generic probabilities useful in ways it was never previously useful. It tells us how to combine different probabilities that would lead to conflicting direct inferences and still arrive at a univocal value. Consider Bernard again, who has symptoms suggesting a particular disease, and tests positive on two independent tests for the disease. Suppose the probability of a person with those symptoms having the disease is .6. Suppose the probability of such a person having the disease is they test positive on the first test is .7, and the probability of their having the disease if they test positive on the second test is .75. What is the probability of their having the disease if they test positive on both tests? We can infer defeasibly that it is Y(.7, .75 | .6) = .875. We can then apply classical direct inference to conclude that the probability of Bernard’s having the disease is .875. This is a result that we could not have gotten from the probability calculus alone. Similar reasoning will have significant practical applications, for example in engineering where we have multiple imperfect sensors sensing some phenomenon and we want to arrive at a joint probability regarding the phenomenon that combines the information from all the sensors. Again, because singular probabilities are generic probabilities in disguise, we can apply computational inheritance to them as well and infer defeasibly that if prob (P) = a, prob(P/Q) = r , and prob(P/R) = s then prob(P/Q&R) = Y(r, s | a).
10 Inverse probabilities and the statistical syllogism All of the principles of probable probabilities that have been discussed so far are related to defeasible assumptions of statistical independence. As we have seen, Nonclassical
123
346
Synthese (2011) 181:317–352
Direct Inference is equivalent to a defeasible assumption of statistical independence, and Computational Inheritance follows from a defeasible assumption of Y-independence. This might suggest that all principles of probable probabilities derive ultimately from various defeasible independence assumptions. However, this section turns to a set of principles that do not appear to be related to statistical independence in any way. Where A, B U , suppose we know the value of prob(A/B). If we know the base rates prob(A/U ) and prob(B/U ), the probability calculus enables us to compute the value of the inverse probability prob(∼B/∼A&U ): Theorem 3 If A, B U then prob(∼B/∼A&U ) =
1 − prob(A/U ) − prob(B/U ) + prob(A/B) · prob(B/U ) . 1 − prob(A/U )
However, if we do not know the base rates then the probability calculus imposes no constraints on the value of the inverse probability. It can nevertheless be shown that there are expectable values for it, and generally, if prob(A/B) is high, so is prob(∼ B/ ∼ A&U ). Inverse probabilities I If A, B U and we know that prob(A/B) = r , but we do not know the base rates prob(A/U ) and prob(B/U ), the following values are expectable: .5
; + .5 .25 − .5r ; prob(A/U ) = .5 − r r (1 − r )1−r + .5 prob(∼A/∼B&U ) = .5; rr prob(∼B/∼A&U ) = . (1 − r )r + r r prob(B/U ) =
r r (1 − r )1−r
These values are plotted in Fig. 5. Note that when prob(A/B) > prob(A/U ), we can expect prob(∼B/∼A&U ) to be almost as great as prob(A/B). Sometimes we know one of the base rates but not both: Inverse probabilities II If A, B U and we know that prob(A/B) = r prob(B/U ) = b, but we do not know the base rate prob(A/U ), the following values are expectable: prob(A/U ) = .5(1 − (1 − 2r )b); .5 + b(.5 − r ) ; prob(∼A/∼B&U ) = 1 + b(1 − r )
123
Synthese (2011) 181:317–352
347
Fig. 5 Expectable values of prob(∼B/∼A&U ), prob(A/U ), and prob(B/U ), as a function of prob(A/B), when the base rates are unknown
prob(∼B/∼A&U ) =
1−b . 1 + b(1 − 2r )
Figure 6 plots the expectable values of prob(∼B/∼A&U ) (when they are greater than .5) as a function of prob(A/B), for fixed values of prob(B/U ). The diagonal dashed line indicates the value of prob(A/B), for comparison. The upshot is that for low values of prob(B/U ), prob(∼B/ ∼ A&U ) can be expected to be higher than prob(A/B), and for all values of prob(B/U ), prob(∼B/ ∼ A&U ) will be fairly high 1 if prob(A/B) is high. Furthermore, prob(∼ B/ ∼ A&U ) > .5 iff prob(B/U ) < 3−2r . The most complex case occurs when we do know the base-rate prob(A/U ) but we do not know the base-rate prob(B/U ):
Inverse probabilities III If A, B U and we know that prob(A/B) = r and prob(A/U ) = a, but we do not know the base rate prob(B/U ), then: r 1−r (1−r )b r ·b · (a) where b is the expectable value of prob(B/U ), a−r ·b 1−a−(1−r )b = 1; 1−r b. (b) the expectable value of prob(∼B/∼A&U ) = 1 − 1−a
123
348
Synthese (2011) 181:317–352
Fig. 6 Expectable values of prob(∼B/∼A&U ) as a function of prob(A/B), when prob(A/U ) is unknown, for fixed values of prob(B/U )
The equation characterizing the expectable value of prob(B/U ) does not have a closed-form solution. However, for specific values of a and r , the solutions are easily computed using hill-climbing algorithms. The results are contained in Fig. 7. When prob(A/B) = prob(A/U ), the expected value for prob(∼B/∼A) is .5, and when prob(A/B) > prob(A/U ), prob(∼B/∼A&U ) > .5. If prob(A/U ) < .5, the expected value of prob(∼B/∼A&U ) is greater than prob(A/B). The upshot is that even when we lack knowledge of the base rates, there is an expectable value for the inverse probability prob(∼B/∼A&U ), and that expectable value tends to be high when prob(A/B) is high. 11 Meeting some objections I have argued that mathematical results, coupled with the statistical syllogism, justify defeasible inferences about the values of unknown probabilities. Various worries arise regarding this conclusion. A few people are worried about any defeasible (non-deductive) inference, but I presume that the last 50 years of epistemology has made it amply
123
Synthese (2011) 181:317–352
349
Fig. 7 Expectable values of prob(∼B/∼A&U ) as a function of prob(A/B), when prob(B/U ) is unknown, for fixed values of prob(A/U )
clear that, in the real world, cognitive agents cannot confine themselves to conclusions drawn deductively from their evidence. We employ multitudes of defeasible inference schemes in our everyday reasoning, and the statistical syllogism is one of them. Granted that we have to reason defeasibly, we can still ask what justifies any particular defeasible inference scheme. At least in the case of the statistical syllogism, the answer seems clear. If prob(A/B) is high, then if we reason defeasibly from things being B to their being A, we will generally get it right. That is the most we can require of a defeasible inference scheme. We cannot require that the inference scheme will always lead to true conclusions, because then it would not be defeasible. People sometimes protest at this point that they are not interested in the general case. They are concerned with some inference they are only going to make once. They want to know why they should reason this way in the single case. But all cases are single cases. If you reason in this way in single cases, you will tend to get them right. It does not seem that you can ask for any firmer guarantee than that. You cannot avoid defeasible reasoning. But we can have a further worry. For any defeasible inference scheme, we know that there will be at possible cases in which it gets things wrong. For each principle
123
350
Synthese (2011) 181:317–352
of probable probabilities, the possible exceptions constitute a set of measure 0, but it is still an infinite set. The cases that actually interest us tend to be highly structured, and perhaps they also constitute a set of measure 0. How do we know that the latter set is not contained in the former? Again, there can be no logical guarantee that this is not the case. However, the generic probability of an arbitrary set of cases falling in the set of possible exceptions is 0. So without further specification of the structure of the cases that interest us, the probability of the set of those cases all falling in the set of exceptions is 0. Where defeasible reasoning is concerned, we cannot ask for a better guarantee than that. We should resist the temptation to think of the set of possible exceptions as an amorphous unstructured set about which we cannot reason using principles of probable probabilities. The exceptions are exceptions to a single defeasible inference scheme. Many of the cases in which a particular inference fails will be cases in which there is a general defeater leading us to expect it to fail and leading us to make a different inference in its place. For example, knowing that prob(A/B) = r gives us a defeasible reason to expect that prob(A/B&C) = r . But if we also know that prob(A/C) = s and prob(A/U ) = a, the original inference is defeated and we should expect instead that prob(A/B&C) = Y (r, s|a). So this is one of the cases in which an inference by nonclassical direct inference fails, but it is a defeasibly expectable case. There will also be cases that are not defeasibly expectable. This follows from the simple fact that there are primitive nomic probabilities representing statistical laws of nature. These laws are novel, and cannot be predicted defeasibly by appealing to other nomic probabilities. Suppose prob(A/B) = r , but prob(A/B&C) = s is a primitive law. The latter is an exception to nonclassical direct inference. Furthermore, we can expect that strengthening the reference property further will result in nomic probabilities like prob(A/B&C&D) = s, and these will also be cases in which the nonclassical direct inference from prob(A/B) = r fails. But, unlike the primitive law, the latter is a defeasibly expectable failure arising from subproperty defeat. So most of the cases in which a particular defeasible inference appealing to principles of probable probabilities fails will be cases in which the failure is defeasibly predictable by appealing to other principles of probable probabilities. This is an observation about how much structure the set of exceptions (of measure 0) must have. The set of exceptions is a set of exceptions just to a single rule, not to all principles of probable probabilities. The Probable Probabilities Theorem implies that even within the set of exceptions to a particular defeasible inference scheme, most inferences that take account of the primitive nomic probabilities will get things right, with probability 1. 12 Conclusions The problem of sparse probability knowledge results from the fact that in the real world we lack direct knowledge of most probabilities. If probabilities are to be useful, we must have ways of making defeasible estimates of their values even when those values are not computable from known probabilities using the probability calculus. Within the theory of nomic probability, limit theorems from combinatorial mathematics provide the necessary bridge for these inferences. It turns out that in very general circumstances, there will be expectable values for otherwise unknown probabilities.
123
Synthese (2011) 181:317–352
351
These are described by principles telling us that although certain inferences from probabilities to probabilities are not deductively valid, nevertheless the second-order probability of their yielding correct results is 1. This makes it defeasibly reasonable to make the inferences. I illustrated this by looking at indifference, statistical independence, classical and nonclassical direct inference, computational inheritance, and inverse probabilities. But these are just illustrations. There are a huge number of useful principles of probable probabilities, some of which I have investigated, but most waiting to be discovered. I proved the first such principles laboriously by hand. It took me six months to find and prove the principle of computational inheritance. But it turns out that there is a uniform way of finding and proving these principles. I have written a computer program (in Common LISP) that analyzes the results of linear constraints and determines what the expectable values of the probabilities are. If desired, it will produce a human-readable proof. This makes it easy to find and investigate new principles. This profusion of principles of probable probability is reminiscent of Carnap’s logical probabilities (Carnap 1950, 1952; Hintikka 1966; Bacchus et al. 1996). Historical theories of objective probability required probabilities to be assessed by empirical methods, and because of the weakness of the probability calculus, they tended to leave us in a badly impoverished epistemic state regarding probabilities. Carnap tried to define a kind of probability for which the values of probabilities were determined by logic alone, thus vitiating the need for empirical investigation. However, finding the right probability measure to employ in a theory of logical probabilities proved to be an insurmountable problem. Nomic probability and the theory of probable probabilities lies between these two extremes. This theory still makes the values of probabilities contingent rather than logically necessary, but it makes our limited empirical investigations much more fruitful by giving them the power to license defeasible, non-deductive, inferences to a wide range of further probabilities that we have not investigated empirically. Furthermore, unlike logical probability, these defeasible inferences do not depend upon ad hoc postulates. Instead, they derive directly from provable theorems of combinatorial mathematics. So even when we do not have sufficient empirical information to deductively determine the value of a probability, purely mathematical facts may be sufficient to make it reasonable, given what empirical information we do have, to expect the unknown probabilities to have specific and computable values. Where this differs from logical probability is (1) that the empirical values are an essential ingredient in the computation, and (2) that the inferences to these values are defeasible rather than deductive. Acknowledgment
This work was supported by NSF Grant No. IIS-0412791.
References Bacchus, F. (1990). Representing and reasoning with probabilistic knowledge. Cambridge, MA: MIT Press. Bacchus, F., Grove, A. J., Halpern, J. Y., & Koller, D. (1996). From statistical knowledge bases to degrees of belief. Artificial Intelligence, 87, 75–143. Braithwaite, R. B. (1953). Scientific explanation. Cambridge: Cambridge University Press. Carnap, R. (1950). The logical foundations of probability. Chicago: University of Chicago Press.
123
352
Synthese (2011) 181:317–352
Carnap, R. (1952). The continuum of inductive methods. Chicago: University of Chicago Press. de Finetti, B. (1974). Theory of probability (Vol. 1). New York: Wiley. Dombi, J. (1982). Basic concepts for a theory of evaluation: The aggregative operator. European Journal of Operational Research, 10, 282–293. Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society A, 222, 309–368. Fodor, J., Yager, R., & Rybalov, A. (1997). Structure of uninorms. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 5, 411–427. Goodman, N. (1955). Fact, fiction, and forecast. Cambridge, MA: Harvard University Press. Halpern, J. Y. (1990). An analysis of first-order logics of probability. Artificial Intelligence, 46, 311–350. Harman, G. (1986). Change in view. Cambridge, MA: MIT Press. Hintikka, J. (1966). A two-dimensional continuum of inductive methods. In J. Hintikka & P. Suppes (Eds.), Aspects of inductive logic (pp. 113–132). Amsterdam: North Holland. Jeffrey, R. (1983). The logic of decision (2nd ed.). Chicago: University of Chicago Press. Klement, E. P., Mesiar, R., & Pap, E. (1996). On the relationship of associative compensatory operators to triangular norms and conorms. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 4, 129–144. Kneale, W. (1949). Probability and induction. Oxford: Oxford University Press. Kushmerick, N., Hanks, S., & Weld, D. (1995). An algorithm for probabilistic planning. Artificial Intelligence, 76, 239–286. Kyburg, H., Jr. (1961). Probability and the logic of rational belief. Middletown, CT: Wesleyan University Press. Kyburg, H., Jr. (1974a). The logical foundations of statistical inference. Dordrecht: Reidel. Kyburg, H., Jr. (1974b). Propensities and probabilities. British Journal for the Philosophy of Science, 25, 321–353. Levi, I. (1980). The enterprise of knowledge. Cambridge, MA: MIT Press. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Pollock, J. L. (1983). A theory of direct inference. Theory and Decision, 15, 29–96. Pollock, J. L. (1984a). Foundations for direct inference. Theory and Decision, 17, 221–256. Pollock, J. L. (1984b). Foundations of philosophical semantics. Princeton: Princeton University Press. Pollock, J. L. (1990). Nomic probability and the foundations of induction. New York: Oxford University Press. Pollock, J. L. (1995). Cognitive carpentry. Cambridge, MA: Bradford/MIT Press. Pollock, J. L. (2006a). Thinking about acting: Logical foundations for rational decision making. New York: Oxford University Press. Pollock, J. L. (2006b). Defeasible reasoning. In J. Adler & L. Rips (Eds.), Reasoning: Studies of human inference and its foundations. Cambridge: Cambridge University Press. Popper, K. (1938). A set of independent axioms for probability. Mind, 47, 275ff. Popper, K. (1956). The propensity interpretation of probability. British Journal for the Philosophy of Science, 10, 25–42. Popper, K. (1957). The propensity interpretation of the calculus of probability, and the quantum theory. In S. Körner (Ed.), Observation and interpretation (pp. 65–70). New York: Academic Press. Popper, K. (1959). The logic of scientific discovery. New York: Basic Books. Reichenbach, H. (1949). A theory of probability. Berkeley: University of California Press (original German edition 1935). Reiter, R., & Criscuolo, G. (1981). On interacting defaults. In IJCAI81 (pp. 94–100). Renyi, A. (1955). On a new axiomatic theory of probability. Acta Mathematica Academiae Scientiarum Hungaricae, 6, 285–333. Russell, B. (1948). Human knowledge: Its scope and limits. New York: Simon and Schuster. Savage, L. (1954). The foundations of statistics. New York: Dover. Shafer, G. (1976). A mathematical theory of evidence. Princeton: Princeton University Press. Sklar, L. (1970). Is propensity a dispositional concept?. Journal of Philosophy, 67, 355–366. Sklar, L. (1973). Unfair to frequencies. Journal of Philosophy, 70, 41–52. Skyrms, B. (1980). Causal necessity. New Haven: Yale University Press. van Fraassen, B. (1981). The scientific image. Oxford: Oxford University Press. Venn, J. (1888). The logic of chance (3rd ed.). London: Macmillan.
123
Synthese (2011) 181:353–365 DOI 10.1007/s11229-010-9796-1
Is logic in the mind or in the world? Gila Sher
Received: 3 December 2009 / Accepted: 26 July 2010 / Published online: 9 October 2010 © The Author(s) 2010. This article is published with open access at Springerlink.com
Abstract The paper presents an outline of a unified answer to five questions concerning logic: (1) Is logic in the mind or in the world? (2) Does logic need a foundation? What is the main obstacle to a foundation for logic? Can it be overcome? (3) How does logic work? What does logical form represent? Are logical constants referential? (4) Is there a criterion of logicality? (5) What is the relation between logic and mathematics? Keywords Logic · Mind · World · Foundation · Logical constants · Criterion of logicality · Mathematics My goal in this paper is to present an outline of a unified answer to the following questions: 1. Is logic in the mind or in the world? 2. Does logic need a foundation? What is the main obstacle to a foundation for logic? Can it be overcome? 3. How does logic work? What does logical form represent? Are logical constants referential? 4. Is there a criterion of logicality? 5. What is the relation between logic and mathematics? I will address the first two questions individually, and offer an overall view of my answer to the last three.
This paper was presented at the Jean Nicod Institute (Paris), UC Santa Cruz, Pacific APA Symposium (Vancouver), and the Society of Exact Philosophy (Edmonton). I would like to thank the audiences, and in particular my commentators at the APA Symposium, Philip Hanson and Marcus Rossberg. Special thanks go to Clinton Tolley and Peter Sher. G. Sher (B) University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0119, USA e-mail:
[email protected]
123
354
Synthese (2011) 181:353–365
Is logic in the mind or in the world?—Logic, I believe, like all other branches of knowledge, is grounded both in the mind and in the world, and its two grounds are interconnected. My answer is motivated both by general considerations pertaining to all branches of knowledge and by specific considerations pertaining to logic. The general considerations I will not discuss in detail here. The main point is that knowledge qua knowledge must be grounded both in its object, the world in a broad sense, and in its subject, the mind (also in a broad sense). Groundedness in the world is veridicality, i.e., compliance with strict standards of truth, evidence, and factual justification. Groundedness in the mind is conformity with pragmatic, conceptual, transcendental, linguistic, and possibly other extra-veridical norms. Turning to special considerations, I will concentrate primarily on logic’s grounding in the world, although I will say something about its grounding in the mind as well. The reason my study starts with the world is this: Throughout history, most systematic accounts of logic have focused on the mind and few on the world. As a result, many options for grounding logic in the mind are available to us, but only few for grounding it in reality. It follows that we are more likely to arrive at a unified grounding of logic in both if we attend to its grounding in the world first and use the result to constrain our account of its grounding in the mind (for which we have a rich trove of ideas to work with). To prevent misunderstandings, let me say right away that by “world” and “reality” (which I use as synonyms) I mean neither “thing in itself” nor “mere appearances”. Nor do I restrict reality to empirical experience or identify those aspects of the world that are relevant to logic with conceptual reality. Neither the Kantian duality of noumena and phenomena, nor pure Platonism or extreme empiricism are compatible with the present outlook. Furthermore, my claim that logic is both in the mind and in the world is not deflationist. Logic is both in the mind and in the world in a substantive sense, a sense that yields significant explanations, solves significant problems, and has significant consequences. The view that logic is grounded in reality goes in the face of a powerful philosophical tradition: Kant regarded logic as grounded exclusively in the mind, as did the post-Kantian idealists; Carnap regarded logic as grounded in pragmatic conventions; and many contemporary philosophers regard logic as requiring no grounding at all. So, does logic require a grounding in reality? Does it require a grounding at all? And what does logic have to do with reality in any case? I will limit myself to five points: (i) Logic’s Special Standing in Our System of Knowledge. Logic has a special standing in our system of knowledge, both due to its great basicness and generality, and due to its normative force. Compare logic with physics, for example. Physics is bound by the laws of logic, but logic is (at least for the most part) not bound by the laws of physics; logic is conceivable outside of physics, but physics is inconceivable outside of logic; logic abstracts from the content of physical terms, but physics does not abstract from the content of logical terms; logic provides tools for physical theories, but physics (for the most part) does not provide tools for logical theories; a serious error in logic is likely to undermine physics, but (most) serious errors in physics are unlikely to undermine logic; and so on. Logic delineates some of the most basic forms of human thought and its expression, provides us with the most basic tools of valid inference,
123
Synthese (2011) 181:353–365
355
tells us what combinations of statements are permissible and impermissible, etc. Logical form, logical inference, logical criteria of consistency, are all ingredients that no system of knowledge can do without. Our system of knowledge can survive the removal of many a science, but not of logic. As a result, a grounding or a foundation for logic is crucial for our system. (ii) The Importance of a Powerful wholesale Method of Inference. We have much to gain by having a well-founded logical system and much to lose without one. Due to our biological, psychological, intellectual, and other limitations, we, as agents of knowledge, can establish no more than a small part of our knowledge directly (or even relatively directly). Most items of knowledge have to be established through inference, or at least with considerable help of inference. A powerful method of inference is therefore indispensable for a system of knowledge built by and for humans. In particular, a powerful wholesale method—a wholesale method that enables us to increase the amount of our knowledge without reduction of truth, evidence, justification, or modal force—is of utmost importance. Not all methods of inference, however, are of this kind. Some powerful methods are narrow in scope, and some general methods are exceedingly weak. “Chemical inference”, for example, is powerful but narrow in scope. If we know that water is H2 O, we can use this to expand our knowledge of phenomena involving water. But chemical inference is limited to physical knowledge, and indeed to a small portion thereof. Other types of inference are broad in scope but lacking in power. Material inference—mere preservation of truth—is a paradigm of such a method. Material consequences are found in all areas of knowledge, but they yield weak consequences, and therefore play no role in expanding our knowledge. Logical inference, in contrast, is both broad in scope and modally powerful. It enables us to draw inferences in all areas of knowledge, and its inferences preserve not just truth, but also evidence and modal status. As a result, logic is of immense value for our system of knowledge. Indeed, not only does logic enable us to extend our knowledge without reducing its veridicality or modal force, it enables us to reduce error by preventing inconsistencies, contradictions, and invalid inferences. All these strongly suggest that logical theory requires a grounding. We have too much to gain by a well-functioning logic, and too much to lose by an ill-functioning one, to afford an unfounded logic. (iii) Logic Has to “Work” in the World. It is a simple and straightforward observation that logical theory, like physical theory, is correct or incorrect in the sense that it either “works” or “does not work” in the world. In the same way that the use of, say, defective aerodynamical principles can cause an airplane to malfunction, so the use of defective logical principles can result in its malfunctioning. If in designing an airplane we rely on incorrect logical laws—e.g., the law of “affirming the consequent”, or the “new Leibniz law” (see (v) below)—we are likely to cause drag when lift is needed, a right turn when a left is intended, etc. A flawed logic can cause havoc in an airplane no less than a flawed physics. This is not to say that we have no latitude in constructing our logical (or physical) theory, but there is a very real sense in which our logical theory (like our physical theory) either works or does not work in the world. A useful logical theory has to avoid conflict with the world, just like any other theory. Adopting an
123
356
Synthese (2011) 181:353–365
influential argument from the philosophy of science, we may say that it would be a complete mystery that logic worked in the world if it were not tuned to the world. (iv) Logic is Connected to Reality Through Truth. One way in which logic is inherently linked to reality is through its connection to truth. Take logical consequence. Consequence relations in general are relations of transmission or preservation of truth. If a set of sentences and S is a sentence, then: (1)
S is a consequence of iff truth of the totality of sentences in Γ is transmitted to (or preserved by) S.
Truth, in general, inherently depends on whether things in the world are as given sentences say they are. Hence, whether S is a logical consequence of inherently depends, in nontrivial cases, on whether the world being as the sentences in say it is guarantees (with a force sufficient for logical consequence) its being the way S says it is. To see more clearly how judgments of logical consequence can be constrained by the world let us consider two sentences, S1 and S2 , such that S1 is true and the truth of both S1 and S2 is clearly a matter of whether certain things hold in the world. Let L be a logical theory saying that S2 is a logical consequence of S1 . In symbols: (2) (Level o f Consequence)
S1 |=L S2 .
Then, if L is right, the truth of S1 guarantees the truth of S2 . Figuratively: (3) (Level o f T r uth)
T(S1 ) Ú Ú Ú T(S2 ).
Now, let S1 and S2 be the situations, or features of the world, on which the truth of S1 and S2 depends: (4)
(Level o f T r uth) (Level o f W orld)
T(S1 ), S1
T(S2 ) S2 .
And suppose that in the world S1 rules out S2 . Then, clearly, the truth of S1 does not guarantee the truth of S2 , and S2 is not a logical consequence of S1 . No matter what our logical theory says, the truth of S1 does not guarantee that of S2 . Now suppose that S1 does not rule out S2 but simply S1 is the case while S2 is not the case. Then again S2 is not a logical consequence of S1 . Finally, suppose that S1 and S2 are both the case, but S1 does not necessitate (strongly guarantee) S2 . Again, S2 is not a logical consequence of S1 . In all these cases our logical theory, L , is factually wrong: the truth of S1 does not guarantee the truth of S2 . Figuratively: (5) (Level o f Consequence) (Level o f T r uth) (Level o f W orld)
123
S1
NOT: S1 |=L S2 ⇑ NOT: T(S1 )Ú Ú Ú T(S2 ) ⇑ ➡S \ 2 or S1 ,S \ 2 or S1 \➡ S2 .
Synthese (2011) 181:353–365
357
Of course, all this does not rule out the possibility that there is some built-in feature in judgments of logical consequence that protects them from ever being challenged by the world. But in that case we would have to veridically justify the claim that our logical theory is not falsified by the world. i.e., we have to ground logic in the world at least in a negative sense. There are, however, reasons to believe that logic is grounded in the world also in a positive sense. i.e., some things in the world are significantly (if partially) responsible for the transmission of truth, or its failure, by putative logical consequences. This is because the world itself is capable of giving rise to consequences of various kinds, including logical. To see how this comes about, consider two situations in the world, S1 and S2 , that correspond to two sentences, S1 and S2 , respectively, as in (4) above. (Surely, there are such situations!) This case suffices to establish that S2 is a material consequence of S1 . Figuratively: (6) (Level o f Consequence T heor y) T(S1 )
(Level o f T r uth) (Level o f W orld)
S1 |=M S2 ⇑ → → → T(S2 ) ⇑ S1 , S2 .
That is, the world can give rise to material consequences. Next, let us assume that S1 nomically necessitates S2 , i.e., the world is governed by a law that positively connects S1 and S2 with some modal force (e.g., that of a physical law). Then, this will sanction not just a claim of material consequence, but a claim of nomic consequence. Figuratively: (7)
(Level o f Consequence T heor y) (Level o f T r uth) (Level o f W orld)
S1 |=N S2 ⇑ T(S1 ) → → → T(S2 ) ⇑ S1 ⇒ ⇒ ⇒S2 .
i.e., the world can give rise to nomic consequences. Now, take any intuitive characterization of logical consequence in objectual terms or in terms that have objectual analogs. Say, “logical consequence is a universal, necessary, and formal consequence”, where “formal” is spelled out in objectual terms. Suppose the law connecting S1 and S2 is universal, necessary, and formal (in the designated sense). Then, this law sanctions the claim that S2 follows from S1 with the force of a logical consequence. Figuratively: (8)
(Level o f Consequence) (Level o f T r uth) (Level o f W orld)
S1 |=L S2 ⇑ T(S1 )Ú Ú ÚT(S2 ) ⇑ S1 ² ² ²S2 .
123
358
Synthese (2011) 181:353–365
For example, if the classical laws of meet and cardinality are necessary, universal, and formal, then the logical consequence “(∃x)(Px & Qx); therefore (∃x)Px” is sanctioned by the world: (9) (Level o f Consequence) (Level o f T r uth) (Level o f W orld)
(∃x)(Px & Qx) |=L (∃x)Px ⇑ T[(∃x)(Px & Qx)] Ú Ú Ú T[(∃x)Px] ⇑ nonemptiness of P Q ➡ ➡ ➡ nonemptiness of P.
So the world can give rise to logical consequences. But if logic is empowered by the world, it is also constrained by it: (10) (Level o f Consequence) NOT: (∃x)(Px & Qx) |=L (∃x)Px ⇑ (Level o f T r uth) NOT: T[(∃x)(Px & Qx)] Ú Ú Ú T[(∃x)Px] ⇑ (Level o f W orld) NOT: [nonemptinessof P Q ➡ ➡ ➡ nonemptinessof P].
Logical consequence, then, is tied to reality through (i) its inherent connection with truth, (ii) the inherent connection between truth and reality, and (iii) the inherent relevance of (ii) to (i).1 These considerations suggest that logic is grounded in reality not in the traditional sense of being grounded in conceptual reality, but in the more robust and less mysterious sense of being grounded in certain laws governing the behavior of objects and properties in the world. i.e., logic is grounded in certain objectual features of reality.2 (v) The Threat of Factual Error in Logic. We have seen how, theoretically, logical judgments can be falsified, and justified, by the world. But the possibility of error in logic, including factual error, is not just a remote theoretical possibility. First, there are many advocates of nonclassical logics—free, fuzzy, intuitionistic, quantummechanical, paraconsistent, and others—who actually believe that “standard” (here, “classical”) logic leads, at least potentially, to error or inefficacy. Second, it is easy to construct “toy examples” that demonstrate the possibility of error, including factual error, in a logical theory. In a well known example Prior (1960) showed that an infelicitous choice of logical constants can introduce error into our system. Error, in logic, is typically inconsistency or contradiction, but a contradiction is, arguably, a factual error. Some contradiction-generating rules have a transparently factual dimension. Consider the following toy rule, which we can call “the new Leibniz Law”: “(x); x = y ∼ (y)”. Whatever else is wrong with this rule, one thing that is wrong with it—and arguably responsible for its introducing a contradiction 1 I emphasize inherence to warn against drawing conclusions on this issue based on insignificant connections. 2 Note: this analysis can in a sense be read off the standard model theory of logic, but to treat it as derived
from that theory is to put the cart in front of the horse. Rather, it is this analysis, or something like it, that explains why (something like) model-theoretic semantics displays, knowingly or unknowingly, a sound understanding of logic.
123
Synthese (2011) 181:353–365
359
into our system—is that it “says” that no (definable) property is common to multiple objects. And what about the non-contradiction-introducing rule “Water molecule (x)
H3 O molecule (x)”. More than just violating certain requirements on logical constants, this rule is simply factually wrong. Of course, we can avoid these errors by a variety of restrictions on logical constants and rules of inference, but to determine what restrictions to introduce and to justify their appropriateness is just to provide a foundation for logic (or parts thereof). The most compelling example of an error in logic, however, is an example of a devastating error in a “real life” logical theory, indeed one of the most distinguished theories in the history of logic: Frege’s theory. Russell’s paradox conclusively proved that there was a fatal error in Frege’s logic, and it is widely believed that the source of this error is factual: Frege’s logic is committed to the existence of an object—a class—that does not (and cannot) exist. These are some of the considerations that lead me to think that logic requires a foundation, including a foundation in reality. But is a foundation for logic possible? Can we, in principle, provide logic with a substantive foundation, be it in the mind, in reality, or in both? Traditionally, the greatest obstacle to a foundation for logic was thought to be circularity (infinite regress). The crux of the matter is the “‘logocentric’ predicament”: “In order to give an account of logic, we must presuppose and employ logic” (Sheffer 1926, p. 228). In Wittgenstein’s idiom: To provide a foundation for logic we have to stand “somewhere outside logic”, but there is no cognitive standpoint outside logic (1921, 4.12). Is the logocentric predicament unsurpassable? Does it rule out in advance the possibility of a foundation for logic? I believe it does not. This alleged predicament is a remnant of foundationalism—a widely discarded methodology that still affects philosophers’ conception of a foundation for logic. And by consistently eschewing this methodology we avoid this predicament. Let me explain. Foundationalism purports to provide a grounding for knowledge (in whatever a given foundationalist theory purports to ground it) in a simple and straightforward manner. Focusing on a foundation in reality, we can describe the foundationalist method succinctly as follows: (i) (ii)
Foundational items of knowledge are grounded in reality directly, through direct experience or intuition. Derivative items of knowledge are grounded in reality indirectly, through reliable knowledge-extending procedures which are themselves either foundational or derivative.
A salient, and often overlooked, feature of foundationalism is its strict ordering requirement. Foundationalism imposes a strict ordering on our system of knowledge by requiring the grounding relation to (i) be (in paradigmatic cases) irreflexive, anti-symmetric and transitive, (ii) have an absolute base of minimal elements, and (iii) connect each nonminimal element to one or more minimal elements by a finite chain. This salient feature of foundationalism is both a source of its promise and a cause of its failure. On the one hand, it enables foundationalism to reduce the unmanageable task of providing a grounding for every item of knowledge to the more manageable
123
360
Synthese (2011) 181:353–365
task of providing a grounding only to the basic units. On the other hand, it leaves foundationalism with no resources for grounding the basic units. Now, since, due to its special standing, logic is placed at the bottom of the foundationalist hierarchy, foundationalism is incapable of providing a foundation for logic. Placing logic at the base means that while logic can provide (partake in providing) a foundation for other sciences, no science(s) can provide a foundation for logic. Yet, since a serious error in logic will undermine our entire system of knowledge, a foundation for logic, more than for any other science, must be provided. Must and cannot. Having postulated (i) that any resource for founding logic must be more basic than those produced by logic itself, and (ii) that there are no resources more basic than those produced by logic, foundationalism has no resources for constructing a foundation for logic. The key to a viable foundation for logic (and knowledge in general) is, I believe, separating the foundational project from the foundationalist methodology, and in particular rejecting the strong ordering requirement. My line of reasoning is the following: If we demand that the grounding relation be strongly ordered, we undermine the foundational project. But why should the grounding relation be required to have this formal feature? It is true that logic itself provides a model for a strongly ordered grounding relation, namely, the deductive method (in its most basic form), but this does not mean that the grounding of this method itself, let alone of all of knowledge, must follow the pattern exemplified by logic. Rather, the grounding relation may follow multiple formal patterns, some exemplifying strict orderings, others not. That is to say, the grounding relation may be strictly ordered in some places, have other features in others. Relaxing the strong ordering requirement by itself, however, will not necessarily lead to a foundational methodology in our sense, namely, a methodology that mandates the grounding of knowledge in reality. A prime example of a methodology that does not require such a grounding is coherentism. Indeed, foundationalism and coherentism mark two opposites of the foundation divide. Using three parameters—(i) strong ordering of the grounding relation, (ii) use of knowledge-based resources in the grounding process, and (iii) grounding of knowledge in reality—we can display their opposition by the following table:
Foundationalism Coherentism
Strong ordering of the grounding relation
Use of knowledgebased resources in the grounding process
Grounding knowledge in reality
Required Not required
Restricted Unrestricted
Required Not required
Foundationalism and coherentism, however, do not exhaust all possibilities with respect to these parameters. In particular, one configuration left out is:
123
Strong ordering
Use of knowledge based resources
Grounding knowledge in reality
Not required
Unrestricted
Required
Synthese (2011) 181:353–365
361
I will call this configuration foundation without foundationalism (a variation on the title of Shapiro 1991). Foundation without foundationalism shares foundationalism’s commitment to grounding all knowledge in reality while renouncing its rigid methods. It says that in grounding our system of knowledge in reality we need not encumber ourselves with unreasonable restrictions, and it grants us maximum freedom in carrying out the grounding project. Unlike foundationalism, it does not determine in advance either the formal structure of, or the resources used in, each stage of the grounding process; and unlike coherentism, its does not give up, or in any way compromise, our veridicality standards. The foundation-without-foundationalism strategy is a holistic strategy. It is not holistic in the sense, rightly criticized by Dummett (1973), of treating our entire system of knowledge as a single, undifferentiated, whole; it is holistic in the sense of emphasizing the existence of a variegated, open-ended network of relations encompassing all units of knowledge. I will call a methodology advocating this strategy “foundational holism”. A characteristic metaphor of foundational holism is “Neurath’s boat”. To check the soundness of the boat we find a temporary foothold in a relatively sound area of the boat, use available resources to find and repair holes, and continue sailing. Once we have mended one area of the boat, we can use it as a temporary standpoint for checking and repairing other areas. We may create new resources for repatching (better patching) the original holes, and so on. In this way all sections of the boat can be repaired, and different sections can be used as foothold. Neurath’s-boat is often viewed as a coherentist metaphor. But it is just as much, and even more so, a metaphor of foundational holism. First, Neurath’s boat, as representing our system of knowledge, does not hover in empty space, but travels in a real, external, resistant medium: sea or ocean. This medium exerts real pressures upon it (buffets it with real waves, etc.). And this means that its sailors must take reality into account. A hole in the boat cannot be patched with a (regular) scotch tape, and in steering the boat the sailors must harness the forces of nature while guarding against its dangers. Moreover: Not only is the boat not hovering in empty space, it is not aimlessly floating in the water either. Neurath’s boat is not a pleasure boat or a dinghy adrift in sea, but a vessel on an expedition, headed in a special direction, and having a mission to accomplish. Its mission is to study the sea and its environs, and its strategy is to use any available resources, internal and external, in any possible way, strictly ordered or not. As such it is invested in the world, directed toward the world, and constrained by the world, though in a holistic rather than in a foundationalist manner. Having identified foundational holism as our non-foundationalist method for grounding logic in reality, our task is to develop an actual grounding for logic using this method. I will pursue this task by starting with an analysis of logic’s connection to reality. Then, exploiting the holistic license to employ resources developed by other disciplines, I will identify the features of reality that logic is grounded in and explain how it is grounded in these features. Since I have elaborated on some of the specific issues involved in this grounding elsewhere, I will focus on the overall conception here. The central idea is that logic is grounded in formal or structural laws governing the world—laws governing the formal (structural) features of objects, or their formal
123
362
Synthese (2011) 181:353–365
behavior. This notion of a formal law is an objectual notion and it is closely related to the objectual conception of a mathematical law in the philosophy of mathematics. As a result, our solution forges a close link between logic and mathematics, although a link quite different from those forged by earlier philosophers, be they logicists, intuitionists, or (Hilbertian) formalists.3 One way to arrive at our solution is to start with the basic schema connecting logic to reality through truth: (11) (Level o f Logic) (Level o f T r uth) (Level o f Realit y)
logically implies S ⇓⇑ T()Ú Ú ÚT(S) ⇓⇑ S ➡ ➡ ➡ SS .
The upward arrows represent a strong-guarantee relation, and the downward arrows represent dependence. We know that ➡ ➡ ➡ is a strong necessitation relation, so we can treat it as representing a law. Our question is: What type of law connecting objects and structures of objects in the world is sufficiently and appropriately strong to ground logical consequence? To answer this question we need to have some idea about the special strength of logical consequence. One way to go about this is to turn to the common attributes of logic. The traits most commonly attributed to logic are necessity, great generality, formality, apriority, analyticity, topic neutrality, strong normative force, and certainty. Not all of these are compatible with our conclusion that logic is substantially (if not exclusively) grounded in reality. In particular, analyticity is incompatible with it. Thus, one significant result of our view is rejection of the claim that logic is (fully) analytic. Another attribute that stands in some tension with our approach is apriority, but here the situation is more nuanced. While traditional or absolute apriority is incompatible with our holistic approach to logic, the so-called new or relative a priori (see, e.g., Boghossian and Peacocke 2000) is compatible with it. In particular, the view that reason, as distinct from sensory experience, plays a central (if not an exclusive) role in logic is concordant with our approach. We are left, then, with the attributes of necessity, formality, relative apriority, topic neutrality, strong normative force, great generality, and certainty.4 Our question now becomes: Is there a type of (objectual) law that has all these attributes? Examining the (objectual) laws characteristic of different sciences, we see that there is one type of law that does: mathematical law, and in particular “basic” mathematical law. A few examples of such a law are the laws of complementation, join, meet, identity, and finite cardinality. These laws are necessary, formal, relatively a priori, and 3 My objectual view of mathematical law is similar to that of contemporary structuralists, but in viewing
mathematics as closely related to logic I am closer to the logicists, intuitionists, and formalists than to most structuralists. One structuralist who allows a close relation between logic and mathematics is Shapiro (1997), but even he prefers to leave this issue an open question. 4 In saying that logic is intuitively “certain” I do not mean to say that we cannot err in our logical inferences.
Rather I mean to say that when we do not err, our logical inferences can be relied on without reservation, unlike, say, our inductive inferences.
123
Synthese (2011) 181:353–365
363
highly certain; they are also highly general and topic neutral (they have an especially wide range of applications); and their normative force is stronger than that of most other laws (in the sense that generally laws of other fields—say, biological laws—have to obey them, but not vice versa). To show that this connection is not ad hoc, let us show how it develops from an independent characterization of the formal. Sharpening our intuitions by means of an example, let us ask: What formal law is the logical consequence (12) (∃x)(Px & Qx) |=L (∃x)Px grounded in? Our earlier discussion suggested the law: (13) nonemptiness of P Q ➡ ➡ ➡ nonemptiness of P, where P and Q are any properties of individuals. Now consider (a correct statement about) the failure of a purported logical consequence: (14) (∃x)(Px ∨ Qx) L (∃x)Px. Where does the failure of this consequence lie?—A natural answer is: In the fact that (15) NOT : [nonemptiness of P Q ➡ ➡ ➡ nonemptiness of P]. These examples suggest that formal laws are laws governing such features of objects as meet, join, and cardinality. How can we characterize these features? Historically, one way in which philosophers have characterized the formality of logic is in terms of not distinguishing differences between, or particular features of, objects (Kant 1781/1787; Frege 1879), or being unaffected by 1–1 replacements of objects (Tarski 1936). To be more specific, we suggest the following criterion of formality (a generalization of Tarski 1966): (F) To be formal is not to distinguish any replacement of objects (properties, relations, functions) induced by 1–1 replacements of individuals within and across any domains. Using the resources of modern mathematics (which we are licensed to do by our holism), we can reformulate (F) as an Invariance Criterion: (IF) To be formal is to be invariant under isomorphisms of structures, where a structure is an n+1-tuple, A, β1 , . . . , βn , such that A is a nonempty set of objects viewed as individuals, and for 1 ≤ i ≤ n, βi is either a member/subset/relation/function of (on) A, or a set/relation/function of subsets/relations/functions/(plus, possibly, individuals) of A, etc. To see how this criterion works in a particular case, take the feature of nonemptiness. This feature is (objectually) formal since for any (objectual) structures A, B and A , B , where A and A are as above and B and B are any subsets of A and A respectively (representing possible arguments of “X is nonempty” in A and A ), the following holds: (16) [Nonempty (B); A, B ∼ = A , B ] ➡ ➡ ➡ Nonempty(B).
123
364
Synthese (2011) 181:353–365
What kind of law is responsible for logical consequences involving the formal property of nonemptiness? (E.g., What kind of law is (13)?) Well, a mathematical law that can be formulated by various theories within mathematics (e.g., ZFC— Zermelo-Fraenkel set theory with the axiom of choice). It is easy to see that all logical consequences of classical logic are supported by laws of this kind and all failures of consequence according to this logic can be attributed to a violation of such laws. (In the case of logical consequences of classical sentential logic, formality is bivalence, and the relevant formal laws are formulated by a 2-element Boolean algebra. In the case of nonclassical logic, the formal laws are given by nonclassical mathematical theories.) This gives us a direct connection between logic and mathematics. Our characterization of logic in terms of formality is further supported by the consideration that formality (in our sense) entails necessity, generality, topic-neutrality, and quasi-apriority, while none of these attributes entails formality. Take, for example, quasi-apriority. Due to their strong invariance, the formal properties, and hence the laws governing them, are indifferent to most variations between structures of objects, including all variations concerning empirical differences between structures. This, in turn, means that formal laws are not affected by most discoveries made in the sciences, hence are quasi-a priori. (Being holists, we do allow that some theoretical considerations concerning science may suggest a change in formal laws. But for the most part these laws are not affected by the vicissitudes of science.) Similar considerations show that formality entails necessity, generality, topic neutrality, and certainty (see Sher 1999, 2001, 2008). Therefore, by characterizing logic in terms of formality we capture many of its traditional attributes. All these considerations suggest the following picture of the grounding of logic: A. B.
C.
D.
E.
Logic provides a method of inference based on the formal laws governing the behavior of objects in the world. This it does by taking parts of language that are dedicated to the formal, holding them “fixed”, and integrating them into the syntactic structure of our language as logical constants. Logical constants, on this conception, denote formal properties (relations, functions) of objects. Logical form represents the formal structure of whatever given sentences claim to be the case. (For example, the logical form of “(∃x)(Planet x & Frozen x)” represents a nonempty intersection.) There is a division of labor between logic and mathematics concerning the formal: The formal is studied in mathematics; employed as a basis for a wholesale method of inference in logic. Logic and mathematics are developed in tandem. Starting with some basic formal principles, a basic mathematics is used to develop a logic which, in turn, is used to develop a more sophisticated mathematics. This mathematics is used to develop a still more sophisticated logic, and so on. Although our account is functional rather than historical, the mutually sustaining development of modern logic and set theory exemplifies our conception. In developing logic using mathematical resources, several courses of action are open to us. At the two extremes we have opposing methodological principles: (i) Minimize, and (ii) Maximize. The former leads to the construction of minimal logics (e.g., standard 1st-order logic); the latter to the construction of
123
Synthese (2011) 181:353–365
365
maximal logics (e.g., generalized 1st-order logics and full 2nd-order logic). Taking the Minimize route (and limiting ourselves to classical logics), our criterion of formality—IF (plus the Boolean criterion for sentential-operators)—is understood as a constraint on logicality, i.e., as a necessary condition on the choice of logical operators. Taking the Maximize route, this criterion becomes a full criterion of logicality, i.e., a necessary and sufficient condition on the choice of logical operators. (A non-classical conception of the formal structure of reality will lead to appropriate adjustments in IF and the Boolean criterion and, consequently, to a choice of non-classical logical operators.) Much work remains to be done in developing our conception of logic as grounded both in the mind and in reality. For a further elaboration of (A)–(E), as well as some of the other themes developed in this paper, see Sher (1991, 1996, 1999, 2001, 2002, 2003, 2008). Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
References Boghossian, P., & Peacocke, C. (Eds.). (2000). New essays on the a priori. Oxford: Oxford University Press. Dummett, M. (1973). Frege: Philosophy of language. New York: Harper & Row. Frege, G. (1879). Begriffsschrift: A formula language, modeled upon that of arithmetic, for pure thought. In J. van Heijenoort (Ed.), From Frege to Gödel (pp. 5–82). Cambridge: Harvard, 1967. Kant, I. (1781/1787). Critique of pure reason (1st and 2nd ed., Trans. P. Guyer & A. W. Wood). Cambridge: Cambridge University Press, 1998. Prior, A. N. (1960). The runabout inference-ticket. Analysis, 21, 38–39. Shapiro, S. (1991). Foundations without foundationalism: A case for second-order logic. Oxford: Oxford University Press. Shapiro, S. (1997). Philosophy of mathematics: Structure & ontology. New York: Oxford University Press. Sheffer, H. M. (1926). Review of Principia mathematica by Whitehead, A.N.; Russell, B. Isis, 8, 226–231. Sher, G. (1991). The bounds of logic: A generalized viewpoint. Cambridge: MIT. Sher, G. (1996). Did Tarski commit ‘Tarski’s fallacy’?. Journal of Symbolic Logic, 61, 653–686. Sher, G. (1999). Is logic a theory of the obvious?. European Review of Philosophy, 4, 207–238. Sher, G. (2001). The formal-structural view of logical consequence. Philosophical Review, 110, 241–261. Sher, G. (2002). Logical consequence: An epistemic outlook. Monist, 85, 555–579. Sher, G. (2003). A characterization of logical constants is possible. Theoria, 18(2003), 189–197. Sher, G. (2008). Tarski’s thesis. In D. Patterson (Ed.), New essays on Tarski and philosophy (pp. 300–339). Oxford: Oxford University Press. Tarski, A. (1936). On the concept of logical consequence. In J. Corcoran (Ed.), Logic, semantics, metamathematics (J. H. Woodger, Trans., 2nd ed. 1983, pp. 409–420). Indianapolis: Hackett. Tarski, A. (1966). What are logical notions? In J. Corcoran (Ed.), History and philosophy of logic 7 (1986), 143–154. Wittgenstein, L. (1921). Tractatus logico-philosophicus (Pears & McGuinness, Trans., 1961). London: Routledge & Kegan Paul.
123